Want to be as smart as Google’s BERT or Facebook’s LLaMA? Well then, you should keep reading this blog, as it was used to help train them.
With so much attention being paid to the current generation of AI trained on large language models, such as ChatGPT, most of us know little about the text used to train them.
Now, The Washington Post has lifted the cover off this black box. Working with the Allen Institute for AI, it analyzed Google’s C4 data set, “a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs,” including Google’s T5 and Facebook’s LLaMA.
It then categorized all of those websites (journalism, entertainment, etc.) and ranked them based on how many “tokens” appeared from each data set — with tokens being the bits of text used to process the disorganized information.
In addition to analyzing all these sites, it