“…It’s true that LLMs have been fed enormous amounts of information from the internet, so the idea that they could replace a search engine seems natural at first. To build ChatGPT, OpenAI started out by doing the same thing that everyone else (Google, Anthropic) does when building an LLM: they obtained a snapshot of much of the text available on the internet at the time. The data used for training GPT-3 (the LLM that powered ChatGPT when it was first launched in late 2022) included all of Wikipedia, many websites sourced from Reddit links, an undisclosed number of books (likely numbering in the hundreds of thousands or more), and a great deal of the news, blogs, recipes, flame wars, and the rest of the mess that makes up the modern internet. But, crucially, that doesn’t mean that ChatGPT or any other LLM actually has all of that information inside itself. Instead, the software engineers training the LLM first break down the text into small chunks called tokens, usually around the size of a single word. Then they feed the tokenized text into the LLM, which analyzes the connections between the tokens. All the LLM knows about are tokens and the connections between them—and all it knows how to do is generate new strings of tokens in response to whatever input is given to it. So in one sense, ChatGPT and other LLMs are text-prediction generators: give ChatGPT text, in the form of a question or conversation, and it will try to respond in a manner similar to the text it was trained on—namely, the entire internet. “Think of ChatGPT as a blurry JPEG of all the text on the Web,” wrote the science fiction author Ted Chiang. “It retains much of the information on the Web, in the same way that a JPEG retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation...”
“In other words: ChatGPT is a text generation engine that speaks in the smeared-out voice of the internet as a whole. All it knows how to do is emulate that voice, and all it cares about is getting the voice right. In that sense, it’s not making a mistake when it hallucinates, because all ChatGPT can do is hallucinate. It’s a machine that only does one thing. There is no notion of truth or falsehood at work in its calculations of what to say next. All that’s there is a blurred image of online language usage patterns. It is the internet seen through a glass, darkly.i”
“— More Everything Forever: AI Overlords, Space Empires, and Silicon Valley's Crusade to Control the Fate of Humanity by Adam Becker”

No comments:
Post a Comment
Note: Only a member of this blog may post a comment.