A key moment from Karen Hao's Empire of AI
Information is only as good as its source. We assume that for people, news outlets and other authorities. But the opacity around A.I. systems' data and programming is complicating that connection.
I’m about midway through Karen Hao’s hefty history of OpenAI and the modern artificial intelligence industry, Empire of AI. It’s a fascinating and informative read, full of insider details about the birth of this industrial behemoth and the storytelling and myth-making around it.
I wanted to highlight a moment from the book for this week’s newsletter that felt salient and worth discussing as we continue to chew over A.I. — what it is and how it is or isn’t reshaping our world and working lives.
The book traces OpenAI as it grew from an idealistic project with a set of core principles to a $500 billion juggernaut that has been sanding down some of those principles in the rapid pursuit of growth since. The model that made the biggest splash and basically kicked off the A.I. arms race was GPT-4, released in November 2022 to great fanfare. But a key moment in this system’s development happened years before, when the company released the previous model, GPT-3, to much less attention and hype, in 2020.
Its second generation model, GPT-2 had been meticulously fed only high-quality info, data, books and articles. And since it hadn’t been associated with any major cases of harm, the company was feeling more bullish about its creation.
As developers were working on GPT-3, new pressures began to come to the fore — pressures that were there before but would soon come to define the A.I., namely the push to hyperscale — the models, the datasets, the compute, and thus the company (and as it turned out, industry). This was due in part to an new infusion of investment from Microsoft of around $1 billion, Hao notes.
The biggest difference between OpenAI’s third generation chat model and its predecessor was not the programming, but the size of the data it was fed and its computing abilities. Hao writes that the new model was scaled up so much that the “outcome would appear to many as beyond a difference of degree to a difference in kind.”
Much like social media companies and their algorithms before it, the A.I. industry thrives on opacity about how their products work: the data they’ve ingested and the advanced programming that makes the systems go.
But Hao does a nice job at pulling back the curtain, just a bit, on what’s going on behind the scenes. She writes that the GPT-2 model, the selective one, had been trained in part on articles and websites that had been shared on Reddit, as part of a massive data pull. The only selection criteria? That they had received at least three upvotes on the platform.
For the next model, OpenAI decided to be less selective. Instead, programmers used an even wider array of Reddit links, a scrape of Wikipedia, and a mysterious library-type dataset called Books2, the details of which appear to still be murky. The size of these files, and the massive trove of works they contain is hard to fathom.
Hao says she has sources who told her that Books2 contained published books that were ripped from a digital library of torrented books and scholarly articles. But even that was not enough data! The OpenAI team then took a dataset known as Common Crawl, “a sprawling data dump with petabytes, or millions of gigabytes, of text, regularly scraped from all over the web,” which they had previously avoided because of the poor quality of the information included therein.
This is highly technical stuff, which Hao admirably does a good job of distilling for a general audience. But I include it — and found it interesting — because it starts answering what A.I. systems are when you look under the hood, past the fireworks you get on your screen.
Hao writes that these changes made to GPT-3, the push to scale namely, was the pivot that lead the industry to where it is today:
In the coming months, Amodei and Altman would clash over how and when to release GPT-3; Altman would win out, pushing the model into the world on an accelerated timeline. Years before ChatGPT, these two decisions — the one to explode GPT-3's size and the one to quickly release it — would change the course of Al development. It would set off a rapid acceleration of Al advancement, sparking fierce competition between companies and countries. It would fuel an unprecedented expansion of surveillance capitalism and labor exploitation. It would, by virtue of the sheer resources required, consolidate the development of the technology to a degree never seen before, locking out the rest of the world from participating. It would accelerate the vicious cycle of universities, unable to compete, losing PhD students and professors to industry, atrophying independent academic research, and spelling the beginning of the end of accountability. It would amplify the environmental impacts of AI to an extent that, in the absence of transparency or regulation, neither external experts nor governments have been able to fully tabulate to this day.
ChatGPT has made OpenAI famous as an answer bot, presenting itself as an authority with detailed responses, solutions, and proposals for problems small, large and unanswerable. It never sheds that tone of authority, even when offering up laughably wrong answers to simple queries.
But where does it get its information and authority? For anyone in the business of giving answers in the written form, there’s a cardinal rule. Show your work. Disclose your sources. The idea of course is that answers are meaningless if they are not based on trustworthy sources.
And in ChatGPT’s case, Hao’s anecdotes show where at least some of those answers derive from: a bunch of slop on the Internet. A recent study found that a full 40 percent of references used by LLMs like ChatGPT, Google’s AI Overviews, and others are from Reddit, followed next by Wikipedia. As much as Reddit posts can be a good resource for advice and news, that kind of heavy reliance feels overwrought and precarious, particularly for a system in the business of answering Big Questions with supposed expertise.
I know someone with a mysterious and chronic medical issue who has been active in Reddit communities for others with that issue, as part of their search to understand the condition, what makes it better and what makes it worse. When on a whim my friend plugged in a query about their symptoms on ChatGPT, it fed them back a response that sounded a lot like their experience. They asked the bot for the source, and voila: ChatGPT had cited one of their own posts on Reddit in order to answer their question.
Perhaps this kind of information diet is not all that surprising for a system that is excellent at producing slop. But the A.I. chat bots are selling a kind of magic — a kind of magic they appear capable of when they succeed at performing a credible simulacrum of human intelligence. But just like social media algorithms, these systems are products of a) people, b) who have programmed them and c) fed them vast amounts of information. And until we know more about all of those things, particularly the sourcing, ethics and methodology that underpins them — information you would have some idea of from any traditional source of authority, like a doctor or a newspaper — all we’re left with is a bit of a mirage on our screens.
Here’s what else we’re reading this week:
On that note… “AI isn’t magic; it’s a pyramid scheme of human labor;” A well-reported Guardian feature about domestic data annotators and writing trainers for Google’s A.I. products.
A new report published by the Aspen Institute calls for a series of checks and balances to be instituted amid the A.I. boom, including roles that elevate the voices of workers into the decision and policy making positions. “We need both sides of the coin: corporate regulation that puts workers and the public in the driver’s seat, and sectoral bargaining—especially in the tech sector—that gives workers industry-wide influence over how AI is deployed.”
Annie Lowery in The Atlantic about how the job market is starting to resemble the worst parts of app-based dating, namely the sense of infinite scroll, the malaise, and the contradictory lack of engagement and opportunities in a sea of seemingly infinite options.
Still, a lot of job applicants never end up in a human-to-human process. The impossibility of getting to the interview stage spurs jobless workers to submit more applications, which pushes them to rely on ChatGPT to build their résumés and respond to screening prompts. (Harris told me he does this; he used ChatGPT pretty much every day in college, and finds its writing to be more “professional” than his own.) And so the cycle continues: The surge in same-same AI-authored applications prompts employers to use robot filters to manage the flow. Everyone ends up in Tinderized job-search hell.
And in more warning signs about the labor market, jobless claims soared to their highest level in nearly four years this week.
Nicely written and informative……….and frightening