What a Single AI Benchmark Result Actually Costs

A conversation with Gavin Leech, co-author of The Scaling Era

Mar 06, 2026

We at Pathways to Progress were thrilled to host Gavin Leech, co-author of The Scaling Era: An Oral History of AI, 2019–2025 with Dwarkesh Patel, for an author Q&A session. Gavin is a machine learning researcher and co-founder of the research consultancy group Arb.

The questions that follow were asked by members of our Scaling Era reading group during a conversation with Gavin on February 15th. The conversation covers the current constraints on continued AI progress, the costs of large language model evaluations, the future of job automation, the next frontiers in AI, and more.

The Scaling Era: An Oral History of AI, 2019–2025: Patel, Dwarkesh, Leech, Gavin: 9781953953551: Amazon.com: Books — *The Scaling Era: An Oral History of AI, 2019–2025* by Dwarkesh Patel and Gavin Leech

The Data Constraint Isn’t Dead

In your book, you describe different eras within the scaling phase of AI development—early on, the focus seemed to be on increasing data and parameters, and more recently on compute. Where do you think we are now? What is the main “low-hanging fruit” in terms of scaling?

If you ask the more optimistic people—especially those working in major labs—they will say the data constraint has effectively been removed. Their argument is that compute can now be converted into data through synthetic data generation, reinforcement learning environments, and related techniques. In their view, data is no longer scarce because we can generate as much as investors are willing to fund.

I am skeptical of that claim. I still think pretraining—meaning learning by imitating and generalizing from high-quality human data—remains more valuable. Last year, we saw some disappointing large pretraining runs, most notably GPT-4.5. While these models did reduce hallucinations and produce slightly cleaner outputs, current models are actually smaller in pre-training scale compared to those earlier efforts. So even with developments like large-scale code generation and other advances, I am not convinced that data constraints have disappeared.

There was something distinctive about the earlier period when models were trained on ten times more human data each year. That rapid growth seems harder to replicate now.

Other constraints have also emerged. Electricity became a major bottleneck, especially as energy costs rose. A gigawatt is roughly the output of a large nuclear power plant, and companies such as Amazon have attempted to secure dedicated energy sources. Until recently, concentrated electricity supply limited scaling. With systems like Gemini 2.0, labs have learned to run a single training process across multiple data centers separated by kilometers. This likely reduces the electricity bottleneck, though it remains technically difficult. Even physical transformers—the large industrial electrical components—have been genuine constraints. They are a small industry, built to order rather than commoditised, and you need a special “grain-oriented” steel.

Meanwhile, the demand for human data has not disappeared. There are now massive companies, worth billions of dollars, dedicated solely to producing high-quality human annotations and content. For example, skilled contributors can be paid substantial sums for high-quality work, and subject-matter experts in regions like the Global South can earn very strong incomes through this work. Anthropic settled a copyright infringement class action lawsuit for around $1.5 billion in 2025. So despite claims to the contrary, human data remains a significant part of the scaling process.

The Hidden Price Tag on Frontier Breakthroughs

You mentioned earlier that you disagree with Dario’s view—particularly the idea that we might reach something like a “Nobel-level” model within twelve months. Where does your worldview differ from his?

To begin with, the claim is frustratingly vague. In principle, there are ways to spend very large amounts of money to push performance on specific benchmarks.

For example, consider the ARC-AGI benchmark, which consists of relatively simple pattern-completion grid tasks. Last year, OpenAI reportedly spent about $3 million—not on a full training run, but on a single evaluation run—just to evaluate around 1,000 grids. That illustrates how compute spending can be used to push performance on narrow measures.

This is not to dismiss the broader point. The fact that money can be converted into some proxy for intelligence at all is historically unprecedented and potentially world-changing. But it is important to be precise about what is being purchased.

I have also been told that one of OpenAI’s recent physics results—apparently involving a new gluon interaction that seems to be correct—required roughly $250,000 of compute to produce that single result. So, if a lab were determined enough and willing to burn, say, $100 million to prove a point, perhaps they could achieve something impressive.

What I question is the more casual claim that we are on the verge of creating a “country of geniuses in a data center” at relatively modest cost—something like $10 million. That framing glosses over how expensive these frontier pushes actually are.

The Verifiability Test for AI Breakthroughs

How can someone stay informed about what is actually happening in AI without simply deferring to the authority of lab directors or prominent figures?

You should not defer to them automatically. Lab leaders have strong incentives, and they have been wrong before. Their claims need justification like anyone else’s.

A useful shift over the past eighteen months is that performance now depends heavily on how much compute and orchestration you apply to a task. If you are willing to spend very large sums—say, a million dollars on a single benchmark—you can achieve striking results. This is not just about making a model “think longer.” It can involve running thousands of parallel instances on the same task, monitoring their progress, terminating the weaker trajectories, and cloning the most promising ones. This approach—often called population-based learning—can dramatically boost performance on narrowly defined tasks.

From the outside, much of this remains opaque. Even people paying close attention do not have full visibility into internal lab processes. So instead of trying to track every rumor or paper, use a simple rule of thumb when evaluating claims.

When someone says, “Our system can do X at a superhuman level,” ask:

Is there an automatic verifier?
Is there software that can reliably and objectively check whether the output is correct?
Is correctness computationally checkable?
Does the task have a clear right or wrong answer that can be evaluated at scale?

If the answer is yes, the claim is more plausible. Mathematics, for example, has formal proof verification systems. Coding has unit tests, which are imperfect but workable as automated checks. In these domains, scaling compute and search can yield real performance gains.

If the answer is no—if there is no cheap, reliable automatic verifier—then skepticism is warranted. Writing a truly great poem has no objective test. Proposing a new physical theory involves intuition and conceptual coherence that cannot be mechanically scored in the same way. In such cases, claims of dramatic breakthroughs are harder to validate.

Can Code-Writing Ability Unlock Everything Else?

So far, model progress seems jagged. Given that pattern, do you think it is realistic that one day a lab such as OpenAI or Anthropic could complete a training run and suddenly reach a dramatically higher level of capability?

With current models and current development processes, progress is indeed jagged. Capabilities improve unevenly, and systems sometimes get worse at specific tasks. Even trends in areas like prose quality are ambiguous; it is not always clear whether changes represent improvement.

The key question is whether exceptional software-writing ability is sufficient to unlock everything else. Current models are already very strong at producing software. If that capability were enough to automate AI research itself—designing better reinforcement learning environments, constructing training pipelines, and iterating on successors—then the trajectory could change.

For example, reinforcement learning environments are themselves software. In principle, a system could help build the environments used to train its successor. That would still require extensive reinforcement learning, which remains expensive and resource-intensive. The central issue is whether a system could compress billions of dollars of AI research effort into a much smaller compute budget. It is not obvious that this is possible.

I do not want to claim that human cognition is fundamentally magical. What humans do may ultimately be sophisticated interpolation over a rich training distribution, or the result of an extremely powerful evolutionary optimization process that produced strong internal representations. It is plausible—though somewhat humbling—that large language models, or related systems, could eventually match or exceed that process. The open question is whether their current strengths in software are enough to trigger that kind of self-accelerating dynamic.

Software Engineering Will Be Automated in Three Years. Most Jobs Won’t

In a recent podcast, Ilya Sutskever suggested that the scaling era may be ending and that we are returning to a research-driven era. Do you think scaling will continue to drive improvements—especially on verifiable tasks like math and coding—while progress on less verifiable tasks depends more on new research ideas? More broadly, what do you make of his claim?

I am genuinely uncertain.

My current best guess is that software engineering will be heavily automated in the next three years, while most other professions will not. Software engineering is unusual because much of the relevant context already exists in digital form—source files, documentation, repositories, and searchable databases. That makes it especially amenable to large-scale models trained on digitized corpora. Very few other jobs have that degree of structured, accessible context.

The open question is whether being extremely good at writing software is sufficient to produce a generally superhuman system. I do not know the answer. It might be enough, particularly if strong software ability enables systems to improve their own training environments or infrastructure.

Synthetic data is another uncertainty. If models increasingly generate their own training data, that process could compound—humans, after all, learn from each other. It is possible that sufficiently large-scale synthetic training continues to yield gains. I suspect that if someone were willing to spend a trillion dollars on scaling—more data, more compute, larger systems—it would probably continue to produce improvements.

In practice, however, no one is yet prepared to scale at that level. So while scaling may still work in principle, financial and practical constraints make it less certain that brute-force expansion alone will drive the next phase.

There’s a 15% Chance the Bubble Pops in 18 Months

Some of the arguments about AI companies’ long-term goals and profitability seem vague. There are claims about enormous returns—such as AI capturing a large share of global GDP or replacing wage income—but those projections can feel hand-wavy. At the same time, much of the current buildout is debt-financed.

How do you think about the sustainability of this model? And realistically, how much of the economic returns could AI companies capture?

One purpose of my book is to suggest that we should largely take these companies at their word. Their stated goal is to build a recursively self-improving system capable of solving essentially every other problem. In the case of OpenAI, this was originally framed as a nonprofit mission. Now it is a for-profit company pursuing many different initiatives, sometimes in ways that appear incoherent. Still, their public statements consistently describe extremely ambitious outcomes.

They openly talk about global growth rates of 50 percent per year at some stage, or compressing two centuries of scientific progress into a decade. If that were to happen, it would indeed be transformative—much of the last two centuries of scientific innovation has been profoundly beneficial.

There is also a moral dimension. I do not think most of the founders are motivated solely by personal wealth. For example, Sam Altman had no equity in OpenAI for many years. The co-founders of Anthropic have pledged to give away much of their wealth, though whether those pledges are legally binding is another matter.

Beyond profit, there is a race dynamic. If you believe this technology will be the most important development in history, you may not trust rivals to control it. OpenAI was, in part, founded because some individuals did not trust Demis Hassabis or others to dominate the field. There are documented legal disputes showing these tensions. So the motivations are a mix of expected profits, legacy, real utility, and a struggle over whose values shape the resulting systems.

It is a potent combination of incentives.

What paths do you see for AI companies actually capturing durable returns? And do you think we are in a bubble?

I think there is perhaps a 15 percent chance that the bubble pops within eighteen months. If we see several consecutive years of underwhelming model releases, investor confidence could collapse. The current system depends heavily on continued belief in rapid progress. Investor trust is foundational; if that disappears, the structure weakens quickly.

So far, every frontier model has effectively been a loss-maker. A company might spend $200 million on research and development, deploy the model for a few months, and then replace it with a new version. That cycle is unusual in industrial history at this scale.

If a bubble does burst, it would likely slow progress. The analogy we use in the book is the dot-com era. There was nothing fundamentally wrong with the internet companies; they were directionally correct but early. They overbuilt infrastructure—such as laying vast amounts of optical fiber across the United States—then many went bankrupt. Later firms acquired that infrastructure cheaply, and the internet still transformed the world.

Similarly, a correction in AI investment might delay progress by several years—perhaps five or six—but it would not eliminate the underlying technological trajectory.

Your Digital Twin Is Coming

Looking ahead six months to a year, what developments do you expect? What comes next?

Right now, most AI agents are relatively uninteresting outside of coding. For non-technical tasks—such as booking a hotel—the experience is functional but not transformative. What interests me more is the idea of a “digital twin”: a model trained on a large personal corpus that can act on your behalf, communicate with others, and selectively share information under defined permissions.

At present, most models are not deeply customized. I expect that to change. Custom fine-tuned models—ones that genuinely reflect an individual’s style, preferences, and values—could become significant within the next year. Instead of interacting with a generic corporate model, you would use one shaped to resemble you.

There are practical reasons for this. If you care about aesthetics, taste, or intellectual style, a system that understands your preferences can recommend music, filter information, or draft content in ways that align with you. Current models are already strong at pattern recognition and can infer aspects of your style from conversation alone. But fine-tuning on a personal corpus would make that alignment more precise.

There are early signs of movement in this direction. For example, Thinking Machines Lab—a newer lab founded by researchers previously at OpenAI and Anthropic—has released tools focused on making fine-tuning more efficient and accessible. If their strategy centers on personalized models, they are early to that space, and it is not yet widely recognized. Simile is another startup in this space. Their goal is to simulate all 8 billion people on earth.

Another likely development is AI systems that proactively guide users. Rather than responding only when prompted, they would initiate interaction—suggesting actions, structuring your day, or providing behavioral coaching. Even simple schedulers already exist that allow models to check in at specific times. The next step is more integrated assistance: acting as a trainer, teacher, productivity coach, or even a tool for managing attention or ADHD-like challenges.

On Skepticism, Adoption, and What Comes Next

There seems to be a strong narrative about AI coming from the people building these systems—many of whom have clear stakes in their success. At the same time, anecdotally, there also appears to be significant resistance among ordinary people.

For example, at a high school I’m familiar with, it is socially uncool to use AI. In computer science classes, students who rely on AI to generate code are looked down upon. It feels like there is a real cultural tension. I was not around for earlier technological transitions, but it is hard to imagine high school students reacting this way to personal computers or smartphones. Is this backlash unique to AI? What explains it?

First, skepticism is not limited to the general public. Many academics, even in 2026, remain unconvinced. In part, this may reflect early experiences: people tried the models two years ago, found them unimpressive, and have not revisited them since. Rapid progress means that impressions can quickly become outdated.

Among young people, there is also a cultural element. I teach teenagers, and perhaps 15 to 20 percent take pride in writing everything by hand. Whether or not they believe AI systems are capable, they see reliance on them as socially undesirable. That reaction is not surprising; new tools often clash with identity and norms, especially when the norms are associated with older generations or corporate culture.

More broadly, backlash to AI is not historically unusual. Throughout the history of technology, people have resisted adoption. Electric light had to compete with gas lighting. Trains had to overcome public fear. Major innovations routinely require persuasion, marketing, and social normalization. Resistance is a standard human response to large-scale change.

At the same time, AI appears to be one of the fastest-adopted technologies in history, likely even faster than smartphones. There is also evidence that usage is underreported. For example, workplace studies have shown that when people are asked whether their coworkers use AI—rather than whether they themselves do—reported usage rises significantly. One Microsoft study suggested that roughly half of workplace AI use was undisclosed at the time.

Looking at past technological transitions, are there high-level lessons that generalize? And do you think people today are applying those lessons when they try to encourage AI adoption or shape public attitudes toward these tools?

One common observation is that new technologies are overestimated in the short run and underestimated in the long run. In the AI context, this leads to predictions that within three years we will have total job automation or widespread social collapse. Those short-term forecasts are often exaggerated.

At the same time, many people struggle to imagine what strong AI would actually mean. They extrapolate from current systems by adding a modest multiplier—assuming future systems will simply be 4× or 10× better. Historically, that kind of linear extrapolation sometimes made sense. For example, in 1900 it would have been reasonable to imagine that trains in 2000 would be that much faster—and in some places, they are.

But AI may not fit that pattern. If full automation or recursive self-improvement were to occur, the change would not be incremental. It could fundamentally alter economic structures, labor markets, and even the physical landscape—for example, if the world became heavily reorganized around compute infrastructure. Many people find it difficult to conceptualize that scale of transformation.

Regarding backlash, it is entirely possible for AI to be both the fastest-adopted technology in history and the subject of a highly organized counter-movement. There are numerous reasons people might oppose it: concerns about plagiarism, job displacement, aesthetic degradation of creative work, erosion of small online communities, misinformation, and increasing centralization of economic power. In extreme scenarios, critics worry about existential displacement—creating systems that no longer require humans at all.

These concerns are not abstract. Political responses are already emerging. For example, Bernie Sanders has expressed support for measures such as pausing AI development or limiting data center expansion.

You can follow Gavin's work on his website, Twitter, or via his RSS feed.

If you are interested in joining future reading groups, subscribe to our newsletter and follow us on Twitter.

Pathways to Progress

Discussion about this post

Ready for more?