Issue · May 19, 2026
Scaling's slowdown and the trouble with evals
The scaling-has-stalled argument resurfaced, and underneath it a quieter fight over whether any evaluation can really promise a model is safe.
Reading the slowdown
Gary Marcus has argued this since 2022, and he restated the case this week. Gary Marcus points to the flattening returns from raw pretraining and reads them as confirmation: you can’t buy your way to intelligence one token at a time. Both camps agree the pretraining curve has bent. Where they split is what the bend means.
Nathan Lambert grants the flattening and locates the action elsewhere. By his account, most of this year’s gains arrived after pretraining was finished, from reinforcement learning and from spending more compute while the model answers. Pretraining leveling off, he argues, was expected by anyone building, and says little about where the whole system is headed.
Zvi Mowshowitz raises a methodological objection. The end of scaling, he notes, has been called several times since GPT-2, and each call has so far been wrong. His challenge to the plateau camp is to name in advance the result that would prove them wrong. He says they have not.
What turns on the answer is money. If the limit is in pretraining, spending shifts toward data and the post-training layer. If it is in the approach itself, much of today’s investment is aimed at the wrong thing. The two readings point in opposite directions, and the field has not settled which one it believes.
What an eval can promise
The evaluation fight got sharper this week, and the new edge is legal. Once a passing score can be written into a regulation, the question of what it actually proves carries real money.
METR published a critique of its own task suite. The suite measures whether a model can finish jobs that take a human from a few minutes to most of a workday. METR notes that the tasks have coverage holes, and that a model clearing them in the lab can still fail inside a live product. Its reading is narrow: a score describes how a model did on one specific set of tasks and not much more. Stanford HAI extends the point to policy. A measurement carries error bars, and writing one into law as a guarantee assumes a precision the tools lack.
The case for thresholds runs the other way. Its proponents argue that a rough test you can run on every frontier model beats a perfect one that never arrives, and that waiting for better measurement means measuring nothing. Neither side disputes the other’s facts. They disagree about who should carry the risk while the tools are still crude, and that question gets settled in legislatures rather than labs.
Sources
- Where the gains went after pretraining · Nathan Lambert (Interconnects)
- On the latest round of 'scaling is over' takes · Zvi Mowshowitz (Don't Worry About the Vase)
- Deep learning is still hitting a wall · Gary Marcus (Gary Marcus)
- What our task suite can and cannot tell you (METR)
- What eval scores don't prove (Stanford HAI)