What Robin Li Said Two Months Ago Just Came True in Silicon Valley #

Meta once had an internal leaderboard with a rather clever name: Claudeonomics. 85,000 employees, ranked by token consumption — the more you burned, the higher you ranked. For a while it was a badge of honor. In 30 days, the company burned through 60 trillion tokens; the person at the top burned 281 billion in a single month. What happened next is easy to guess: plenty of people started running pointless model jobs just to climb the chart.

Looking at those numbers, I thought of something Robin Li said at Baidu’s developer conference in May — a line many people dismissed as concept-peddling at the time: “Tokens are not necessarily the endgame. They represent cost, not revenue; they measure input, not output.” Less than two months later, Silicon Valley validated that line one company at a time.

This disease has a name: Goodhart’s Law. When a measure becomes a target, it ceases to be a good measure. Veteran programmers watched its last outbreak — grading engineers by lines of code, which taught everyone to stuff repositories with long, terrible code. The token leaderboard is the AI edition of lines-of-code reviews: it measures activity volume, and says nothing about whether the work is any good.

Then came the industry-wide tuition payment. Amazon shut down its internal chart-climbing board, KiroRank, with executives warning employees “don’t use AI for the sake of using AI.” Uber gave 5,000 engineers AI tools and burned the entire annual budget in four months; its COO later admitted there was no clear linear relationship between token consumption and valuable shipped products. Microsoft was blunter: after finding engineers burning $2,000 of tokens a month, its official line was that burning tokens had become more expensive than the employees themselves — and it hit pause. On July 1, Palantir CEO Alex Karp said it plainly on CNBC: the token-based pricing model, in his view, is thoroughly broken. From boom to backlash, the whole tokenmaxxing craze lasted six months.

Overseas backlash against token-maximalism

The books look even stranger: token unit prices have fallen more than 90% since 2023, yet corporate AI bills have doubled. Bain ran the numbers — token costs halved over a year while consumption rose 450%. Companies are spending more than ever, and can explain less than ever about what they bought.

The failures come in many shapes, but the root cause is one thing: tokens are an input-side number, and input-side numbers simply cannot answer “what got done.” It’s like trying to compute a team’s output from its overtime hours. What this round of tuition bought is a lesson: the yardstick has to move to the output side.

So what do you count on the output side? Robin Li’s answer back then was DAA — Daily Active Agents: how many agents actually worked for people today and actually delivered results. It’s the counterpart of DAU, the metric everyone in mobile internet understands — except instead of counting “how many people opened the app,” it counts “how much work got finished.” In his words: “What matters is how many agents are working for humans and delivering results. That is closer to value, and closer to the essence, than meaningless token consumption.”

Last month, this yardstick got its heaviest footnote yet. I went back to the transcript of Nadella’s appearance on the Possible podcast. His exact words: “Microsoft has 20 million AI agents running right now.” Twenty million agents, inside one company. What’s more interesting is the complaint that followed: he personally runs about 100 coding agents at once and admits the cognitive load is brutal — which is why these agents must be “fully inspectable, fully auditable,” given identities, sandboxes, and policies, managed like employees.

DAA scale forecast for global AI companies

When a CEO starts worrying about how many of his 20 million agents are actually doing real work, the dashboard he needs is, in effect, counting DAA.

Of course, Goodhart’s Law spares no metric. Will DAA eventually get gamed too? Hard to say — I haven’t fully thought this through. But the two yardsticks differ structurally: gaming tokens takes a few extra API calls, so faking is nearly free; DAA counts completed task loops, and to fake “delivered a result” you pretty much have to do the work. Whether a metric deserves to be a yardstick isn’t about how precise it is — it’s about how expensive it is to fake. Lines of code lost on that count. So did tokens.

This round, the person who said it clearly ahead of time is the same person who, two months ago, was accused of just talking concepts.