Anyone who has ever worked in a company has seen this play.
Some new metrics are introduced, usually as KPIs or dressing up KPIs.
Then the game begins.
Last year, Amazon built an internal AI leaderboard called KiroRank. We measured and ranked employees based on how well they used AI to perform their jobs.
The problem surfaced when managers discovered that employees were programming AI assistants to perform useless, mundane tasks in an effort to drive performance. This in itself was only a moderate problem. The real problem appeared in Amazon’s financial statements.
When using AI, tokens are typically written, which are small chunks of data. When AI algorithms process this data, sentences, words, and images are broken down and processed by the GPU.
This avalanche of new processing has led to a significant spike in token usage and an accompanying increase in costs.
This effort was part of an effort to get 80% of Amazon developers to use AI on any given week. This was also done in the shadow of thousands of layoffs aimed at cutting costs and keeping pace with a $200 billion investment in AI.
David Treadwell, a senior vice president at Amazon, recently said that while leaderboards were built with good intentions, they led to token maxing, a Silicon Valley trend that measures employee performance solely by AI usage.
Treadwell told employees, “Don’t use AI just for the sake of using AI. Use AI to solve customer problems, use AI to solve business problems, and use AI to innovate.”
All of this could have been predicted, right?
It’s hard to miss the bit of irony in this. Amazon somehow found a way to make AI more expensive than hiring people.
It’s interesting that Amazon discovered the power of incentives, but it’s also interesting how this case reflects reality and fiction.
This work has the flavor of the classic science fiction short story The Plague of Midas, published in 1954 by Fred Pohl. The story is set in a future world inhabited by human-made robots that overproduce everything and create unparalleled abundance. As a result, the leader decides that each person should consume more and is given a consumption quota. Eventually, a man who programs his own robot, consumes it, and achieves his quota is hailed as a genius.
More than a century ago, the British discovered that colonial India had perverse incentives. They set bounties on cobras to reduce the cobra population. Eventually, authorities noticed an increase in the number of snakes and discovered a large cobra farm with thousands of cobras rampaging in their burrows.
In education, teachers who are evaluated by test scores often end up “teaching to the test” rather than helping students fundamentally understand the content. This produces the exact result that tests were designed to prevent: students who can understand without being able to understand.
Evaluating fast food workers on speed risks compromising other important metrics such as quality and safety.
And these metrics also overlook the importance of other outcomes. For example, Chick-fil-a ranks first in customer satisfaction and has the highest profit margin per store, despite being one of the slowest fast food lines in the industry.
The important thing was to make money, right?
You have to admit that there’s something terribly funny about Amazon creating a set of incentives for using AI, and then canceling their plans as soon as people respond to those incentives by using AI. Isn’t that the pinnacle of corporate stupidity?
Their team could have realized this mistake simply by referring to Goodhart’s Law, which states that once a goal becomes a measure, it is no longer a good measure (this explains the previous example).
It was painful to read this story. Because it reminded me of my time as a financial analyst. I spent countless hours moving money for executives. All to keep the KPIs green (even when they should be red). I wondered, “Was this really the point of this program?” My time was being sucked away from activities that added real value, just to create a cover.
Companies can consider how to benchmark these AI initiatives. History has taught us time and time again that human nature gravitates toward gamification and shortcuts.
Don’t get the point wrong
Yes, AI tools have many great uses.
The real point is that results-based implementations tend to produce better results in the long run. This is evidenced by the fact that 95% of generative AI pilots (test runs of AI programs within enterprises) fail. This is primarily because organizations measure success by adoption rates rather than actual usage results. This is also exacerbated by completely failed integrations by vendors.
According to Neil Dahl, an IBM Consulting global managing partner, “Rather than systematically asking, ‘How can technology improve my company?’ some people were spraying and praying.”
Law clerks should not be asking themselves how much AI they have used. You should ask if they found more errors, reduced document review time, or brought case files to the surface faster. However, many employees do not intuitively realize this themselves, especially when they are artificially forced to use AI.
This distinction between how much AI is used and what it actually produces is a key differentiator between leaderboards and actual strategy.
How to determine key performance indicators is not an easy task and is worthy of discussion and study. However, by working to get this approach right, companies can avoid significant amounts of time, money, and bad headlines like this one.
