Surge AI CEO says AI companies are prioritizing flash over content.
“I'm worried that we're optimizing for the AI slope instead of building an AI that advances us as a species, that advances us as a species, that cures cancer, that solves poverty, that understands universality, that actually advances all these big grand problems,” Edwin Chen said on Sunday's episode of the Lenny's podcast.
“We're basically teaching the model to chase dopamine instead of the truth,” he added.
Chen founded AI training startup Surge in 2020 after working at Twitter, Google, and Meta. Surge runs the gig platform Data Annotation, which says it pays 1 million freelancers to train AI models. Surge competes with data labeling startups like Scale AI and Mercor, and counts Anthropic as a customer.
On Sunday's podcast, Chen said companies are prioritizing AI slop because of industry leaderboards.
“Currently, the industry is dominated by terrible leaderboards like LMArena,” he said, referring to popular online leaderboards where people can vote on which AI response is better.
“They haven't read carefully or checked the facts,” he said. “They skim through these answers for two seconds and choose the one that looks the flashiest.”
He added: “We're literally optimizing the model for the type of people who buy tabloids at the grocery store.”
Still, Surge CEO said AI Labs should pay attention to these leaderboards, as they may be asked about rankings during sales meetings.
Like Chen, research scientists have criticized benchmarks for overestimating superficial characteristics.
Dean Valentine, co-founder and CEO of AI security startup ZeroPath, said in a blog post in March that “most recent advances in AI models feel like haphazard.”
Valentine said that since the release of Anthropic's 3.5 Sonnet in June 2024, he and his team have been evaluating the performance of various models, claiming there are “some improvements.” He said none of the new models his team tried made a “material difference” in internal benchmarks or in developers' ability to find new bugs.
They may have been “more fun to talk about” but “did not reflect economic utility or generality.”
In their February paper “Can AI Benchmarks Be Trusted?”, researchers from the European Commission’s Joint Research Center concluded that there are major problems with today’s assessment approaches.
The researchers said benchmarks are “fundamentally shaped by cultural, commercial, and competitive dynamics, often prioritizing cutting-edge performance at the expense of broader societal concerns.”
Companies have been accused of “gaming” these benchmarks.
Meta released two new models in its Llama family in April, saying they delivered “better results” than comparably sized models from Google and French AI research institute Mistral. It then faced accusations of hitting the benchmark.
LMArena said it “should have been more clear” that Meta had submitted a “customized” version of Llama 4 Maverick to perform better in test formats.
“Meta's interpretation of the policy was inconsistent with what we expect from model providers,” LMArena said. ×post.
