As global tech giants and capital flood into the AI agent space, former OpenAI co-founder and AI guru Andrej Karpathy has issued a deeply counterintuitive warning. The cutting edge of agent capabilities now resides with independent developers and entrepreneurs, not with deep-pocketed companies. At the same time, major AI labs are attempting to compensate for the structural deficiencies of the underlying models in end-to-end inference through ecosystem positioning, creating a melee that has moved from “competing on model parameters” to “locking down the workflow ecosystem.”
In a recent internal sharing session, Karpathy pinpointed the industry’s core mistakes. It’s that people are forcing agents to work while completely ignoring the need to learn the underlying underlying model first. Drawing on his first-hand experience with OpenAI, he revealed that in 2016, OpenAI was trying to force agents to perform routine tasks such as booking airline tickets and ordering food through reinforcement learning. The project completely failed and cost the company a full five years.
“The technology simply wasn’t ready at the time. The only hammer the team had was reinforcement learning. The right thing to do at that point was to completely forget about AI agents and focus all our energy on building language models,” Karpathy emphasized. His core logic follows three steps. First, quickly stop fantasizing about agents doing everything and understand the underlying model first. Second, recognize the realities of the industry. The demo is very easy, but it takes 10 years to build the product. Third, understand that the agent itself is not a product at all, but the underlying model is the real core. If the foundation is solid, agency will emerge naturally.
This paper finds subtle support in the latest moves by tech giants in the field of AI for Science (AI4S). On June 30th, Anthropic and OpenAI happened to make an important bet in the AI4S space. And their choices happened to reveal the shortcomings of the underlying model that Karpathy had warned about.
Anthropic releases Claude Science, a scientific research agent workbench, explicitly calling it “new model agnostic.” Instead, it integrates existing functionality through workflows to handle scientists’ daily research processes. The workbench connects to over 60 scientific databases, comes with pre-built toolkits for genomics, protein structure, and chemistry, and has a main AI assistant with a fact checker that breaks down tasks and performs cross-validation like a project manager. The technical essence involves calling an external vertical model via the MCP protocol to perform specific calculations, while Claude itself only handles natural language understanding, task decomposition, and result interpretation.
Meanwhile, OpenAI launched GeneBench-Pro, an evaluation benchmark covering 10 fields including genomics and quantitative biology. Its test data shows that across 129 real-world scientific research workflow problems, even the most powerful model, GPT-5.6 Sol, achieved only a 28.7% end-to-end pass rate in the Max inference setting. Claude Opus 4.8, the best performing non-GPT model, had a pass rate of only 16.0%.
This data reveals a critical flaw that OpenAI has dubbed the “notice-to-act gap.” Although models can notice anomalies in the data and identify local diagnostic signals, they cannot translate that awareness into downstream methodological adjustments or make correspondingly correct analytical decisions. As analyzed by Wu Hao, founder and CEO of Lumitech, general-purpose large-scale language models face three structural shortcomings in the life sciences. Text tokenization rules cannot be simply applied to biological phenomena. and the prevalence of large numbers of unknown missing values in biological data.
The three major AI labs have adopted very different strategies at AI4S, reflecting their respective judgments about where the upper limit lies. Anthropic’s approach is the simplest, essentially “owning” the entire lane by using engineering to compensate for unreliability in the model. Claude Science is available to Pro, Max, Team, and Enterprise subscribers. The company also recently launched a $30,000 grant program for 50 postdoctoral and graduate student projects, aimed at helping young scientists establish academic habits before becoming independent PIs.
OpenAI’s logic is to use GeneBench-Pro as the referee to define “what good AI4S looks like” and then use GPT-Rosalind (a specialized biological reasoning fine-tuned model launched 4 months ago) as the athlete to chase high scores. This model is available as a research preview for US enterprise customers subject to security review.
Google DeepMind has a unique trump card. It owns fundamental scientific models such as AlphaFold and AlphaGenome as its own assets, is tightly bundled with Gemini for Science, and is integrated with over 30 life science databases. The model that other players can only access as an external tool is part of Google’s own underlying infrastructure.
Mr. Karpathy offered a highly destructive assessment of this ecosystem competition. “When a paper is published about a new type of agent, teams at big labs also find it eye-opening, because they haven’t been developing it in that particular field in secret for five years. “It means we have to compete on an equal footing with cars,” he advised developers to take new inspiration from neuroscience, drawing on brain structures such as the hippocampus and thalamus to design memory, planning and conflict resolution mechanisms for digital entities.
Notably, top-tier clients in the AI4S space are not yet locked in by a single giant. Pharmaceutical giant Novo Nordisk is simultaneously listed on Anthropic’s Claude Science case study customer list and OpenAI’s list of early Rosalind partners. The same client trying out solutions from multiple vendors in parallel shows that the market is still in an open competitive phase. No one company’s toolchain is powerful enough to convince scientists to migrate their complete workflows.
Karpathy’s warning and the actions of the giants both point to the common reality that the model’s capabilities are at the upper end of the “foreknowledge-doing gap.” Traditional methods of stacking computing power do not work in complex scenarios like scientific research. Engineering integration, ecosystem positioning, and data sovereignty have become more practical breakthrough points. But as he pointed out, self-driving has already demonstrated a decade-long gulf between demo and product, and the same pattern applies to agents. It’s easy to imagine and demo, but to make it work, developers need to be willing to work for 10 years.
For ordinary developers currently developing agents, Karpathy’s conclusion may serve as both a hard reality check and a sign of confidence: “You are at the forefront of this revolutionary technology.”
