Google DeepMind’s Philipp Schmid recently shared insight into why even experienced engineers face challenges when building AI agents. This talk, titled “Why (senior) engineers struggle with building AI agents,” focuses on five important “mental model conflicts” that arise when moving from traditional engineering practices to the world of AI agents.
Google DeepMind explains the pain of building AI agents — from an AI engineer
Visual TL;DR. Engineers’ mindsets and agents’ realities lead to difficulties in building AI agents. The struggle to build an AI agent leads to Text is New State. The text is a new state, leading to a handover of control. Yielding control will result in an error. This is just an input. Since the error is just an input, the evaluation comes from the unit test. “Error is just input” leads to “adaptation” and “loop”.
Engineer thinking and agent reality: Traditional linear deterministic development versus stochastic adaptive agent development
Text is a new state: agents interpret and produce text for understanding and action.
Handover of control: Engineers must trust agents to make decisions and take actions
Errors are just input: Mistakes are learning opportunities for agents to improve and adapt
From unit testing to evaluation: Moving from rigorous code checking to comprehensive agent performance evaluation
The struggle to build AI agents: Senior engineers face mental model conflicts when building AI agents
Adaptation and loops: Agents observe, adapt, and repeat their actions based on feedback.
Visual TL;DR
Engineer’s mindset and agent’s reality
Schmidt begins by contrasting the deterministic nature of traditional software engineering with the probabilistic approach required for AI agents. In traditional software, engineers define explicit steps to write code, rigorously test it, and deploy it. This process is linear and predictable. However, building AI agents requires a different paradigm.
Define: Instead of strict definitions, agents are given instructions or goals.
Observe: Agents interact with the environment and receive feedback.
Adapt: Based on observations and feedback, agents adjust their behavior.
Loopback: This iterative process allows for continuous learning and improvement.
This fundamental difference in approach, Schmidt explains, often leads to engineers trying to “encode” the inherent probabilistic nature of AI, leading to what he outlines as “clash of mental models.”
Key challenges and solutions
Schmidt identifies several key areas where engineers often encounter difficulties.
1. Text is new
Traditionally, software state is represented by discrete data structures and Boolean values. However, for AI agents, especially those leveraging large-scale language models (LLMs), text becomes the primary means of expressing information and intent. The trap here is to treat natural language instructions as if they were simple boolean values, failing to capture their nuanced semantic meaning. This modification involves preserving this semantic meaning through the raw string and allowing agents to intelligently interpret and process this information downstream.
2. Handover of control
In microservices, user intent is often mapped to a specific route. Engineers intuitively hand-code these paths. However, with AI agents, interactions are more fluid and less deterministic. The trap is to treat the agent as just a traffic controller and expect it to follow a strict predefined path. Instead, agents should be trusted as disambiguating dispatchers. The key insight is to describe what you’re looking for rather than the exact path to get there, offering constraints and steps rather than a rigid route.
3. Errors are just input
Traditional software development often fails quickly or crashes when an error occurs. While this approach is effective for deterministic systems, it is counterproductive for AI agents. If the agent fails quickly due to a minor schema failure, it may cost $0.50 and take 5 minutes to debug, but crashing at a critical step (4 out of 5) is unacceptable. Conflicts occur when engineers treat errors as critical failures. The fix is to take errors as valuable input and allow the agent to learn from them and self-correct. This involves catching errors and feeding them back into the agent’s process, allowing the agent to try a different approach.
4. From unit tests to evaluation
Evaluating AI agents is very different from traditional software testing. Unit tests that rely on deterministic assertions are not sufficient. Schmid emphasizes the need to move to “eval”, which is designed for non-deterministic output. This involves running multiple trials per prompt to measure the distribution of outcomes. Negative cases are important. Testing whether the agent ignores irrelevant information is just as important as testing the agent’s core functionality. Additionally, the focus should be on evaluating the outcome rather than the specific path the agent took to get there. This means evaluating how often agents succeed and ensuring reliability, rather than enforcing strict incremental compliance.
5. Agents will evolve, but APIs will not.
A significant challenge lies in the static nature of the API and the dynamic evolution of the agent. Traditional APIs are often designed with a “human grade” approach, expecting clear and well-defined parameters. However, agents are literal in nature and can hallucinate ambiguous parameters. The trap is that agents build APIs as if they were human developers. The solution is to create an “agent-aware” API that is explicit, verbose, and self-documented. This means providing a clear description of the function and its expected behavior, including what happens if the item is not found, ensuring that the agent has all the context it needs without guessing.
Summary: Trust but verify
Schmid concluded by summarizing the core principles for building effective AI agents.
Stop fighting models: Accept that you are a dispatcher, not a programmer.
Preserve meaning: Treats text as the primary state, not just boolean values.
Designed for recovery: Build agents that can learn and adapt from errors.
Evaluate, but do not assert: Measure your performance through multiple trials and LLM assessments as a judge.
Removed from build: Understand that agents evolve and their underlying models need to be rebuilt and improved over time.
The basic takeaway is that building AI agents requires thinking differently, accepting the probabilistic nature of these systems, and adapting traditional engineering methods accordingly.