The team ships a new agent feature and it runs fine in our demo environment. After a week, it goes into production and regular cluster deployments roll through. Agents instantly forget what they were doing, delete half-completed tasks, and leave orphaned changes in their systems of record, frustrating customers. Effectively, this feature is a failure.
Most engineering teams spend far too much time doing this instead of building new innovations.
Organizations of all sizes deal with this changing scenario every day. Teams often first blame the LLM and assume the model is hallucinating or has lost a thread. However, when I looked at the logs, in most cases the model was behaving as expected. Once the agent is able to function, failures become more difficult to identify and teams must analyze the execution layer below.
Securing the execution layer: 4 ways to assess the durability of your AI agents
The resiliency of an agent system depends on the guarantees it provides on its inputs and execution order. If the agent’s work is long-term, distributed, and consequential, the right question to ask is: What kind of execution does the system guarantee? Teams can assess these capabilities using an execution maturity matrix across four independent axes: The matrix is not a single sequence, because real systems never mature so cleanly. The purpose of the performance matrix is to show which features are limiting the system. Enforcing these guarantees requires end-to-end operational automation for infrastructure provisioning, work routing, sandbox lifecycle management, capacity expansion, environment cleanup, and recovery if the infrastructure itself fails, without relying on manual remediation or specialized knowledge. For example, if a server crashes in the middle of a task, the system automatically moves the agent to a healthy server and resumes the agent exactly where it left off, rather than starting over.
-
Running durability: States exist in primordial end memory. When a process crashes, the context and pending tool calls are destroyed. At the mature end, every step lasts. The system knows what has happened, what is going on, and what must happen next.
-
Working period: Primitive systems process work that lasts a few seconds within a single open session. Mature systems support durable timers, durable waits without holding threads, polling, periodic jobs, human approval, long-running tool calls, and resumable work. Agents and subagents can communicate across obstacles, allowing them to perform work safely for days or even months.
-
Hosting and isolation: In mature systems, risky operations such as shell commands and CLI calls are performed in a provisioned and isolated environment with lifecycle management.
-
Quality of Service (QoS): Systems without flow control experience unpredictable slowdowns, brownouts, and even outages during spikes in concurrency. Mature systems are designed to handle backpressure, priorities, fairness between callers and tenants, rate limits, quotas, fault isolation, and predictable degradation. The system can decide who gets how much capacity and when.
Other dimensions such as security, identity, observability, and cost span all four axes rather than remaining within the matrix itself. Execution is important when the stakes are high. If the agent has the authority to move funds, the approval path should be structurally enforced. Immediate proposals are not mandatory.
Where the team is lacking
Most production agent systems are most powerful at hosting and isolation. Many teams have some form of harnessing, cloud execution, sandboxing, and lifecycle management in place, but reliable execution is far less common.
An agent system is not just an agent loop, but a collection of workflows. That is, the control plane code that coordinates tools, manages state, and connects steps. Whether engineers write the code or agents generate code on the fly, much of the code these days is disposable and relies on ephemeral sessions to persist. A durable execution layer underneath turns workflows into durable automation, regardless of who created them. Completion is guaranteed or failure is specified. Execution resumes after a crash, timers are durable, and subagents communicate across failures.
focus
Agent systems are changing rapidly, and if every application has to perform its own durability, retry, timer, recovery, versioning, and throttling logic, your team will either move too slowly or fail in production. Often we do both.
Teams should stop rebuilding reliability primitives in every new agent codebase. You need to spend more time working on your product and less time on the machinery needed to keep it running in the face of failure.
The agent framework defines the behavior of the agent. A durable execution layer makes its work resilient and scalable even if the underlying infrastructure fails. As the model takes on more important work, the execution layer determines whether it can do it safely and whether it has the necessary guarantees to maintain it.

Max Fateev, CTO and Co-Founder, Temporal
Max is Temporal’s CTO and co-founder. He is a 20-year veteran of AWS, Google, and Uber with experience as an engineering leader, leading the development of the SQS Replicated Message Store and Simple Workflow service at AWS, and then co-developing Cadence (predecessor to Temporal) at Uber. Today, millions of Temporal workflows run every day for highly reliable and highly scalable workloads, from Stripe to Datadog to Snapchat.
