Human-centered AI for SRE: Multi-agent incident response without losing control

A growing body of recent research and industry commentary suggests that a shift in the way organizations approach site reliability engineering is underway. Rather than handing off pagers to machines, the team designed a multi-agent AI system that works in conjunction with on-call engineers to narrow search areas and automate the tedious steps of incident investigation while leaving humans to make decisions.

In a blog post detailing multi-agent incident response, Ar Hakboian, co-founder of OpsWorker, an agent-based AI coworker-as-a-service company, argues that the real value of AI in incident management lies in orchestration. Hakboian describes a pattern in which specialized agents (loggers, metrics people, runbooks, etc.) are coordinated by a layer of supervisors who decide who works and in what order. According to the authors, the goal is not to completely replace humans, but to reduce the cognitive load on engineers by proposing hypotheses, formulating queries, and curating relevant context.

This blog post succinctly summarizes this approach, noting that the AI agent must propose hypotheses, queries, and remediation options for human judgment and approval. This framework is closely aligned with a recent academic paper by Zefang Liu published on arXiv. In this paper, we use a backdoor and compromise tabletop framework to study how large teams of language model agents work together during simulated cyber incidents.

Liu’s experiments compared centralized, decentralized, and hybrid team structures and found that homogeneous centralized and hybrid structures achieved the highest success rates. In contrast, decentralized teams of domain specialists lacked a leader and struggled to reach consensus. Liu’s findings suggest that when autonomous agents work together, they actually cause more disruption and fail to solve problems faster. For SREs, having a supervisor or orchestrator may prove to be a better approach. However, mixed teams of domain specialists sometimes struggled more than homogeneous teams of generalists, even in the presence of supervisors. This may be because the specialists disagreed about priorities and were unable to focus on a single course of action.

The OpsWorker blog post indirectly addresses this issue by emphasizing explicit role design and structured handoff, giving each agent clear tools and responsibilities to reduce the risk of deadlock.

Although this experiment verified technical feasibility, it became clear that there was a significant gap in production. Agents are great technical investigators, but they lack the safety management, reliability engineering, and operational maturity required for production incident response.
– Al Haqboian

Cloud consultancy EverOps recently wrote an article about how LLMs are transforming SRE operations without replacing engineers, supporting this hypothesis. The company reports that while only a minority of SRE professionals surveyed believe that AI will replace their jobs within two years, a clear majority see AI as a tool that will make their jobs easier. The article states that real-world use cases are focused on log ingestion and anomaly detection, triage automation, alert clustering, and search-based access to internal knowledge repositories. EverOps also highlights the gap between promise and performance, citing a ClickHouse experiment that tested several advanced language models in real-world root cause analysis scenarios. Autonomous analysis fell short of human investigation.

OpsWorker’s blog post shares that warning, emphasizing reputation and safety. Test your multi-agent configuration using realistic incidents and make a set of recommendations, such as granting agents the least necessary privileges. Hakboian suggests deploying these agent technologies in stages, starting with read-only access and moving to controlled agent actions only after carefully verifying behavior. He also argues for using guardrails and carefully integrating tools rather than wasting time on clever prompts in incident situations. Hakboian consistently calls for human supervision and emphasizes the risk of hallucinations when agents operate tools.

Amazon Web Services has published a detailed example of a multi-agent SRE assistant built on the Bedrock platform. The architecture mirrors the OpsWorker blog post almost exactly, with a supervisor coordinating four specialized agents for metrics, logs, topology, and runbooks, all connected to a synthetic Kubernetes backend. While the AWS portion is vendor-focused and tied to specific services such as Bedrock and LangGraph, the workflow-first mindset is shared with the OpsWorker blog post.

Overall, these sources suggest that agent SRE is rapidly maturing, but organizations are using agent SRE to augment rather than fill staff. OpsWorker’s blog post provides a thoughtful and detailed methodology for teams looking to integrate AI agents into their incident workflows while maintaining control over human engineers.

Source link