How to use AI agents for infrastructure management

Applications of AI


Most organizations that have invested in AI tools for their infrastructure teams are not seeing the benefits they were promised.

Gartner predicted that global spending on AI-optimized IaaS will reach $37.5 billion in 2026. However, much of this spending is not fully achieved. A Gartner survey of 782 infrastructure and operations leaders found that only 28% of AI use cases in infrastructure and operations fully meet ROI expectations, and 20% fail completely. This is not due to an inappropriate model, but rather an incomplete implementation strategy. Simply changing AI vendors or allocating budget to more expensive tools will not solve this problem.

To realize the benefits of AI agents that automate infrastructure development, companies must optimize the agents and provide them with business-specific data. Learn how to provide AI agents with the data they need to succeed and how to address the serious security and operational concerns this technology can pose at the infrastructure layer.

Why is infrastructure AI agent performance poor?

Engineers across organizations are treating AI agents as smarter search engines rather than properly integrating them into the platform. They throw every incident, error, and configuration issue at a random AI agent and expect it to magically solve it. But most of the time you end up with a general response. It may be essentially correct, sound authoritative, and seemingly helpful, but it may be wrong for the environment and potentially disrupt operations.

AI agents can write infrastructure code, design configurations, and reason about complex problems. However, prompts have structural blind spots that cannot be overcome and are limited by training data. Developers of more general-purpose models, such as Claude Code and GitHub Copilot, train their models only on publicly available data. By default, these agents do not know how a particular company operates. This includes:

  • Naming conventions.
  • System constraints.
  • Internal service topology.
  • Custom abstraction.
  • Compliance policy.
  • Architectural decisions.
  • Post-mortem analysis.
  • A runbook containing operationally important specifications.

Engineers can spend hours modifying and tweaking these AI agents to effectively integrate them with their systems, thereby negateing the expected productivity gains. This is the gap that CIOs and executives must fill when evaluating AI tools for their infrastructure teams. Choosing an AI agent is half the battle. The agent’s success depends on how the organization supplies it with organizational knowledge.

How to give infrastructure knowledge to AI agents

There are three approaches that companies can use to feed information to AI agents on their infrastructure.

1. Tribal knowledge

Our knowledgeable engineers provide business-specific instructions, including prompts from memory. “This company uses…” This only works because the engineer happened to remember the correct information. This method can be unreliable and unscalable if engineers get important details wrong or new team members lack necessary information.

2. Static documentation

Engineers can tell the AI ​​where to find documentation (perhaps in a Markdown file) that describes internal standards. You can also copy its contents to all conversations with the model. However, this is a manual process, and given how slow your team is, the documentation can quickly become outdated.

More importantly, organizational knowledge is more than just a few documents. It consists of valuable knowledge scattered across git repositories, Notion pages, Confluence pages, Slack threads, and Zoom transcripts. Many of these sources overlap or contradict each other, so the stress of copying and pasting every time you interact with AI is unsustainable.

3. Context-aware search pipeline

In reality, a single document may cover many different topics. It’s inefficient to provide an AI agent with all the details when it only needs information about the task at hand. Enterprises should implement search augmentation generation (RAG) with two pipelines: one for ingestion and one for retrieval.

The ingestion pipeline captures your company’s documents wherever they reside and decomposes them into data. Vector databases store, manage, and index this data. The retrieval pipeline receives queries from engineers and sends them to the model context protocol server. The MCP server converts the query into an embedding and performs a semantic search against the vector database to retrieve the relevant data. LLM combines specific operational context and general knowledge to generate responses.

Diagram showing how the RAG pipeline works.
RAG requires an acquisition pipeline and an ingestion pipeline to work.

Kubernetes controllers automate document ingestion, continuously running pipelines, and synchronizing changed documents and resources. For most infrastructure teams, Kubernetes is where their workloads already reside, so there’s no need to introduce another orchestration layer.

Note that RAGs have some moving parts, so the infrastructure is a bit more complex. Data quality is also important, as poorly structured data can lead to unreliable results.

Data may become outdated. If someone updates the source document and that information remains in the vector database, the RAG will retrieve inconsistent information. Engineers must design pipelines to remove old data, not just add new data.

Prevent security risks with infrastructure AI agents

As AI agents become embedded in infrastructure, they become a top security and compliance concern. Three key security areas that businesses need to address early on are:

  • Authorization and access control. Agents are more than just passive tools. They have constant access to sensitive company data. As a result, agents should be treated the same as employees with privileged human access, since the explosive scope for mistakes is just as large. You should be able to make changes to your infrastructure cluster, but you should not have access to your cloud billing system. You must be able to open pull requests, but you cannot merge your work into production without human approval.
  • guardrail. These are important safeguards to limit what agents can and cannot do. Agents should not complete high-stakes actions without human involvement. This may include actions such as deploying databases, deleting data, and performing financial transactions.
  • Observability. AI reasoning is non-deterministic. Inputs, outputs, and LLM inferences are unpredictable. Agents may invoke tools that engineers did not expect. Even if you ask an agent the same question, you may get different answers. For these reasons, your team must have observability to your agents. Observability tools can be extended to AI agents, covering their behavior and providing a unified view across tool calls, model inputs and outputs. This should be treated as a non-negotiable requirement, not an afterthought.
A graphic illustrating the security risks of agent AI.
Agentic AI presents several security risks that can have devastating effects at the infrastructure layer.

Operational challenges when scaling AI agents to fit your infrastructure

Two major operational challenges that engineers must prepare for when using AI agents for infrastructure development are context window constraints and cost.

Context window limitations

Ultimately, agents will be dealing with large amounts of data from a variety of sources. If engineers keep piling this data into the AI ​​agent’s context window, it will quickly fail. You won’t get better results in a broader context. Rather, it can lead to decreased performance, increased costs, and inaccurate responses, rendering the system useless.

To prevent this, each interaction with the MCP server must start from a completely new context. MCP retrieves the relevant information needed to process a specific task without worrying about when the information was retrieved or created.

Fee

Running multiple systems simultaneously quickly increases the cost of an agent AI system. A single query can trigger a multi-step inference chain that calls multiple tools to burn out the token. Model routing allows engineers to route different types of requests to agents running different models.

It works better if you do the routing in the model itself. Agents can decide which models to use for which tasks. For simple tasks such as data summarization and classification, engineers can use inexpensive models and save more powerful models for complex inferences.

IT leadership blueprint

For IT leaders making or defending agent AI investments within their infrastructure, an architecture that truly delivers on its promise must include:

  • Multiple professional agents. Instead of a single monolithic AI agent, use multiple AI agents, each scoped to a domain with clear responsibilities.
  • Ann MPC server. Companies need to integrate this server into the tools their engineers are already using.
  • a System context layer. This results in AI agents with enterprise knowledge and operational guidance.
  • Vector database. It stores data that the AI ​​agent disaggregates from your company’s resources and documents.
  • agent’s memory. Memory allows agents to learn from their experiences.
  • guardrail. Prioritize guardrails Learn about the key factors that impact production systems and include human-involved strategies.
  • Observable setup. Leaders have complete visibility into system performance and associated costs.

Wisdom Ekpotu is a DevOps engineer and technical writer focused on building infrastructure using cloud-native technologies.



Source link