Harness the power of GenAI observability to overcome the challenges of early AI adoption

AI News


Perspectives from the Broadcom AIOps and Observability Team

  • AI is already impacting society and businesses in countless ways. Enterprises are embracing AI and autonomous agents and working quickly to operationalize them.“A person who is not human”
  • Still, many challenges exist for adoption, including in the critical area of ​​GenAI observability.
  • By understanding and addressing observability challenges at each layer of the AI ​​stack, companies can reduce risk and move toward AI success faster.

In a frequently cited Wall Street Journal essay,1Marc Andreessen famously declared that “software is eating the world.” Today, this insight seems equally applicable to artificial intelligence. Artificial intelligence is not a narrow add-on to enterprises. It is rapidly becoming an integral part of the products and services companies offer, as well as their internal processes.

It’s a truism that a technology company’s core competency lies in disrupting the status quo with innovative technology and turning industry-wide change into competitive advantage. For companies experimenting with, deploying, and deriving value from AI, the core questions are familiar:

  • How can you minimize risk and manage it at scale and quickly?
  • What is the shortest path to value that fits our organization?

AI is in the first wave of disruption. Most organizations didn’t initially budget for AI. Only a select few have formally trained their teams (programmers, QA, users, business owners) on its usage, risks, and possibilities. And without a doubt, every company is urgently pivoting into the world of AI, as entire industries leap from proof of concept to production. In the near future, managers and shareholders will want to know the return on investment.

“84% of executives believe they need to leverage artificial intelligence (AI) to achieve their growth goals.”2,

“While 38% are piloting agents, only 11% of organizations have agents in production.” 3.

As we move from proof of concept to operational AI, barriers to adoption still exist. At the same time, expectations for positive return on investment from AI are increasing.

Challenges of AI implementation

In the Deloitte report referenced above, analysts say one of the challenges is “companies trying to wrap cutting-edge GenAI into broken legacy processes rather than redesigning them.” Other companies are facing similar challenges.

  • Usage escalation: Even before pricing models are fully understood by buyers and users, token consumption is skyrocketing, and IT’s underlying AI resources (applications, APIs, infrastructure, networks) are being stretched to the limit and operating at limits “not intended by Mother Nature.” Even user behavior is changing corporate policies in unexpected ways.
  • Identifying and vetting AI use cases: In some companies, CXOs note that a quick pivot from pilot to production means use cases are not fully vetted before dedicated teams begin implementation. As a result, organizations may learn the wrong lessons from well-intentioned efforts and projects may fail prematurely.
  • Organizational challenges: Adopting AI goes beyond specific tools and technologies, and so do the challenges. Other challenges involve human aspects such as collaboration, organizational behavior, and organizational culture. Like any disruptive technology, AI has the potential to extend the limits of, or completely replace, proven, well-tested, long-standing processes. Deploying AI requires a high degree of collaboration between teams with data to discuss and interpret so people can challenge assumptions and course-correct appropriately.
  • Latency, cost overruns, inadequate risk management, etc.: This is a broad topic that includes multiple areas such as GenAI observability, governance, and risk. Tech teams should ask questions such as:
    • What specific prompt did the user enter?
    • Which application was called?
    • What data or content resources did the AI ​​utilize?
    • What is the output delivered to the user in response to the prompt?
    • How many input and/or output tokens are consumed?

Answering these questions about AI can help organizations avoid common pitfalls with disruptive technologies. To maximize ROI opportunities, leaders must take a holistic approach to each of the following challenges: GenAI observability It spans three key pillars, as explained below.

Three Pillars of GenAI Observability

Observability is the ability to understand the internal state of a system by understanding the data it generates externally. Given the complexity of GenAI and the wealth of IT operational data it generates, IT teams are well-positioned to understand these challenges before they impact the business or users.

1. Performance: From latency to operational resonance

With the advent of GenAI, performance monitoring and management must extend beyond traditional measures of uptime and throughput. Teams need to understand where their time is being spent. With GenAI, when a performance issue occurs, you need more precise information to determine whether it’s caused by model inference, vector retrieval, or somewhere in the orchestration layer. Even small delays in AI at scale can lead to intolerable user experiences and increased costs.

Conversely, when observability is an issue, teams may make the mistake of over-provisioning for simple tasks, especially if they are moving quickly from one model version to another and have low performance risk thresholds.

2. Cost: Managing the token economy

Organizations will naturally be sensitive to increased costs, especially if the value of AI is not established or fully monetized. When teams responsible for expense management rely on traditional usage reporting, their understanding of costs becomes permanently outdated, to the point where cost reporting becomes mandatory.

Leadership mission: To accurately understand AI costs in near real-time, you need a robust observability framework that can track input/output tokens, cache hits, and vector database usage. This enables teams to understand usage patterns from a cost perspective, improve responsiveness to changing conditions, and more confidently present budget requirements for investment opportunities. This is critical for companies where decisions about AI initiatives are typically made at the board or board level, while the cost impact is felt at the IT or operational level. Tracking better AI countermeasures can help close this gap.

3. Decision quality: solving probabilities

Unlike traditional code, AI is probabilistic. To ensure accuracy and reduce illusions and bias, IT teams must move beyond familiar, proven quality and performance management approaches to new AI-centric best practices. These include:

  • Reference tracking: Eliminate bias by grounding external data responses.
  • Call path trace: Distinguish between “good” and “bad” execution paths to tune agent behavior.
  • Intermediate query tracking: It is important not only for optimization and performance profiling, but also from an audit and governance perspective.
  • Feedback loop: Quantify user interactions to determine a confidence score for each AI response.

AI full-stack architecture: A strategic perspective

Gathering data to measure the performance, cost, and quality of AI, especially GenAI, requires IT teams to deeply understand all five layers of a modern AI stack. This may seem like an almost insurmountable requirement, but using the latest AIOps and observability technologies, IT operations teams can easily get the data they need.

In other words, the challenge facing IT teams is to stitch together the full range of AI-related data in a way that provides insights to efficiently manage AI infrastructure that meets expectations regarding performance, cost, and decision quality. The table below provides a high-level overview of the types of data that can further enhance the insights that AI requires.

architecture layer strategic focus what to observe
hardware Foundation and computing GPU/TPU utilization and network throughput.
LLM (The Brain) core intelligence Latency profile, version comparison, token consumption.
Augmentation Domain expertise Vector DB accuracy score and RAG (Retrieval Augmentation Generation) performance.
orchestration enforcement and agency Agent identity, security protocols, and autonomous decision paths.
application User experience End-user delays and third-party library dependencies.

way forward

In the race for AI success, the companies with the biggest AI models or the biggest investments won’t necessarily win. To successfully manage and achieve desired outcomes through the disruptive transition brought about by AI, companies must elevate the role of IT operations and better leverage AI-related observability data. These organizations provide information to:

  • Make smart decisions that balance risk to performance, cost, and decision quality as you move from pilot to production
  • Avoid learning the wrong lessons from early failures of AI initiatives
  • Bridging the vision established by the board and corporate directors with the IT department’s ‘realistic’ understanding
  • Gain confidence in realizing the value of your AI initiatives.

By looking at AI holistically and addressing the challenges above and moving AI observability from a technical afterthought to a core business function, leaders can position themselves to accelerate transformation across the enterprise.

resource:

  1. To learn more about Broadcom’s AIOps and observability, visit broadcom.com/aiops.
  2. Click here to learn more about DX operational observability.

source:

1 – Why Software Is Eating the World, Wall Street Journal, 2011
2 – AI: Building for Scale, Accenture, 2019
3 – Technology Trends 2026, Deloitte, 2026



Source link