Accelerate token production in AI Factory using integration services and real-time AI

Machine Learning


In today’s AI factory environment, performance is no longer theoretical. It is economic, competitive and existential. A 1% decrease in available GPU time can result in the loss of millions of tokens per hour. A few minutes of congestion can cascade into hours of recovery. Oversubscription of rack-level power can lead to power stagnation and diminishing tokens per watt, quietly eroding the output of large factories. As AI factories scale to thousands of GPUs running a variety of mission-critical workloads, the costs of unpredictable congestion, power constraints, long-tail latency, and limited visibility increase exponentially.

Operations teams and administrators need more than just dashboards. It requires flexibility and foresight.

NVIDIA announced NVIDIA Mission Control, a unified software stack for AI factories built on the NVIDIA Reference Architecture, codifying NVIDIA best practices in a unified control plane. Mission Control version 3.0 is further enhanced to introduce architectural flexibility, multi-organizational separation, intelligent power orchestration, and predictive AIOps to detect operational anomalies and maximize token generation.

Flexible software that unlocks speed

NVIDIA Mission Control 3.0 delivers new agility by introducing a new layered, API-driven architecture built on modular services, improving on previous tightly coupled stacks that required synchronous releases and complex validation across hardware platforms. New components, such as Domain Power Services, which provide a new management plane for automated network management and power optimization, further extend the Mission Control stack by bringing additional modular services into a single control plane.

The combination of open components and modular design quickly supports the latest NVIDIA hardware while enabling OEM system providers and independent software vendors (ISVs) to integrate mission control capabilities directly into their ecosystems. This gives businesses more flexibility and choice in their own software stacks, making it easier to customize solutions to fit their unique business and technology challenges.

Isolation in a multi-tenant world

One of the technical challenges many organizations face is supporting the separation of multiple organizations within a centralized AI factory. As AI factories evolve from research and experimentation to production-grade, mission-critical environments, shared infrastructure across multiple teams will require strong organizational isolation and secure multi-tenancy.

The enhanced Mission Control control plane transforms the AI ​​factory management stack into a software-defined virtualized architecture. Mission Control services are decoupled from the physical management node and deployed to a virtual machine (KVM)-based platform using automation provided by NVIDIA. Compute racks and management nodes are dedicated to each organization, but network switches are shared, requiring additional isolation for multi-tenancy. NVIDIA Spectrum-X Ethernet’s shared fabric architecture is logically segmented using VXLAN, and NVIDIA Quantum InfiniBand is segmented using PKey.

This architecture reduces the physical management infrastructure footprint, establishes hard tenant isolation, and creates a secure foundation for a multi-organizational AI factory. This gives operators the flexibility to onboard multiple organizations onto a shared infrastructure, reducing the need to purchase and operate multiple clusters, reducing the physical footprint, while providing strong isolation and self-service for each organization, reducing total cost of ownership.

Power: Invisible Constraints

Another concern for AI Factory token production is the fixed power envelope due to economic constraints such as fixed utility companies and regulatory compliance. Although performance improves with each GPU generation, facility power is naturally limited by the combination of existing data center infrastructure and available power grid. The challenge is clear. How can I increase token power and rack density without exceeding power limits?

Power management in previous versions of Mission Control helped organizations responsibly manage complex power considerations, but it was reactive. The job was first scheduled. After that, power policies were implemented. This was a big step toward balancing power and performance, but managing it at scale required a more dynamic solution, especially across mixed Slurm and Kubernetes environments. This is the evolution of Mission Control in version 3.0.

By incorporating domain power services directly into Mission Control, power becomes a first-class scheduling primitive that helps organizations optimize token generation according to power policies. This power management service enables power-aware workload placement across traditional Slurm workloads or Kubernetes-native workloads, orchestrated by NVIDIA Run:ai, which is integrated and included in the Mission Control stack. Domain Power Services also supports MAX-P and MAX-Q profiles for training and inference, and provides rack- and topology-aware reservation steering by leveraging Mission Control and facility building management system integration.

In one example where NVIDIA was operating the MAX-Q profile, Domain Power Services allowed a data center to run at 85% power with only 7% throughput loss. This was achieved by dynamically leveraging the integrated power profile provided by Mission Control.

This integration allows data center operators to define facility constraints and allows AI professionals to confidently choose performance or efficiency modes to match workload priorities. Governance remains centralized, but flexibility allows you to tune your AI factory for the best performance per watt and performance per dollar.

From dashboards to real-time decision making

In addition to providing new services for dynamic power management, Mission Control version 3.0 enhances existing anomaly detection capabilities by integrating with NVIDIA AIOps Collector and Platform Stacks (NACPS) for AI-powered predictive anomaly detection. At the core of NACPS is an AI cluster model, a graph-based representation of infrastructure and workloads that creates a topology-aware view across GPUs, NVIDIA NVLink scale-up, NVIDIA Spectrum-X Ethernet or NVIDIA Quantum InfiniBand East-West scale-out, and NVIDIA BlueField DPU North-South networking. This view is combined with the job topology of the cluster model.

NACPS combines unsupervised online machine learning on metrics, natural language processing (NLP)-based log analysis to detect unknown issues, supervised learning trained on labeled incidents, and deterministic rule-based guardrails.

Telemetry is continuously streamed to NACPS from GPUs, switches, hosts, network interface cards (NICs), and schedulers. Events and anomalies are automatically correlated across layers to enable contextual root cause analysis while reducing alert noise. The system understands relationships instead of individual metrics.

When an anomaly is detected, Mission Control can trigger automatic remediation workflows from automatic hardware recovery that works in conjunction with NVIDIA Base Command Manager or Slurm integration for NVIDIA Run:ai for Kubernetes workloads.

This system does more than just monitor your infrastructure. Understand it and act on it.

Operators no longer need to track symptoms. They gain foresight.

Another type of KPI: usage and token generation

As AI factory operations continue to evolve, operations teams need to consider different types of KPIs. While traditional data centers were optimized for usage, AI factories need to be optimized for token production.

To optimize an AI factory for token production, companies should consider metrics such as token production per GPU and rack, token production per watt and megawatt. Any inefficiency directly reduces the overall token output. If congestion in the network fabric is not detected and mitigated, or if a single rack unexpectedly exceeds its power constraints, or if a compute node experiences an anomaly during a job, the AI ​​factory loses token generation and potential revenue.

However, if the AI ​​factory is operating intelligently, every megawatt can be accurately converted into tokens to maximize production.

Get started with Mission Control

Mission Control 3.0 is designed around minimizing inefficiencies and increasing token production for AI factory operators. Transform your infrastructure from a passive platform to an active participant in performance optimization by correlating telemetry across domains, intelligently adjusting power, modularizing your architecture for agility, and powering autonomous remediation with AI.

resource:

Stay tuned for the latest release notes and implementation guides for NVIDIA Mission Control 3.0.

You can also check out an on-demand replay of the NVIDIA GTC 2026 session with Eli Lilly & Company to hear first-hand insights on building and deploying high-performance AI infrastructure with powerful, intelligent software.



Source link