Each wave of AI created new scaling laws. Pre-training Achieve scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems. after training Expand utility through instruction tuning and GPU rebalancing for generative inference. Test time scaling We improved inference by giving the model more generated tokens to think about.
now, agent AI and reinforcement learning scale action. The model performs more steps, calls more tools, performs more evaluations, and interacts with the execution environment to perform tasks..
In this blog, we explain how NVIDIA Vera CPUs help AI factories scale agent AI and reinforcement learning by reducing CPU execution time, increasing task throughput, and increasing overall AI factory output, enabling agents to think smarter and longer.


Why CPU becomes more important in the agent era
GPUs continue to be essential for model inference and training. However, across agent AI, reinforcement learning, and data-intensive AI services, much of the execution around the model occurs on the CPU, including:
- Running sandboxed code and tools
- Data retrieval and data processing
- Calculating the result
- Scheduling and orchestration
This is a precise loop.
- Generation is initiated by a prompt (either from the user, an inference token, or the result of a previous turn). “hello.c must be compiled and run.”
- The GPU generates the parameters for tool calls that run on the CPU.
gcc -o hello hello.c ; ./hello - The CPU executes tool calls and produces results that are fed back to the GPU to update weights during reinforcement learning or used by the agent to generate the next prompt. Output: ‘Hello, world!’ – Task returned (0) – Success
- The GPU generates inference tokens based on the results. “Hmm, looks like it worked!”
As an agent becomes more capable, it takes more steps, calls more tools, and performs more checks. CPU time increases over the entire request.
This makes the CPU part of the critical path. It’s no longer just a host processor that powers the GPU. This determines latency, accelerator utilization, and AI factory output per watt and dollar.
Over the past decade, much of the data center CPU market has been optimized around cloud economics: more cores, more virtual machines, and lower costs per core. While this remains important for general purpose cloud services, performance per core is not increasing at the same rate.
This is further exacerbated by the demise of Moore’s Law, which limited performance gains with each generation of CPUs, even though GPU architectures and workloads benefited from continuous cycles of co-optimization.
AI Factory moves metrics from cores per dollar to tokens per dollar, from the number of CPU cores a data center can rent to the amount of AI output it can generate.
This will require new CPU design points for AI factories.
- The high core count allows you to run thousands of agents, RL environments, sandboxes, and services simultaneously.
- Each agent step is gated by sequential execution, resulting in high performance per core.
- Energy-efficient memory bandwidth keeps data moving without bottlenecking your CPU infrastructure.


NVIDIA Vera CPU: Built for AI agents
NVIDIA Vera CPUs are designed for the realities of modern workloads, with fast per-core performance, high concurrency, and power-efficient memory bandwidth to keep your AI factory running.
Vera CPUs combine 88 NVIDIA Olympus cores with up to 1.2 TB/s of LPDDR5X memory bandwidth to keep the cores fed through tool invocations, sandbox execution of both native code and languages like Python and JavaScript, data acquisition, data processing, and orchestration.
The main requirement is that fast performance per core is always maintained. Unlike cloud virtual machines, the CPU socket remains fully loaded and performs the work of many concurrent agents. When core speed is maintained under high system load, tasks complete faster, resulting in faster results while freeing up resources to service the next request.
For agents, this means lower latency across multi-step requests. For reinforcement learning, this means more completed evaluations and more data from each training window, helping the model reach high quality standards faster. For AI factories, fast cores eliminate the need for accelerators to wait for orchestration, tool execution, or data movement.
To achieve this, cores, memory subsystems, and fabrics must be designed together for branch-heavy code, high-bandwidth data movement, and predictable performance under load.
It starts with NVIDIA custom Olympus cores inside the Vera CPU.


NVIDIA Olympus Core and Memory Subsystem
NVIDIA Olympus cores deliver up to 50% higher IPC than NVIDIA Grace and combine a wide front end, advanced branch prediction, deep out-of-order instruction scheduling, and specialized memory prefetching to maintain high throughput in branch-heavy and memory-bound agent code.
Olympus uses neural branch prediction to reduce stalls in branch-heavy code. When combined with other prediction mechanisms, it can maintain two branches per cycle without penalty, preserving throughput for deep software stacks such as PyTorch, graph workloads, and scripting engines.
Olympus also includes a 10-wide decode unit and a deep out-of-order engine designed to maintain high instructions per cycle. Large buffers and advanced instruction scheduling help the core keep moving forward as code paths, dependencies, and memory access patterns change.
To maintain high IPC under load, the core must be continuously fed with data. Vera CPUs deliver up to 1.2 TB/s of LPDDR5X memory bandwidth and maintain over 90% of peak memory bandwidth under load. It also has 40% lower peak memory latency compared to x86 CPUs, ensuring that Olympus cores are fed on time through acquisition, analysis, sandbox execution, and orchestration.
Olympus is also adding a new graph prefetcher built for indirect memory access patterns common in graph analysis and agent memory traversal. Combined with high memory bandwidth per core, Vera CPUs deliver over 3x performance for graph traversal workloads compared to x86-based architectures.
NVIDIA Scalable Coherency Fabric (SCF) connects all cores and unified cache across a monolithic mesh, delivering predictable latency and 50% faster data movement between cores compared to CPUs that fragment compute across the die. For reinforcement learning and agent AI, that predictability helps maintain the evaluation loop even under full load.
As shown in Figure 4, the combination of Olympus cores, NVIDIA SCF, and LPDDR5X memory subsystem enables Vera CPUs to deliver over 1.8x higher sandbox performance compared to competitive products across agent workloads at full load.


system efficiency
Agent AI is putting increasing pressure not only on performance, but also on infrastructure efficiency. As AI factories scale to thousands of CPUs, memory power can become a significant contributor to platform power, cooling demands, and operating costs.
The Vera CPU combines its architecture with high-bandwidth SOCAMM LPDDR5X memory to reduce memory power compared to traditional DDR server designs. LPDDR5X subsystems typically consume less than 30 watts, while DDR5 configurations consume well over 100 watts. MRDIMM-based systems allow even higher memory power consumption.
With a configurable TDP range of 250W to 450W, Vera CPUs reduce total CPU and memory subsystem power while providing the bandwidth needed for agent inference and reinforcement learning environments. For AI factories, this translates into higher performance per watt, lower operating costs, and more efficient use of power and cooling infrastructure.
AI factory CPU for agents
The era of agent AI requires a shift in CPU design from maximizing the number of cores per dollar to maximizing AI factory output per watt and dollar. NVIDIA Vera CPUs are CPUs for agents that combine fast per-core performance, high concurrency, and power-efficient memory bandwidth. Vera CPUs with custom Olympus cores, LPDDR5X memory, and NVIDIA Scalable Coherency Fabric deliver over 1.8x higher agent sandbox performance than traditional x86 architectures, helping AI factories run more tool calls, return more evaluations, and keep the accelerators rolling.
Learn more about Vera CPU benchmarks with Vera CPU, NVIDIA Vera Rubin NVL2, and Phoronix.
Relative performance is based on measured data and is subject to change. NVIDIA Vera CPU with LPDDR5X performance based on the latest x86 CPUs.
