Scaling long-context inference in OCI using WEKA’s extended memory grid

OCI + WEKA partnership

More and more organizations are choosing to run AI inference in their own environments to protect sensitive data, reduce long-term costs, avoid dependence on third-party APIs, and have more control over model selection, uptime, and operations. But as you move forward, your initial infrastructure choices will determine how much AI you can actually deliver at scale.

Long-context agent AI workloads have a common bottleneck: unnecessary recomputation. If the system runs out of memory and KV cache entries are removed, the cost of prefilling is wasted GPU cycles, increased latency, and decreased throughput. Solving it at scale is what brings OCI and WEKA together.

In 2025, OCI and WEKA published a joint blog post showing that WEKA’s expanded memory grid on OCI H100 infrastructure delivered nearly 20x faster time to first token (TTFT) in a 128K context compared to baseline vLLM. At SC25, WEKA announced the commercial availability of Extended Memory Grid on NeuralMesh through Oracle Cloud Marketplace with OCI as exclusive launch partner.

This post is the next chapter. We’ll move from initial validation on the OCI bare metal H100 to production-related workload testing to demonstrate what this partnership can do at scale.

What this benchmarking effort seeks to achieve

The first phase of OCI + WEKA testing proved that the extended memory grid works. This phase was about proving what it could do. Specifically, our team worked on:

Extensive validation: We confirm that Augmented Memory Grid maintains its benefits beyond synthetic TTFT tests to production-like LLM service operation at cluster scale (72 GPUs across 9 nodes).
Testing the inferential economics in practice. We measure how Augmented Memory Grid changes processing density and throughput when DRAM is no longer sufficient, especially for long-context, cache-dependent workloads on OCI infrastructure.
Establish a reference architecture. We demonstrate that the OCI bare metal H100 infrastructure can support a validated and cost-effective LLM service architecture with Augmented Memory Grid.
Beyond steady state: Test operationally important behaviors such as cache persistence and SLO stability under high concurrency loads. This is a condition that reveals how the system behaves when pushed, not just warmed up.

System and workload details

Cluster configuration:

9-node OCI bare metal H100 cluster, 8 GPUs per node – 72 GPUs total
Multiple TP4 instances for MiniMax-M2.5-NVFP4
16x Gen4 NVMe drives per node (3.84 TiB each), pooled into integrated expanded memory grid layer
287 TiB of NVMe available via expanded memory grid and up to 8.64 TiB of DRAM available at baseline
2x 200Gb RDMA NICs per node

Workload definition:

Each simulated “user” = 100,000 tokens input + 100 tokens response per turn
Tests configured to maximize potential cache hit rates. Separate the effects of offloading and recomputation.

Baseline for comparison:

Baseline: HBM + DRAM only (standard vLLM service)
Extended memory grid: HBM + NVMe only
Extended Memory Grid Full Stack: HBM + DRAM + NVMe

Results: Three important benchmarks

The results were clear across all aspects tested. Here are three that most directly tell our story.

1. 10x increase in concurrent users

With DRAM only, the number of concurrent users hits a hard limit at about 600. Augmented Memory Grid exceeded 5,000 in unlimited testing.

Augmented Memory Grid changes the processing density at the GPU level. More users per GPU means more investment in OCI bare metal. The same cluster handles dramatically more demand without adding a single node.

When DRAM becomes saturated, not only does the system not be able to accommodate any more users; It starts to disappoint users who already have it. Cache entries are removed, TTFT increases unexpectedly, and SLO decreases. Users experience longer wait times and the variability makes it difficult to maintain consistent quality of service. Augmented Memory Grid avoids that cliff completely. Offloading the KV cache to NVMe via RDMA maintains cache hits as concurrency increases, keeping TTFT stable and SLOs intact even when load increases far beyond what DRAM can support.

2. 10x higher token throughput

The Augmented Memory Grid reached approximately 2 million tokens per second, compared to less than 200,000 tokens per second for DRAM only, a 10x increase in raw output.

In practice, this could mean that product teams running real-time AI functions (search, summarization, or code assist) on OCI now have to process an additional 1.8 million tokens when they were hitting a throughput limit of 200,000 tokens per second. Same infrastructure, same OCI investment, but fundamentally different limits.

3. More than 7x more tokens will be provided

During the same test period, Augmented Memory Grid delivered 5 billion tokens, compared to 700 million for DRAM only, increasing the total amount by 7x. For organizations running agent workflows in OCI, each session can consume anywhere from 500,000 to millions of tokens, quickly saturating DRAM and silently consuming GPU capacity with recalculations. Augmented Memory Grid changes that equation, allowing you to deliver 5 billion tokens in the same footprint that was previously limited to 700 million tokens.

The graph below shows this in real time over a 1-hour test with 2,400 users. At the inflection point where the DRAM cache saturates, the baseline response time increases and remains unstable, while AMG stabilizes and the 7x difference in completed requests grows from there.

Why this helps stabilize SLOs

One of the most important lessons learned from our testing is that the performance difference between a DRAM-only cache and an expanded memory grid is not gradual. This occurs when the DRAM cache becomes saturated.

Below that, the systems can look similar. Beyond that, DRAM-only configurations hit a wall. More cache misses, more recomputations, lower throughput, and less predictable latency. The Augmented Memory Grid continues to expand as the active cache working set can reside in larger locations.

This is why Augmented Memory Grid is not just about throughput. This is also a story about SLO.

If the cache is missing for a user’s session, the system must rebuild the context. For short prompts, that may be acceptable. Agent workflows with 100,000 token inputs, multi-turn coding sessions, or long project history will no longer be visible to the user as a pause. At scale, these pauses become SLO risks.

Augmented Memory Grid extends the cache layer to improve persistence, mitigating conditions that lead to cache evictions and recomputations. This makes concurrency performance more predictable, especially for workloads where demand spikes, context is large, and users expect interactive responsiveness.

Operational lessons from the field

The test also reinforced several operational realities.

First, production inference is a full-stack problem. This work was performed across Kubernetes, GDS, RDMA, vLLM, OCI bare metal GPUs, and extended memory grids. Results depend not only on raw device performance, but also on the complete data path between cache, network, runtime, and GPU.

Second, both cache capacity and cache movement are important. Capacity alone is not enough if the system cannot acquire the KV cache fast enough to avoid GPU stalls. Third, operational value extends beyond a single request. By decoupling the KV cache from local GPU memory and storing it in a high-performance token warehouse, Augmented Memory Grid allows any host to serve sessions with cache hits intact, reducing the need for strict session stickiness, improving load balancing, and simplifying scaling.

Summary of key statistics

metric	DRAM baseline	extended memory grid	improvement
Maximum number of concurrent users	~600	5,000+	10 times
Request completed, 2,400 user tests	~6,700	47,000+	7 times
Token provided	700M	5B	7 times
Token throughput	<200K tokens/sec	~2 million tokens/sec	10 times

What this means for OCI customers

For OCI customers building long-context inference services, these results demonstrate a validated architecture to get more from the same GPU infrastructure.

This value is wider than the fast time to first token (TTFT). This includes more users per cluster, more tokens provided per test window, higher throughput, better cache persistence, and more predictable behavior with increased concurrency. These are the characteristics customers need as they move from experimentation to production-scale AI applications.

This is especially important for agent AI. Agents rely on memory. Many times you need to save project history, previous tool calls, documentation, code, instructions, and intermediate reasoning. Repeatedly deleting and recalculating that context increases infrastructure costs and degrades the user experience. Long context AI becomes more practical to operate at scale when that context is persisted and can be reused efficiently.

Conclusion and next steps

OCI and WEKA have moved from initial validation to production-scale proof points for long-term context inference.

In the first phase, we showed that Augmented Memory Grid can reduce TTFT by avoiding unnecessary prefills. This next phase will show broader implications. This means higher concurrency, higher throughput, more tokens served, and more consistent performance when DRAM-only caches reach their limits.

Using WEKA Augmented Memory Grid on NeuralMesh and OCI bare metal H100 infrastructure, customers can extend KV caches beyond local memory, reduce recomputes, and run more inference in the same GPU footprint.

To get started using Augmented Memory Grid with OCI, visit the Oracle Cloud Marketplace solution.

Blog co-authored with WEKA: