Google Cloud announces Virgo Network to power next-generation AI data centers

AI News


According to Google, the next decade of AI will require fundamental changes in physical cloud infrastructure, especially networking.

google cloud introduced a new AI-era data center networking architecture designed to support the rapidly growing scale and complexity of modern machine learning workloads. The company says traditional network designs are reaching their limits as the size and computational demands of underlying AI models continue to grow exponentially.

According to Google, the next decade of AI will require fundamental changes in physical cloud infrastructure, especially networking. To address this, the company has developed the Virgo Network, a large-scale AI data center fabric that is built on a “campus as a computer” philosophy and forms the core of its AI hypercomputer infrastructure.

Google explained that traditional network architectures struggle to address four key constraints of modern AI workloads. These are large-scale requirements spanning multiple data centers, rapidly increasing bandwidth demands driven by model training, synchronous traffic bursts that tax network buffers, and stringent low-latency requirements for real-time inference.

The company said that “even a single ‘straggler’ node can degrade the performance of the entire cluster,” emphasizing the importance of deterministic and resilient network behavior in AI training environments.

To overcome these challenges, Google is moving away from general-purpose networking to a specialized multi-tier architecture that separates workloads into separate domains. These include scale-up interconnects for tightly coupled accelerator communication, east-west scale-out fabrics for distributed training across pods, and north-south Jupiter front-end networks for storage and compute access across the datacenter.

This decoupled structure is designed to support faster innovation cycles while enabling independent upgrades across network layers, reducing bottlenecks and improving overall system resiliency.

At the heart of this architecture is the Virgo Network, a flat, two-layer non-blocking fabric that connects up to 134,000 chips with a reported bisectional bandwidth of 47 petabits per second. The system is designed to deliver up to 4x more bandwidth per accelerator compared to previous generations, while reducing latency by approximately 40%.

Google said this design enables more predictable performance for both training and inference workloads, especially in large-scale distributed AI systems.

The company also emphasized reliability as a core design principle. Given the scale of modern AI clusters, hardware failures are inevitable, making fault isolation and rapid recovery essential. Virgo Network includes independent switching planes to prevent local failures from impacting the entire cluster.

Additionally, Google highlighted advances in observability and automation, including sub-millisecond telemetry, congestion detection, and automatic identification of performance bottlenecks such as “stragglers” and system “hangs.” These features are designed to reduce average recovery time and maximize training efficiency.

Ultimately, Google described the Virgo Network as the foundational layer of its AI hypercomputer strategy, enabling integrated computing across large-scale AI systems. The company said this architecture aims to provide the scale, latency control, and resiliency needed in the new agent AI era.



Source link