New Google TPUs double the efficiency of your AI infrastructure

LAS VEGAS — The Google TPU refresh announced this week lays the foundation for cost and power efficiency improvements for cloud providers’ AI infrastructure.

Google introduced two separate Tensor Processing Unit (TPU) chips with its 8th generation refresh. This marks the first time the product line has been split since its 2015 launch. The TPU 8t is designed for training AI models and features 9,600 chips per pod with 2x the memory bandwidth and 4x the network bandwidth per chip compared to previous generation TPUs. Google TPU 8t also includes 2 PB of shared high-bandwidth memory per pod.

In addition to computing power, the TPU 8t is designed with increased reliability in mind, said Amin Vahdat, Google’s senior vice president and chief technologist for AI and infrastructure.

“When we talk about our systems… we don’t just have 9,600 chips working on a problem; we often have tens of thousands, dare I say more, of them working together literally on the nanosecond scale,” Vahadat said at a media event at the Google Cloud Next conference this week. “What this means is that if any chip fails, the computation stops.”

The new TPU 8t system targets not only high throughput, but also consistent “goodput” of 97% or higher, a measure of useful and productive computing time. This was achieved by improving the way the system automatically detects and reroutes failed chip-to-chip interconnect (ICI) links without interrupting jobs and reconfigures hardware around failures without human intervention.

The TPU 8i is designed to support AI inference, using a new custom ICI layer called Boardfly to break through the AI ”memory wall,” where computational demands outpace the speed and capacity of chip memory, a long-standing problem in AI infrastructure. The new ICI in TPU 8i doubles the memory bandwidth of the previous generation and reduces the distance between chips in the pod. This allows them to operate as an integrated unit with the low latency required for mixing expert models used for inference.

“Default connection method [chips] Both didn’t support latency. “Previous generations of TPUs supported throughput and bandwidth. They were very good at passing large amounts of data,” Vahdat said. But in the age of agents, what we really care about is latency, or the minimum amount of time it takes to retrieve data. ”

Larry Carvalho, principal consultant at Robust Cloud, said breaking the “memory wall” could be a major competitive shift for Google in AI chips.

“Memory is in short supply, but vendors who optimize memory can deliver AI at scale without having to deal with supply chain issues,” Carvalho said. “With the rise of AI computing for inference, this could be a big differentiator for Google.”