LAS VEGAS — The Google TPU refresh announced this week lays the foundation for cost and power efficiency improvements for cloud providers’ AI infrastructure.
Google introduced two separate Tensor Processing Unit (TPU) chips with its 8th generation refresh. This marks the first time the product line has been split since its 2015 launch. The TPU 8t is designed for training AI models and features 9,600 chips per pod with 2x the memory bandwidth and 4x the network bandwidth per chip compared to previous generation TPUs. Google TPU 8t also includes 2 PB of shared high-bandwidth memory per pod.
In addition to computing power, the TPU 8t is designed with increased reliability in mind, said Amin Vahdat, Google’s senior vice president and chief technologist for AI and infrastructure.
“When we talk about our systems… we don’t just have 9,600 chips working on a problem; we often have tens of thousands, dare I say more, of them working together literally on the nanosecond scale,” Vahadat said at a media event at the Google Cloud Next conference this week. “What this means is that if any chip fails, the computation stops.”
The new TPU 8t system targets not only high throughput, but also consistent “goodput” of 97% or higher, a measure of useful and productive computing time. This was achieved by improving the way the system automatically detects and reroutes failed chip-to-chip interconnect (ICI) links without interrupting jobs and reconfigures hardware around failures without human intervention.
The TPU 8i is designed to support AI inference, using a new custom ICI layer called Boardfly to break through the AI ”memory wall,” where computational demands outpace the speed and capacity of chip memory, a long-standing problem in AI infrastructure. The new ICI in TPU 8i doubles the memory bandwidth of the previous generation and reduces the distance between chips in the pod. This allows them to operate as an integrated unit with the low latency required for mixing expert models used for inference.
“Default connection method [chips] Both didn’t support latency. “Previous generations of TPUs supported throughput and bandwidth. They were very good at passing large amounts of data,” Vahdat said. But in the age of agents, what we really care about is latency, or the minimum amount of time it takes to retrieve data. ”
Larry Carvalho, principal consultant at Robust Cloud, said breaking the “memory wall” could be a major competitive shift for Google in AI chips.
“Memory is in short supply, but vendors who optimize memory can deliver AI at scale without having to deal with supply chain issues,” Carvalho said. “With the rise of AI computing for inference, this could be a big differentiator for Google.”
Comparison of Nvidia GPU and Google TPU
Google officials used many of the same terms to describe this week’s TPU update that NVIDIA used when it announced its Vera Rubin system for AI inference in January, such as optimization of inference separate from model training, but Gartner analyst Chirag Dekate said the two systems are designed for different forms of performance optimization.
“They’re actually operating in two different trade-off areas,” Decato said. “What Nvidia is designing is being able to design things that can be deployed across broader domains, broader ecosystems, such as neoclouds and hyperscalers. Google’s TPUs are [to be] Mainly managed and provided by Google or experts who understand TPU architecture. ”
Dekate said Nvidia needs to balance general-purpose GPU and CPU systems to handle a wide range of potential workloads, but Google TPUs, which started out as application-specific integrated circuits (ASICs), are much more specialized for specific calculations performed during AI training and inference.
“NVIDIA GTC was focused on creating token factories, not necessarily AI factories,” he said. “ASICs are always [perform] Better than any general purpose architecture. That is the reality. ”
Specifically, “NVIDIA is following a scale-up philosophy with NVLink 6, which is designed to provide maximum flexibility and ultra-low latency within a single rack environment,” said Ron Westfall, analyst at HyperFrame Research. “Multiple racks can be linked together via InfiniBand to achieve petabytes of total memory, but that data must pass through traditional network protocols, introducing unavoidable delays.
The market discussion moves from the amount of tokens produced to the usefulness of the tokens and intelligence per dollar.
Chirag Dekate Gartner Analyst
“In contrast, Google’s single-machine philosophy allows 9,600 TPUs to function as a unified entity within a single global address space,” Westfall said. “Because this interconnect is integrated directly into the silicon, Google can pool 2 PB of memory into a single superpod and avoid the performance bottlenecks typically associated with standard data center networking. [and] It operates with a level of cohesion that traditional clusters cannot achieve. ”
What’s the result for enterprise IT buyers? Because most enterprises access AI chips through cloud providers rather than running them in-house, the new Google TPUs will make setting up AI infrastructure services much more power and cost efficient, DeKate said.
“Energy is constrained, especially in the United States and Europe,” he said. “The market discussion will move from the amount of tokens you generate to the usefulness of the tokens and intelligence per dollar and intelligence per watt. So it’s actually much more about power efficiency, cost efficiency and the value you generate per token.”
Still, Google isn’t the only vendor increasing competition for AI chips. AWS also announced this week that it has signed a 5 gigawatt data center agreement with Anthropic to train and deploy cloud models on AWS Trainium chips.
“Google TPU is primarily for Google, with some use by Anthropic,” Carvalho said. “Meanwhile, Amazon Trainium powers Anthropic’s workloads across the data centers built on it. This is a win-win for both Amazon and Anthropic.”
Google executives predict CPU revival
Google Cloud this week committed to supporting Nvidia Vera Rubin systems along with its TPUs, and added support for its latest Axion custom Arm CPUs, which launched in January. The company claims it delivers 100% better cost performance than general-purpose x86 CPUs. TPU 8i systems also support Axion CPUs.
“There’s a lot of general-purpose computing involved in running an AI agent,” Vahdat said. “They’re creating sandboxes and virtual machines to build code, run it, see the results, and find the next set of outputs. So general-purpose computing will come back.”
At the same time, what Vahadat called the “age of specialization” will continue for AI infrastructure.
“We’re going to find additional workloads that might require our own chips,” he said. “At a time when general-purpose CPU performance is only really increasing by 5% per year, pursuing entirely new workloads requires specialization. So two chips could become even more.”
Beth Pariseau, senior news writer at Informa TechTarget, is an award-winning IT journalism veteran. Any tips? send an email to her or connect linkedin.