Google powers AI training with A3 virtual machines powered by Nvidia’s H100 GPUs

Machine Learning


Google Cloud is expanding its portfolio of virtual machines for training and running artificial intelligence and machine learning models with the launch of the A3 supercomputer.

Announced today at Google I/O, the Google Compute Engine A3 supercomputer is purpose-built to train and serve cutting-edge AI models, including models that are driving advances in the exciting field of generative AI. , the company said.

Cutting-edge AI and machine learning require massive amounts of computing power provided by purpose-built infrastructure, said Roy Kim, Director of Product Management, and Chris Kleban, Group Product Manager, Google. I explained it in a co-authored blog post. Google Cloud is using the A3 supercomputer to offer Nvidia Corp.’s new H100 graphics processing unit combined with its own cutting-edge networking advancements, giving the customer the most powerful power for his AI workloads. Kim and Kleban said they are making sure they have access to the best GPUs.

A single A3 supercomputer VM is powered by eight H100 GPUs built on Nvidia’s Hopper architecture, delivering 3x faster computing than the previous generation chip, the A100. It also delivers 3.6 terabytes per second of bisection bandwidth across these GPUs via NVSwitch and NVLink 4.0, plus integration with Intel Corp.’s 4th Gen Xeon Scalable processors to offload management tasks. .

The A3 supercomputer is the first GPU instance to leverage Google’s custom-designed Intel infrastructure processing units, bypassing the CPU host to accelerate data transfers from the GPU to the central processing unit. According to Google, this increases network bandwidth by up to 10x over previous generation A2 VMs.

These instances also use Google’s intelligent Jupiter data center networking fabric and can scale to 26,000 interconnected GPUs, delivering up to 26 exaflops of AI performance. As a result, A3 VMs significantly improve the time and cost of training large-scale machine learning models, Google said. Additionally, when enterprises move from training to serving models, A3 VMs can deliver a 30x improvement in inference performance compared to A2 VMs.

A3 VMs are not only very powerful, but Google Cloud also offers some flexible deployment options. For example, customers can choose to deploy A3 VMs on Google Cloud’s Vertex AI platform to build machine learning models on fully managed infrastructure purpose-built for high-performance training. Vertex AI was recently updated with new generative AI capabilities and enhanced support for large-scale language model development.

Alternatively, customers who want to build their own customized software stacks can deploy the A3 supercomputer on Google Compute Engine or Google Kubernetes Engine, the company said. This will allow teams to train and serve advanced underlying models while benefiting from automatic scaling, workload orchestration and automatic updates, Google said.

Image: Google Cloud

Your upvotes are important to us and help us keep our content free.

One click below supports our mission to provide free, deep and relevant content.

Join our community on YouTube

Join a community of over 15,000 #CubeAlumni professionals including Amazon.com CEO Andy Jassy, ​​Dell Technologies Founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many other celebrities and experts. please.

“TheCUBE is an important partner for the industry. You guys really attend our events. – Andy Jassy

thank you



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *