Over the years, Nvidia has dominated many machine learning benchmarks, and now it has broken two more records.
MLPerf, the AI benchmark suite often referred to as the “Olympics of machine learning,” has released a new set of training tests to enable more and better comparisons between competing computer systems. One of MLPerf's new tests is for fine-tuning large-scale language models, the process of further training an existing trained model with expert knowledge to make it fit a specific purpose. The other is for graph neural networks, a type of machine learning behind some literature databases, fraud detection in financial systems, and social networks.
Despite the addition and participation of computers using AI accelerators from Google and Intel, systems powered by Nvidia's Hopper architecture once again dominated the results. One system with 11,616 Nvidia H100 GPUs (the largest collection to date) topped all nine benchmarks and set records in five of them (including two new benchmarks).
“Just throwing hardware at a problem doesn't necessarily make it better.” —Dave Salvatore, Nvidia
The 11,616 H100 system is “the largest we've ever done,” said Dave Salvator, director of accelerated computing products at Nvidia. The system completed a GPT-3 training trial in under 3.5 minutes. By comparison, a system with 512 GPUs took about 51 minutes. (Note that GPT-3 tasks are not full training, which can take weeks and cost millions of dollars; instead, the computer trains on a representative portion of the data, at a point agreed upon well before completion.)
Compared to Nvidia's largest participant in GPT-3 last year, 3,584 H100 computers, the 3.5-minute result represents a 3.2x improvement. You'd expect that given the difference in size of these systems, but Salvator explains that this isn't necessarily the case in AI computing. “You don't necessarily get improvements when you just throw hardware at a problem,” he says.
“We've got essentially linear scaling,” Salvator says, meaning that doubling the GPUs cuts training time in half.[That] This represents a great achievement from our engineering team,” he adds.
Competitors are also approaching linear scaling: In this round, Intel deployed a system with 1,024 GPUs that ran a GPT-3 task in 67 minutes, compared to 224 minutes on a computer a quarter the size six months ago. Google's largest GPT-3 entry used 12 times more TPU v5p accelerators than the smallest entry, running the task nine times faster.
Linear scaling will be especially important for upcoming “AI factories” that will house more than 100,000 GPUs, Salvator said, adding that he expects one such data center to be up and running this year, with another using Nvidia's next-generation architecture, Blackwell, coming online in 2025.
NVIDIA's winning streak continues
Despite using the same architecture as last year's training results, Hopper, Nvidia continued to reduce training times, all thanks to software improvements, Salvator said. “Typically, we get a speedup of 2 to 2.5 times.” [boost] “Changes may come from the software after a new architecture is released,” he says.
For GPT-3 training, Nvidia recorded a 27 percent improvement from the June 2023 MLPerf benchmark. Salvator said several software changes were behind the gains. For example, Nvidia engineers tuned Hopper's use of lower-precision 8-bit floating-point math by reducing unnecessary conversions between 8-bit and 16-bit numbers and better targeting which layers of the neural network can use lower-precision numeric formats. They also found a more intelligent way to adjust the power budget of each chip's compute engine and sped up communication between GPUs. Salvator likened this to “buttering toast while it's still in the toaster.”
Additionally, the company implemented a scheme called flash attention. Invented in the Stanford lab of Samba Nova founder Chris Re, flash attention is an algorithm that speeds up Transformer networks by minimizing memory writes. When it first appeared in the MLPerf benchmarks, flash attention reduced training time by up to 10 percent. (Intel also used a version of flash attention, but not for GPT-3; instead, it used the algorithm for fine-tuning, one of its newer benchmarks.)
Using other software and network tricks, Nvidia achieved an 80% speedup in Stable Diffusion, a text-to-image test, compared to its November 2023 submission.
A new benchmark
MLPerf keeps up with trends in the AI industry by adding new benchmarks and upgrading older ones, and this year saw the addition of fine-tuning and graph neural networks.
Fine-tuning specializes an already-trained LLM for use in a specific domain. For example, Nvidia trained a 43 billion-parameter model on GPU manufacturers' design files and documents to create ChipNeMo, an AI that makes chip designers more productive. At the time, the company's chief technology officer, Bill Dally, said that training an LLM is like giving an LLM a liberal arts degree, while fine-tuning is like sending them to graduate school.
The MLPerf benchmark takes a pre-trained Llama-2-70B model and challenges it to fine-tune the system using a dataset of government documents with the goal of generating more accurate document summaries.
There are several ways to do this fine-tuning. MLPerf chose a method called Low-Rank Adaptation (LoRA), which trains only a small portion of the LLM's parameters, reducing the hardware load by a third compared to other methods and reducing memory and storage usage, the organization said.
Another new benchmark was on graph neural networks (GNNs). These target problems that can be represented by very large sets of interconnected nodes, such as social networks and recommendation systems. Compared to other AI tasks, GNNs require a large amount of communication between nodes in a computer.
In this benchmark, we trained a GNN on a database of relationships between academic authors, papers, and institutions (a graph with 547 million nodes and 5.8 billion edges). The neural network was then trained to predict the appropriate label for each node in the graph.
Future Battles
The 2025 training round will likely feature a head-to-head battle comparing new accelerators from AMD, Intel and Nvidia. AMD's MI300 series was launched about six months ago, an upgraded version with enhanced memory, the MI325x, is scheduled for the end of 2024, and the next-generation MI350 is planned for 2025. Gaudi 3, which will be generally available to computer manufacturers later this year, will appear in MLPerf's upcoming inference benchmarks, according to Intel. Intel executives have said the new chip is capable of beating the H100 in training LLM. But the win may not last long, as Nvidia has unveiled a new architecture, Blackwell, due later this year.
From an article on your site
Related articles from around the web
