Nvidia's Blackwell Ultra Dominates MLPERF Inference

Machine Learning


The machine learning field is moving rapidly, and the progress of measuring the scales used must compete to catch up. MLPERF, a biennial machine learning competition called “The Olympics of AI,” has introduced three new benchmark tests to reflect a new direction in the field.

“It's been extremely difficult to try and track what happens in the field these days,” says Milo Hodak, AMD engineer and co-chair of the MLPERF Inference Working Group. “You can see that the model is gradually getting bigger. In the last two rounds, we've introduced the biggest model to date.”

The chips that worked on these new benchmarks came from regular suspects, Nvidia, Arm and Intel. Nvidia has introduced a new Blackwell Ultra GPU at the top of the charts, packaged in a GB300 rack-scale design. AMD delivers strong performance and introduces the latest MI325X GPU. Intel has proven that it can make inferences to the CPU using Xeon submissions, but has joined the GPU game with Intel Arc Pro submissions.

New benchmarks

In the final round, MLPERF introduced the largest benchmark, a large language model based on LLAMA3.1-403B. In this round, they once again broke through themselves and introduced benchmarks based on the DeepSeek R1 671B model. This is more than 1.5 times the number of parameters in the previous largest benchmark.

As an inference model, DeepSeek R1 goes through several steps in the chain of thinking when approaching a query. This means that much of the calculation occurs during inference with normal LLM operations, which makes this benchmark even more difficult. Inference models are claimed to be the most accurate and are the best techniques for science, mathematics and complex programming queries.

In addition to the still-largest LLM benchmark, MLPERF also introduced the smallest benchmark based on LLAMA3.1-8B. Taran Iyengar, Mlperf Inference Task Force Chair, explains that there is growing industry demand for low incubation but high-precision inference. A small LLM can provide this and is perfect for tasks such as text summaries and edge applications.

This results in a confusing four total counts for LLM-based benchmarks. Includes new, smallest llama3.1-8b benchmarks. Existing llama2-70b benchmark. Introducing the final round of the Llama 3.1-403b benchmark. And the biggest new DeepSeek R1 model. If there's nothing else, this signal LLM will not go anywhere.

In addition to countless LLMS, this round of MLPERF inference included a new speech-to-text model based on Whisper-Large-V3. This benchmark is a response to an increase in voice-enabled applications, such as smart devices and voice-based AI interfaces.

The mlperf Incerfure competition comes in two broad categories: “Closed.” This is “open” where the reference neural network model must be used without modification and the model is allowed to be changed. Among them are several subcategories related to how tests are run and what infrastructure they relate to. For your sanity, focus on the results of “closed” data center servers.

Nvidia lead

Surprisingly, the best performance per accelerators at least on each benchmark in the “server” category was achieved by NVIDIA GPU-based systems. Nvidia also unveiled the Blackwell Ultra, breaking the charts with two biggest benchmarks: Lllama 3.1-405b and Deepseek R1 Reasoning.

Spread the visualization

The Blackwell Ultra is a stronger iteration of the Blackwell architecture, featuring significantly increased memory capacity, attention layer acceleration, 1.5x AI calculations, and faster memory and connectivity compared to standard Blackwell. It targets larger AI workloads, like the two benchmarks tested.

In addition to hardware improvements, the Director of Accelerated Computing Products at Nvidia Dave Salvator believes the success of Blackwell Ultra is attributed to two important changes: First, we use Nvidia's own 4-bit floating point number format, NVFP4. “It can provide comparable accuracy to formats like BF16,” says Salvator, but uses much less computing power.

The second is what is called a disassembled serving. The idea behind the decomposed serving is that there are two main parts of the inference workload. The prills, queries (“Summary this report”) and their entire context window (report) are loaded into LLM, and the generation/decoding is actually calculated. These two stages have different requirements. Prefill is heavily calculated, but generation/decoding is much more dependent on memory bandwidth. Salvator says Nvidia's performance is achieving nearly 50% by assigning different groups of GPUs to two different stages.

AMD is right behind

AMD's latest accelerator chip, the MI355X, was released in July. The company only provided results in the “open” category where software changes to the model are permitted. Like the Blackwell Ultra, the MI355X features 4-bit floating point support and extended high-bandwidth memory. Mahesh Balasubramanian, senior director of data center GPU product marketing at AMD, says the Mi355X defeated the Mi325X with a 2.7x benchmark in the open Llama 2.1-70B benchmark.

AMD's “closed” submission included systems with AMD MI300X and MI325X GPUs. The more advanced MI325X computer runs similarly to computers built with the NVIDIA H200S on the LLLAMA2-70B, mixed expert tests, and image generation benchmarks.

This round also included the first hybrid submission in which both the AMD MI300X and MI325X GPUs were used for the Llama2-70B benchmark, the same inference task. Using a hybrid GPU is important. This is because new GPUs come to Cadence every year and the older model, EN-Masse, is not deployed, so they don't go anywhere. Being able to spread workloads across different types of GPUs is an essential step.

Intel will enter GPU games

In the past, Intel has remained unshakable that it does not require a GPU to perform machine learning. In fact, submissions using Intel's Xeon CPUs were run on par with NVIDIA L4 in the Object Detection Benchmark, but reached in the Recommended System Benchmark.

This was the first time that Intel GPU was also performing a show. Intel Arc Pro was first released in 2022. The MLPERF submission featured a graphics card called the Maxsun Intel Arc Pro B60 Dual 48G Turbo, which contained two GPUs and 48GB of memory. This system was run on Par using Nvidia's L40 on a small LLM benchmark and tracked it on the Llama2-70B benchmark.

From the article on the site

Related articles on the web



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *