Why AI Inference Primarily Stays on the CPU

Applications of AI


Sponsor features: Training AI models requires enormous amounts of computational power and high-bandwidth memory. GPUs were the natural device on which his AI revolution began, as model training can be parallelized, so data can be chopped into relatively small pieces and processed by a large number of rather modest floating-point units. is almost certain.

Although there are some custom ASICs that can perform the required bulk matrix computations using various types of SRAM or DRAM memory, GPUs remain the preferred AI training device. Since GPUs are ubiquitous and their computational frameworks are well-developed and easily accessible, there is no reason to believe that GPUs will continue to be the computational engine of choice for AI training in most companies (for the most part). .

No wonder GPU acceleration has become commonplace in HPC simulation and modeling workloads. Or other workloads in your data center, such as virtual desktop infrastructure, data analytics, and database management systems, can be accelerated with the exact same irons used to run AI training.

But AI inference, where a relatively complex AI model boils down to a set of weights to compute predictions on new data that wasn’t part of the original training set, is another matter altogether. For very sound technical and economic reasons, in many cases AI inference should (and will) remain on the same server CPU where the application is currently running and being augmented with AI algorithms. It’s a schedule).

Nothing beats free AI inference

There is a lot of debate as to why inference should stay on the CPU and not move to accelerators in server chassis or over the network to banks of GPUs or custom ASICs running as inference accelerators.

First, an external inference engine adds complexity (you have to buy something that can break) and increases the attack surface between your application and its inference engine, potentially increasing security risks. I have. And no matter what, the external inference engine adds latency. Especially for those workloads that run across the network, which many hyperscalers and cloud builders do.

Indeed, previous generations of server CPUs did not deliver inference throughput using mixed-precision integer or floating-point data. might not have been a problem at all, but it didn’t need a lot of bandwidth. That’s why 70% of inference in data centers, including hyperscalers, cloud builders, and other types of companies, is still powered by Intel® Running on Xeon® CPUs. However, for heavy inference jobs, the throughput of server-class CPUs could not compete with GPUs and custom ASICs.

until now.

As previously explained, in the “Sapphire Rapids” 4th Generation Intel® Xeon® processors, the Intel Advanced Matrix Extensions (AMX) matrix math accelerator inside each “Golden Cove” core boosts the performance of the underlying low-precision math. greatly improve. AI Inference (read more about the accelerators built into Intel’s latest Xeon CPUs here).

SAPPHIRE RAPIDS VECTOR MATRIX THROUGHPUT

AMX units can handle 2,048 8-bit integer (INT8) operations per cycle per core. This is 24x the throughput of the plain vanilla AVX-512 vector unit used in the “Skylake” CPUs, and for INT8 operations, the much more efficient vector neural-enhanced “Cascade Lake” and “Ice Lake” AVX-512 unit eight times. Network Instruction (VNNI). Golden Cove cores support the use of both AVX-512 with VNNI and AMX units running simultaneously, resulting in 32x his INT8 throughput for inference workloads.

The secret to the AMX unit is that it’s included in every Golden Cove core for all 52 variants of the Sapphire Rapids CPUs in the SKU stack. Based on the integer performance of these cores (AVX-512 and AMX performance not included), the Sapphire Rapids Xeon’s price/performance is on par with or slightly better than his Xeon SP processors from the previous generation. This can be rephrased that AMX units are essentially free as they are included in every CPU and offer additional performance at no incremental cost compared to Ice Lake. It’s hard to get inference cheaper than free, especially if you need a CPU to run your application in the first place.

stack the flops

Theoretical performance is important, but how real-world AI inference applications can take advantage of the new AMX units in Golden Cove cores.

Let’s take a closer look at how inference performance has evolved from the ‘Broadwell’ Xeon E7, launched in June 2016, through the next four generations of Xeon SP processors. This particular graph shows the interplay between processor throughput and wattage per 1,000 of his images processed per second.

Inference of SAPPHIRE RAPIDS XEON and RESNET

look [A17, A33] https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/. Results may vary.

In this case, the tests run on 5th generation servers are using the ResNet-50 model on top of the TensorFlow framework for image recognition. Over the past nine years, image processing throughput has increased from about 300 images per second to over 12,000 images per second. This is over a 40x improvement.

Also, the heat generated per 1,000 images per second is even lower than this graph indicates. It requires three and three 24-core Broadwell E7 processors with FP32 precision, capable of processing 1,000 images per second. At 165 watts per chip, a total of 550 watts is allocated for this load. Powered by an AMX unit that uses a combination of BF16 and INT8 processing, the Sapphire Rapids chip burns under 75 watts. This means that Sapphire Rapids has over 7.3x better performance per watt than his previous 5th generation Broadwell CPU.

What about other workloads? Let’s see. This is a 56-core Sapphire Rapids Xeon SP-8480+ CPU running at 2GHz compared to the previous generation 40-core Ice Lake Xeon SP-8380 CPU running at 2.3GHz for image classification, natural language processing and image segmentation. , is how it stacks in the transformer. , and an object detection model running on top of the PyTorch framework:

SAPPHIRE RAPIDS VS ICE LAKE Various reasoning

look [A17, A33] https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/. Results may vary.

As the chart shows, this is a comparison of running FP32 processing on an AVX-512 unit on an Ice Lake chip versus running BF16 processing on an AMX unit. Only half the accuracy between the two platforms, and twice his throughput between these two generations. The relative performance of these two chips (core count and clock speed combined) yields an additional 21.7% higher performance. The remaining performance gain, equivalent to 3.5X to 7.8X from 5.7X to 10X above, is due to the use of AMX units.

The real test, of course, is how the inference power of the AMX unit, unique to Sapphire Rapids, compares to using an outboard accelerator. Comparing a 2-socket server with a Xeon SP-8480+ processor to an Nvidia ‘Ampere’ A10 GPU accelerator:

Various inferences for SAPPHIRE RAPIDS and NVIDIA A10

look [A218] https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/. Results may vary.

The two Sapphire Rapids processors reach 90% better performance than A10 for natural language inference on BERT-Large models, and outperform A10 by 1.5x to 3.5x for other workloads.

A10 GPU accelerators probably cost around $3,000 to $6,000 today and can be used on the PCI-Express 4.0 bus or even over Ethernet or InfiniBand networks in dedicated inference servers accessed over the network. It is much better either to be placed at a distance or not. On a round trip from the application server. And even though Nvidia’s new ‘Lovelace’ L40 GPU accelerator can do more, the AMX unit is built into his Sapphire Rapids CPU by default and doesn’t require any add-ons.

Sponsored by Intel.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *