Originally, PyTorch used an eager mode where each PyTorch operation that forms a model is executed independently as soon as it arrives. In PyTorch 2.0, torch.compile was introduced to make PyTorch code faster than the default eager mode. In contrast to eager mode, torch.compile pre-compiles the entire model into a single graph in a way that is optimal for execution on a given hardware platform. AWS has optimized the PyTorch torch.compile feature for AWS Graviton3 processors. This optimization improves Hugging Face model inference performance by up to 2x (based on the geometric mean of performance improvement for 33 models) and TorchBench model inference performance by up to 1.35x (based on the geometric mean of performance improvement for 45 models) compared to default eager mode inference across multiple natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances. Starting with PyTorch 2.3.1, optimizations are available in the Torch Python wheel and the AWS Graviton PyTorch Deep Learning Container (DLC).
In this blog post, we explain how to optimize the performance of torch.compile on AWS Graviton3 based EC2 instances, how to use the optimizations to improve inference performance, and the resulting speedups.
Why use torch.compile and what is its purpose?
In eager mode, operators in a model are executed as soon as they are encountered. This mode is the default mode because it is easy to use and is well suited for machine learning (ML) researchers. However, eager mode incurs runtime overhead due to redundant kernel launches and memory read overhead. In contrast, in torch compilation mode, operators are first composed into a graph, and one operator is merged with another to reduce and localize the total overhead of memory reads and kernel launches.
The goal of the AWS Graviton team was to optimize the torch.compile backend for the Graviton3 processor. PyTorch eager mode was already optimized for the Graviton3 processor with Arm Compute Library (ACL) kernels that use oneDNN (also known as MKLDNN). The question then was how to reuse these kernels in torch.compile mode to get the best of both graph compilation and kernel optimized performance.
result
The AWS Graviton team has reused ACL kernels and extended the Torch inductor and oneDNN primitives to optimize compiled-mode performance on the Graviton3 processor. Starting with PyTorch 2.3.1, this optimization is available in the Torch Python wheel and the AWS Graviton DLC. Running inference See the following sections for information about installation, runtime configuration, and how to run tests.
To demonstrate the performance improvements, we used TorchBench's NLP, CV and recommendation models as well as Hugging Face's most downloaded NLP models across question answering, text classification, token classification, translation, zero-shot classification, translation, summarization, feature extraction, text generation, Text2Text generation, Fill-Mask and sentence similarity tasks, covering a range of customer use cases.
First, we measured the latency in milliseconds (msec) for TorchBench model inference in Eager mode, which is marked as 1.0 by the dotted red line in the following graph. Then, we compared the improvement from torch.compile for the same model inference. Normalized results are plotted in the graph. Across the 45 models we benchmarked, we see a 1.35x improvement in latency (geometric mean of the 45 models).
Image 1: Improving PyTorch model inference performance with torch.compile on AWS Graviton3 based c7g instances using TorchBench framework. Reference eager mode performance is marked as 1.0 (higher is better)
Similar to the preceding TorchBench inference performance graph, we started by measuring the Hugging Face NLP model inference latency (in milliseconds) in Eager mode, which is marked as 1.0 in the following graph by the dotted red line. We then compared the improvement from torch.compile for the same model inference. Normalized results are plotted in the graph. Across the 33 models benchmarked, we see roughly a 2x performance improvement (geometric mean of the 33 models).
Image 2: Using the Hugging Face example script, running torch.compile on an AWS Graviton3-based c7g instance results in better performance for Hugging Face NLP model inference. The performance of the reference eager mode is marked as 1.0 (higher is better).
Running inference
Starting with PyTorch 2.3.1, optimizations are available in the torch Python wheel and the AWS Graviton PyTorch DLC. In this section, we show how to run inference in eager and torch.compile modes using the torch Python wheel and benchmark scripts from the Hugging Face and TorchBench repositories.
To successfully run the script and reproduce the speedup numbers discussed in this post, you need an instance of the Graviton3 family of hardware (c7g/r7g/m7g/hpc7g). This post used a c7g.4xl (16 vcpu) instance. The instance, AMI details, and required torch library versions are described in the following snippet:
The general-purpose runtime tunings implemented for eager mode inference are equally applicable to torch.compile mode, so to further improve torch.compile performance on AWS Graviton3 processors, set the following environment variables:
TorchBench benchmark script
TorchBench is a collection of open-source benchmarks used to evaluate the performance of PyTorch. We benchmarked 45 models using the scripts in the TorchBench repository. The following code shows how to run the scripts in eager mode and in compile mode with the inductor backend:
After the inference run is completed successfully, the script saves the results in JSON format. Below is a sample output.
Hugging Face benchmark script
The Google T5 Small Text Translation model is one of about 30 Hugging Face models that we benchmarked. We use it as an example model to demonstrate how to run inference in eager and compiled modes. The additional configuration and APIs required to run in compiled mode are: boldSave the following script as google_t5_small_text_translation.py.
To run the script, follow these steps:
After the inference run completes successfully, the script prints a torch profiler output that includes a breakdown of the torch operator latency. Below is a sample output from the torch profiler:
What's next?
We then extend support for the torch inductor CPU backend to compile Llama models and add support for fused GEMM kernels to enable fusion optimization of the torch inductor operator on AWS Graviton3 processors.
Conclusion
In this tutorial, we've shown you how to optimize the performance of torch.compile on AWS Graviton3-based EC2 instances, how to use the optimizations to improve inference performance of your PyTorch models, and the resulting speedups. Give it a try! If you need support with Graviton's ML software, see the AWS Graviton Technical Guide or file an issue on GitHub.
About the Author
Sunita Nadampalli He is a Software Development Manager and AI/ML expert at AWS, leading AWS Graviton software performance optimization for AI/ML and HPC workloads. He is passionate about open source software development and delivering high performance and sustainable software solutions for SoCs based on Arm ISA.