Accelerating PyTorch Inference with torch.compile on AWS Graviton Processors

Originally, PyTorch used an eager mode where each PyTorch operation that forms a model is executed independently as soon as it arrives. In PyTorch 2.0, torch.compile was introduced to make PyTorch code faster than the default eager mode. In contrast to eager mode, torch.compile pre-compiles the entire model into a single graph in a way that is optimal for execution on a given hardware platform. AWS has optimized the PyTorch torch.compile feature for AWS Graviton3 processors. This optimization improves Hugging Face model inference performance by up to 2x (based on the geometric mean of performance improvement for 33 models) and TorchBench model inference performance by up to 1.35x (based on the geometric mean of performance improvement for 45 models) compared to default eager mode inference across multiple natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances. Starting with PyTorch 2.3.1, optimizations are available in the Torch Python wheel and the AWS Graviton PyTorch Deep Learning Container (DLC).

In this blog post, we explain how to optimize the performance of torch.compile on AWS Graviton3 based EC2 instances, how to use the optimizations to improve inference performance, and the resulting speedups.

Why use torch.compile and what is its purpose?

In eager mode, operators in a model are executed as soon as they are encountered. This mode is the default mode because it is easy to use and is well suited for machine learning (ML) researchers. However, eager mode incurs runtime overhead due to redundant kernel launches and memory read overhead. In contrast, in torch compilation mode, operators are first composed into a graph, and one operator is merged with another to reduce and localize the total overhead of memory reads and kernel launches.

The goal of the AWS Graviton team was to optimize the torch.compile backend for the Graviton3 processor. PyTorch eager mode was already optimized for the Graviton3 processor with Arm Compute Library (ACL) kernels that use oneDNN (also known as MKLDNN). The question then was how to reuse these kernels in torch.compile mode to get the best of both graph compilation and kernel optimized performance.

result

The AWS Graviton team has reused ACL kernels and extended the Torch inductor and oneDNN primitives to optimize compiled-mode performance on the Graviton3 processor. Starting with PyTorch 2.3.1, this optimization is available in the Torch Python wheel and the AWS Graviton DLC. Running inference See the following sections for information about installation, runtime configuration, and how to run tests.

To demonstrate the performance improvements, we used TorchBench's NLP, CV and recommendation models as well as Hugging Face's most downloaded NLP models across question answering, text classification, token classification, translation, zero-shot classification, translation, summarization, feature extraction, text generation, Text2Text generation, Fill-Mask and sentence similarity tasks, covering a range of customer use cases.

First, we measured the latency in milliseconds (msec) for TorchBench model inference in Eager mode, which is marked as 1.0 by the dotted red line in the following graph. Then, we compared the improvement from torch.compile for the same model inference. Normalized results are plotted in the graph. Across the 45 models we benchmarked, we see a 1.35x improvement in latency (geometric mean of the 45 models).

Image 1: Improving PyTorch model inference performance with torch.compile on AWS Graviton3 based c7g instances using TorchBench framework. Reference eager mode performance is marked as 1.0 (higher is better)

Similar to the preceding TorchBench inference performance graph, we started by measuring the Hugging Face NLP model inference latency (in milliseconds) in Eager mode, which is marked as 1.0 in the following graph by the dotted red line. We then compared the improvement from torch.compile for the same model inference. Normalized results are plotted in the graph. Across the 33 models benchmarked, we see roughly a 2x performance improvement (geometric mean of the 33 models).

Image 2: Using the Hugging Face example script, running torch.compile on an AWS Graviton3-based c7g instance results in better performance for Hugging Face NLP model inference. The performance of the reference eager mode is marked as 1.0 (higher is better).

Running inference

Starting with PyTorch 2.3.1, optimizations are available in the torch Python wheel and the AWS Graviton PyTorch DLC. In this section, we show how to run inference in eager and torch.compile modes using the torch Python wheel and benchmark scripts from the Hugging Face and TorchBench repositories.

To successfully run the script and reproduce the speedup numbers discussed in this post, you need an instance of the Graviton3 family of hardware (c7g/r7g/m7g/hpc7g). This post used a c7g.4xl (16 vcpu) instance. The instance, AMI details, and required torch library versions are described in the following snippet:

Instance: c7g.4xl instance
Region: us-west-2
AMI: ami-05cc25bfa725a144a (Ubuntu 22.04/Jammy with 6.5.0-1017-aws kernel)

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

The general-purpose runtime tunings implemented for eager mode inference are equally applicable to torch.compile mode, so to further improve torch.compile performance on AWS Graviton3 processors, set the following environment variables:

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

TorchBench benchmark script

TorchBench is a collection of open-source benchmarks used to evaluate the performance of PyTorch. We benchmarked 45 models using the scripts in the TorchBench repository. The following code shows how to run the scripts in eager mode and in compile mode with the inductor backend:

# Set OMP_NUM_THREADS to number of vcpus, 16 for c7g.4xl instance
export OMP_NUM_THREADS=16

# Install the dependencies
sudo apt-get install -y libgl1-mesa-glx
sudo apt-get install -y libpangocairo-1.0-0
python3 -m pip install psutil numpy transformers pynvml numba onnx onnxruntime scikit-learn timm effdet gym doctr opencv-python h5py==3.10.0 python-doctr

# Clone pytorch benchmark repo
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# PyTorch benchmark repo doesn't have any release tags. So,
# listing the commit we used for collecting the performance numbers
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295

# Setup the models
python3 install.py

# Colect eager mode performance using the following command. The results will be
# stored at .userbenchmark/cpu/metric-<timestamp>.json.
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --metrics="latencies,cpu_peak_mem"

# Collect torch.compile mode performance with inductor backend
# and weights pre-packing enabled. The results will be stored at
# .userbenchmark/cpu/metric-<timestamp>.json
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

After the inference run is completed successfully, the script saves the results in JSON format. Below is a sample output.

{
"name": "cpu"
"environ": {
"pytorch_git_version": "d44533f9d073df13895333e70b66f81c513c1889"
},

"metrics": {
"BERT_pytorch-eval_latency": 56.3769865,
"BERT_pytorch-eval_cmem": 0.4169921875
}
}

Hugging Face benchmark script

The Google T5 Small Text Translation model is one of about 30 Hugging Face models that we benchmarked. We use it as an example model to demonstrate how to run inference in eager and compiled modes. The additional configuration and APIs required to run in compiled mode are: boldSave the following script as google_t5_small_text_translation.py.

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config config.cpp.weight_prepack=True config.freezing=True

def test_inference(mode, num_iter):
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

    if (mode == 'compile'):         model = torch.compile(model)

with torch.no_grad():
for _ in range(50):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(num_iter):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))

def main() -> None:
global m, args
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"-m",
"--mode",
choices=["eager", "compile"],
default="eager",
help="Which test to run.",
)
parser.add_argument(
"-n",
"--number",
type=int,
default=100,
help="how many iterations to run.",
)
args = parser.parse_args()
test_inference(args.mode, args.number)

if __name__ == "__main__":
main()

To run the script, follow these steps:

# Set OMP_NUM_THREADS to number of vcpus to 4 because
# the scripts are running inference in sequence, and
# they don't need large number of vcpus
export OMP_NUM_THREADS=4

# Install the dependencies
python3 -m pip install transformers

# Run the inference script in Eager mode
# using number of iterations as 1 just to show the torch profiler output
# but for the benchmarking, we used 1000 iterations.
python3 google_t5_small_text_translation.py -n 1 -m eager

# Run the inference script in torch compile mode
python3 google_t5_small_text_translation.py -n 1 -m compile

After the inference run completes successfully, the script prints a torch profiler output that includes a breakdown of the torch operator latency. Below is a sample output from the torch profiler:


# Torch profiler output for the eager mode run on c7g.xl (4vcpu)
---------------    ------------  -----------  ------------  -----------  ------------  ------------
Name                 Self CPU %   Self CPU     CPU total %   CPU total   CPU time avg    # of Calls
---------------    ------------  -----------  ------------  -----------  ------------  ------------
aten::mm            40.71%         12.502ms       40.71%      12.502ms     130.229us            96
model_inference     26.44%         8.118ms       100.00%      30.708ms      30.708ms             1
aten::bmm            6.85%         2.102ms         9.47%       2.908ms      80.778us            36
aten::matmul         3.73%         1.146ms        57.26%      17.583ms     133.205us           132
aten::select         1.88%       576.000us         1.90%     583.000us       0.998us           584
aten::transpose      1.51%       464.000us         1.83%     563.000us       3.027us           186
---------------    ------------  -----------  ------------  -----------  ------------  -------------
Self CPU time total: 30.708ms

# Torch profiler output for the compile mode run for the same model on the same instance
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
Name                      Self CPU %    Self CPU    CPU total %    CPU total   CPU time avg   # of Calls
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
mkldnn::_linear_pointwise   37.98%       5.461ms        45.91%       6.602ms      68.771us            96
Torch-Compiled Region       29.56%       4.251ms        98.53%      14.168ms      14.168ms             1
aten::bmm                   14.90%       2.143ms        21.73%       3.124ms      86.778us            36
aten::select                 4.51%     648.000us         4.62%     665.000us       1.155us           576
aten::view                   3.29%     473.000us         3.29%     473.000us       1.642us           288
aten::empty                  2.53%     364.000us         2.53%     364.000us       3.165us           115
-------------------------  ---------  -----------  ------------  ------------  ------------ -------------
Self CPU time total: 14.379ms

What's next?

We then extend support for the torch inductor CPU backend to compile Llama models and add support for fused GEMM kernels to enable fusion optimization of the torch inductor operator on AWS Graviton3 processors.

Conclusion

In this tutorial, we've shown you how to optimize the performance of torch.compile on AWS Graviton3-based EC2 instances, how to use the optimizations to improve inference performance of your PyTorch models, and the resulting speedups. Give it a try! If you need support with Graviton's ML software, see the AWS Graviton Technical Guide or file an issue on GitHub.

About the Author

Sunita Nadampalli He is a Software Development Manager and AI/ML expert at AWS, leading AWS Graviton software performance optimization for AI/ML and HPC workloads. He is passionate about open source software development and delivering high performance and sustainable software solutions for SoCs based on Arm ISA.