As generative artificial intelligence (AI) inference becomes increasingly important for enterprises, customers are exploring how to scale their generative AI operations and how to integrate generative AI models into existing workflows. Model optimization has emerged as a key step for organizations to balance cost-efficiency and responsiveness to improve productivity. However, price and performance requirements vary significantly across use cases. For chat applications, minimizing latency is critical to deliver an interactive experience, while real-time applications such as recommendations require maximizing throughput. Navigating these trade-offs is a major challenge in rapidly adopting generative AI, as different optimization techniques must be carefully selected and evaluated.
To overcome these challenges, we are happy to introduce the Inference Optimization Toolkit, a fully managed model optimization feature in Amazon SageMaker. This new feature reduces the cost of generative AI models such as Llama 3, Mistral, and Mixtral models by up to 50%, while delivering up to 2x higher throughput. For example, the Llama 3-70B model can now achieve up to 2400 tokens/sec on an ml.p5.48xlarge instance, compared to the previous 1200 tokens/sec without optimization.
The inference optimization toolkit uses the latest generative AI model optimization techniques, including compilation, quantization, and speculative decoding, to reduce the time it takes to optimize generative AI models from months to hours, achieving the best price-performance ratio for your use case. For compilation, the toolkit uses the Neuron compiler to optimize the model's computational graph for specific hardware, such as AWS Inferentia, to speed up execution time and reduce resource utilization. For quantization, the toolkit uses Activation-aware Weight Quantization (AWQ) to efficiently reduce the size and memory footprint of your model while maintaining quality. For speculative decoding, the toolkit uses a faster draft model to predict candidate outputs in parallel, improving inference speed for long text generation tasks. For more information about each technique, see Optimizing Model Inference with Amazon SageMaker. For more information and benchmark results on popular open source models, see Achieve up to 2X Higher Throughput While Reducing Generative AI Inference Costs by up to 50% in Amazon SageMaker with the New Inference Optimization Toolkit – Part 1.
This post shows you how to get started with the model inference optimization toolkit supported by Amazon SageMaker JumpStart and the Amazon SageMaker Python SDK. SageMaker JumpStart is a fully managed model hub where you can explore, fine-tune, and deploy popular open-source models with just a few clicks. You can use pre-optimized models or create your own custom optimizations. Alternatively, you can achieve this using the SageMaker Python SDK, as shown in the following notebook. For a complete list of supported models, see Optimizing Model Inference with Amazon SageMaker.
Using pre-optimized models with SageMaker JumpStart
The Inference Optimization Toolkit provides pre-optimized models that are optimized for best-in-class price performance at scale without compromising accuracy. You can choose a configuration based on the latency and throughput requirements of your use case and deploy it with one click.
Take the Meta-Llama-3-8b model from SageMaker JumpStart as an example: expand From the Model page, in Deployment configuration, you can expand the model configuration options, select the number of concurrent users, and deploy the optimized model.

Deploying pre-optimized models using the SageMaker Python SDK
You can also use the SageMaker Python SDK to deploy pre-optimized generative AI models with just a few lines of code. The following code: ModelBuilder A class for SageMaker JumpStart models. ModelBuilder is a class in the SageMaker Python SDK that gives you fine-grained control over various aspects of deployment, such as instance type, network isolation, and resource allocation. You can use it to convert framework models (such as XGBoost or PyTorch) or inference specifications into SageMaker-compatible models and create deployable model instances. For more information, see Creating Models Using ModelBuilder in Amazon SageMaker.
Use the following code to list the available pre-benchmark configurations:

Choose the right one instance_type and config_name Select from the list based on your requirements for number of concurrent users, latency, and throughput. In the table above, you can see the latency and throughput at various concurrency levels for a particular instance type and configuration name. If the configuration name is lmi-optimizedwhich means the configuration has been pre-optimized by SageMaker. Then, .build() Run the optimization job. After the job is complete, you can deploy it to an endpoint to test the model predictions. See the following code:
Creating a Custom Optimization Using the Inference Optimization Toolkit
In addition to creating pre-optimized models, you can also create custom optimizations based on the instance type you select. The following table shows a complete list of available combinations. In the next sections, we first discuss compiling on AWS Inferentia, and then we explore other optimization techniques for GPU instances.
| Instance type | Optimization Technology | composition |
| AWS Inference | compile | Neuron Compiler |
| GPU | Quantization | AWQ |
| GPU | Speculative Decoding | SageMaker provided or bring your own (BYO) draft model |
Compiling from SageMaker JumpStart
To compile, we choose the same Meta-Llama-3-8b model from SageMaker JumpStart. Optimize On the model page. On the optimization settings page, you can select ml.inf2.8xlarge as the instance type. Then, specify the output Amazon Simple Storage Service (Amazon S3) location for the optimized artifacts. For example, for a large model like Llama 2 70B, the compilation job can take more than an hour. Therefore, we recommend that you use the Inference Optimization Toolkit to perform ahead-of-time compilation. That way, you only need to compile once.

Compiling with the SageMaker Python SDK
In the SageMaker Python SDK, you can configure compilation by changing environment variables in the .optimize() function. For more information, compilation_configFor more information, see the tutorial on precompiling models in LMI NeuronX.
Quantization and Inferential Decoding from SageMaker JumpStart
When optimizing your model on a GPU, ml.g5.12xlarge is the default deployment instance type for Llama-3-8b. You can choose quantization, speculative decoding, or both as your optimization options. Quantization uses AWQ to reduce your model weights to a low-bit (INT4) representation. Finally, you can provide an output S3 URL to store the optimized artifacts.
Speculative decoding allows you to improve latency and throughput by using a draft model provided by SageMaker, bringing your own draft model from the public Hugging Face model hub, or uploading from your own S3 bucket.

Once the optimization job is complete, you can deploy the model or run further evaluation jobs on the optimized model. In the SageMaker Studio UI, you can choose to use the default example dataset or provide your own dataset using an S3 URI. At the time of writing, the performance evaluation option is only available from the Amazon SageMaker Studio UI.

Quantization and Speculative Decoding with the SageMaker Python SDK
Below is the SageMaker Python SDK code snippet for quantization. quantization_config Attribute .optimize() function.
For speculative decoding, speculative_decoding_config Configure your SageMaker or custom model by setting attributes. You might need to adjust GPU utilization based on the size of both the draft and target models to fit your instance for inference.
Conclusion
Optimizing generative AI models for inference performance is key to delivering cost-effective and responsive generative AI solutions. With the release of the Inference Optimization Toolkit, you can now optimize your generative AI models using modern techniques such as speculative decoding, compilation, and quantization to achieve up to 2X higher throughput and reduce costs by up to 50%. This allows you to achieve the best price-performance balance for your specific use case with just a few clicks in SageMaker JumpStart or a few lines of code using the SageMaker Python SDK. The Inference Optimization Toolkit significantly simplifies the model optimization process, helping enterprises accelerate their adoption of generative AI and capitalize on more opportunities to improve business outcomes.
For more information, see Optimizing Model Inference with Amazon SageMaker and Achieve up to 2X Higher Throughput and Reduce Costs by up to 50% with Generative AI Inference in Amazon SageMaker Using the New Inference Optimization Toolkit – Part 1.
About the Author
James Wu Senior AI/ML Specialist Solutions Architect
Saurabh Trikhande Senior Product Manager
Rishab Ray Chowdhury Senior Product Manager
Kumara Swami Bora I'm a front-end engineer.
Alwyn (Chiyun) Chao Senior Software Development Engineer
Seiran Senior SDE
