Today, we are excited to announce that Meta Llama 3 inference is now available on AWS Trainium and AWS Inferentia-based instances in Amazon SageMaker JumpStart. Meta Llama 3 models are a collection of pre-trained and fine-tuned generative text models. Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances powered by AWS Trainium and AWS Inferentia2 provide the most cost-effective way to deploy Llama 3 models on AWS. Deploy up to 50% less than comparable Amazon EC2 instances. These not only reduce the time and cost of training and deploying large-scale language models (LLMs), but also allow developers to easily turn them into high-performance accelerators that meet the scalability and efficiency needs of real-time applications such as chatbots and AI. allow access to. assistant.
In this post, we demonstrate how easy it is to deploy Llama 3 on AWS Trainium and AWS Inferentia-based instances in SageMaker JumpStart.
Meta Llama 3 model on SageMaker Studio
SageMaker JumpStart provides access to a publicly available proprietary foundation model (FM). Foundation models are onboarded and maintained from third-party and proprietary providers. As such, they are released under different licenses specified by the model source. Be sure to check the license of the FM you use. Before downloading or using Content, you are responsible for reviewing and complying with the applicable license terms and determining whether they are acceptable for your use case.
Meta Llama 3 FM can be accessed through SageMaker JumpStart and SageMaker Python SDK in the Amazon SageMaker Studio console. This section describes how to discover models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single, web-based visual interface with access to dedicated tools for all machine learning (ML) tasks, from data preparation to building, training, and deploying ML. ) development steps. model. For more information on how to start and set up SageMaker Studio, see Getting Started with SageMaker Studio.
You can selectively access SageMaker JumpStart in the SageMaker Studio console. jump start in the navigation pane. If you are using SageMaker Studio Classic, see Open and use JumpStart in Studio Classic to navigate to a SageMaker JumpStart model.

From the SageMaker JumpStart landing page, you can search for “Meta” in the search box.

Select the Meta Model card to list all models from the SageMaker JumpStart meta.

You can also search for “neuron” to find related model variants. If you don't see your Meta Llama 3 model, try updating your SageMaker Studio version by shutting down and restarting SageMaker Studio.

No-code deployment of Llama 3 Neuron models with SageMaker JumpStart
Select a model card to view details about the model, including its license, data used for training, and usage. There are also two buttons. expand and Notebook previewwhich helps you deploy your model.

when choosing expand, you will see the page shown in the following screenshot. The top section of the page displays the End User License Agreement (EULA) and Terms of Use, which you must accept.
After approving the policy, provide and select the endpoint settings expand Deploy the model endpoint.

Alternatively, you can choose to deploy through a sample notebook. open notebook. The sample notebook provides end-to-end guidance on how to deploy models for inference and clean up resources.
Deploying Meta Llama 3 on AWS Trainium and AWS Inferentia using SageMaker JumpStart SDK
SageMaker JumpStart precompiled Meta Llama 3 models for various configurations to avoid runtime compilation during deployment and fine-tuning. The Neuron Compiler FAQ provides details about the compilation process.
There are two ways to deploy Meta Llama 3 on AWS Inferentia and Trainium-based instances using the SageMaker JumpStart SDK. You can deploy your model with two lines of code for simplicity, or you can focus on having more control over your deployment configuration. The following code snippet shows a simpler deployment mode.
To perform inference on these models, you must specify arguments accept_eula is true as part of model.deploy() phone. This means that the model has read and agrees to her EULA. The EULA can be found in the model card description or at https://ai.meta.com/resources/models-and-libraries/llama-downloads/.
The default instance type for Meta LIama-3-8B is ml.inf2.24xlarge. Other model IDs supported for deployment are:
meta-textgenerationneuron-llama-3-70bmeta-textgenerationneuron-llama-3-8b-instructmeta-textgenerationneuron-llama-3-70b-instruct
SageMaker JumpStart has preselected configurations to help you get started, listed in the following table. For more information on how to further optimize these configurations, see Advanced Deployment Configurations.
| LIama-3 8B and LIama-3 8B instructions | ||||
| instance type |
OPTION_N_POSITI Oz |
OPTION_MAX_ROLLING_BATCH_SIZE | OPTION_TENSOR_PARALLEL_DEGREE | OPTION_DTYPE |
| ml.inf2.8xlarge | 8192 | 1 | 2 | BF16 |
| ml.inf2.24xlarge (default) | 8192 | 1 | 12 | BF16 |
| ml.inf2.24xlarge | 8192 | 12 | 12 | BF16 |
| ml.inf2.48xlarge | 8192 | 1 | twenty four | BF16 |
| ml.inf2.48xlarge | 8192 | 12 | twenty four | BF16 |
| LIama-3 70B and LIama-3 70B instructions | ||||
| ml.trn1.32xlarge | 8192 | 1 | 32 | BF16 |
| ml.trn1.32xlarge (Default) |
8192 | Four | 32 | BF16 |
The following code shows how to customize deployment configurations such as sequence length, tensor parallelism, and maximum rolling batch size.
Now that you have deployed the Meta Llama 3 neuron model, you can call the endpoint to perform inference from the model.
For more information about parameters in the payload, see Advanced Parameters.
For more information about passing parameters to control text generation, see Fine-tune and Deploy Llama 2 Models Cost-Effectively with Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium.
cleaning
When your training job is complete and you no longer want to use the existing resources, you can delete them using the following code:
conclusion
Deploying Meta Llama 3 models on AWS Inferentia and AWS Trainium using SageMaker JumpStart demonstrates the lowest cost of deploying large-scale generative AI models like Llama 3 on AWS. These models, including variants such as Meta-Llama-3-8B, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B, and Meta-Llama-3-70B-Instruct, are suitable for inference on AWS. Use AWS Neuron. Trainium and Inferentia. AWS Trainium and Inferentia offer up to 50% lower deployment costs than comparable EC2 instances.
In this post, we demonstrated how to use SageMaker JumpStart to deploy a Meta Llama 3 model to AWS Trainium and AWS Inferentia. You can deploy these models through the SageMaker JumpStart console and Python SDK, providing flexibility and ease of use. We look forward to seeing how you use these models to build interesting generative AI applications.
To get started using SageMaker JumpStart, see How to Get Started with Amazon SageMaker JumpStart. For more examples of deploying models to AWS Trainium and AWS Inferentia, see our GitHub repository. For more information about how to deploy Meta Llama 3 models on GPU-based instances, see Meta Llama 3 models now available in Amazon SageMaker JumpStart.
About the author
Shinfan I'm a senior applied scientist.
Rachna Chadha I am a Principal Solutions Architect for AI/ML.
Chin Lan Advanced SDE – ML System
pinak panigrahi I am a Senior Solutions Architect at Annapurna ML.
Christopher Witten I'm a software development engineer
Kamran Khan I am in charge of BD/GTM Annapurna ML.
Ashish Ketan I'm a senior applied scientist.
Pradeep Cruz I'm a senior SDM.
