Today, we are excited to announce that Meta Llama 3 foundational models are now available for deploying and running inference through Amazon SageMaker JumpStart. Llama 3 models are a collection of pre-trained and fine-tuned generative text models.
This post explains how to discover and deploy Llama 3 models via SageMaker JumpStart.
What is Metalrama 3?
Llama 3 comes in two parameter sizes (8B and 70B with a context length of 8K) to support a wide range of use cases with improved inference, code generation, and instruction follow-up. Llama 3 uses a decoder-only transformer architecture and a new tokenizer that improves model performance at 128k size. Additionally, Meta has improved the post-training procedure, significantly reducing the false rejection rate, improving alignment, and increasing the diversity of model responses. You can now combine the performance of Llama 3 with the benefits of MLOps control using Amazon SageMaker features such as SageMaker Pipelines, SageMaker Debugger, and container logs. Additionally, the models are deployed in his secure AWS environment under the control of a VPC, which helps provide data security.
What is SageMaker JumpStart?
SageMaker JumpStart allows you to choose from a wide selection of publicly available foundation models. An ML practitioner can deploy the underlying model from a network-isolated environment to his dedicated SageMaker instance and customize the model for model training and deployment using SageMaker. You can now discover and deploy Llama 3 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. This will allow you to derive model performance and MLOps control using his SageMaker features such as SageMaker Pipelines, SageMaker Debugger, and container logs. Models are deployed in a secure environment in AWS and under the control of a VPC, which helps provide data security. Llama 3 models are currently available for deployment and inference in Amazon SageMaker Studio. us-east-1 (Northern Virginia), us-east-2 (Ohio), us-west-2 (Oregon), eu-west-1 (Ireland) and ap-northeast-1 (Tokyo) AWS Region.
discover the model
The base model is accessible through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. This section describes how to discover models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface with access to dedicated tools for all ML development steps, from data preparation to building, training, and deploying ML models. can be executed. For more information about how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.
SageMaker Studio provides access to SageMaker JumpStart, which includes pre-trained models, notebooks, and pre-built solutions. Pre-built automated solutions.

From the SageMaker JumpStart landing page, you can easily discover different models by browsing through different hubs named after model providers. Llama 3 models can be found on Meta Hub. If you don't see your Llama 3 model, try shutting down and restarting to update your version of SageMaker Studio. For more information, see Shut down and update Studio Classic apps.

You can find the Llama 3 model by searching for “Meta-llama-3” in the search box on the top left.

[メタ ハブ]You can find all meta models available in SageMaker JumpStart by clicking .

Clicking on a model card opens the corresponding model details page, from which you can easily deploy the model.

Deploy the model
when choosing expand Once you accept the EULA terms, deployment will begin.

You can monitor the progress of the deployment on the page that appears after you click the Deploy button.

Alternatively, you can choose open notebook Deploy through a sample notebook. The sample notebook provides end-to-end guidance on how to deploy models for inference and clean up resources.
To deploy using a notebook, first, model_id. You can deploy any of the selected models to SageMaker using the following code.
By default accept_eula is set to False. You must manually accept the EULA to successfully deploy the endpoint. This constitutes your acceptance of the User License Agreement and Terms of Use. The license agreement is also available on the Llama website. This will deploy the model to SageMaker with default configurations including the default instance type and default His VPC configuration. You can change these configurations by specifying non-default values. JumpStartModel. For more information, please see the following documentation:
The following table lists all Llama 3 models available in SageMaker JumpStart and model_idsthe default instance type and maximum number of total tokens (the sum of the number of input tokens and the number of generated tokens) supported for each of these models.
| Model name | model id | Maximum total number of tokens | Default instance type |
| Metalrama-3-8B | Metatext Generation-Rama-3-8B | 8192 | ml.g5.12xlarge |
| Metalrama-3-8B-Instructions | Metatext Generation-Rama-3-8B-Instruction | 8192 | ml.g5.12xlarge |
| Metalrama-3-70B | Metatext Generation-Rama-3-70b | 8192 | ml.p4d.24xlarge |
| Meta-Rama-3-70B-Instructions | metatext generation-rama-3-70b-instruction | 8192 | ml.p4d.24xlarge |
perform inference
After you deploy your model, you can run inference against the deployed endpoints through SageMaker predictors. A fine-tuned instruction model (Llama 3: 8B Instructions and 70B Instructions) accepts the history of chats between the user and the chat assistant and generates subsequent chats. Pre-trained models (Llama 3: 8B and 70B) require a string prompt and perform text completion on the provided prompt.
Inference parameters control the text generation process at the endpoint. The maximum number of new tokens controls the size of the output produced by the model. This is not the same as the number of words, because the model's vocabulary is not the same as the English vocabulary, and each token may not be an English word. The temperature parameter controls the randomness of the output. The higher the temperature, the more creative and hallucinogenic output you will get. All inference parameters are optional.
Example prompt for 70B model
The Llama 3 model can be used for text completion of any text. Through text generation, you can perform various tasks such as question answering, language translation, and sentiment analysis. The input payload to the endpoint looks like the following code.
Below is a sample example prompt and the text generated by the model.All output is generated using inference parameters {"max_new_tokens":64, "top_p":0.9, "temperature":0.6}.
The following example shows how to use an Llama 3 model with small-shot in-context learning, which provides training samples available to the model. This process performs inference only on the deployed model and does not change the model weights.
Example prompts for the 70B-Instruct model
In the Llama 3 instruction model, which is optimized for interaction use cases, the input to the instruction model endpoint is the previous history between the chat assistant and the user. You can ask questions related to the conversation so far. You can also provide system configuration, such as personas, that define the behavior of your chat assistant. The input payload format is the same as the basic pretrained model, but the input text must be formatted in the following way:
This instruction template optionally system Add rolls and include as many alternating rolls as you want in your turn-based history. The final role should always be: assistant Ends with two new line breaks.
Now consider some examples of prompts and responses from the model. In the following example, a user asks the assistant a simple question.
In the following example, a user is having a conversation with an assistant about tourist attractions in Paris. The user then asks about the first option recommended by her chat assistant.
The following example sets the configuration of the system.
cleaning
Once your notebook has finished running, be sure to delete any resources you created during the process so that billing will stop. Use the following code:
conclusion
In this post, you learned how to get started with Llama 3 models in SageMaker Studio. You now have access to four of his Llama 3 basic models containing billions of parameters. The base model is pre-trained, reducing training and infrastructure costs and also allowing customization for your use case. Check out SageMaker JumpStart for SageMaker Studio to get started today.
About the author
Kyle Ulrich I'm an Applied Scientist II at AWS.
Shinfan I'm a senior applied scientist at AWS.
Chin Lan I'm a senior software development engineer at AWS.
Haotian An I am a software development engineer II at AWS.
Christopher Witten I am a software development engineer II at AWS.
tyler osterberg I am a software development engineer at AWS.
Manan Shah I'm a software development manager at AWS.
Jonathan Guinegani I'm a senior software development engineer at AWS.
adrianna simmons I'm a senior product marketing manager at AWS.
Joon Won I'm a senior product manager at AWS.
Ashish Ketan I'm a senior applied scientist at AWS.
Rachna Chadha I am a Principal Solutions Architect for AI/ML at AWS.
Deepak Rupakula I am a Principal GTM Specialist at AWS.
