Deploy Mistral AI’s Voxtral to Amazon SageMaker AI

Configure the model in code/serving.properties.

To deploy Voxtral-Mini, use the following code:

option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1

To deploy Voxtral-Small, use the following code:

option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4

Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy the endpoint and test the text, audio, and function call functionality.

Docker container configuration

The GitHub repository contains the complete Dockerfile. The following code snippet highlights important parts.

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install audio processing dependencies
RUN pip install --no-cache-dir \
"mistral_common>=1.8.1" \
librosa>=0.10.2 \
soundfile>=0.12.1 \
pydub>=0.25.1

This Dockerfile contains the required audio processing libraries (mistral_common For tokenization, librosa/soundfile/pydub (for audio processing) as well as setting the appropriate SageMaker environment variables for model loading and caching. This approach separates infrastructure from business logic by keeping containers generic and allowing SageMaker to dynamically inject model-specific code (model.py and serving.properties) from Amazon S3 at runtime, allowing flexible deployment of different Voxtral variants without requiring container rebuilds.

Model configuration

The complete model configuration is: serving.properties The file is located in the code folder. The following code snippet highlights the key configurations.

# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings (as per official documentation)
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing (Voxtral specifications)
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={"audio_sampling_rate": 16000, "audio_max_length": 1800.0}
# Performance optimizations (vLLM v0.10.0+ features)
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true

This configuration file provides Voxtral-specific optimizations that follow Mistral's official recommendations for vLLM server deployment, setting appropriate tokenization modes, audio processing parameters (supporting up to 8 audio files per 30-minute transcribed prompt), and using the latest vLLM v0.10.0+ performance features such as chunk prefill and prefix caching. The modular design supports seamless switching between Voxtral-Mini and Voxtral-Small with a simple change. model_id and tensor_parallel_degree Optimize parameters while maintaining optimal memory utilization and enable advanced caching mechanisms to improve inference performance.

custom inference handler

The complete custom inference code is in the model.py file located in your code folder. The following code snippet highlights key features.

# FastAPI app for SageMaker compatibility
app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
	"""Start vLLM server with Voxtral-specific configuration"""
	config = load_serving_properties()

	cmd = [
	"vllm", "serve", config.get("option.model_id"),
	"--tokenizer-mode", "mistral",
	"--config-format", "mistral",
	"--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
	"--host", "127.0.0.1",
	"--port", "8000"
	]

	vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
	server_ready = wait_for_server()
	return server_ready
@app.post("/invocations")
async def invoke_model(request: Request):
	"""Handle chat, transcription, and function calling"""
	# Transcription requests
	if "transcription" in request_data:
		audio_source = request_data["transcription"]["audio"]
	return transcribe_audio(audio_source)

# Chat requests with multimodal support
messages = format_messages_for_openai(request_data["messages"])
tools = request_data.get("tools")

# Generate via vLLM OpenAI client
response = openai_client.chat.completions.create(
	model=model_config["model_id"],
	messages=messages,
	tools=tools if supports_function_calling() else None
	)
	return response

This custom inference handler creates a FastAPI-based server that integrates directly with the vLLM server to optimize Voxtral performance. The handler handles multimodal content, including base64-encoded audio and audio URLs, and dynamically loads model settings. serving.properties files and supports advanced features such as function calls for Voxtral-Small deployments.

SageMaker deployment code

The Voxtral-vLLM-BYOC-SageMaker.ipynb notebook is Voxtral-vllm-byoc The folder coordinates the entire deployment process for both Voxtral models.

import boto3
import sagemaker
from sagemaker.model import Model
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = "your-s3-bucket"
# Upload model artifacts to S3
byoc_config_uri = sagemaker_session.upload_data(
path="./code",
bucket=bucket,
key_prefix="voxtral-vllm-byoc/code"
)
# Configure custom container image
account_id = boto3.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/voxtral-vllm-byoc:latest"
# Create SageMaker model
voxtral_model = Model(
	image_uri=image_uri,
	model_data={
		"S3DataSource": {
		"S3Uri": f"{byoc_config_uri}/",
		"S3DataType": "S3Prefix",
		"CompressionType": "None"
		}
	},
	role=role,
	env={
		'MODEL_CACHE_DIR': '/opt/ml/model',
		'TRANSFORMERS_CACHE': '/tmp/transformers_cache',
		'SAGEMAKER_BIND_TO_PORT': '8080'
		}
	)
# Deploy to endpoint
predictor = voxtral_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g6.12xlarge", # For Voxtral-Small
	container_startup_health_check_timeout=1200,
	wait=True
	)

Model use case

The Voxtral model supports a variety of text and speech-to-text use cases, and the Voxtral-Small model supports using tools with voice input. For the complete code, see the GitHub repository. This section provides code snippets for various use cases that the model supports.

text only

The following code shows a basic text-based conversation with the model. Users submit text queries and receive structured responses.

payload = {
	"messages": [
	{
		"role": "user",
		"content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"
		}
	],
	"max_tokens": 200,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

Transcription only

The following example focuses on audio-to-text transcription by setting the temperature to 0 for deterministic output. This model processes an audio file URL or an audio file converted to Base64 code and returns the transcribed text without any additional interpretation.

payload = {
	"transcription": {
		"audio": "https://audiocdn.frenchtoday.com/file/ft-public-files/audiobook-samples/AMPFE/AMP%20FE%20Ch%2002%20Story%20Slower.mp3",
		"language": "fr",
		"temperature": 0.0
		}
	}
response = predictor.predict(payload)

Understanding text and audio

The following code combines both text instructions and audio input for multimodal processing. The model can follow specific text commands while analyzing a provided audio file in one inference pass, enabling more complex interactions such as guided transcription and speech analysis tasks.

payload = {
	"messages": [
	{
		"role": "user",
		"content": [
			{
				"type": "text",
				"text": "Can you summarise this audio file"
			},
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
			}
			]
		}
	],
	"max_tokens": 300,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

Using tools

The following code demonstrates the function call functionality that allows the model to interpret voice commands and run predefined tools. This example shows a weather query with voice input, where the model automatically calls the appropriate function and returns a structured result.

# Define weather tool configuration
WEATHER_TOOL = {
    "type": "function",
	"function": {
		"name": "get_current_weather",
		"description": "Get the current weather for a specific location",
		"parameters": {
			"type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use."
                }
            },
		    "required": ["location", "format"]
        }
	}
}
# Mock weather function
def mock_weather(location, format="celsius"):
	"""Always returns sunny weather at 25°C/77°F"""
	temp = 77 if format.lower() == "fahrenheit" else 25
	unit = "°F" if format.lower() == "fahrenheit" else "°C"
	return f"It's sunny in {location} with {temp}{unit}"
# Test payload with audio
payload = {
	"messages": [
	{
		"role": "user",
		"content": [
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav"
            }
            ]	
		}
	],
	"temperature": 0.2,
	"top_p": 0.95,
	"tools": [WEATHER_TOOL]
}
response = predictor.predict(payload)

Strands agent integration

The following example shows how to integrate Voxtral with the Strands framework to create intelligent agents that can use multiple tools. The agent automatically selects and executes the appropriate tool (such as a calculator, file operations, or Strands prebuilt tool shell commands) based on the user's query, enabling complex multi-step workflows through natural language interactions.

# SageMaker integration with Strands agents
# from strands import Agent
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell
model = SageMakerAIModel(
	endpoint_config={
		"endpoint_name": endpoint_name,
		"region_name": "us-west-2",
	},
	payload_config={
		"max_tokens": 1000,
		"temperature": 0.7,
		"stream": False,
	}
)
agent = Agent(model=model, tools=[calculator, current_time, file_read, shell])
response = agent("What is the square root of 12?")

cleaning

After you finish experimenting with this example, delete the SageMaker endpoint you created in the notebook to avoid unnecessary costs.

# Delete SageMaker endpoint
print(f" Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print(" Endpoint deleted successfully")

conclusion

In this post, we demonstrated how to self-host Mistral's open source Voxtral model on SageMaker using a BYOC approach. We created a production-ready system using the latest vLLM framework and official Voxtral optimizations for both Mini and Small model versions. The solution supports the full range of Voxtral's capabilities, including text-only conversations, voice transcription, advanced multimodal understanding, and function invocation directly from voice input. This flexible architecture allows you to switch between Voxtral-Mini and Voxtral-Small models through a simple configuration update without rebuilding the container.

Take your multimodal AI applications to the next level by trying out the complete code from the GitHub repository to host your Voxtral model in SageMaker and start building your own voice-enabled applications. Visit Mistral's official website for detailed features, performance benchmarks, and technical specifications to unlock Voxtral's full potential. Finally, explore the Strands Agents framework to seamlessly create agent applications that can run complex workflows.

About the author

Dr. Hou Yin, She is a Senior Specialist Solutions Architect for GenAI on AWS, working with model providers to onboard the latest and most intelligent AI models onto the AWS platform. With deep expertise in Gen AI, ASR, computer vision, NLP, and time series predictive models, we work closely with customers to design and build cutting-edge ML and GenAI applications.