Today, Amazon SageMaker AI introduced OpenAI-compatible API support for real-time inference endpoints. If you use OpenAI SDK, LangChain, or Strands Agent, you can now call your model on SageMaker AI by changing just the endpoint URL. No custom clients, SigV4 wrappers, or code rewrites required.
overview
With this release, SageMaker AI endpoints are now /openai/v1 A path that accepts chat completion requests and returns unchanged responses from the container, including streaming. OpenAI endpoints are enabled for all endpoints and inference components using the standard SageMaker AI API and SDK.
SageMaker AI routes based on the endpoint name in the URL, so you can use any OpenAI-compatible client out of the box. You can now create time-limited bearer tokens for your endpoints and use them with OpenAI clients.
For a working example with deployment and invocation, see the accompanying notebook on GitHub.
“We run an AI coding agent that uses multiple LLM providers through an LLM gateway (Bifrost) that speaks the OpenAI Chat Completion Protocol. The bearer token feature allows us to add SageMaker as a drop-in OpenAI-compatible inference endpoint (without custom SigV4 signing), so it works natively with our gateway, the Vercel AI SDK, and standard OpenAI clients.” Giorgio Piatti (AI/ML) Engineer – Caffeine.AI) says
use case
Agent workflows on owned infrastructure
When you build multi-step AI agents using frameworks like Strands Agent or LangChain, you can run their entire workflow on your own SageMaker AI endpoint. The agent calls the model using the same OpenAI-compatible interface it was built with, but the inference runs on a dedicated GPU instance in your account.
Hosting multiple models through a single interface
If you want to run multiple models (for example, Llama for general tasks, a fine-tuned Mistral for domain-specific work, and a smaller model for classification), you can host them all on a single SageMaker AI endpoint using the inference component. Each model has its own resource allocation, and all models can be called through the same OpenAI SDK. You don’t need to write separate API clients or routing logic in your application code.
Deliver fine-tuned models without changing code
If you want to fine-tune open source models for specific use cases, you can deploy them to SageMaker AI and call them through the same OpenAI-compatible interfaces that your applications already use. The only change is the endpoint URL. The rest of the application (SDK calls, streaming logic, prompt format) remains the same.
Solution overview
In this post we will cover:
- How bearer token authentication works with SageMaker AI endpoints.
- Deploying and invoking endpoints for a single model.
- Deploying and invoking inference components for multi-model deployment.
- Integration with Strands Agent framework.
Prerequisites
To proceed with this tutorial you will need:
- An AWS account with permissions to create SageMaker AI endpoints.
- SageMaker Python SDK (
pip install sagemaker). - OpenAI Python SDK (
pip install openai). - Models stored in Amazon Simple Storage Service (Amazon S3). For example, Qwen3-4B, which I downloaded from Hugging Face.
- An AWS Identity and Access Management (IAM) execution role to create the endpoint.
AmazonSageMakerFullAccesspolicy. - IAM execution role
sagemaker:CallWithBearerTokenandsagemaker:InvokeEndpointPermission to call the endpoint.
Authentication with bearer token
SageMaker AI OpenAI compatible endpoints use bearer token authentication. The SageMaker Python SDK includes a token generator that creates time-limited tokens (valid for up to 12 hours) from your existing AWS credentials. No additional secrets or API keys are required.
The token contains role or user credentials and requires the following: sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Action authority.
Generate a token
Generate a token using the following Python script.
The token generator uses AWS credentials available in your environment: IAM user credentials, an instance profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Identity Center (SSO) session.
of generate_token The function generates a time-limited bearer token for authenticating with the SageMaker API. By default, tokens are valid for 12 hours, but you can override this. expiry parameters using timedelta Values are between 1 second and 12 hours. This function accepts an optional region. aws_credentials_providerand expiration date. If no AWS Region is specified, reverts to the AWS Region. AWS_REGION environmental variables. If no credential provider is specified, the default AWS credential chain, which searches multiple sources including environment variables, is used to resolve the credentials. ~/.aws/credentials, ~/.aws/configcontainer credentials, instance profiles. See the Boto3 Credentials documentation for the complete resolution order.
Auto-refresh tokens for long-running applications
For applications that run continuously, you can implement an automatic update pattern using: httpx Ensures that a new token is generated for each request.
IAM permissions
The IAM role or user that calls the endpoint must have the following permissions:
As a best practice, always limit. Resource to a specific endpoint ARN InvokeEndpoint Rather than using wildcards. Bearer tokens generated from this role have the same level of access, so the narrow scope policy limits the scope of the explosion if the token is accidentally exposed. note that CallWithBearerToken Wildcard ("*") for Resource field. Resource level limits are not supported.
How tokens work
The bearer token is a base64 encoded SigV4 signed URL. when making a call generate_tokenthe SageMaker AI SDK constructs requests to SageMaker AI services. CallWithBearerToken Execute the action, sign it locally with your AWS credentials, and encode the resulting signed URL as a portable token string. No network calls are made during token generation. Signing is done entirely on the client side. When you present this token to the SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token has not expired, and verifies that the original IAM identity has the necessary permissions. The token lifetime is the lesser of the expiration value and the remaining lifetime of the AWS credentials used to sign the token.
Security best practices: The bearer token contains the same authorization as the underlying AWS credentials used to generate it. Treat tokens with the same care as credentials. Limit the scope of the IAM role used for token generation to the minimum necessary privileges. sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken Only target endpoint ARNs that the caller needs to access. Do not generate tokens from roles with extended privileges, such as those granted by . AdministratorAccess or SageMakerFullAccess Managed policy.
Do not store tokens on disk, in environment variables, in configuration files, in databases, or in distributed caches. Do not log tokens and only send them over encrypted communication protocols such as HTTPS. Generating a token is a local operation with no network overhead, so we recommend that you generate a new token at the time of use or use the auto-renew feature. httpx.Auth The pattern shown in the previous example. This avoids the risk of token leakage and allows you to use your tokens with maximum expiry time remaining. As a best practice, set the token expiration time to the shortest duration required by your workload.
Deploy a single model endpoint
A single model endpoint hosts one model and handles requests directly. The following example deploys Qwen3-4B using the SageMaker AI vLLM Deep Learning Container. ml.g6.2xlarge Examples.
Note: SageMaker AI endpoints incur charges during service, regardless of traffic. For more information, see the Amazon SageMaker AI pricing page.
The endpoint transitions as follows: InService The status will be displayed within a few minutes. Once you’re ready, it’s compatible with both standard SageMaker AI. /invocations Paths and OpenAI Compatible Paths /openai/v1/chat/completions.
Call endpoint for a single model
Once the endpoint is a service, call it using the OpenAI Python SDK. The base URL follows this format:
of model Fields are passed to the container. SageMaker AI routes requests based on the endpoint name in the URL, so you can leave this field empty or set it to match the model name the container expects.
Deploy the inference component endpoint
Inference components allow a single endpoint to host multiple models, each with dedicated computing resources. For inference components, the model is associated with the component rather than the endpoint configuration.
You can create additional inference components on the same endpoint to host multiple models with independent scaling and resource allocation.
Call the inference component
To call a specific inference component, include its name in the URL path.
The following example shows two inference components on a shared endpoint. Each component is targeted to a separate OpenAI client that shares a connection pool.
shared httpx.Client Enables both OpenAI client instances to reuse the same TLS session and connection pool.
Integration with Strands agent
Strands Agents is an open source SDK for building AI agents. Strands Agents supports OpenAI-compatible model providers, so you can now run multi-agent workflows entirely on your own SageMaker AI infrastructure. This gives you the flexibility of an agent application that can control dedicated endpoints. No data leaves your account, and you can choose exactly which model versions your agents run.
cleaning
To avoid ongoing charges, delete the endpoint and associated resources when you’re done. SageMaker AI endpoints incur costs while in service regardless of whether they are receiving traffic.
conclusion
With OpenAI-compatible API support, Amazon SageMaker AI removes the integration barrier between where most AI applications currently reside and the infrastructure they need to scale. You can keep your existing code, use OpenAI-compatible frameworks, and run inference on dedicated endpoints with the necessary GPU, scaling, and data residency controls. First, deploy your model to a SageMaker AI real-time endpoint using a supported container, install the SageMaker Python SDK, and specify the OpenAI client in the endpoint URL. For more information, see Use SageMaker AI with OpenAI-compatible APIs. Amazon SageMaker AI Developer Guideor open the Amazon SageMaker AI console and create your first endpoint.
About the author
