Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints

Today, Amazon SageMaker AI introduced OpenAI-compatible API support for real-time inference endpoints. If you use OpenAI SDK, LangChain, or Strands Agent, you can now call your model on SageMaker AI by changing just the endpoint URL. No custom clients, SigV4 wrappers, or code rewrites required.

overview

With this release, SageMaker AI endpoints are now /openai/v1 A path that accepts chat completion requests and returns unchanged responses from the container, including streaming. OpenAI endpoints are enabled for all endpoints and inference components using the standard SageMaker AI API and SDK.

SageMaker AI routes based on the endpoint name in the URL, so you can use any OpenAI-compatible client out of the box. You can now create time-limited bearer tokens for your endpoints and use them with OpenAI clients.

For a working example with deployment and invocation, see the accompanying notebook on GitHub.

“We run an AI coding agent that uses multiple LLM providers through an LLM gateway (Bifrost) that speaks the OpenAI Chat Completion Protocol. The bearer token feature allows us to add SageMaker as a drop-in OpenAI-compatible inference endpoint (without custom SigV4 signing), so it works natively with our gateway, the Vercel AI SDK, and standard OpenAI clients.” Giorgio Piatti (AI/ML) Engineer – Caffeine.AI) says

use case

Agent workflows on owned infrastructure

When you build multi-step AI agents using frameworks like Strands Agent or LangChain, you can run their entire workflow on your own SageMaker AI endpoint. The agent calls the model using the same OpenAI-compatible interface it was built with, but the inference runs on a dedicated GPU instance in your account.

Hosting multiple models through a single interface

If you want to run multiple models (for example, Llama for general tasks, a fine-tuned Mistral for domain-specific work, and a smaller model for classification), you can host them all on a single SageMaker AI endpoint using the inference component. Each model has its own resource allocation, and all models can be called through the same OpenAI SDK. You don’t need to write separate API clients or routing logic in your application code.

Deliver fine-tuned models without changing code

If you want to fine-tune open source models for specific use cases, you can deploy them to SageMaker AI and call them through the same OpenAI-compatible interfaces that your applications already use. The only change is the endpoint URL. The rest of the application (SDK calls, streaming logic, prompt format) remains the same.

Solution overview

In this post we will cover:

How bearer token authentication works with SageMaker AI endpoints.
Deploying and invoking endpoints for a single model.
Deploying and invoking inference components for multi-model deployment.
Integration with Strands Agent framework.

Prerequisites

To proceed with this tutorial you will need:

An AWS account with permissions to create SageMaker AI endpoints.
SageMaker Python SDK (pip install sagemaker).
OpenAI Python SDK (pip install openai).
Models stored in Amazon Simple Storage Service (Amazon S3). For example, Qwen3-4B, which I downloaded from Hugging Face.
An AWS Identity and Access Management (IAM) execution role to create the endpoint. AmazonSageMakerFullAccess policy.
IAM execution role sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Permission to call the endpoint.

Authentication with bearer token

SageMaker AI OpenAI compatible endpoints use bearer token authentication. The SageMaker Python SDK includes a token generator that creates time-limited tokens (valid for up to 12 hours) from your existing AWS credentials. No additional secrets or API keys are required.

The token contains role or user credentials and requires the following: sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Action authority.

Generate a token

Generate a token using the following Python script.

from sagemaker.core.token_generator import generate_token
from datetime import timedelta

token = generate_token(region="us-west-2", expiry=timedelta(minutes=5))

The token generator uses AWS credentials available in your environment: IAM user credentials, an instance profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Identity Center (SSO) session.

of generate_token The function generates a time-limited bearer token for authenticating with the SageMaker API. By default, tokens are valid for 12 hours, but you can override this. expiry parameters using timedelta Values are between 1 second and 12 hours. This function accepts an optional region. aws_credentials_providerand expiration date. If no AWS Region is specified, reverts to the AWS Region. AWS_REGION environmental variables. If no credential provider is specified, the default AWS credential chain, which searches multiple sources including environment variables, is used to resolve the credentials. ~/.aws/credentials, ~/.aws/configcontainer credentials, instance profiles. See the Boto3 Credentials documentation for the complete resolution order.

Auto-refresh tokens for long-running applications

For applications that run continuously, you can implement an automatic update pattern using: httpx Ensures that a new token is generated for each request.

import httpx
from sagemaker.core.token_generator import generate_token

class SageMakerAuth(httpx.Auth):
    def __init__(self, region: str):
        self.region = region

    def auth_flow(self, request):
        request.headers["Authorization"] = f"Bearer {generate_token(region=self.region)}"
        yield request

http_client = httpx.Client(auth=SageMakerAuth(region="us-west-2"))

IAM permissions

The IAM role or user that calls the endpoint must have the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sagemaker:InvokeEndpoint",
            "Resource": "arn:aws:sagemaker:::endpoint/"
        },
        {
            "Effect": "Allow",
            "Action": "sagemaker:CallWithBearerToken",
            "Resource": "*"
        }
    ]
}

As a best practice, always limit. Resource to a specific endpoint ARN InvokeEndpoint Rather than using wildcards. Bearer tokens generated from this role have the same level of access, so the narrow scope policy limits the scope of the explosion if the token is accidentally exposed. note that CallWithBearerToken Wildcard ("*") for Resource field. Resource level limits are not supported.

How tokens work

The bearer token is a base64 encoded SigV4 signed URL. when making a call generate_tokenthe SageMaker AI SDK constructs requests to SageMaker AI services. CallWithBearerToken Execute the action, sign it locally with your AWS credentials, and encode the resulting signed URL as a portable token string. No network calls are made during token generation. Signing is done entirely on the client side. When you present this token to the SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token has not expired, and verifies that the original IAM identity has the necessary permissions. The token lifetime is the lesser of the expiration value and the remaining lifetime of the AWS credentials used to sign the token.

Security best practices: The bearer token contains the same authorization as the underlying AWS credentials used to generate it. Treat tokens with the same care as credentials. Limit the scope of the IAM role used for token generation to the minimum necessary privileges. sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken Only target endpoint ARNs that the caller needs to access. Do not generate tokens from roles with extended privileges, such as those granted by . AdministratorAccess or SageMakerFullAccess Managed policy.

Do not store tokens on disk, in environment variables, in configuration files, in databases, or in distributed caches. Do not log tokens and only send them over encrypted communication protocols such as HTTPS. Generating a token is a local operation with no network overhead, so we recommend that you generate a new token at the time of use or use the auto-renew feature. httpx.Auth The pattern shown in the previous example. This avoids the risk of token leakage and allows you to use your tokens with maximum expiry time remaining. As a best practice, set the token expiration time to the shortest duration required by your workload.

Deploy a single model endpoint

A single model endpoint hosts one model and handles requests directly. The following example deploys Qwen3-4B using the SageMaker AI vLLM Deep Learning Container. ml.g6.2xlarge Examples.

Note: SageMaker AI endpoints incur charges during service, regardless of traffic. For more information, see the Amazon SageMaker AI pricing page.

import boto3
import sagemaker
import time
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.helper.session_helper import get_execution_role

# AWS configuration
REGION = "us-west-2"

# Automatically resolve account ID and default SageMaker execution role
session = Session(boto_session=boto3.Session(region_name=REGION))
ACCOUNT_ID = boto3.client("sts", region_name=REGION).get_caller_identity()["Account"]
EXECUTION_ROLE = get_execution_role(sagemaker_session=session)

# HF Model ID
MODEL_HF_ID = "Qwen/Qwen3-4B"

# SageMaker vLLM Deep Learning Container
VLLM_IMAGE = f"763104351884.dkr.ecr.{REGION}.amazonaws.com/vllm:0.20.2-gpu-py312-cu130-ubuntu22.04-sagemaker"

# Instance type (1x NVIDIA L4 GPU)
INSTANCE_TYPE = "ml.g6.2xlarge"

sagemaker_client = boto3.client("sagemaker", region_name=REGION)

print(f"Region: {REGION}")
print(f"Account ID: {ACCOUNT_ID}")
print(f"Execution role: {EXECUTION_ROLE}")
print(f"Model HF ID: {MODEL_HF_ID}")

import time

TIMESTAMP = str(int(time.time()))
SME_MODEL_NAME = f"openai-compat-sme-model-{TIMESTAMP}"
SME_ENDPOINT_CONFIG_NAME = f"openai-compat-sme-epc-{TIMESTAMP}"
SME_ENDPOINT_NAME = f"openai-compat-sme-ep-{TIMESTAMP}"

print(f"Timestamp suffix: {TIMESTAMP}")
print(f"Model: {SME_MODEL_NAME}")
print(f"Endpoint config: {SME_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {SME_ENDPOINT_NAME}")

sagemaker_client.create_model(
    ModelName=SME_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Image": VLLM_IMAGE,
        "Environment": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Model created: {SME_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": SME_MODEL_NAME,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {SME_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=SME_ENDPOINT_NAME,
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {SME_ENDPOINT_NAME}")

print("Waiting for endpoint to reach InService status (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=SME_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {SME_ENDPOINT_NAME}")

The endpoint transitions as follows: InService The status will be displayed within a few minutes. Once you’re ready, it’s compatible with both standard SageMaker AI. /invocations Paths and OpenAI Compatible Paths /openai/v1/chat/completions.

Call endpoint for a single model

Once the endpoint is a service, call it using the OpenAI Python SDK. The base URL follows this format:

https://runtime.sagemaker..amazonaws.com/endpoints//openai/v1

from openai import OpenAI
from sagemaker.core.token_generator import generate_token

REGION = "us-west-2"

sme_base_url = f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1"

client = OpenAI(
    base_url=sme_base_url,
    api_key=generate_token(region=REGION)
)

print(f"Base URL: {sme_base_url}")

stream = client.chat.completions.create(
    model="",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how transformers work in machine learning, in three sentences."},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
print()

of model Fields are passed to the container. SageMaker AI routes requests based on the endpoint name in the URL, so you can leave this field empty or set it to match the model name the container expects.

Deploy the inference component endpoint

Inference components allow a single endpoint to host multiple models, each with dedicated computing resources. For inference components, the model is associated with the component rather than the endpoint configuration.

IC_MODEL_NAME = f"openai-compat-ic-model-{TIMESTAMP}"
IC_ENDPOINT_CONFIG_NAME = f"openai-compat-ic-epc-{TIMESTAMP}"
IC_ENDPOINT_NAME = f"openai-compat-ic-ep-{TIMESTAMP}"
IC_NAME = f"openai-compat-ic-qwen3-4b-{TIMESTAMP}"

print(f"Model: {IC_MODEL_NAME}")
print(f"Endpoint config: {IC_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {IC_ENDPOINT_NAME}")
print(f"Inference comp: {IC_NAME}")

sagemaker_client.create_model(
    ModelName=IC_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Image": VLLM_IMAGE,
        "Environment": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Model created: {IC_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {IC_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=IC_ENDPOINT_NAME,
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {IC_ENDPOINT_NAME}")

print("Waiting for endpoint to reach InService status (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=IC_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {IC_ENDPOINT_NAME}")

sagemaker_client.create_inference_component(
    InferenceComponentName=IC_NAME,
    EndpointName=IC_ENDPOINT_NAME,
    VariantName="variant1",
    Specification={
        "ModelName": IC_MODEL_NAME,
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": 1024,
            "NumberOfCpuCoresRequired": 2,
            "NumberOfAcceleratorDevicesRequired": 1,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)
print(f"Inference component creation initiated: {IC_NAME}")

print("Waiting for inference component to reach InService status...")
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=IC_NAME)
    status = desc["InferenceComponentStatus"]
    if status == "InService":
        print(f"Inference component is InService: {IC_NAME}")
        break
    elif status == "Failed":
        raise RuntimeError(f"Inference component failed: {desc.get('FailureReason', 'unknown')}")
    time.sleep(30)

You can create additional inference components on the same endpoint to host multiple models with independent scaling and resource allocation.

Call the inference component

To call a specific inference component, include its name in the URL path.

https://runtime.sagemaker..amazonaws.com/endpoints//inference-components//openai/v1

The following example shows two inference components on a shared endpoint. Each component is targeted to a separate OpenAI client that shares a connection pool.

import httpx
from openai import OpenAI
from sagemaker.core.token_generator import generate_token

shared_http = httpx.Client()

client_a = OpenAI(
    base_url=(
        f"https://runtime.sagemaker.{REGION}.amazonaws.com"
        f"/endpoints/{IC_ENDPOINT_NAME}/inference-components/{IC_NAME}/openai/v1"
    ),
    api_key=generate_token(region=REGION),
    http_client=shared_http,
)

response = client_a.chat.completions.create(
    model="",
    messages=[{"role": "user", "content": "What is 42 * 3? Reply with the number."}],
)
print(f"Response: {response.choices[0].message.content}")
print(f"Connection pool active: shared_http is reusable across multiple IC clients")

shared httpx.Client Enables both OpenAI client instances to reuse the same TLS session and connection pool.

Integration with Strands agent

Strands Agents is an open source SDK for building AI agents. Strands Agents supports OpenAI-compatible model providers, so you can now run multi-agent workflows entirely on your own SageMaker AI infrastructure. This gives you the flexibility of an agent application that can control dedicated endpoints. No data leaves your account, and you can choose exactly which model versions your agents run.

from openai import AsyncOpenAI
from strands import Agent, tool
from strands.models.openai import OpenAIModel
from sagemaker.core.token_generator import generate_token

@tool
def calculator(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

strands_client = AsyncOpenAI(
    base_url=f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1",
    api_key=generate_token(region=REGION),
)

model = OpenAIModel(client=strands_client, model_id="", params={"temperature": 0.7})

coder = Agent(
    model=model,
    system_prompt=(
        "You are an expert Python developer. Write clean, well-documented "
        "Python code with type hints. Output ONLY the code, no explanation."
    ),
    tools=[calculator],
)

reviewer = Agent(
    model=model,
    system_prompt=(
        "You are a senior code reviewer. Review Python code for correctness, "
        "performance, and PEP 8 style. Give a concise review with specific suggestions."
    ),
    tools=[calculator],
)

cleaning

To avoid ongoing charges, delete the endpoint and associated resources when you’re done. SageMaker AI endpoints incur costs while in service regardless of whether they are receiving traffic.

import boto3
sagemaker_client = boto3.client("sagemaker", region_name="us-west-2")

sagemaker_client.delete_inference_component(InferenceComponentName="")
sagemaker_client.delete_endpoint(EndpointName="")
sagemaker_client.delete_endpoint_config(EndpointConfigName="")
sagemaker_client.delete_model(ModelName="")

conclusion

With OpenAI-compatible API support, Amazon SageMaker AI removes the integration barrier between where most AI applications currently reside and the infrastructure they need to scale. You can keep your existing code, use OpenAI-compatible frameworks, and run inference on dedicated endpoints with the necessary GPU, scaling, and data residency controls. First, deploy your model to a SageMaker AI real-time endpoint using a supported container, install the SageMaker Python SDK, and specify the OpenAI client in the endpoint URL. For more information, see Use SageMaker AI with OpenAI-compatible APIs. Amazon SageMaker AI Developer Guideor open the Amazon SageMaker AI console and create your first endpoint.