Amazon SageMaker AI Async Inference now supports inline request payloads

Machine Learning


Today, we are announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now send inference payloads directly in the request body. InvokeEndpointAsync The API eliminates the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each call.

For payloads up to 128,000 bytes, this removes network-wide round trips, simplifies client-side code, and reduces operational space for asynchronous inference workloads.

In this post, we explain the motivation behind this feature, detail the before and after customer experience, and show you how to start using inline payloads today.

Background: How asynchronous inference used to work

You can use Amazon SageMaker AI Async Inference to queue and process inference requests asynchronously. This is suitable for workloads that have large payloads, variable traffic, or can tolerate delays of seconds to minutes. It supports autoscaling to zero, making it cost-effective for bursty or batch-type workloads.

Previously, workflows required two steps for each call.

  1. upload Input payload to an Amazon S3 bucket.
  2. call Endpoint. Pass the S3 object URI as follows: InputLocation.

The endpoint processes requests asynchronously and writes output to the configured S3 output location. Clients poll it or receive it via Amazon Simple Notice Service (Amazon SNS) notifications.

This two-step pattern is suitable for large payloads (images, audio, multi-MB documents). However, for customers whose input payloads (in KB) were small and required longer processing times than real-time inference allowed, the required S3 dependency added unnecessary complexity.

New feature: Inline payload with Body parameter

With today’s release, InvokeEndpointAsync accept new things Body Parameter. If the payload is present, the payload is sent inline with the API request itself and no S3 upload is required.

Main details:

side detail
new parameter Bodyraw bytes, is capped at 128,000 bytes.
Maximum inline size 128,000 bytes (raw payload).
mutual exclusivity Body and InputLocation mutually exclusive. The API will reject requests to set both.
Output operation No change. Output is written to S3 OutputLocation.
Endpoint compatibility Designed to work with existing async endpoints. No changes to the model or container are expected.
error handling Size and mutual exclusivity violations return sync ValidationError response.
availability Available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV).

Before and after: customer experience

Changes are most clearly visible in the code. The following two examples make the same asynchronous call to the same endpoint. The first uses the previously required S3 upload step, and the second uses inline Body Parameter to replace it.

Before: First upload to S3 and then call

import boto3, json, uuid

s3 = boto3.client("s3")
sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")

# 1. Upload the request payload to S3 (extra latency + cost)
input_key = f"async-input/{uuid.uuid4()}.json"
s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"

# 2. Invoke the endpoint
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    InputLocation=input_location,
    ContentType="application/json",
)

print(response["OutputLocation"])

This approach requires:

  • Your S3 client and input bucket are now provisioned.
  • AWS Identity and Access Management (IAM) s3:PutObject Caller’s permission.
  • Naming scheme (such as UUID) to avoid key collisions.
  • Cleanup strategy for old input objects.

After: Send payload inline

import boto3, json

sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")

# One call, no S3 upload, no input bucket needed
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    Body=payload,
    ContentType="application/json",
)

print(response["OutputLocation"])

No S3 client, no uuidno input buckets, no IAM grants on input paths, and no cleanup of old objects.

Customer benefits

Sending the payload inline removes network hops and dependencies from each request. This leads to five tangible benefits:

  • Reduced waiting time. One network roundtrip and one S3 PUT are removed per request. For fan-out workloads, this latency savings increases significantly.
  • A simpler architecture. Avoid input bucket provisioning, lifecycle policies, cross-account access patterns, and caller IAM. s3:PutObject Permissions on the input path.
  • There are fewer error paths. A request is a single API call. Either you enqueue or you don’t.
  • Low cost. Removes S3 PUT charges for input uploads on all inline calls.
  • Instant validation feedback. Size errors and mutual exclusion errors are returned synchronously.

When to use each approach

Inline payloads are usually the easier choice for small payloads; InputLocation There’s still a place for it. Use the following table to determine which path fits your specific workload.

scenario Recommended approach
Payload <= 128,000 bytes (JSON prompt, structured data) in line Body. Make it simpler. Avoid one network round trip and S3 PUT charges.
Payload > 128,000 bytes (images, audio, large documents) InputLocation. First, upload to S3.
Mixed workload with variable payload size Branch according to size. use Body If it is small, InputLocation In case of large size.
Input data needs to persist in S3 for auditing or replay InputLocation. Keep the input in a bucket.

Start

For a complete tutorial, see the sample code notebook.

Before you begin, make sure you have the following:

  • Existing Amazon SageMaker AI asynchronous inference endpoint (validate using the following method) aws sagemaker describe-endpoint --endpoint-name my-async-endpoint).
  • The latest AWS SDK for Python (Boto3) is installed and configured with your credentials.
  • IAM permissions sagemaker:InvokeEndpointAsync.
  • An S3 output bucket configured for an asynchronous endpoint, e.g. my-output-bucket).

Note: Following this guide uses billable AWS resources. SageMaker AI asynchronous inference endpoints incur charges for instance hours, and S3 buckets incur charges for storage and requests. To avoid recurring charges, please follow the cleanup steps after completing the tutorial.

step

Inline payload support is currently available. To use:

  1. Update the AWS SDK. Install or upgrade Boto3 to the latest version. pip install --upgrade boto3.
  2. Verify the installation. pip show boto3.
  3. Replace the calling code. In your application, it replaces S3 Upload+. InputLocation direct pattern Body Use parameters as shown in the preceding code example.
  4. test the call by calling InvokeEndpointAsync API using Body Parameter.
  5. Check the response contains OutputLocation field.
  6. Poll or monitor S3 OutputLocation Verify that the inference results were written successfully.

You don’t need to change your endpoint configuration, model container, or output S3 setup.

cleaning

To avoid ongoing charges, delete the resources used in this tutorial.

  1. If the SageMaker AI endpoint was created for testing, delete it.
    aws sagemaker delete-endpoint --endpoint-name my-async-endpoint

  2. Delete the output S3 bucket (if you don’t need it). caveat: Deleting an S3 bucket permanently deletes the objects in that bucket. Make sure you back up any inference results you need to keep.
    aws s3 rb s3://my-output-bucket --force

  3. Delete any IAM policies created specifically for this tutorial.

conclusion

Inline payload support for SageMaker AI asynchronous inference eliminates a common point of friction in asynchronous inference workflows: required S3 uploads for each request. For most inference payloads that fit within 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest.

This feature is designed to be backward compatible. existing InputLocation The workflow continues unchanged. Both inline and S3 inputs are processed the same way once a request is accepted, and the model receives the same request regardless of the input source.

Update the AWS SDK and Body Parameters for the SageMaker AI InvokeEndpointAsync API. For more information about asynchronous inference, see the Amazon SageMaker AI Asynchronous Inference documentation.


About the author

Dan Ferguson

Dan Ferguson

Dan is a Solutions Architect at AWS based in New York, USA. Dan is a machine learning services expert dedicated to helping customers integrate ML workflows efficiently, effectively, and sustainably.

blues one

blues one

Bruce is a software development engineer on the SageMaker AI Inference DataPlane team at AWS. He builds infrastructure to power real-time asynchronous inference for SageMaker AI customers.



Source link