As organizations increasingly adopt basic models (FMS) for artificial intelligence and machine learning (AI/ML) workloads, it becomes important to efficiently manage large-scale inference operations. Amazon Bedrock supports two common types of large-scale inference patterns: Real-time and batch inference for use cases with processing large datasets that do not require immediate results.
Amazon Bedrock Batch Inference is a cost-effective solution that offers a 50% discount compared to on-demand processing, making it ideal for large amounts of time-impressive workloads. However, implementing batch inference at scale has its own challenges, such as managing input formatting and job quotas, coordinating concurrent executions, and handling post-processing tasks. Developers need a robust framework to streamline these operations.
In this post, we present a flexible and scalable solution that simplifies your batch inference workflow. This solution offers a highly scalable approach to managing your FM batch inference needs, such as generating embeddings of millions of documents or performing custom evaluation or completion tasks on large datasets.
Solution overview
The following diagram provides a detailed overview of the broader automated workflow. It details three main phases: preprocessing the input dataset (fast formatting), running parallel executions of batch inference jobs, and post-processing for analyzing the model output.

This solution provides a flexible and scalable framework for simplifying batch orchestration. Given a simple configuration input, the step will work the state machine deployed in this AWS Cloud Development Kit (AWS CDK) stack, handling dataset preprocessing, parallel batch jobs launch, and output postprocessing.
A particular use case uses 2.2 million rows of data from the open source dataset SimpleCot. The SimpleCot dataset of embracing faces is a collection of diverse task-oriented examples designed to demonstrate and train chain chain (COT) inference in language models. This dataset includes a wide range of problem types, including reading comprehension, mathematical inference, logical deduction, and natural language processing (NLP) tasks. The dataset consists of each entry that contains a task description, a question, a correct answer, and a detailed description of the inference process.
The following diagram illustrates the solution architecture.

The Amazon Bedrock Batch Orchestration pattern uses scalable, serverless components to cover key architectural considerations specific to batch processing workflows.
- File format and storage – Job inputs must be structured as JSONL files stored in Amazon Simple Storage Service (Amazon S3) Bucket. Each row represents a single input record that matches the API request structure of that FM or provider. For example, Anthropic's Claude model has a different JSON structure compared to Amazon Titan Text Embeddings v2. There are also quotas to consider. At the time of writing, there are at least 1,000 records and a maximum of 50,000 records per batch. Based on the requirements of your use case, you can use service quotas to request an increase in quota.
- Step Functional State Machine – A robust control flow system is required for orchestrating asynchronous, long-term jobs. Our architecture uses step functionality to coordinate the entire process, and Amazon DynamoDB maintains individual jobs and their state inventory. Again, there are important allocation considerations. For example, the maximum total of in-progress and submitted batch inference jobs using the Amazon Titan Text Embeddings V2 base model is currently 20 per AWS region. Using Map Workflow State, the Step function helps maximize throughput by controlling the completion status of job submissions and monitoring.
- Post-processing – Finally, I recommend running a light process on batch output (and even Amazon S3 JSONL files) to parse the response and combine the output with the original input. For example, when generating text embeddings, you need a mechanism to map the output vector to the source text. These configurable AWS Lambda functions are triggered as part of the step function workflow after the batch result arrives in Amazon S3.
In the next section, you will proceed to deploy the AWS CDK stack to your AWS environment.
Prerequisites
Complete the following prerequisite steps:
- Install the node and NPM.
- Install the AWS CDK:
- Clone the GitHub repository into a local development environment.
Deploy the solution
Install the required packages with the following code:npm i
Please check prompt_templates.py File and add a new prompt template prompt_id_to_template For your desired use case.
prompt_id_to_template The key is a key prompt_id (You can associate a specific job with a specific prompt). The format key for the prompt string template must also be present in the input file. For example, consider the following prompt template:
You must ensure that the input dataset has a column for each format key (for example, source (in the previous example code).
Prompt templates are not used to embed model-based jobs. Deploy the AWSCDK stack with the following code:npm run cdk deploy
Note the AWS Cloud Formation Output that indicates the name of the bucket and step functionality workflow.
Job input structure
You can use the face dataset you hug as the input dataset or point directly to the Amazon S3 dataset (CSV or Parquet formats are supported at the time of writing). The source and model type of the input dataset (text generation or embedding) determines the structure of the step function input.
Hugging face data set
For face datasets to hug, see the dataset ID (for example, w601sxs/simpleCoT) and splitting (for example train), and your dataset will pull directly from hugging the facehub.

question_answering Prompt Template prompt_templates.py The format key is called source To match the names of the appropriate columns in the referenced dataset (see example above). Use this prompt to generate the rationale and answer for each of the 2.2 million rows in the dataset. See the following code:
There is also an optional key max_num_jobs (To limit the total number of jobs that can be useful for testing on small scales) max_records_per_batch.
Amazon S3 Data Set
Upload the input CSV or Parquet file to an S3 bucket and copy the S3 URI. for example:aws s3 cp topics.csv s3://batch-inference-bucket-
Open the Step Function State machine in the Step Function Console and send the input with the following structure: You need to supply s3_uri For S3 data sets.
For example, for a human model with Amazon S3 input, use the following code:
prompt_id of joke_about_topic Map to prompt template prompt_templates.pythere is a format key topicmust be one of the columns in the input CSV file.
Generate a batch embedding
To generate embeddings in a model such as Amazon Titan Text Embeddings v2, prompt_idbut you need to make sure there is a column called in your input CSV file input_text With the text you want to embed. for example:
Step Function Workflow
The following diagram shows an example of a successful execution of a step function workflow.

Once the step makes the state machine work, the following steps are completed:
- A preprocessing input dataset for preparing batch job input for a specific model ID and prompt template.
BaseProcessorAbstract classes can be quickly scaled for other model providers, such as Metalama 3 and Amazon Nova. - Adjust batch jobs in an event-driven way. Maintains an internal inventory of jobs in the DynamoDB table and continues updating when Amazon Bedrock issues events related to job status changes. These updates are sent to the step function as follows Wait for task token callback Integration patterns. Use the SFN map to ensure that the maximum capacity of concurrent jobs is maintained until the records are processed.
- Perform simultaneous post-processing of batch output, perform some write analysis, and merge the model response into the original input data using the RecordID field as the join key. The output data will vary depending on the type of model you are using. For text-based models, the output string is in a new column called
response.
Monitors the state machine when running a job. The maximum number of concurrent jobs is controlled by AWS CDK context variables cdk.json (key: maxConcurrentJobs). The path to the resulting parquet file is aggregated into the output from the run.
The output Parquet file contains the same columns as the input file, along with the generated response.
For text generation models, the output string is in a new column called responseas shown in the following screenshot of the sample output.

For embedded models, the output (list of floats) is in a new column called embeddingas shown in the following screenshot.

There are no guaranteed SLAs in the batch inference API. The runtime depends on the demand for the desired model at the time of the request. For example, to process 2.2 million records in the SimpleCot dataset, it was categorized into 45 individual processing jobs that were run, with up to 20 concurrent jobs at a given time. Anthropic's Claude Haiku 3.5 experiment us-east-1 In the region, individual jobs took an average of 9 hours to run, with a total of approximately 27 hours of end-to-end processing time.
cleaning
You can run it to clean up stack resources to avoid any additional costs cdk destroy.
Conclusion
In this post, we outlined a serverless architecture for performing large batch processing using Amazon Bedrock batch inference. We investigated the use of solutions for a variety of use cases, including large-scale data labeling and embedding generation. It also allows for the generation of large amounts of synthetic data from teacher models used to train student models as part of the model distillation process.
This solution is published on GitHub Repo. I can't wait to see how this architecture works for use cases.
About the author
Swagat Kulkarni He is a senior solution architect at AWS and an active generator AI practitioner. He is passionate about using cloud-native services and machine learning to help customers solve real challenges. Swagut has provided impactful solutions that enable innovation and scale in a strong background driving digital transformation across diverse industries. Outside of work, I enjoy traveling, reading and cooking.
Evan Diwelld A data and machine learning engineer with AWS Professional Services, helping AWS customers develop and deploy ML solutions across a variety of industries. Before joining AWS, he received an MS from Carnegie Mellon University, where he was researching at the intersection of advanced manufacturing and AI. Outside of work, he enjoys mountain biking and rock climbing.
Shreyas Subramanian A leading data scientist, helping customers by using generative AI and solving business challenges using AWS services such as Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge research in deep learning, agent AI, basic models and optimization techniques. In his current role on Amazon, Dr. Subramanian works with a variety of science leaders and research teams both inside and outside of Amazon, making the best use of customers to help them solve business problems by making the most of their latest algorithms and technologies. Outside of AWS, Dr. Subramanian is an expert reviewer in AI papers and funding through organizations such as Neurip, ICML, ICLR, NASA, and NSF.
