Cost-effective, large-scale multilingual audio transcription using Parakeet-TDT and AWS Batch

Machine Learning


Many organizations are archiving large media libraries, analyzing contact center recordings, preparing training data for AI, or processing on-demand video for closed captioning. As data volumes increase significantly, the cost of managed automatic speech recognition (ASR) services can quickly become the primary constraint to scalability.

To address this cost and scalability challenge, we use the NVIDIA Parakeet-TDT-0.6B-v3 model deployed on GPU-accelerated instances via AWS Batch. Parakeet-TDT’s token and duration transducer architecture simultaneously predicts text tokens and their durations, intelligently skipping silence and redundant processing. This enables inference speeds that are orders of magnitude faster than real-time. Enable transcription at scale by paying only for short bursts of computing rather than the full length of the audio. Audio for less than 1 cent per hour Based on the benchmarks described in this post.

This post describes building a scalable, event-driven transcription pipeline that automatically processes audio files uploaded to Amazon Simple Storage Service (Amazon S3), and shows how you can use Amazon EC2 Spot Instances and buffered streaming inference to further reduce costs.

Model features

Parakeet-TDT-0.6B-v3, released in August 2025, is an open-source multilingual ASR model that provides high accuracy across 25 European languages ​​with automatic language detection and flexible licensing under CC-BY-4.0. According to metrics published by NVIDIA, this model maintains a word error rate (WER) of 6.34% clean, 11.66% WER at 0 dB SNR, and supports up to 3 hours of audio using local attention mode.

The 25 supported languages ​​include Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, and Ukrainian. This reduces the need for separate models and language-specific configurations when serving Europe’s international economy. For deployment on AWS, the model requires a GPU-enabled instance with a minimum of 4 GB VRAM, although 8 GB provides better performance. G6 instances (NVIDIA L4 GPU) offer the best price/performance ratio for test-based inference workloads. This model also performs well on G5 (A10G), G4dn (T4), and maximum throughput P5 (H100) or P4 (A100) instances.

solution architecture

The process begins when you upload an audio file to your S3 bucket. This triggers an Amazon EventBridge rule that sends the job to AWS Batch. AWS Batch provisions GPU-accelerated compute resources, and the provisioned instances pull container images containing pre-cached models from Amazon Elastic Container Registry (Amazon ECR). The inference script downloads the file, processes it, and uploads a timestamped JSON transcript to an output S3 bucket. The architecture scales to zero when idle, so costs are only incurred during active compute.

For more information about the general architecture components, see our previous post, Whisper Audio Transcription with AWS Batch and AWS Inferentia.

AWS architecture diagram showing an audio transcription pipeline using Docker, AWS Batch, EventBridge, ECR, S3, and CloudWatch servicesFigure 1. Event-driven audio transcription pipeline using Amazon EventBridge and AWS Batch

Prerequisites

  1. Create an AWS account if you don’t already have one and sign in. Create a user with full administrative privileges using AWS IAM Identity Center, as described in Adding a User.
  2. Install the AWS Command Line Interface (AWS CLI) on your local development machine and create an administrator user profile as described in Setting Up the AWS CLI.
  3. Install Docker on your local machine.
  4. Clone the GitHub repository to your local machine.

Building a container image

The repository contains Docker files that build streamlined container images optimized for inference performance. This image uses Amazon Linux 2023 as a base, installs Python 3.12, and pre-caches the Parakeet-TDT-0.6B-v3 model during the build to reduce download delays at runtime.

FROM public.ecr.aws/amazonlinux/amazonlinux:2023

WORKDIR /app

# Install system dependencies, Python 3.12, and ffmpeg
RUN dnf update -y && \
    dnf install -y gcc-c++ python3.12-devel tar xz && \
    ln -sf /usr/bin/python3.12 /usr/local/bin/python3 && \
    python3 -m ensurepip && \
    python3 -m pip install --no-cache-dir --upgrade pip && \
    dnf clean all && rm -rf /var/cache/dnf

# Install Python dependencies and pre-cache the model
COPY ./requirements.txt requirements.txt
RUN pip install -U --no-cache-dir -r requirements.txt && \
    rm -rf ~/.cache/pip /tmp/pip* && \
    python3 -m compileall -q /usr/local/lib/python3.12/site-packages

COPY ./parakeet_transcribe.py parakeet_transcribe.py

# Cache model during build to eliminate runtime download
RUN python3 -c "from nemo.collections.asr.models import ASRModel; \
    ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v3')"

CMD ["python3", "parakeet_transcribe.py"]

Push to Amazon ECR

The repository includes an updateImage.sh script that handles environment discovery (CodeBuild or EC2), builds the container image, optionally creates an ECR repository, enables vulnerability scanning, and pushes the image. Run it like this:./updateImage.sh

Deploying the solution

This solution uses an AWS CloudFormation template (deployment.yaml) to provision your infrastructure. The buildArch.sh script automates the deployment by discovering the AWS Region, gathering VPC, subnet, and security group information, and deploying the CloudFormation stack.

./buildArch.shUnder the hood:

aws cloudformation deploy --stack-name batch-gpu-audio-transcription \
  --template-file ./deployment.yaml \
  --capabilities CAPABILITY_IAM \
  --region ${AWS_REGION} \
  --parameter-overrides VPCId=${VPC_ID} SubnetIds="${SUBNET_IDS}" \
  SGIds="${SecurityGroup_IDS}" RTIds="${RouteTable_IDS}"

The CloudFormation template creates an AWS Batch compute environment with G6 and G5 GPU instances, a job queue, a job definition that references an ECR image, and input and output S3 buckets with EventBridge notifications enabled. You’ll also create an EventBridge rule to trigger a Batch job on S3 uploads, an Amazon CloudWatch agent configuration for GPU/CPU/Memory monitoring, and an IAM role with a least privilege policy. AWS Batch allows you to select an Amazon Linux 2023 GPU image by specifying it. ImageType: ECS_AL2023_NVIDIA Within your computing environment configuration.

Alternatively, you can deploy directly from the AWS CloudFormation console using the launch link provided in the repository’s README.

Configuring a spot instance

Amazon EC2 Spot Instances can further reduce costs by running your workloads on unused EC2 capacity at discounts of up to 90%, depending on the instance type. To enable Spot Instances, modify your compute environment. deployment.yaml:

DefaultComputeEnv:
  Type: AWS::Batch::ComputeEnvironment
  Properties:
    Type: MANAGED
    State: ENABLED
    ComputeResources:
      AllocationStrategy: SPOT_PRICE_CAPACITY_OPTIMIZED
      Type: SPOT
      BidPercentage: 100
      InstanceTypes:
        - "g6.xlarge"
        - "g6.2xlarge"
        - "g5.xlarge"
      MinvCpus: !Ref DefaultCEMinvCpus
      MaxvCpus: !Ref DefaultCEMaxvCpus
      # ... remaining configuration unchanged

You can enable this by setting –parameter-overrides. UseSpotInstances=Yes when running aws cloudformation deploy. of SPOT_PRICE_CAPACITY_OPTIMIZED The allocation strategy selects the Spot Instance pool that is least likely to be interrupted and has the lowest possible price. Diversifying your instance types (G6 xlarge, G6 2xlarge, G5 xlarge) increases Spot availability. Setting MinvCpus: 0 zero-scales the environment when idle, so there is no cost across workloads. ASR jobs are stateless and idempotent, making them perfect for spots. When an instance is reclaimed, AWS Batch automatically retries the job (up to 2 retries configured in the job definition).

Memory management for long audio

The memory consumption of the Parakeet-TDT model increases linearly with the length of the audio. The Fast Conformer encoder must generate and store a complete audio signal feature representation, creating a direct dependency where doubling the audio length almost doubles the VRAM usage. According to the model card, if you are careful enough, this model can handle up to 24 minutes with 80GB of VRAM.

NVIDIA addresses this as follows: local attention Modes that support up to 3 hours of audio on 80 GB A100:

# Enable local attention for long audio

asr_model.change_attention_model("rel_pos_local_attn", [128, 128])

asr_model.change_subsampling_conv_chunking_factor(1) # auto select

asr_model.transcribe(["input_audio.wav"])

This may have a small accuracy impact, so we recommend testing it in your use case.

Buffered streaming inference

Use buffered streaming inference for audio longer than 3 hours, or to cost-effectively process long audio on standard hardware such as g6.xlarge. Based on the NVIDIA NeMo streaming inference example, this technique processes audio in overlapping chunks rather than loading the complete context into memory.

To maintain transcription quality at chunk boundaries, we configure 20-second chunks with 5-second left context and 3-second right context (changing these parameters can reduce accuracy, so experiment to find the optimal configuration; reducing chunk_secs will increase processing time).

# Streaming inference loop
while left_sample < audio_batch.shape[1]:
    # add samples to buffer
    chunk_length = min(right_sample, audio_batch.shape[1]) - left_sample

    # [Logic to manage buffer and flags omitted for brevity]
    buffer.add_audio_batch_(...)

    # Encode using full buffer [left-chunk-right]
    encoder_output, encoder_output_len = asr_model(
        input_signal=buffer.samples,
        input_signal_length=buffer.context_size_batch.total(),
    )

    # Decode only chunk frames (constant memory usage)
    chunk_batched_hyps, _, state = decoding_computer(...)

    # Advance sliding window
    left_sample = right_sample
    right_sample = min(right_sample + context_samples.chunk, audio_batch.shape[1])

Processing audio with a fixed chunk size decouples VRAM usage from the total length of the audio, allowing a single g6.xlarge instance to process a 10-hour file with the same memory footprint as a 10-minute file.

Process flow diagram showing audio chunking with encoder/decoder architecture for audio transcription using context windows

Figure 2. Buffered streaming inference processes audio in overlapping chunks with constant memory usage.

To enable and deploy buffered streaming, EnableStreaming=Yes Parameter.

aws cloudformation deploy \ 
    –stack-name batch-gpu-audio-transcription \ 
    –template-file ./deployment.yaml \ 
    –capabilities CAPABILITY_IAM \ 
    –parameter-overrides EnableStreaming=Yes \
    VPCId=your-vpc-id SubnetIds=your-subnet-ids SGIds=your-sg-ids RTIds=your-rt-ids

Testing and monitoring

To validate our solution at scale, we ran an experiment using 1,000 identical 50-minute audio files from NASA’s preflight crew press conferences, distributed across 100 g6.xlarge instances processing 10 files each.

Screenshot of the AWS Batch console. It shows 1,000 jobs in the batch-gpu-audio-transcription-jq queue and 100 running inference demo jobs.Figure 3. Run a batch job on 100 g6.xlarge instances simultaneously.

This deployment includes an Amazon CloudWatch agent configuration that collects GPU usage, power consumption, VRAM usage, CPU usage, memory consumption, and disk usage at 10 second intervals. These metrics appear under the CWAgent namespace and allow you to build dashboards for real-time monitoring.

Performance and cost analysis

To verify the efficiency of the architecture, we benchmarked the system using several long-form audio files.

The Parakeet-TDT-0.6B-v3 model achieved raw inference speed. 0.24 seconds per minute of audio. However, a complete pipeline also includes overhead such as loading the model into memory, loading audio, pre-processing the input, and post-processing the output. Because of this overhead, long-running audio performs the best cost optimization to maximize processing time.

Benchmark results (g6.xlarge):

  • Audio length: 3 hours 25 minutes (205 minutes)
  • Total duration of the job: 100 seconds
  • Effective processing speed: 0.49 seconds per minute of audio
  • Cost breakdown

You can estimate the cost per minute of audio processing based on the pricing in the us-east-1 region for g6.xlarge instances.

price model Cost per hour (g6.xlarge)* Cost per minute of voice
on demand ~$0.805 **$0.00011**
spot instance ~$0.374 **$0.00005**

*Prices are estimates based on US-East-1 rates at the time of writing. Spot prices vary by availability zone and are subject to change.

This comparison brings value to large-scale transcription and highlights the economic benefits of a self-hosted approach for high-volume workloads compared to managed API services.

cleaning

To avoid future charges, delete the resources created by this solution.

  1. Empty all S3 buckets (input, output, logs).
  2. Delete the CloudFormation stack.

aws cloudformation delete-stack --stack-name batch-gpu-audio-transcription

  1. Delete the ECR repository and container images if necessary.

For detailed cleanup instructions, see the cleanup section of the repository’s README.

conclusion

In this post, you learned how to build an audio transcription pipeline that processes audio at scale at a rate of cents per hour. By combining NVIDIA’s Parakeet-TDT-0.6B-v3 model with AWS Batch and EC2 Spot Instances, you can transcribe across 25 European languages ​​with automatic language detection, reducing costs compared to alternative solutions. Buffered streaming inference technology extends this capability to varying lengths of audio on standard hardware, and an event-driven architecture automatically scales from scratch to handle varying workloads.

To get started, check out the sample code in our GitHub repository.


About the author

Gleb Geinke

Gleb Geinke is a deep learning architect at the AWS Generative AI Innovation Center. Gleb works directly with enterprise customers to design and scale innovative generative AI solutions that address complex business challenges.

Justin Leto

Justin Leto is a Global Principal Solutions Architect on the Private Equity team at AWS. Justin is the author of Data Engineering with Generative and Agentic AI on AWS, published by APRESS.

Yoosung Wang

Yusong Wang is a Principal High Performance Computing (HPC) Specialist Solutions Architect at AWS with over 20 years of experience ranging from national research institutions to large financial companies.

Brian Maguire

Brian Maguire is a Principal Solutions Architect at Amazon Web Services, focused on helping customers build their ideas in the cloud. Brian is the co-author of Scalable Data Streaming with Amazon Kinesis.



Source link