Voice AI changes the way we interact with technology, making conversational interactions more natural and intuitive than ever before. At the same time, AI agents can become increasingly sophisticated, understand complex queries and take autonomous actions on our behalf. As these trends converge, we see the emergence of intelligent AI voice agents that can engage in human-like interactions while performing a wide range of tasks.
In this series of posts, you will learn how to use Amazon Bedrock's basic model to build intelligent AI voice agents using Pipecat, an open source framework for voice and multimodal conversational AI agents. This includes high-level reference architectures, best practices and code samples to guide your implementation.
An approach to building an AI voice agent
There are two general approaches to building a conversational AI agent:
- Using Cascade Models: In this post (Part 1), we learn about the cascade model approach and dive into the individual components of conversational AI agents. Using this approach, voice input passes through a set of architectural components before voice responses are sent to the user. This approach is also sometimes referred to as a voice architecture for pipelines or component models.
- Using the basic model of speech to speech in a single architecture: In Part 2, we learn how Amazon Nova Sonic, the fundamental model of speech from cutting edge, unified speech, can combine speech understanding and generation in a single architecture to enable real-time human-like speech conversation.
Common Use Cases
AI Voice Agents can handle multiple use cases, including, but not limited to:
- Customer Support: AI Voice Agents can handle customer inquiries 24/7, allowing you to route immediate answers and complex issues to humans when needed.
- Outbound call: AI agents can follow up leads with personalized outreach campaigns, schedule appointments, or natural conversations.
- Virtual Assistant: Voice AI can drive personal assistants that help users manage tasks and answer questions.
Architecture: Building AI voice agents using cascade models
Building agent voice AI applications using a cascade model approach requires tuning multiple architectural components, including multiple machine learning and basic models.

Figure 1: Overview of the architecture of a voice AI agent using Pipecat
These components include:
WebRTC Transport: Enables real-time audio streaming between client devices and application servers.
Voice Activity Detection (VAD): Use Silero VAD to detect audio, remove background noise, and detect noise suppression features to improve audio quality.
Automatic voice recognition (ASR): Use Amazon to convert accurate, real-time audio to text.
Natural Language Understanding (NLU): Use delay-optimized inference on the bedrock using models such as Amazon Nova Pro to interpret user intent.
Tool execution and API integration: Integrate backend services and data sources through PIPECAT flows and take advantage of the tool-use capabilities of the foundation model to perform actions or retrieve information about RAG.
Natural Language Generation (NLG): Generate coherent responses using Amazon Nova Pro with Bedrock, providing the right balance between quality and latency.
Text Two Speech (TTS): Use Amazon Polly in a generative voice to transform text responses into realistic speeches.
Orchestration Framework: Pipecat coordinates these components to provide a modular Python-based framework for real-time multimodal AI agent applications.
Best Practices for Building an Effective AI Voice Agent
The development of highly responsive AI voice agents requires focusing on latency and efficiency. Best practices continue to emerge, but consider the following implementation strategies to achieve natural, human-like interactions:
Minimize conversation delays: To maintain a natural flow of conversation, we use Latency-Optimized inference from a Basic Model (FMS) like Amazon Nova Pro.
Choose an efficient foundation model. Prioritize smaller, faster basic models (FMS) that can provide faster responses while maintaining quality.
Implement prompt caching: Use rapid caching to optimize for both speed and cost efficiency, especially in complex scenarios that require knowledge search.
Expand the Text Two Speech (TTS) filler. Before intensive operations, use natural filler phrases (such as “Let Me That Me for You”) while the system makes tool or long-term calls to the foundation model, while maintaining user engagement.
Create a robust audio input pipeline. It integrates components such as noise to support clear audio quality to improve speech recognition results.
Easy to launch and iterate: Before moving on to a complex agent system that can handle multiple use cases, we start with a basic conversation flow.
Regional Availability: Low latency and rapid caching capabilities may be available only in certain regions. We evaluate the trade-offs between these advanced features and selecting geographically close to the end-user.
Implementation example: Build your own AI voice agent in minutes
This post provides a sample GitHub application that demonstrates the concepts explained. Create a working voice agent that can be tried in minutes using Pipecat and its accompanying state management framework, Pipecat flows using Amazon Bedrock, and the Daily Web Real-Time Communication (WeBRTC) feature.
Prerequisites
The following prerequisites are required to set up the sample application:
- Python 3.10+
- AWS accounts with appropriate identity and access management (IAM) permissions for Amazon Bedrock, Amazon Transcribe, and Amazon Polly
- Accessing the basics of Amazon Bedrock models
- Access to API keys daily
- Modern web browsers with webrtc support (such as Google Chrome or Mozilla Firefox)
Implementation procedure
Once you have completed the prerequisites, you can start setting up the sample voice agent.
- Clone the repository:
git clone https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock cd build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock/part-1 - Set the environment:
cd server python3 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt - Configure the API key
.env:DAILY_API_KEY=your_daily_api_key AWS_ACCESS_KEY_ID=your_aws_access_key_id AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key AWS_REGION=your_aws_region - Start the server:
python server.py - Connect via browser
http://localhost:7860Grants microphone access - Start a conversation with an AI voice agent
Customizing the voice AI agent
To customize, you can start.
- change
flow.pyChange the conversation logic - Adjusting model selection
bot.pyFor your waiting time and quality needs
For more information, see the pipecat flow documentation and check out the README for Github code samples.
cleaning
The above instructions are for configuring your application in a local environment. Local applications leverage AWS services and daily use via AWS IAM and API credentials. To avoid security and unexpected costs, remove these credentials once they are complete and make sure you are no longer able to access them.
Accelerate the implementation of voice AI
To accelerate the implementation of AI voice agents, AWS Generic AI Innovation Center (GAIIC) partners with customers to identify high-value use cases and develop proof of concept (POC) solutions.
Customer feedback: Debt
Indebted, a global fintech that transforms the consumer debt industry, is working with AWS to develop voice AI prototypes.
“We believe that AI-powered voice agents represent a pivotal opportunity to enhance human touch in customer engagement in financial services. By integrating AI-enabled voice technology into operations, our goal is to provide faster and more intuitive access to adapt to your needs and support that improves the performance of your contact centre operations.”
says Mike Zhou, Chief Data Officer at Indebted.
By working with AWS to leverage Amazon Bedrock, organizations like Indebted can create safe, adaptive voice AI experiences that meet regulatory standards, while still having real, human-centered impacts, even in the most challenging financial conversations.
Conclusion
Building intelligent AI voice agents is now more accessible than ever, thanks to an open source framework like Pipecat, and a powerful foundation model with latency-optimized inference and quick caches in the Amazon bedrock.
In this post, we learned about two general approaches on how to build a voice agent for AI, and delved into the cascade model approach and its key components. These key components work together to create intelligent systems that can naturally understand, process, and respond to human utterances. By leveraging these rapid advances in generator AI, you can create sophisticated, responsive voice agents that bring real value to users and customers.
To start your own Voice AI project, try the code samples on GitHub or contact your AWS Account team to consider engaging with the AWS Generic AI Innovation Center (GAIIC).
You can also learn about building AI voice agents using Amazon Nova Sonic, the basic model of speech from part 2, which is the unified speech.
About the author
Adithya Suresh He acts as a deep learning architect at AWS Generic AI Innovation Center, partnering with technology and business teams to build innovative generator AI solutions that address real-world challenges.
Daniel Willho He is a solution architect at AWS and focuses on fintech and SaaS startups. As a former startup CTO, he enjoys working with founders and engineering leaders to promote AWS growth and innovation. Outside of work, Daniel enjoys taking a walk with coffee, appreciating nature, and learning new ideas.
Karanshin He is AWS Generic AI Specialist and works with top-tier third-party foundation model and agent framework providers to develop and execute joint market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise-generated AI challenges.
Xuefeng liu He leads the science team at the AWS Generated AI Innovation Centre in the Asia-Pacific region. His team is partnering with AWS customers on Generated AI projects with the goal of accelerating the adoption of Generated AI.
