CloudFlare is the perfect place to build real-time voice agents

Applications of AI


The way we interact with AI is fundamentally changing. Text-based interfaces like ChatGpt show what is possible in terms of interaction, but that's just the beginning. Humans are not just texting, they are speaking, they show things, interrupt and clarify in real time. Voice AI brings these natural interaction patterns to your application.

We look forward to today to unveil new features that make it easier than ever to build real-time, voice-enabled AI applications on CloudFlare's global network. These new features can create a complete platform for developers building next-generation conversational AI experiences, or act as building blocks for more sophisticated AI agents running across the platform.

We are launching:

  • CloudFlare Real-Time Agent – Runtime for adjusting voice AI pipelines at the edge

  • Pipe raw webrtc audio as worker PCM -WeBRTC audio can now be connected directly to AI models or to existing complex media pipelines.

  • Worker AI WebSocket Support -Real-time AI inference using models such as Pipecat's Smart-Turn-V2

  • Workers' deepgram ai – Audio to text and text speeches running in over 330 cities around the world

Why is real-time AI important?

Building voice AI applications is difficult today. You need to coordinate multiple services, such as speech-to-text, language models, and text-to-speech, while managing complex audio pipelines.


Building production voice AI requires tweaking the complex symphony of technology. It has low latency speech recognition, an intelligent language model that can understand context and handle interruptions, and natural sound integration. All this should occur in under 800 milliseconds. This waiting budget is unforgiving. For every millisecond counted: 40ms for microphone input, 300ms for transcription, 400ms for LLM inference, 150ms between text. Inadequate infrastructure choices or additional delays from remote servers can turn a fun experience into an annoying experience.

That's why we're building real-time AI tools. We want to create real-time audio AI that is easy to deploy as a static website. We are also witnessing important inflection points that move conversational AI from experimental demonstrations to production-enabled systems that can scale globally. If you are already a developer of a real-time AI ecosystem, you want to build the best building blocks to get the lowest latency by leveraging the 330+ data centers built by CloudFlare.

Introducing CloudFlare Real-Time Agent

CloudFlare Real-Time Agent is a simple runtime for organizing voice AI pipelines that run on a global network as close as possible to users. Instead of managing your own complex infrastructure, you can focus on building great conversational experiences.


There is what happens when a user connects to a Voice AI application.

  1. webrtc connection – Audio streams from your device will be sent via WeBRTC to your nearest CloudFlare location using CloudFlare RealTimeKit mobile or Web SDK

  2. AI Pipeline Orchestration – Pre-configured pipeline execution: Speech-to-text → LLM → Text-to-speech, support for interruption detection and turn-taking

  3. The configured runtime options/callbacks/tools will be run

  4. Response delivery – The generated audio stream returns to the user with minimal latency

The magic lies in the way this was designed as a configurable building block. It is not locked into a rigid pipeline. You can configure data flows, add tees to participate in operations, and control exactly how AI agents work.

Please take a look MyTextHandler For example, it works from the diagram above. This is a function that takes text into account and returns text, inserted from text to text and before speech.

class MyTextHandler extends TextComponent {
	env: Env;

	constructor(env: Env) {
		super();
		this.env = env;
	}

	async onTranscript(text: string) {
		const { response } = await this.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
			prompt: "You are a wikipedia bot, answer the user query:" + text,
		});
		this.speak(response!);
	}
}

Agents are JavaScript classes that extend Realtimeagent and initialize a pipeline consisting of a variety of text-to-text, speech-to-text, text-to-text, and even speech-to-speech conversions.

export class MyAgent extends RealtimeAgent {
	constructor(ctx: DurableObjectState, env: Env) {
		super(ctx, env);
	}

	async init(agentId: string ,meetingId: string, authToken: string, workerUrl: string, accountId: string, apiToken: string) {
		// Construct your text processor for generating responses to text
		const textHandler = new MyTextHandler(this.env);
		// Construct a Meeting object to join the RTK meeting
		const transport = new RealtimeKitTransport(meetingId, authToken, [
			{
				media_kind: 'audio',
				stream_kind: 'microphone',
			},
		]);
		const { meeting } = transport;

		// Construct a pipeline to take in meeting audio, transcribe it using
		// Deepgram, and pass our generated responses through ElevenLabs to
		// be spoken in the meeting
		await this.initPipeline(
			[transport, new DeepgramSTT(this.env.DEEPGRAM_API_KEY), textHandler, new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY), transport],
			agentId,
			workerUrl,
			accountId,
			apiToken,
		);

		// The RTK meeting object is accessible to us, so we can register handlers
		// on various events like participant joins/leaves, chat, etc.
		// This is optional
		meeting.participants.joined.on('participantJoined', (participant) => {
			textHandler.speak(`Participant Joined ${participant.name}`);
		});
		meeting.participants.joined.on('participantLeft', (participant) => {
			textHandler.speak(`Participant Left ${participant.name}`);
		});

		// Make sure to actually join the meeting after registering all handlers
		await meeting.rtkMeeting.join();
	}

	async deinit() {
		// Add any other cleanup logic required
		await this.deinitPipeline();
	}
}

View the complete example at Developer Documentation And run your own real-time agent. View Real-time Agent On the dashboard.

What makes real-time agents powerful is their flexibility.

  • Many AI provider options -Use the Worker AI, OpenAI, Humanity, or Provider model through AI Gateway

  • Multiple I/O modes – Accept audio and/or text and respond with audio and/or text

  • Stateful adjustments – Maintain context throughout the conversation without managing complex situations yourself

  • Speed ​​and flexibility – Used RealTimeKit You can also connect directly using standard WeBRTC clients or RAW WebSockets to manage WeBRTC sessions and UI, or to have faster control over the stack.

  • Integrate in CloudFlare Agents SDK

In the open beta version that is open today, CloudFlare RealTime Agents Runtime is free to use and works with a variety of AI models.

  • Speech and Audio: Integration with platforms such as ElevenLabs and Deepgram.

  • LLM Inference: Flexible options to use large language models via CloudFlare Worker AI and AI Gateways, connect to third-party models such as Openai, Gemini, Grok, Claude, or bring your own custom model.

Pipe raw webrtc audio as worker PCM

We are publishing our RAW WEBRTC audio pipeline directly to workers for developers who need the most flexibility in their applications beyond real-time agents.

Worker's WeBRTC audio works by leveraging CloudFlare's real-time SFU. CloudFlare's real-time SFU converts OPUS codec WeBRTC audio to PCM and streams it to any WebSocket endpoint you specify. This means it can be implemented using workers.

  • Live transcription – Stream audio directly from video calls to transcription services

  • Custom AI Pipeline – Sending audio to AI models without setting up complex infrastructure

  • Recording and processing – Save, audit, or analyze audio streams in real time


Voice AI WebSockets vs Webtc

WebSockets and WeBRTC can handle audio for AI services, but they work best in a variety of situations. WebSocket is ideal for server-to-server communication, works fine when no superfast response is required, and is ideal for testing and experimentation. However, when building apps that require real-time conversations with low latency, WeBRTC is the better option.

WeBRTC has several advantages that it excels in live audio streaming. Use UDP instead of TCP. This prevents audio delays due to loss of packets that hold the entire stream (Line blocking head (These are the general topics discussed in this blog). WebrTC's OPUS audio codecs are automatically adjusted to network conditions and can handle packet losses gracefully. WeBRTC also includes built-in features such as echo cancellation and noise reduction, which WebSockets must be built individually.

This feature allows clients to use WebRTC for server communication and leverage CloudFlare to convert them to the familiar WebSocket for server-to-server communication and backend processing.

Worker Power + webrtc

Once WeBRTC audio is converted to WebSockets, you can get PCM audio at the original sample rate and perform tasks from there to enter and exit the CloudFlare developer platform.

  • Recycle audio and send it to various AI providers

  • Performs WebAssembly-based audio processing

  • Build complex applications with Durable Objects, alarm Other workers' primitives

  • Expand containerized processing pipelines Workers' containers

Because WebSocket works bidirectionally, data sent via WebSocket is available as a WeBRTC track for real-time SFUs that are ready to be consumed within WeBRTC.

I've created it briefly to illustrate this setup webrtc application demo It uses the ElevenLabs API for text-to-speech.

Please visit Real-time SFU Developer Documentation How to get started.

Real-time AI inference using WebSockets

WebSockets provides the backbone of your real-time AI pipeline. This is because it is a low latency, bidirectional primitive with ubiquitous support for developer tools, especially for server-to-server communication. While HTTP is great for many use cases such as chat and batch inference, real-time voice AI requires persistent, low-latency connections when talking to AI inference servers. To support real-time AI workloads, Workers AI now supports WebSocket connections for selected models.

Released using Pipecat SmartTurn V2

The first model with WebSocket support is PipeCat Smart Turn-V2 Turn detection model – a critical component of natural conversation. The turn detection model is appropriate for the speaker to determine when the discussion is finished and the AI ​​responds. Getting this right is the difference between AI that is constantly interrupted and AI that feels natural to talk about.

Below is an example of how to call Smart-Turn-V2 running on worker AI.

"""
Cloudflare AI WebSocket Inference - With PipeCat's smart-turn-v2
"""

import asyncio
import websockets
import json
import numpy as np

# Configuration
ACCOUNT_ID = "your-account-id"
API_TOKEN = "your-api-token"
MODEL = "@cf/pipecat-ai/smart-turn-v2"

# WebSocket endpoint
WEBSOCKET_URL = f"wss://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL}?dtype=uint8"

async def run_inference(audio_data: bytes) -> dict:
    async with websockets.connect(
        WEBSOCKET_URL,
        additional_headers={
            "Authorization": f"Bearer {API_TOKEN}"
        }
    ) as websocket:
        await websocket.send(audio_data)
        
        response = await websocket.recv()
        result = json.loads(response)
        
        # Response format: {'is_complete': True, 'probability': 0.87}
        return result

def generate_test_audio():    
    noise = np.random.normal(128, 20, 8192).astype(np.uint8)
    noise = np.clip(noise, 0, 255) 
    
    return noise

async def demonstrate_inference():
    # Generate test audio
    noise = generate_test_audio()
    
    try:
        print("\nTesting noise...")
        noise_result = await run_inference(noise.tobytes())
        print(f"Noise result: {noise_result}")
        
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    asyncio.run(demonstrate_inference())

On Wednesday, it announced that Deepgram's speech-to-text-to-text model will be available for Workers AI. This means:

  • Low latency – Speech recognition occurs at the edge near the user running on the same network as the worker

  • WeBRTC Audio Processing Without leaving the CloudFlare network

  • The cutting edge audio ML model Powerful, competent and fast audio model AI, available directly through workers

  • Worldwide – Automatically leverage CloudFlare's global network in over 330 cities

DeepGram is a popular choice for Voice AI applications. By building Voice AI systems on the CloudFlare platform, you can access powerful models and lowest latency infrastructure to provide a natural and responsive experience for your applications.

Are you interested in other real-time AI models running on CloudFlare?

If you're developing AI models for real-time applications, you want to run them on CloudFlare's network. Even if you have your own model, it reaches us whether or not you need large-scale ultra-low latency inference using open source models.

All of these features are available.

Do you want to choose the brains of the engineers who built this? Join us for a live demo Q&A and technical deep diving at CloudFlare Connect in Las Vegas. Explore Full schedule and registration.




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *