Inference layer designed for agents

AI models are changing rapidly. The best model you use for agent coding today might be a completely different model from a different provider three months from now. Additionally, real-world use cases often require invoking multiple models. A customer support agent might use a fast, inexpensive model to classify a user’s messages. Large-scale inference models for planning actions. A lightweight model for performing individual tasks.

This means you need access to all models without being financially and operationally tied to a single provider. You also need to have the right systems in place to monitor costs across providers, ensure reliability in the event of an outage of one of your providers, and manage delays no matter where your users are.

These challenges are always present when building with AI, but they become even more pressing when building. agent. You might even be able to create a simple chatbot. inference Call it at each user prompt. If an agent chains together 10 calls to complete one task, suddenly instead of a single slow provider adding 50 ms, it might add 500 ms. A single request failure does not result in a retry, but an abrupt cascade of downstream failures.

Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare, and we’re shipping quickly to keep up. Over the past few months, we’ve updated our dashboard to add a zero-setup default gateway, automatic retries on upstream failure, and more granular logging controls. Currently, we have Cloudflare as our unified inference layer. This means you can access any AI model from any provider with one API, and it’s built to be fast and reliable.

One catalog, one integration endpoint

Starting today, you can call third-party models using the same AI.run() binding that you already use with Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or another provider is a one-line change.

const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});

For those who don’t use Workers, we plan to release REST API support in the coming weeks. This allows you to access your complete model catalog from any environment.

We’re also excited to announce that you can now switch between models with one API, one line of code, and access over 70 models from over 12 providers with one set of credits. And we’re rapidly expanding this.

You can browse through our model catalog Find the best model for your use case, from open source models hosted on Cloudflare Workers AI to proprietary models from leading model providers. We’re excited to expand access to models from. Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, Vidu — Who will serve the models through the AI Gateway? In particular, we are expanding our model offering to include image, video, and audio models so that you can build multimodal applications.

Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies are now making phone calls. Average 3.5 models This means that no provider has a comprehensive view of AI usage. AI Gateway gives you one central location to monitor and manage your AI spending.

By including custom metadata in your requests, you can get cost breakdowns for the attributes that matter most to you, such as spend by free and paid users, individual customers, or specific workflows within your app.

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',
      {
prompt: 'What is AI Gateway?'
      },
      {
metadata: { "teamId": "AI", "userId": 12345 }
      }
    );

AI Gateway allows you to access models from all providers through one API. However, there may be times when you need to run a model that is fine-tuned with your own data or optimized for a specific use case. To that end, we’re working to enable users to bring their own models to Workers AI.

The overwhelming majority of our traffic comes from dedicated instances for enterprise customers running custom models on our platform, and we want to offer this to more customers. To do this, use Replicate. gear Technologies that help containerize machine learning models.

Cog is designed to be very simple. All you need to do is write your dependencies in a cog.yaml file and write your inference code to a Python file. Cog abstracts away all the hard stuff about packaging ML models, such as CUDA dependencies, Python versions, and loading weights.

Example cog.yaml file:

build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"

example predict.py This file contains functions to configure the model and functions to be executed when an inference request (prediction) is received.

from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -> Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output

You can then run cog build to build the container image and push the Cog container to Workers AI. We will deploy and serve the model for you. Then access the model through the regular Workers AI API.

We are working on some big projects that will allow us to offer this to more customers. Examples include customer-facing APIs, Wrangler commands that allow you to push your own containers, and faster cold starts with GPU snapshots. We’ve been testing this internally with the Cloudflare team and external customers who help guide our vision. If you are interested in becoming our design partner, please contact us. Soon, anyone will be able to package their own models and use them through Workers AI.

Fast path to first token

Using the Workers AI model with AI Gateway is especially powerful if you’re building live agents. In this case, the user’s perception of speed is determined by the time to first token and how quickly the agent starts responding, rather than how long it takes for a complete response. Even if the total inference is 3 seconds, retrieving the first token 50 milliseconds faster can make the difference between an agent feeling active or slow.

Cloudflare’s network of data centers in 330 cities around the world means our AI gateways are located close to both users and inference endpoints, minimizing network time before streaming begins.

Workers AI also hosts open source models in its public catalog. It includes a large model built specifically for agents. Kimi K2.5 Real-time audio model. Invoking these Cloudflare-hosted models via AI Gateway runs your code and inference on the same global network, eliminating additional hops through the public internet and ensuring the lowest possible agent latency.

Build reliability with automatic failover

When building agents, users don’t just care about speed, they also care about reliability. Each step in an agent workflow depends on the previous step. Reliable inference is critical for agents because the failure of one call can affect the entire downstream chain.

Through AI Gateway, if you are calling a model that is available on multiple providers and one provider goes down, you are automatically routed to another available provider without having to write your own failover logic.

If you are building Long-running agents using Agent SDKstreaming inference calls are also resilient to disconnections. AI Gateway buffers generated streaming responses regardless of the agent’s lifetime. If the agent is interrupted in the middle of inference, it can still reconnect to the AI Gateway and retrieve the response. There is no need to make new inference calls or pay for the same output token twice. Combined with the Agents SDK’s built-in checkpoint functionality, end users will never notice.

Replicate team officially I participated Our AI platform team no longer even considers itself a separate team. We’ve been hard at work integrating Replicate with Cloudflare. This includes deploying all Replicate models to AI Gateway and replatforming the hosted models to Cloudflare infrastructure. You’ll soon be able to access the models you love in Replicate through AI Gateway, and host models you deploy in Replicate in Workers AI as well.

Check out the following documentation to get started: AI gateway or worker AI. Learn more about building agents with Cloudflare. Agent SDK.

Source link