AWS Trainium + Cerebras CS-3 solution deployed in AWS data centers and accessed via Amazon Bedrock accelerates inference speed
Important points
- Fastest Inference Coming Soon: AWS and Cerebras are partnering to deliver the fastest AI inference available through Amazon Bedrock, launching in the coming months.
- Industry-leading speed and performance: Featuring prefill-optimized AWS Trainium and decode-optimized Cerebras CS-3, this innovative integrated system provides unparalleled performance and speed for AI inference.
- Pioneering cloud collaboration: AWS is the first cloud provider for Cerebras’ decoupled inference solution, available exclusively through Amazon Bedrock.
Amazon Web Services, Inc. (AWS) (NASDAQ: AMZN), an Amazon.com, Inc. company, and Cerebras Systems today announced a partnership to deliver the fastest AI inference solutions available for generative AI applications and LLM workloads in the coming months. The solution is deployed in Amazon Bedrock in an AWS data center and combines servers powered by AWS Trainium, Cerebras CS-3 systems, and Elastic Fabric Adapter (EFA) networking. Later this year, AWS will also offer leading open source LLM and Amazon Nova using Cerebras hardware.
This press release features multimedia. Read the full release here: https://www.businesswire.com/news/home/20260313406341/en/
Amazon is deploying Cerebras Wafer Scale Engine in AWS data centers. Lightning-fast inference is now available through AWS Bedrock, bringing industry-leading performance to the largest hyperscale clouds.
“While inference is where AI delivers real value to customers, speed remains a critical bottleneck for demanding workloads such as real-time coding assistance and interactive applications,” said David Brown, vice president of Compute & ML Services at AWS. “What we’re building at Cerebras solves that problem. By splitting the inference workload between Trainium and CS-3 and connecting them to Amazon’s Elastic Fabric Adapter, each system does what it does best. The result will be orders of magnitude faster, higher-performance inference than what’s available today.”
“By partnering with AWS to build a disaggregated inference solution, we will be able to deliver the fastest inference to our global customer base,” said Andrew Feldman, founder and CEO of Cerebras Systems. “Every company around the world will now be able to benefit from blazingly fast inference within their existing AWS environment.”
How it works: decomposing reasoning
The Trainium + CS-3 solution enables “inference decomposition,” a technique that separates AI inference into two stages: prompt processing (“prefill”) and output generation (decoding). These two stages have very different computational properties. Prefill is natively parallel, computationally intensive, and requires moderate memory bandwidth. Decoding, on the other hand, is serial in nature, computationally intensive, and memory bandwidth intensive. Decoding typically takes up the majority of inference time in these scenarios, since each output token must be generated in sequence.
Because each stage has different computational challenges, each benefits from a different computing architecture and low-latency, high-bandwidth EFA networking between stages. By strategically decomposing the inference problem using Trainium optimized for prefilling and Cerebras CS-3 optimized for decoding, you can optimize two different computational challenges in a specialized way.
Built on the AWS Nitro System, the foundation of AWS’s secure and high-performance cloud infrastructure, this new solution ensures that Cerebras CS-3 systems and Trainium-powered instances operate with the same security, isolation, and operational consistency that customers expect from AWS.
AWS Trainium for prefill, Cerebras CS-3 for decoding
Trainium is Amazon’s purpose-built AI chip designed to deliver scalable performance and cost efficiency for training and inference across a wide range of generative AI workloads. Two of the world’s leading AI labs, Anthropic and OpenAI, are working on Trainium. Anthropic has named AWS its primary training partner and uses Trainium to train and deploy its models. Meanwhile, OpenAI consumes 2 gigawatts of Trainium capacity through AWS infrastructure to support the demands of stateful runtime environments, frontier models, and other advanced workloads. Since its recent release, Trainium3 has been widely adopted by customers, with organizations across a variety of industries investing significant capacity.
Cerebras’ CS-3 is the world’s fastest AI inference system. Achieve thousands of times the memory bandwidth of the fastest GPUs. Inference models currently do most of the inference to compute and generate more tokens per request when “thinking through” the problem, and the need to speed up this part of the workflow increases accordingly. OpenAI, Cognition, Mistral, and others use Cerebras to accelerate their most demanding workloads, especially agentic coding, where developer productivity is limited by inference speed.
In a disaggregated solution, CS-3 is fully dedicated to decoding acceleration and can significantly increase the capacity of high-speed output tokens. Trainium handles the prefill, CS-3 handles the decode operations, and the high-speed EFA networking that connects them allows each processor to provide maximum token capacity for focused parts of the workload.
About Amazon Web Services
Amazon Web Services (AWS) is guided by a focus on our customers, a pace of innovation, a commitment to operational excellence, and a long-term mindset. For nearly two decades, AWS has built one of the fastest-growing enterprise technology businesses in history by democratizing technology and making cloud computing and generative AI accessible to organizations of all sizes and industries. Millions of customers trust AWS to accelerate innovation, transform their businesses, and shape their future. With the most comprehensive AI capabilities and global infrastructure footprint, AWS enables builders to turn big ideas into reality. For more information, visit aws.Amazon.com and follow @AWSNewsroom.
About Cerebras System
Cerebras Systems builds the world’s fastest AI infrastructure. We are a team of pioneering computer architects, computer scientists, AI researchers, and engineers of all kinds. We believe that fast AI changes the world, so we’ve come together to make AI incredibly fast through innovation and invention. Our flagship technology, Wafer Scale Engine 3 (WSE-3), is the world’s largest and fastest AI processor. At 56x larger than the largest GPUs, WSE delivers inference and training over 20x faster than competitors while using a fraction of the power per unit compute. Leading enterprises, research institutions, and governments on four continents choose Cerebras to run their AI workloads. Cerebras solutions are available on-premises and in the cloud. For more information, visit cerebras.ai or follow us on LinkedIn, X, and Threads.
This press release contains forward-looking statements, including statements regarding our products and the anticipated benefits of the transactions described herein. These statements involve risks and uncertainties that could cause actual results to differ materially. Neither we nor any other person accepts responsibility for the accuracy or completeness of any forward-looking statements. The forward-looking statements contained in this press release relate only to events and information as of the date of this press release. Cerebras undertakes no obligation to update or revise any forward-looking statements, whether as a result of new information, future events or otherwise, except as otherwise required by law.
View source version on businesswire.com. https://www.businesswire.com/news/home/20260313406341/ja/
media contact
pr@zmcommunications.com
