Amazon powers artificial intelligence with custom Trainium chips designed specifically for machine learning – Copyright AFP Mark Felix
Moises Avila
As billions of dollars are poured into artificial intelligence (AI), tech giant Amazon is working to step out of Nvidia’s shadow with a custom Trainium chip designed specifically for machine learning.
Annapurna Labs, an Amazon subsidiary in Austin, Texas, was testing the lifespan of its latest generation Trainium when AFP visited the facility recently.
Texas has emerged as the El Dorado of the US tech world, attracting investment with cheap energy, relaxed regulations, tax incentives and affordable real estate for large data centers.
Amid deafening roar, UltraServers equipped with 144 Trainium AI accelerator chips operated at Annapurna during routine pre-delivery inspections.
The cloud computing arm of e-commerce giant Amazon Web Services (AWS) has long relied on chip suppliers, but it has started designing its own, acquiring Israeli startup Annapurna Labs in 2015.
Graviton and Inferentia chips first appeared in 2018, with the former used for general cloud computing and the latter used to power AI models.
The first Trainium debuted in 2020, followed by a second generation that claims significantly improved performance.
The Trainium 3 chip, which went live in December, is touted to double the functionality of the second generation, despite being smaller than a credit card.
Christopher King, director of the Annapurna Institute in Austin, claimed that the latest Trainium chips can reduce the cost of developing and running generative AI models by as much as 40% compared to using graphics processing units (GPUs), which are currently considered the “gold standard” for AI.
– Failure is not an option –
In addition to pricing its Trainium chips competitively, AWS is trying to make reliability a selling point because data centers need to run nonstop for long periods at a time.
Mark Carroll, Annapurna’s head of engineering, said developing AI requires hundreds of thousands of chips working simultaneously over weeks.
“If there is a failure or unavailability at this stage, you will have to go back or start over,” says Carroll.
Unlike other big AI processor companies, AWS doesn’t sell its own chips.
Instead, AWS uses Trainium only in its own data centers and leases the computing capacity to customers.
The institute said AWS chose to customize the chip to harmonize with its software, particularly the Bedrock platform, which allows customers to choose from a wide range of competing AI models, including Anthropic, OpenAI, and other rivals.
Trainium is positioned as a cost-saving option in an AI market that is considered “supply constrained” due to insatiable demand for high-performance GPUs from competitors such as industry leaders Nvidia and AMD.
Trainium 3 is only a few months old, but Annapurna is already designing a new generation of chips.
A release date for Trainium 4 has not yet been announced, but Carroll says it will have six times the processing power of its predecessor.
As Google, Microsoft, OpenAI, Meta, and other technology rivals race to develop ever-improved AI models, pressure on chips to make technology smarter, faster, cheaper, and consume less power is increasing.
Nvidia began manufacturing its industry-leading Rubin graphics processing units less than a year after releasing its then-top-of-the-line Blackwell.
The first version of Trainium took about 18 months to create, but the second generation was ready in nine months, and Annapurna is “trying to keep up that pace,” Carroll said.
