Inference is giving AI chip startups a second chance • The Register

Machine Learning


AI adoption is reaching an inflection point as the focus shifts from training new models to delivering services. For AI startups vying for a piece of Nvidia’s pie, it’s now or never.

Compared to training, inference is a much more diverse workload and thus presents an opportunity for chip startups to carve out their own niche. Large-scale batch inference requires a different combination of compute, memory, and bandwidth than an AI assistant or coded agent.

Because of this, inference is becoming increasingly heterogeneous, and certain aspects of it may be better suited to GPUs or other more specialized hardware.

Nvidia’s $20 billion acquisition of Groq in December is a prime example. The startup’s SRAM-heavy chip architecture meant that with enough SRAM, Groq’s LPUs could churn out tokens faster than any GPU. But limited computing power and aging chip technology meant they couldn’t scale everything efficiently.

Nvidia, for their part, addressed this issue by moving the computationally intensive prefill bits of the inference pipeline to the GPU while keeping the bandwidth-constrained decoding operations on the shiny new LPUs.

This combination is not unique to Nvidia. The week after GTC, AWS announced its own distributed computing platform that uses custom Trainium accelerators for prefill and dinner plate-sized wafer-scale accelerators from Cerebras Systems for decoding.

Even Intel got in on the fun, announcing a reference design that uses a GPU (presumably the one teased last Northern Hemisphere fall) in its new RDU for prefill and decoding for AI chip startup SambaNova.

So far, most of the wins for AI chip startups have been on the decoding side of the equation. SRAM isn’t particularly large, but it’s incredibly fast. Therefore, having enough chips, or at least large enough chips in the case of Cerebras, is good for speeding up decoding operations, but chip startup is not limited to this regime.

This week, Lumai detailed an optical inference accelerator that uses light rather than electrons to perform the matrix multiplication operations that are central to most machine learning workloads, using a fraction of the power of purely digital architectures.

Lumai expects its next-generation Iris Tetra system to achieve exaOPS AI performance on a 10kW power budget by 2029.

Technically, the chip uses a hybrid electro-optical architecture, but most of the computations done during inference are handled by the chip’s optical tensor core.

The company is initially positioning the chip as a standalone replacement for GPUs for compute-intensive inference workloads such as batch processing. In the long term, the company also plans to use the optical accelerator as a prefill processor.

Although the architecture is still in its infancy and can run billion-parameter models like Llama 3.1 8B and 70B today, the UK-based startup has made enough progress to open up its chips to neoclouds and hyperscalers for evaluation.

That said, not all AI chip startups are keen on using different chips for prefill and decoding. Earlier this week, Tenstorrent announced its RISC-V-based Galaxy Blackhole computing platform, but suffice it to say that its CEO Jim Keller is not a fan of decomposition reasoning.

“Companies across the industry are working together to build accelerators. CPUs run code, GPUs accelerate CPUs, TPUs accelerate GPUs, LPUs accelerate TPUs, and so on. This creates complex solutions that don’t scale as AI models and use cases change. At Tenstorrent, we thought something more general and simple would work,” he said in a statement. ®



Source link