DeepSeek’s new model significantly reduces inference costs • The Register

Machine Learning


Chinese AI darling DeepSeek is back with a new openweight large-scale language model that promises performance comparable to the best proprietary LLMs in the US. Perhaps more importantly, it claims to significantly reduce inference costs and expands support for Huawei’s Ascend family of AI accelerators.

Announced Friday, DeepSeek V4 is available for download from popular model repositories such as Hugging Face, the company’s API, and two new flavors of web services. The first is a smaller 284 billion parameter Flash Mix of Experts (MoE) model with 13 billion active parameters, and the larger of the two is a 1.6 trillion parameter model, of which 49 billion are in use at any given time.

V4-Pro was trained on 33 trillion tokens and, if DeepSeek is to be believed, outperforms all indifference weight LLMs while matching the best proprietary models in the West across its suite of benchmarks.

DeepSeek says how their V4 model stacks up against the competition.

DeepSeek says how their V4 model stacks up against the competition. – Click to enlarge

Of course, these claims should be taken with a grain of salt. DeepSeek has made a name for itself as a Chinese developer with a strong track record with its V3 and R1 family of models, but just because it performs well on off-the-shelf benchmarks doesn’t mean it will hold up in real-world applications.

DeepSeek V4-Pro is expected to be much better than the company’s previous efforts. The new model has nearly a trillion larger parameters and uses more active parameters during inference. However, as with DeepSeek V3, we have shown that large frontier models can be trained using less compute than previously thought, but the benchmarks don’t tell the whole story.

Under the hood, DeepSeek V4 introduces several new architectural changes that, according to the developers, should result in significantly lower model delivery costs.

The first one is fairly simple. Now, DeepSeek is releasing its second smaller Flash model. This requires less infrastructure to run and provides a more interactive user experience at a lower cost. Smaller models simply cost less to provide.

This is not a new strategy per se, but it is one that DeepSeek is now starting to adopt, at least as far as its own model is concerned.

An even bigger, more meaningful change is in the way DeepSeek calculates attention. A model’s attention mechanism affects how prompts are translated into key-value pairs that are used to generate output tokens.

In a paper published alongside the new model, DeepSeek researchers describe a hybrid attention mechanism that combines two technologies, compressed sparse attention and heavy compressed attention, to reduce the amount of computation required during inference and the amount of memory required for the KV cache used to track model state.

The latter element is key to DeepSeek V4’s efficiency, as these caches can be quite large. Inference providers also tend to offload these to system memory or flash to avoid cold start penalties. A more highly compressed KV cache requires less memory and storage for large-scale inference deployments.

Combining these technologies means this model can support 1 million token context windows while using 9.5 to 13.7 times less memory than DeepSeek V3.2.

To further reduce model memory usage, DeepSeek continues its tradition of using less precise data types. DeepSeek V3 is one of the first open weight models trained on FP8.

Currently, both V4 models use a combination of FP8 and FP4 accuracy. Specifically, the model developers used quantization-aware training for weighting the MoE experts.

As mentioned previously, FP4 effectively requires half the memory to store model weights compared to FP8, which is a significant savings if you can live with the loss in accuracy.

DeepSeek’s architectural improvements are not limited to inference. In V4, model developers introduced a new optimizer called Muon. This is designed to speed up convergence and improve training stability.

Homemade models of homegrown hardware

Perhaps the most interesting, but least detailed, element of the new models relates to the hardware they are running on. DeepSeek V3 is heavily optimized for Hopper GPUs, while V4 is verified to work with both Nvidia and Huawei accelerators.

The DeepSeek V4 paper mentions the chip only in passing and says the company has verified its “fine-grained EP.” [Expert Parallel] We apply this scheme on both Nvidia GPU and Ascend NPU platforms. ”

To be clear, this does not mean that the model was fully trained on Huawei hardware, only that DeepSeek has verified that the Chinese telecom giant’s AI accelerator powers the model.

DeepSeek likely used a combination of Nvidia GPUs for pre-training and Huawei accelerators for reinforcement learning. The latter is an inference-adjacent post-training step used to teach the model new skills, behaviors, and thought-reasoning chains. However, the paper does not directly address this.

Inference generally has a low barrier to entry for new chipmakers. But at one point, DeepSeek also tried to train its models on Huawei chips. The effort was reportedly derailed by dodgy chips, glacial interconnects, and an immature software stack, ultimately resulting in DeepSeek returning to Nvidia’s ownership.

Finally, with V4’s use of 4-bit precision data types, some might think that DeepSeek got Nvidia’s Blackwell accelerator. This accelerator is not allowed to be sold in China by AI weapons dealers, but this is not strictly necessary.

Hopper GPUs do not support FP4 hardware acceleration, but can manipulate data types in a weight-only manner. Although this approach provides no floating-point performance benefit, it reduces the memory footprint and bandwidth required for both training and inference, making it a worthwhile tradeoff for many use cases.

Sales price setting

DeepSeek V4 is currently in preview, with both basic and tuned versions of the model available for download or via the API.

Not surprisingly, the company is offering API access to its small-scale model at a discounted rate of $0.14 per million input tokens (uncached) and $0.28 per million output tokens.

The larger Pro model is much more expensive at $1.74 per million input tokens and $3.48 per million output tokens, but it is still a fraction of what Western AI vendors charge for access to their top-of-the-line models. For reference, OpenAI charges $5 per million input tokens and $30 per million output tokens for GPT-5.5. ®



Source link