The currently infamous R1 research report from the AI Darling Deepseek in China was published in the journal Nature this week, along with new information on the computational resources needed to train the model. Unfortunately, some people got the wrong idea about how expensive it is to create.
Some believe this disclosure allowed China's AI darling to actually train the model at a cost of just USD 294,000. In reality, the real cost of training the model was about 20 times more. at least.
The confusion comes from supplementary information released along with the original January paper, revealing that the AI model development has led to the use of 512 GPUs, which have a total of 198 hours of running, using 512 GPUs to train a spare R1 zero release and complete an additional 80 hours.
With about 5,000 GPU time to generate the monitored fine-tuning dataset used in the training process, the entire effort has emerged in hair under $300,000.
But that's not what really happened. Don't worry about the fact that $300,000 won't buy anywhere near 512 H800 (these estimates are based on GPU lease fees rather than actual hardware costs), and researchers aren't talking about end-to-end model training.
Instead, we focus on applying reinforcement learning, which is used to infuse existing V3-based models with “inference” or “thinking” features.
In other words, by the time they reached the RL phase detailed in this paper, they had already done about 95% of the work.
There are several ways to approach reinforcement learning, but in a nutshell, it is usually a post-training process that involves enhancing step-by-step inference by rewarding the correct model and enhancing step-by-step inference by enhancing the more accurate response of the process.
This paper is very central to the application of Group Relative Policy Optimization (GRPO), a specific reinforcement learning technique used in training models. Instead, the headline touting the $294,000 training cost appears to have confused reinforcement learning that takes place after training.
How do you know? This is because the Deepseek research team revealed the amount of computation used to train the basic model. According to the paper, the Deepseek V3 was trained at 2,048 H800 GPUs for about two months. In total, the model required 2.79 million GPU hours with an estimated cost of $5.58 million.
The actual cost of the model was close to $5.87 million, as you can't have an R1 without building the first V3. Whether these numbers are intentionally understated to cast Western model developers as flirty hype is the subject of heated debate.
It is also worth pointing out that the cost figures are based on the assumption that these H800 GPUs can be rented for $2/hour. The estimates show that the purchase cost of the 256 GPU server used to train the model is somewhere north of $51 million. And it doesn't take into account the wrong start or wrong turn to research and development, data collection, data cleaning, or creating a successful model.
Overall, the idea that Deepseek was significantly cheaper or more efficient to train than Western models appears to be exaggerated. The Deepseek V3 and R1 are roughly comparable to Meta's Llama 4 in terms of computation. The Llama 4 was trained with 22-40 trillion tokens, although it was required between 2.38m (Maverick) and 5m (Scout) hours for training. The Deepseek V3 is bigger than the Llama 4 Maverick, but used significantly fewer 14.8 trillion tokens. In other words, Meta trained a slightly smaller model when GPU time was slightly less, using quite a lot of training data. ®
