How distillation makes AI models smaller and cheaper

Chinese AI company Deepseek released a chatbot called R1 earlier this year. Most focuses on the fact that a relatively small, unknown company said it had built a chatbot that rivals the performance of the world's most well-known AI companies, but uses only a small portion of the power and cost of a computer. As a result, stocks in many Western tech companies have plummeted. Nvidia, which sells chips that run major AI models, lost more stock prices in a day than any other company in history.

Part of that attention included an element of accusation. Sources claimed that Deepseek acquired knowledge from Openai's own O1 model, using a technique known as distillation, without permission. Much of the news report frames this possibility as a shock to the AI industry, meaning Deepseek has discovered new, more efficient ways to build AI.

However, distillation, also known as knowledge distillation, is a widely used tool in AI, a subject of computer science research dating back ten years, and a tool used by large companies in their own models. “Distillation is one of the most important tools companies have today to make their models more efficient,” says Enric Boix-Adsera, a researcher studying distillation at the University of Pennsylvania Wharton School.

Dark knowledge

The idea of distillation began with a 2015 paper by three Google researchers, including AI's so-called godfather and 2024 Nobel Prize winner Geoffrey Hinton. At the time, researchers often ran ensembles of models – “Many of the models glued together,” said Oriol Vinyals, a leading scientist at Google Deepmind and one of the authors of the paper, improving performance. “But running all the models in parallel was very tedious and expensive,” Vinyals said. “We were intrigued by the idea of distilling it into a single model.”

Researchers thought that progress could be made by addressing the notable weaknesses of machine learning algorithms. The wrong answers were all considered equally bad, no matter how wrong they were. For example, in the image classification model, “confusing a dog with a fox was punished in the same way as confusing a dog with pizza,” Vineyards said. Researchers suspected that the ensemble model contained information about which incorrect answers were not worse than others. Perhaps the small “student” model can use information from the large “teacher” model to more quickly grasp the categories that are supposed to organize their photos. Hinton called this “dark knowledge” and evoked similarities to cosmological dark matter.

After discussing this possibility with Hinton, Vinyals developed a method for passing more information about image categories to a larger teacher model to a smaller student model. The key was to hone the “soft target” of the teacher model. Here, rather than solidifying this answer, we assign a probability to each possibility. For example, one model calculated that the image was 30% likely to show a dog, 20% likely to show a cat, 5% indicated a cow and 0.5% indicated a car. By using these probabilities, the teacher model effectively revealed to students that dogs are very similar to cats, not so different from cows, and not quite different from cars. Researchers found that this information helps students learn how to more efficiently identify images of dogs, cats, cows and cars. Large and complex models can be reduced to slimmer models with little accuracy.

Explosive growth

The idea wasn't a hit right away. The paper was rejected from the meeting, and Vinyals was disappointed and turned to other topics. However, the distillation arrived at a critical moment. Around this time, engineers had discovered that the more training data they provided to neural networks, the more effective these networks became. The size of the models quickly exploded, as did the abilities, but the cost of carrying out them climbed in steps along with the size.

Many researchers have turned to distillation as a way to create smaller models. For example, in 2018, Google researchers published a powerful language model called Bert. However, Bert was so big and expensive to run, so the following year other developers distilled a small version named Distilbert, which was widely used in business and research. Distillation gradually became ubiquitous and is now available as a service by companies such as Google, Openai and Amazon. The original distillation paper, which is still only published on the arxiv.org preprint server, has now been cited more than 25,000 times.

Given that distillation requires access to the visceral organs of the teacher model, it is not possible for third parties to secretly distill data from closed models like Openai's O1, as was thought to have been done by deep seek. That said, student models can learn quite a bit from teacher models by simply urging teachers with specific questions and using answers to train their models.

Meanwhile, other researchers continue to find new applications. In January, University of California, Berkeley's Novasky Lab showed it was suitable for training inference models of thinking that use multi-stage “thinking” to answer complex questions well. The lab says training for a completely open source Sky-T1 model costs less than $450, resulting in similar results to much larger open source models. “I was really surprised at how well the distillation went in this environment,” said Duchen Lee, a doctoral student at Berkeley and a co-student of the Novasky team. “Distillation is a basic method of AI.”

Source link