New technology makes training AI models leaner and faster | Massachusetts Institute of Technology News

Training large-scale artificial intelligence models is expensive, not only in dollars but also in terms of time, energy, and computational resources. Traditionally, to get a smaller, faster model, you had to first train a larger model and then trim it, or train a smaller model from scratch and accept the performance penalty.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Institute for Learning and Intelligent Systems, ETH, and Liquid AI have developed a new method to avoid this tradeoff entirely and compress models during training rather than after.

The technology, called CompreSSM, targets a family of AI architectures known as state-space models that power applications ranging from language processing to speech generation to robotics. By borrowing mathematical tools from control theory, researchers can identify which parts of the model are pulling their own weight and which parts are dead weight, before surgically removing unnecessary components early in the training process.

“This is essentially a technique for growing models smaller and faster during training,” said Makram Chahine, a doctoral student in electrical engineering and computer science, a CSAIL affiliate, and lead author of the paper. “While learning, they also eliminate parts that are not useful for growth.”

The key insight is that the relative importance of different components within these models stabilizes surprisingly early during training. Using a mathematical quantity called the Hankel singular value, which measures how much each internal state contributes to the overall behavior of the model, the team showed that it is possible to reliably rank which dimensions are important and which are not, with only about a 10% training process. Once these rankings are established, less important components can be safely discarded and the remaining 90% training proceeds at the speed of a much smaller model.

“What’s interesting about this work is that it turns compression from an afterthought to part of the learning process itself,” says senior author Daniela Russ, MIT professor and director of CSAIL. “Instead of training a large model and then figuring out how to make it smaller, CompreSSM allows the model to discover its own efficient structures as it learns. This is a fundamentally different way of thinking about building AI systems.”

The results were amazing. In image classification benchmarks, compressed models were able to train up to 1.5 times faster while maintaining nearly the same accuracy as full-sized models. The compressed model, which was reduced to about a quarter of the original state dimensionality, achieved 85.7 percent accuracy on the CIFAR-10 benchmark. In comparison, the model trained from scratch at that smaller size had only 81.8 percent. For Mamba, one of the most widely used state-space architectures, our method achieved a training speedup of about 4x, compressing a 128-dimensional model to about 12 dimensions while maintaining competitive performance.

“We capture most of the complex dynamics during the warm-up phase and retain only the most useful states, resulting in greater model performance,” Chahine says. “This model can perform at a higher level than training a small model from scratch.”

CompreSSM differs from existing approaches in its rationale. Traditional pruning techniques train a complete model and then remove parameters. This means you pay all the computational costs of training a large model. Another common technique, knowledge distillation, requires training a large “teacher” model to completion and then training a second smaller “student” model, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed compression decisions mid-stream.

The team directly benchmarked CompreSSM against both alternatives. Compared to Hankel kernel norm regularization, a recently proposed spectral method to promote compact state-space models, CompreSSM was over 40 times faster while also achieving higher accuracy. The regularization approach required expensive eigenvalue computation for each gradient step, and the resulting model still performed poorly, slowing down training by about 16 times. For knowledge distillation in CIFAR-10, CompressSM retained a clear advantage over highly compressed models. For smaller state dimensions, the accuracy of the distilled model decreased significantly, while the model compressed with CompreSSM maintained near maximum performance. Also, because distillation requires a forward pass through both the teacher and student at each step of training, even the smaller student model took longer to train than the full-sized baseline.

By applying Weyl’s theorem, the researchers mathematically demonstrated that the importance of individual model states changes smoothly during training, and empirically showed that the relative ranking of those states is stable. Taken together, these findings give practitioners confidence that dimensions initially determined to be negligible will not suddenly become important later.

This method also comes with a practical safety net. If the compaction step causes unexpected performance degradation, the operator can revert to a previously saved checkpoint. “This frees people from having to define unintuitive energy thresholds and gives them control over how much they are willing to pay in terms of performance,” Chahine explains.

This technique has some practical limitations. CompreSSM works best with models that exhibit strong correlations between internal state dimensions and overall performance. This correlation varies by task and architecture. This method is particularly effective for multiple-input multiple-output (MIMO) models, where the relationship between state size and expressiveness is strongest. For single-input, single-output per channel architectures, the gains are more modest because these models are inherently less sensitive to changes in state dimension.

Although the theory applies most clearly to linear time-invariant systems, the team has developed an extension for input-dependent time-varying architectures, which are becoming increasingly popular. The family of state-space models has also been extended to architectures such as linear attention, which is gaining interest as an alternative to traditional transformers, so the potential applications are wide-ranging.

Chahine and his collaborators see this work as a stepping stone. The team has already demonstrated extensions to linear time-varying systems like Mamba, and future directions include pushing CompreSSM further into matrix-valued dynamic systems used in linear attention mechanisms, bringing the technology closer to the transformer architectures that underpin most of today’s largest AI systems.

“This had to be the first step, because this is where the theory is sound and the approach can remain principled,” says Chahine. “This is a stepping stone to expand to other architectures currently used in the industry.”

“The work of Chahine and his colleagues provides an interesting theory-based perspective on the compression of modern state-space models (SSMs),” said Antonio Orvieto, Principal Investigator at ELLIS Institute Tübingen and Independent Group Leader for Intelligent Systems MPI. He was not involved in this study. “This method provides evidence that the state dimensionality of these models can be effectively reduced during training and that a control theory perspective can successfully guide this procedure. This work opens new avenues for future research, and the proposed algorithm may become a standard approach in pre-training large-scale SSM-based models.”

This research has been accepted as a conference paper at the 2026 International Conference on Learning Representations and will be presented later this month. This research was supported in part by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.

Source link