Among the many Chinese AI companies and institutes competing for market share and attention (no pun intended) in the global market, MiniMax stands out for its commitment to delivering frontier-level intelligence across various modalities such as text, coding, and video (through its Hailuo model series). They are often under permissive, enterprise-grade, standard open source licenses.
Now, MiniMax has released a new detailed technical report on the creation of its popular M2 series of language models (M2, M2.5, and M2.7), raising eyebrows among AI power users and developers around the world. This highlights the company’s numerous engineering innovations and smart approaches. Meanwhile, the company and its leaders have also hinted at an all-new sparse attention approach for the upcoming MiniMax M3 series of models, which will deliver up to 15.6x faster decoding (or LLM response) in long contexts (1 million tokens) by employing a custom second-order framework. In doing so, MiniMax has designed M3 to make the deployment of ultra-long context AI agents economically viable.
The M2 report is worth watching for any company working with AI models, especially those looking to fine-tune and train their own in-house. After all, MiniMax’s M2 series models often achieved world-leading benchmarks for open source AI performance upon release.
The title has since been snatched by several other Chinese institutes, including DeepSeek and Xiaomi, but MiniMax’s new report provides a blueprint that companies around the world can use to improve the performance of their AI models and agents.
As Hugging Face’s Adina Yacup said on X, “Beyond the benchmarks, they’ve done some very solid work on MoE efficiency and agent-oriented design. I’m excited to see where M3 goes next!”
attention dilemma
The core technical architecture of the M2 series relies on a dedicated Transformer layout for sparse Mixture-of-Experts (MoE) decoders used in many other state-of-the-art LLMs.
Although the underlying backbone stores a total of 229.9 billion parameters, we maintain an incredibly lean operational footprint by enabling just 9.8 billion parameters per token across 256 fine-grained experts.
However, to optimize routing and avoid standard load balancing problems, MiniMax implemented a sigmoid gate combined with a learnable expert-specific bias term, significantly reducing the reliance on restrictive auxiliary losses.
The most decisive engineering decision described in the M2 paper was the adherence to full multi-head attention with grouped query attention (GQA) across all 62 layers.
In large language models, “quadratic scaling” refers to the computationally expensive reality of standard full-attention mechanisms, where every token in a sequence must be mathematically connected to every other token. A real-world analogy would be attending a networking event and having to have deep conversations with everyone there, while also monitoring all the other conversations going on.
While this approach provides an incredibly complete context, the processing power and memory required grows exponentially with the square of the input length, creating severe hardware bottlenecks as the model attempts to incorporate hundreds of thousands of words.
quadratic quadratic scaling problem
“Quadratic” scaling introduces architectural shortcuts designed to avoid this exponential computational load. Second-order quadratic techniques, such as sliding window attention and compressed linear attention, only need to analyze local windows of nearby words or generate compressed summaries of broader text, rather than mapping all possible connections.
While these efficient methods significantly reduce hardware costs and allow models to process large numbers of documents at high speeds, they have historically introduced severe trade-offs in accuracy, often resulting in AI missing the “big picture” or failing to track distant context.
This mathematical dilemma defines the evolution of the architecture from MiniMax’s M2 to the upcoming M3 series. During the development of M2, the researchers rigorously tested quadratic quadratic shortcuts, but found that they disabled the model’s “multi-hop inference” (its ability to connect disparate cues across long documents), and the team had to absorb the huge computational cost of fully quadratic quadratic attention to maintain frontier-level intelligence.
In fact, they actively benchmarked efficient attentional options during pre-training, but deliberately discarded them. They extensively experimented with hybrid setups to interleave full attention with second-order architectures such as Lightning Attention and hybrid sliding window attention (SWA) configurations.
The empirical results were conclusive. At larger scales, linear and windowed attention variants showed severe inference deficits.
When evaluated over a 32K context window, the performance of the SWA variant was significantly worse than full attention, dropping from a baseline score of 90.0 to 72.0 on the RULER 128K complex word extraction task.
We found that the quadratic quadratic configuration is prone to memory limitations during training, lacks native prefix cache support, and does not coordinate smoothly with the multi-token prediction (MTP) module used for speculative decoding. It was considered that sufficient care was required to maintain multi-hop inference ability.
However, recognizing that quadratic scaling cannot be sustained indefinitely due to physical hardware limitations, MiniMax is designing the M3 series around a new quadratic quadratic framework to ultimately deliver both fast processing and uncompromised inference.
Receiving MiniMax Sparse Attendant (MSA) and sub-quadratic scaling
The next generation MiniMax-M3 breaks away from the computationally intensive limitations of the previous generation. As revealed by MiniMax’s engineering team under the banner “Something BIG is coming,” the M3 introduces “MiniMax Sparse Attendance” (MSA).
Unlike DeepSeek’s multi-head latent attention (MLA), which compresses keys and values into a low-dimensional latent space, MSA operates on the standard GQA backbone, but leverages block-level selection of actual uncompressed keys and values.
Elie Bakouch of AI training infrastructure and platform lab Prime Intellect posted on [compressed space]. ”
This solves the precision loss and prefix caching failures pointed out in the M2 paper. By dynamically filtering and selecting block-level sequences, MSA provides an architectural leap forward. Initial hardware profiling shows a 9.7x speedup in prefill latency at 1 million token sequence length and a significant 15.6x speedup during the decode phase compared to the full attention M2 architecture.
To understand why speeding up the “decoding stage” is so important, it helps to take a closer look at how AI actually reads and writes information. When interacting with AI, processing occurs in two different steps: prepopulation and decoding.
When you give an AI a prompt, whether it’s a short sentence or a huge 1,000-page document, it processes that entire chunk of text in parallel at once (known as a “prefill”). It basically “reads” the input all at once, building an initial understanding and establishing context.
To generate a response, the AI must enter a “decoding phase.” Examine the prompt to predict the first word of the response. You must see the prompt to predict the second word plus first word. Predicting the 100th word requires recalculating the context of the prompt and Previous 99 words just written. So, in practice, the reaction becomes more difficult to generate as it progresses, eventually requiring a complete review of all previous parts.
For a layperson, imagine reading a thick legal draft (pre-typing) and then having to write a summary report, and before writing every new word you have to quickly re-read the entire draft and everything you’ve written to make sure the next word makes sense (deciphering).
The decoding stage is the most severe computational bottleneck in text generation, as the AI must constantly and repeatedly look back to generate new steps forward. This is why AI models often type their answers word for word, and why they slow down significantly as conversations get longer.
So when this text states that the new architecture achieved a significant speedup of 15.6x during the decoding phase at a 1 million token sequence length, this means that the model found a structural shortcut to generate an answer nearly 16x faster per token. It directly solves the bottlenecks that typically cause AI chatbots to freeze or stutter when processing large amounts of information.
Evolution of the MiniMax M series and the birth of “Forge”
At the product level, MiniMax has consistently evolved its model from a simple text generation interface to an autonomous worker.
The M2 series pioneered an “interleaved thinking” protocol in which the model alternates between natural language planning traces and explicit tool calls within a single trajectory. Rather than removing intermediate thought chain blocks between execution turns, M2 adds the complete thought history directly to the conversational context. This plan persistence prevents state drift and allows the model to gracefully recover from runtime errors and modify its strategy based on environmental feedback.
To train these long-term workflows, MiniMax built Forge, a scalable agent-native reinforcement learning system. Forge separates execution into three independent modules: the agent side, the middleware abstraction layer (gateway server and data pool), and the training/inference engine.
As MiniMax engineer Olive Song explained on the ThurdAI podcast, “What we realized is that there’s a lot of potential for small models like this if you train reinforcement learning with a large number of environments and agents… but it’s not that easy,” adding that this environment training took up a significant portion of the team’s development timeline. To accommodate the extreme trajectory length differences common in multi-step agent environments, Forge implements two key engineering solutions.
-
Windowed FIFO scheduling: A training scheduler that maps sliding windows onto generation queues. This enables greedy, high-throughput fetching of completed tasks within a window, preventing cluster idle time, while strictly enforcing FIFO boundaries to maintain distribution stability and avoid gradient oscillations.
-
Combining prefix trees: Optimization to restructure batch training into tree computation. Completions that share the same conversation prefix are computed only once in the forward pass before branching. This eliminates redundant calculations and speeds up training by up to 40x with zero approximation errors.
This hardening infrastructure directly spawned the M2.7 checkpoint and moved the series toward “self-evolution.” Operating within an automated agent harness, M2.7 acts as an independent machine learning engineer. The model profiles its active training runs, diagnoses anomalies, reads logs, and automatically modifies its codebase and configuration.
According to MiniMax, M2.7 successfully handled 30% to 50% of its unique development workflows.
In OpenAI’s rigorous MLE Bench Lite suite of tests for autonomous ML research capabilities, the M2.7 achieved a medal winning rate of 66.6% across an independent 24-hour trial, effectively tying Google’s closed-weight Gemini 3.1 Pro.
The continued pace from M2 to M2.5, which famously completed 30% of internal tasks and 80% of newly committed code at MiniMax headquarters, highlights a broader vision.
As the MiniMax team noted during the introduction phase, “We believe that M2.5 offers virtually limitless possibilities for the development and operation of agents in the economy.”
With a technical report and MSA technical blog on the horizon codifying the success of the M2 generation, MiniMax is making it clear that the next frontier in AI is converting mini-activation footprints into maximum real-world intelligence.
