After proposing MeanFlow (MF) in May of this year, He Kaiming’s team recently introduced its latest improvements.
Improved MeanFlow (iMF). iMF successfully addresses the three core issues of the original MF: training stability, guidance flexibility, and architectural efficiency.
By reformulating the training objective as a more stable instantaneous velocity loss and introducing flexible classifier-free guidance (CFG) and efficient in-context conditioning, model performance is significantly improved.
In the ImageNet 256×256 benchmark test, the iMF-XL/2 model achieved an FID score of 1.72 with 1-NFE (single-step function evaluation). This is a 50% improvement over the original MF and shows that a single-step generative model trained from scratch can achieve results comparable to a multi-step diffusion model.
First author of MeanFlow, Geng Zhengyangas it is. Of note, co-first author Yiyan LuHe is currently a second year student in the Yao class at Tsinghua University. He Kaiming There’s also a signature at the end.
Other collaborators include: Adobe researchers Wu Zongze, Eli Shechtmanand zico colterDirector of CMU’s Machine Learning Department.
Rebuild the prediction function and return to the standard regression problem.
The core improvement of iMF (Improved MeanFlow) is to improve the training process by standard regression problem by Rebuild the prediction function.
The original MeanFlow (MF) (left in the figure above) directly minimizes the average velocity loss. Here, Utgt is the target average velocity derived from the MeanFlow ID and the conditional velocity ex.
The problem here is that the derived target Utgt contains a derivative term of the network’s own predicted output. “Independence of the target person” The structure makes the optimization very unstable and has high variance.
Based on this, iMF constructs the loss in terms of instantaneous velocity and stabilizes the entire training.
In particular, the network output remains at the average speed, but the training loss becomes the instantaneous speed loss to obtain stable standard regression training.
First, we simplify the input to a single noisy data z and make some clever changes internally to how the prediction function is computed.
Specifically, in iMF, for the computation of the composite prediction function V (representing the prediction of instantaneous velocity), the required tangent vector input to the Jacobian vector product (JVP) term is the critical velocity predicted by the network itself, rather than an external ex.
Through this series of steps, iMF successfully removes the dependence of the composite prediction function V on the target approximation ex. At this time, iMF sets the goal of the loss function to the stability condition speed ex.
Finally, iMF successfully transforms the training process into a stable standard regression problem, providing a solid optimization foundation for learning average speed.
In addition to improving training objectives, iMF comprehensively enhances the practicality and efficiency of the MeanFlow framework through two major advances:
Flexible classifier-free guidance (CFG).
One of the major limitations of the original MeanFlow framework is that to support single-step generation, the guidance scale for classifier-free guidance (CFG) must be fixed during training. This severely limits the ability to adjust scale to optimize image quality and diversity during inference.
iMF solves this problem by internalizing the guidance scale as a learnable condition.
Specifically, iMF directly provides the inductive scale as an input condition to the network.
During the training phase, the model randomly samples different guidance scales from a power distribution biased towards smaller values. This approach allows the network to adapt and learn the average velocity field under different induction strengths, thereby maximizing the flexibility of CFG during inference.
Additionally, iMF extends this flexible conditioning to support CFG spacing, further increasing the model’s control over sample diversity.
Efficient in-context conditioning architecture
The original MF relies on the adaLN-zero mechanism with a large number of parameters to handle multiple heterogeneous conditions (time steps, category labels, guidance scales, etc.).
As the number of conditions increases, simply summing all condition embeddings and passing them to adaLN-zero for processing becomes inefficient and the parameters become redundant.
iMF introduced improved in-context conditioning to solve this problem.
Its innovation lies in encoding all conditions (including time steps, categories, and CFG factors) into multiple learnable tokens (instead of a single vector), directly concatenating these condition tokens with tokens in the image latent space along the sequence axis, and inputting them together into a Transformer block for joint processing.
The biggest benefit of this architectural adjustment is that iMF can completely remove the adaLN-zero module, which has a large number of parameters.
This allows iMF to significantly optimize model size while improving performance. For example, the iMF-Base model size is reduced by approximately 1/3 (from 133M to 89M), greatly increasing model efficiency and design flexibility.
Experimental results
iMF shows good performance on the most difficult 1-NFE on ImageNet 256×256.
The FID of iMF-XL/2 at 1-NFE reaches 1.72, pushing the performance of single-step generative models to new heights.
The performance of iMF trained from scratch is even better than many fast-forward models extracted from pre-trained multi-step models, proving the superiority of the iMF framework in basic training.
The following figure shows the generated results for 1-NFE (Single-Step Function Evaluation) on ImageNet 256×256.
The FID of iMF in 2-NFE reaches 1.54, and the gap between the single-step and multi-step diffusion models further narrows (FID is approximately 1.4–1.7).
One more thing
As mentioned above, the lead author of IMF continues with Geng Zhengyang, the core team of the previous work Mean Flow (selected for oral presentation at NeurIPS 2025).
He graduated from Sichuan University with a bachelor’s degree and is currently pursuing his PhD at CMU under the supervision of Professor Zico Kolter.
The co-lead author is Yiyang Lu, a second-year student in the Yao class at Tsinghua University. He is currently researching computer vision at MIT under the direction of Professor He Kaiming. Previously, he studied robotics at the Institute of Interdisciplinary Information Science, Tsinghua University under the guidance of Professor Xu Huazhe.
Part of the content of this paper was completed by them at MIT under the guidance of Professor Kaiming He.
Additionally, other authors of the paper include Adobe researchers Zongze Wu, Eli Shechtman, CMU Machine Learning Department Director J. Zico Kolter, and Professor He Kaiming.
Among them, Mr. Wu Zongze graduated from Tongji University with a bachelor’s degree and obtained a doctorate from the Hebrew University of Jerusalem. He is currently a researcher at the Adobe Research Institute in San Francisco.
Similarly, Eli Shechtman also comes from Adobe. He is a senior principal scientist in the Adobe Research Image Laboratory. He joined Adobe in 2007 and was a postdoctoral researcher at the University of Washington from 2007 to 2010.
J. Zico Kolter is the supervisor of Geng Zhengyang, the first author of this paper. He is a professor in CMU’s Department of Computer Science and director of the machine learning department.
The last author of the paper is a well-known machine learning scientist He Kaiming. He is currently a tenured associate professor at MIT.
His most famous collaborations are:
