
Ai2 introduces Molmo 2, an open-source video understanding model built to prove that small, transparent AI systems can compete with closed, proprietary platforms for grounded video intelligence.
The Allen Institute for AI (Ai2) has released Molmo 2, an open source video understanding model that aims to demonstrate that small open models can serve as a reliable alternative to large proprietary systems for enterprise video analysis.
Molmo 2 is designed to challenge the dominance of closed models in grounded vision, a key feature for video understanding that directly connects visual elements to language and reasoning. Ai2 said in a press release that Molmo 2 “takes Molmo's strength in grounded vision and extends it to video and multi-image understanding.” The institute added, “One of our core design goals was to fill a major gap in open models: grounding.”
Ai2 has released three variants. The Molmo 2 8B is a Qwen-3 based model that is described as “the best all-around model for video grounding and QA.” Molmo 2 4B, optimized for more efficient deployment. and the Molmo 2-O 7B, built on the Olmo model of Ai2. All variants support single-image, multi-image, and variable-length video input, enabling tasks such as video grounding, object tracking, and question answering.
According to Ai2, Molmo 2 outperforms earlier Molmo versions in accuracy, temporal understanding, and pixel-level grounding, and in some cases has competitive performance with larger proprietary models such as Google's Gemini 3. Despite its smaller size, the Molmo 2 model outperformed the Gemini 3 Pro and other openweight competitors on video tracking benchmarks.
Ai2 pointed out that Molmo 2's biggest benefits come in video grounding and video counting. But the institute acknowledged that challenges remain, saying: “These results highlight both progress and remaining headroom. Video grounding remains challenging, and no model yet reaches 40% accuracy.”
