Multimodal AI agents mimic human thinking for long video analysis and reasoning

A new multimodal agent to promote long video understanding with AI — Credit: github: https://github.com/yeliudev/videomind

Although artificial intelligence (AI) technology is evolving rapidly, AI models still struggle to understand long videos. A research team at Hong Kong Polytechnic University (PolyU) has developed Videomind, a new video language agent that allows AI models to perform long video inference and questioning tasks by emulating human thinking.

The VideoMind Framework incorporates an innovative Decreasing Rank Adaptation (LORA) strategy to reduce the demand for computational resources and power and to facilitate the application of generated AI in video analytics. The findings have been submitted to the world's leading AI conference.

Videos, especially those that are over 15 minutes, have information that unfolds over time, including a series of events, causality, consistency, and scene transitions. Therefore, to understand video content, the AI model must not only identify the objects that exist, but also take into account how it changes across the video. Video visuals take up a large number of tokens, and video understanding requires enormous amounts of computing capacity and memory, making it difficult for AI models to process long videos.

Professor Changwen Chen, interim dean of the PolyU faculty of Computer and Mathematics Science and Chair of Visual Computing, and his team have achieved a breakthrough in research into long video inference with AI. When designing Videomind, they introduced role-based workflows, referring to the human-like process of video understanding. The four roles included in the framework are:

The planner coordinates all other roles for each query.
A grounder for localizing and obtaining related moments.
A verifier who verifies the accuracy of the information of the searched moments and selects the most reliable moments.
It then generates responses that are conscious of responders and queries.

This progressive approach to video understanding helps to address the temporary reasoning challenges faced by most AI models.

Another co-innovation in the video mind framework is the adoption of LORA's chain strategy. Lora is a subtle technology that has been appearing in recent years. Adapt the AI model to a specific application without performing full parameter retraining. The team-pioneered innovative Chain Oblora strategy includes applying four lightweight LORA adapters to a unified model. Each is designed to invoke a specific role.

Credit: VideoMind

Using this strategy, the model dynamically activates role-specific LORA adapters during inference via self-call to switch seamlessly between these roles, eliminating the need and cost of deploying multiple models while increasing the efficiency and flexibility of a single model.

videomind is open source for Github and Huggingface, and related research is arxiv Preprint server. Details of experiments conducted to assess its effectiveness in temporal video comprehension across 14 diverse benchmarks are also available. Comparing Videomind with several cutting-edge AI models, including the GPT-4O and Gemini 1.5 Pro, the researchers found that the grounding accuracy of VideoMind is superior to all competitors in challenging tasks that include video with a 27-minute video with an average duration.

In particular, the team included two versions of Videomind in the experiment. One has a small 200 million (2b) parameter model, and the other is a larger 7 billion (7b) parameter model. The results showed that even at 2B size, VideMind still produced performance comparable to many of the other 7B size models.

Professor Chen said, “When humans understand videos, they switch to different thought modes. They break down tasks, identify relevant moments, revisit these to see the details, and synthesize observations into a consistent answer. The process uses only about 25 watts of power.

“Inspired by this, we designed a role-based workflow that allows AI to understand human-like videos. Meanwhile, we leveraged a chain of LORA strategies to minimize the need for power and memory computing in this process.”

AI is at the heart of global technological development. However, advances in the AI model are constrained by insufficient power and excessive power consumption in computing. Built on the unified open source model QWEN2-VL and extended with additional optimization tools, the VideoMind framework provided a viable solution for bottlenecks that reduces technology costs and deployment thresholds and reduces the power consumption of AI models.

Professor Chen said, “VideMind not only overcomes the performance limitations of AI models in video processing, but also serves as a modular, scalable and interpretable multimodal inference framework. It envisages expanding the application of generator AI to various fields such as intelligent surveillance, sports, entertainment video analytics, and video search engines.

detail:
ye liu et al, videomind: Chain Oblola Agent for Long Video Inference, arxiv (2025). doi:10.48550/arxiv.2503.13444

Journal Information:
arxiv

Provided by Hong Kong University of Polytechnics

Quote: Multimodal AI Agent Mimics Human Thinking for Long Video Analysis and Inference (June 10, 2025) Retrieved from https://techxplore.com/news/2025-06-multi-modal-ai-agent-mimics.html

This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.

Source link