The performance of recent video language models (VidLMs) on various video language tasks is excellent. Such multimodal models have only drawbacks. For example, visual language models treat images as collections of objects, making it difficult to comprehend compositional and ordering relationships within images. shown to be solvable. Such limitations mean that the model’s awareness of object connectivity and understanding of actions that require a lot of structure may need to be improved. To test this hypothesis, they begin by defining behavioral knowledge as understanding the cause and effect of behavior in the textual, visual, and temporal dimensions.
Researchers at UIUC and UNC introduced the Action Dynamics Benchmark (ActionBench) to measure a model’s understanding of actions. ActionBench contains two difficult tasks. (1) to identify original and reversed movies, and (2) to identify video captions in which action verbs have been replaced by their antonyms. A baseline task is also included in the benchmark to minimize the negative impact of domain mismatches and explore potential biases in favor of objects. The baseline challenge is for the model to distinguish between the original video subtitles and the edited version with arbitrary items replaced.
A modern video language foundation model performs almost randomly on action-oriented research tasks, but performs very well on object-oriented baseline tests. This demonstrates the need for action knowledge in VidLM. Their remarkable performance in other benchmarks may be due to their object identification skills rather than their grasp of actions. They provided their own framework called PAXION (Patching Actions) for patching action knowledge into the current VidLM while maintaining common Vision Language (VL) features to solve this weakness. I’m here. Knowledge Patcher and Knowledge Fuser are his two main parts of PAXION.
They found that the widely used purpose of video-text contrast (VTC) needs to be modified, corroborating earlier findings. This presents a significant barrier to patching action knowledge. Knowledge Patcher (KP), a Perceiver-based lightweight module coupled to the frozen VidLM backbone, is used to add action awareness expressions to VidLM. The Discriminative Video Dynamics Modeling (DVDM) goal is to model the correlation between the textual representation of the action, the text of the action (e.g. the word “falling”), and the visual depiction of the action (e.g. a falling clip). to learn. book) is introduced. It is inspired by dynamics modeling in robotics and reinforcement learning.
Two new features of DVDM, Video-Action Contrastive (VAC) and Action-Temporal Matching (ATM), are compatible with VTC without requiring separate settings. They develop identification tasks using antonyms of action and reversal films, with an emphasis on learning from data examples with key state transitions. They show that ActionBench tasks are greatly improved thanks to the interaction between Knowledge Patcher and DVDM. We then consider how Knowledge Patcher, which focuses on understanding actions, could be incorporated into his existing VidLM for jobs downstream that require knowledge of both actions and objects.
To accomplish this, the company offers PAXION’s Knowledge Fuser (KF) component. It takes advantage of cross-attention to blend the object-centric representation from a solid backbone with the action-centric representation from Knowledge Patcher. They have demonstrated that fused PAXION representations allow object and improve knowledge of both actions. Furthermore, their study shows that knowledge fuser is important for maintaining a balance between object-related understanding of the model and improving the performance of downstream actions and time-oriented tasks.
PAXION’s resilience is also further evaluated by considering zero-shot cross-domain transfer settings for Moments-in-Time and Kinetics datasets. They found that by further combining PAXION with the backbone model, it can aggressively migrate to new domains while increasing its robustness against domain changes. This is the first study to rigorously analyze action knowledge and incorporate it as far as possible into the underlying model of video language.
Their three main contributions are:
1. Provide an action dynamics benchmark to test the ability of a video language model to recognize actions. After analyzing three state-of-the-art video language foundation models, it requires a basic understanding of action knowledge.
2. They proposed a unique learning framework, PAXION, that adds missing action knowledge to the underlying model of Frozen Video Language without compromising the model’s overall visual-linguistic skills. The Percyber-based Knowledge Patcher and the Cross-Attention-based Knowledge Fuser are two of the main components of PAXION.
3. They propose a DVDM goal that drives a model that encodes the relationship between action text and a proper sequence of video frames as an improvement on the frequently used VTC loss. Numerous studies have demonstrated that PAXION with DVDM enhances mutual understanding of things and activities while being resilient to domain migration.
please check out paper and code. don’t forget to join 23,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email me. Asif@marktechpost.com
🚀 Check out 100’s of AI Tools at the AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his Bachelor of Science in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. He loves connecting with people and collaborating on interesting projects.
