MIT Researchers Propose New Multimodal Technique that Blends Machine Learning Methods to Learn More Like Humans

https://openreview.net/pdf?id=QPtMRyk5rb

Artificial intelligence is revolutionary in all major use cases and applications we encounter on a daily basis. One such area revolves around many audio and visual media. Think of an AI-powered app that can generate funny videos, artistically stunning images, copy celebrity voices, or jot down entire lectures with a single click. All of these models require huge corpora of data to train. And most successful systems rely on annotated datasets to learn.

The biggest challenge is storing this data, annotating it, and turning it into usable data points that the model can consume. Easy to say, hard to do. Every year companies need help collecting and creating gold standard data points.

Now, researchers at MIT, MIT-IBM Watson AI Lab, IBM Research, and others have developed breakthrough techniques that can efficiently solve these problems by analyzing unlabeled audio and video data. Developed. This model has a lot of potential and potential to improve how the current model is trained. This method resonates with many models such as speech recognition models, transcription and speech production engines, object detection, etc. It combines two self-supervised learning architectures, contrastive learning, and masked data modeling. This approach follows his one basic idea of replicating how humans perceive and understand the world and replicating the same behavior.

🚀 Check out 100’s of AI Tools at the AI Tools Club

As MIT Postdoctoral Fellow Yuan Gong explained, looking at how humans collect data and learn from it is largely done without direct supervision, hence self-supervised learning. is essential. The goal is to enable the same procedure on machines to learn as many features as possible from unlabeled data. This training provides a strong foundation that can be utilized and improved with the help of supervised learning or reinforcement learning, depending on the use case.

The technique used here is Contrast Audiovisual Mask Autoencoder (CAV-MAE), which uses neural networks to extract and map meaningful latent representations from audio and visual data. The model can be trained on a large dataset of 10 second YouTube clips with audio and video components. The researchers say that CAV-MAE is far superior to previous approaches because it clearly highlights connections between audio and visual data that are not incorporated in other methods. claimed.

The CAV-MAE method incorporates two approaches: masked data modeling and contrastive learning. Masked data modeling includes:

Shoot a video and a matching audio waveform.
Convert audio to spectrogram.
Masks 75% of audio and video data.

The model then restores the missing data through a joint encoder/decoder. The reconstruction loss, which measures the difference between the reconstructed prediction and the original audiovisual combination, is used to train the model. The main purpose of this approach is to map similar representations close to each other. This is done by correlating relevant parts of audio and video data, such as connecting mouth movements for spoken words.

We have tested the CAV-MAE based model and other models and found it to be very insightful. Tests were conducted for an audio-video retrieval task and an audio-visual classification task. The results demonstrated that contrastive learning and masked data modeling are complementary methods. CAV-MAE surpassed previous techniques in event classification and remained competitive with models trained using industry-level computational resources. Moreover, multimodal data significantly improved the performance of single-modal representation fine-tuning and audio-only event classification tasks.

MIT researchers believe CAV-MAE is a breakthrough in the advancement of self-supervised audiovisual learning. They envision use cases ranging from action recognition in sports, education, entertainment, automotive, and public safety to automatic speech recognition and audio-video generation across languages. Although current methods focus on audiovisual data, researchers recognize that human perception involves multiple senses beyond audio and visual cues and extend it to other modalities. We also aim to expand

It will be interesting to see how this approach performs over time and whether many existing models are looking to incorporate such techniques.

Researchers expect that as machine learning advances, techniques like CAV-MAE will become more valuable, allowing models to better understand and interpret the world.

please check out Papers and MIT blog. don’t forget to join 23,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email us. Asif@marktechpost.com

🚀 Check out 100’s of AI Tools at the AI Tools Club

Anant is a computer science engineer, currently working as a data scientist, with experience in financial and AI-as-a-service products. He is passionate about building AI-powered solutions that create better data points and solve everyday life problems in an effective and efficient way.

➡️ Try: Criminal IP: AI-Based Phishing Link Checker Chrome Extension

Source link