Researchers at MIT, MIT-IBM Watson AI Lab, and IBM Research are working to analyze unlabeled audio and visual data to improve the performance of machine learning models used in applications such as speech recognition and object detection. developed new technology. In this work, we combine for the first time his two architectures of self-supervised learning, contrastive learning and masked data modeling to perform machine learning tasks such as event classification on single- and multimodal data without the need for annotation. intended to be extended. How humans understand and perceive our world.
“Most of human knowledge is learned in a self-supervised way, because we don’t always get supervisory signals. We want machine learning models to have the same ability.” says MIT Postdoctoral Fellow Yuan Gong. He holds a PhD from the Computer Science and Artificial Intelligence Laboratory (CSAIL).
“So, to put it another way, self-supervised learning is able to learn large amounts of unlabeled data, so it often forms the basis for initial models. Supervised learning and reinforcement learning can be used to fine-tune the model to something specific,” said Jim Glass, senior researcher at MIT and member of the MIT-IBM Watson AI Lab. increase.
The technique, called Contrast Audio-Visual Mask Autoencoder (CAV-MAE), is a method of extracting meaningful latent representations from acoustic and visual data and mapping them into a high-dimensional space by training on the large YouTube dataset. A type of neural network that can learn A 10 second clip of him in audio and video. The researchers say the technique is more effective than previous approaches because it explicitly models the relationships between audio and visual data in ways that otherwise cannot.
Joining Gong and Glass’s research are MIT graduate students Andrew Rouditchenko and Alexander H. Liu, David Harwath PhD ’18 from the University of Texas at Austin, and MIT-IBM Watson AI Lab member Leonid Karlinsky. My name is Hilde Kuehne. Mr. Kuhne is also affiliated with Goethe University in Frankfurt. This method was recently presented at an international conference on learned representations.
A collaborative and coordinated approach
CAV-MAE works by “learning by prediction” and “learning by comparison,” says Gong. Masked data modeling, or prediction method, takes a video and its conditioned audio waveform, converts the audio to a spectrogram, and masks 75% of him in both. The unmasked data is tokenized and fed to separate audio and visual encoders before entering a unified encoder/decoder where the model is asked to recover missing data. The difference between the resulting reconstructed predictions and the original audiovisual combination (reconstruction loss) is used to train a model for better performance. An example of this might be covering part of a piano video and part of a spectrogram of piano music and asking the model to determine the masked inputs. Unfortunately, this method may not capture relationships between video and audio pairs. Contrastive learning, on the other hand, takes advantage of this, but may discard some modality-specific information, such as the background of the video.
Contrastive learning aims to map similar representations that are close to each other. For example, the model tries to place different video and audio data from different parrots closer to each other and farther away from video and audio pairs of guitar playing. Audio-visual pairs are passed to separate modality encoders in a manner similar to masked auto-encoding. However, the audio and visual components are kept separate in the joint encoder before the model performs pooling and contrast loss. In this way, contrastive learning attempts to identify the parts of each audio or video that are most relevant to other audios or videos. For example, if you have a video of someone speaking and the corresponding audio clip of her contains the voice, the autoencoder will learn to associate the speaker’s mouth movements with the spoken words. Then adjust the model parameters so that these inputs are represented closely together. Ultimately, the CAV-MAE method combines both techniques using multiple forward data streams with masking as a first step, modality-specific encoders, and layer normalization so that their representation strengths are similar. Combine.
“we [then] “We wanted to compare the proposed CAV-MAE with a model trained only with masked autoencoders and a model trained only with contrastive learning, because combining masked autoencoders with contrastive learning We wanted to show that we could get some performance gains with it,” says Gong. “And the results support our hypothesis that there is a clear improvement.”
Researchers used CAV-MAE and contrast-loss and masked autoencoders for other state-of-the-art techniques for audiovisual retrieval and audiovisual event classification tasks using standard AudioSets (20K and 2M) I tested a technique that doesn’t. VGGSound dataset – Realistic short clips with labels. May contain multiple sounds. Audiovisual search means that the model recognizes the audio or visual component of the query pair and searches for the missing component. Event classification includes identifying actions and sounds in the data, such as a person singing or a car driving.
Overall, they found contrastive learning and masked data modeling to be complementary methods. Even more surprising, CAV-MAE can outperform previous methods (using fully self-supervised pre-training) by about 2% in event classification performance compared to models with comparable computational complexity. It should have matched or exceeded models with industry-level compute resources. The team’s models ranked similarly to models trained with contrasting losses only. And surprisingly, incorporating multimodal data into his CAV-MAE pre-training enabled fine-tuning of single-modality representations by supervised learning (using some labeled data) and speech-only event classification. The team says it significantly improves task performance. . This indicates that, similar to humans, multimodal information provides an additional ‘soft-label’ boost, even for audio- or visual-only tasks. For example, this can help understand if the model is looking for an electric guitar or an acoustic guitar. That is, a richer supervisory signal.
“I think people will like the elegance of this model, which combines information from different audio and visual streams. and it clearly works very well across these tasks,” says Glass.
Based on this, “One of the special things is that our model can perform both classification and search, which is not common,” Gong adds. “Prior to this work, these methods were used separately, but after this work, most audiovisual learning frameworks implicitly or explicitly use contraction loss and masked autoencoders. I have found that you are using .
Bringing self-supervised audiovisual learning to our world
The researchers believe that the contribution of the contrast audiovisual mask autoencoder (CAV-MAE) is an important milestone for applications that increasingly transition from single-modality to multimodality and require or leverage audiovisual fusion. , is a step forward. They hypothesize that one day it could be used for action recognition in areas such as sports, education, entertainment, automotive, and public safety. It may also be extended to other means at some point. At the moment, “this may have the limitation that it only applies to audiovisual data, but we are targeting a trend in machine learning, multimodal learning,” he says. “As humans, we have a lot more than just hearing and sight, such as smell and touch. but this method is possible. [potentially be] It has also been generalized to other unexplored treatments. ”
Such technologies will become increasingly valuable as machine learning models continue to play an increasingly important role in our lives.
This research was supported by the MIT-IBM Watson AI Lab.