According to the researchers, their AI model, DenseAV, learns word meanings and sound locations just by watching videos, without any human input or text.
In their paper, researchers from MIT, Microsoft, Oxford and Google explained that DenseAV achieves this using only self-supervision from video.
To learn these patterns, we use contrastive audio-video learning to associate specific sounds with the observable world. In this learning mode, the visual side of the model cannot gain insight from the audio side (and vice versa), forcing the algorithm to recognize objects in a meaningful way.
It learns by comparing pairs of audio and visual signals to determine which data is important, then evaluates which signals match and which don't. DenseAV is able to learn without labels, because understanding language and recognizing sounds makes it easier to predict what you see from what you hear.
How does it work?
The idea for the process came to MIT doctoral student Mark Hamilton while watching a movie. Penguin MarchThere's a scene where a penguin falls and groans.
“It's almost clear that the moans are standing in for four-letter words. It was at this moment that we realized that maybe we need to use audio and video to learn language,” Hamilton said in an MIT news release.
Researchers have found that one side of the brain naturally focuses on speech, while the other focuses on sounds like meowing.
His goal was to have the model learn language by predicting what it sees from what it hears, and vice versa: if you hear a voice say “Pick up your violin and start playing,” you're likely to see either a violin or a musician. This game of matching audio and video was repeated with different videos.
Once this was done, the researchers looked at which pixels the model looked at when it heard a particular sound — so if someone said “cat,” the algorithm would start looking for cats in the video — and by looking at which pixels the algorithm selected, they could tell what it thought a particular word meant.
But if DenseAV hears someone say “cat” and then hears a cat meow, the AI might identify an image of a cat in the shot. But does that mean the algorithm thinks a cat and a cat meow are the same thing?
The researchers investigated this by giving DenseAV a “bilateral brain” and found that one side of the brain naturally focuses on speech, and the other on sounds like meowing, meaning that DenseA actually learned the different meanings of both words without any human intervention.
Why is this useful?
DenseAV is an algorithm that can discover language meaning and sound location simply by looking at unlabeled video. DenseAV is fully unsupervised and never sees any text during training. Learn more: https://t.co/eG755yC9mI pic.twitter.com/3I1jJW8l08June 11, 2024
There is already a huge amount of video content, so AI can be trained on instructional videos and so on.
“Another interesting application would be understanding new languages where there is no written communication, such as dolphin and whale communication,” Hamilton said.
The team's next step is to create a system that can learn from either video-only or audio-only data, which would be useful in fields where there is a lot of some types of material but not many others.
