New algorithm discovers languages just by watching videos

Mark Hamilton, a PhD student in electrical engineering and computer science at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to use machines to understand how animals communicate, so he started by creating a system that could learn human language “from scratch.”

“Funnily enough, a key inspiration was the movie 'Penguins of Oz', where the penguins fall while crossing the ice, and when they get up they make a little groan, and it's pretty clear that this groan is a stand-in for a four-letter word. It was at this moment that we thought maybe we need to use audio and video to learn language,” says Hamilton. “How can we get an algorithm to watch TV all day and then understand what we're saying from that?”

“Our model, DenseAV, aims to learn language by predicting what it sees from what it hears and vice versa. For example, if you hear someone say, 'bake the cake at 350 degrees,' they're likely looking at a cake or an oven. To succeed at this audio-video matching game with millions of videos, the model needs to learn what people are talking about,” Hamilton says.

The paper describing this research is arXiv Preprint server.

Hamilton and his colleagues trained DenseAV on this matching game, and then looked at which pixels the model looks for when it hears a sound. For example, if someone says “dog,” the algorithm immediately starts looking for dogs in the video stream. Understanding which pixels the algorithm selects can tell us something about what the algorithm thinks the word means.

Interestingly, a similar search process occurs when DenseAV hears a dog barking: it searches for the dog in the video stream.

“This intrigued us: we wanted to see if the algorithm could recognize the difference between the word 'dog' and a dog barking,” says Hamilton. The team explored this by giving DenseAV a “bilateral brain.” Intriguingly, they found that one side of DenseAV's brain naturally focuses on language like the word “dog,” while the other side focuses on sounds like a dog barking. This shows that DenseAV not only learned the meaning of words and the location of sounds, but also learned to distinguish between these types of cross-sensory connections without human intervention or knowledge of written language.

One area of application is learning from the vast amount of videos that are published on the Internet every day.

“We need systems that can learn from large amounts of video content, such as instructional videos,” says Hamilton. “Another interesting application is understanding new languages that have no written communication, such as dolphin and whale communication. We hope that DenseAV can help us understand these languages that have eluded human translation efforts from the start. Finally, we hope that we can use this method to discover patterns between other pairs of signals, such as seismic sounds made by the Earth and its geology.”

Credit: Massachusetts Institute of Technology

The team was faced with a daunting task: learning a language without text input. Their aim was to rediscover the meaning of language from a clean slate, without using pre-trained language models. This approach was inspired by how children learn by observing and listening to their environment to understand language.

To achieve this feat, DenseAV uses two main components to process audio and visual data separately. This separation makes it impossible for the algorithm to cheat by looking at the audio on the visual side or vice versa. This allows the algorithm to recognize objects, creating detailed, meaningful features in both the audio and visual signals. DenseAV learns by comparing pairs of audio and visual signals to discover which signals match and which don't. This method, called contrastive learning, does not require labeled examples and allows DenseAV to uncover important predictive patterns in the language itself.

One big difference between DenseAV and previous algorithms is that previous work focused on a single concept: audio-image similarity. An entire audio clip, such as someone saying “the dog sat on the grass,” would be matched with an entire image of a dog. As a result, previous methods were unable to discover finer details, such as the connection between the word “grass” and the grass below the dog.

The team's algorithm finds and aggregates all possible matches between the pixels of the audio clip and the image, which not only improves performance but also allows them to pinpoint sounds in a way that previous algorithms could not.

“Traditional methods use a single class token, but our approach compares every pixel and every second of sound. This fine-grained method allows DenseAV to make finer connections and provide more accurate localization,” said Hamilton.

The researchers trained DenseAV on AudioSet, which contains 2 million YouTube videos, and also created a new dataset to test how well the model could link audio and images. In these tests, DenseAV outperformed other top models on tasks such as identifying objects from their names and audio, proving its effectiveness.

“Because our previous datasets only supported coarse-grained evaluation, we created a dataset using a semantic segmentation dataset, which provides pixel-perfect annotations that allow us to accurately evaluate the performance of our models. We can stimulate the algorithm with specific sounds or images to provide fine-grained localization,” says Hamilton.

Due to the huge amount of data involved, the project took about a year to complete. According to the team, migrating to the large-scale Transformer architecture presented challenges, as these models can easily miss small details. Getting the model to focus on these details was a major hurdle.

Going forward, the team aims to create a system that can learn from large amounts of video-only or audio-only data, which is crucial in new domains where there is a large amount of either mode, but not a mix of both, and to extend this with a larger backbone and to integrate knowledge from language models to improve performance.

“Recognizing and segmenting visual objects in images, and environmental sounds and spoken words in audio recordings, are each difficult problems. Until now, researchers have relied on expensive human annotation to train machine learning models to accomplish these tasks,” said David Hurwass, an assistant professor of computer science at the University of Texas at Austin, who was not involved in the research.

“Based on the insight that the things we see and touch often make sounds, and that we also use spoken language when talking about them, DenseAV makes great strides towards developing a way that can learn to solve these tasks simultaneously just by observing the world through sight and hearing. The model also makes no assumptions about the specific language being spoken, so in principle it can learn from data in any language. It will be exciting to see what we can learn by extending DenseAV to thousands or even millions of hours of video data across many languages.”

Additional authors are Andrew Zisserman, professor of computer vision engineering at the University of Oxford, John R. Hersey, Google AI Perception researcher, and William T. Freeman, professor of electrical engineering and computer science at MIT and CSAIL principal investigator.

For more information:
Mark Hamilton et al. “Separating 'chirp' from 'chatter': self-supervised visual grounding of sound and language” arXiv (2024). arxiv.org/abs/2406.05629

Journal Information:
arXiv

Courtesy of Massachusetts Institute of Technology

This story is reprinted with permission from MIT News (web.mit.edu/newsoffice/), a popular site covering news about MIT research, innovation and education.

Quote: New algorithm discovers languages just by watching videos (June 11, 2024) Retrieved June 11, 2024, from https://techxplore.com/news/2024-06-algorithm-language-videos.html

This document is subject to copyright. It may not be reproduced without written permission, except for fair dealing for the purposes of personal study or research. The content is provided for informational purposes only.

Source link