Looking for a specific action in a video? This AI-based method will help you find it

The internet is filled with instructional videos teaching curious viewers everything from how to make the perfect pancakes to how to perform the life-saving Heimlich maneuver.

But pinpointing exactly when and where a particular action occurs in a long video is a tedious task. To streamline this process, scientists are trying to teach computers to do it. Ideally, users would just describe the action they're looking for, and the AI model would skip to that spot in the video.

However, teaching a machine learning model to do this typically requires large amounts of expensive, laborious, hand-labeled video data.

A new, more efficient approach from researchers at MIT and the MIT-IBM Watson AI Lab uses only video and automatically generated transcripts to train a model to perform this task, called spatiotemporal grounding.

Researchers are teaching models how to make sense of unlabeled video in two different ways: by looking at small details to figure out where objects are (spatial information), and by looking at the big picture to understand when an action occurs (temporal information).

Compared to other AI approaches, our technique more accurately identifies actions in long videos containing multiple activities. Interestingly, we find that training on spatial and temporal information simultaneously improves the model's ability to identify each separately.

As well as streamlining the process of online learning and virtual training, the technology could also be useful in healthcare settings, for example by quickly finding key moments in videos of diagnostic procedures.

“When we disentangle the challenge of encoding spatial and temporal information at once, and think of it as two experts working independently, we find that this is a clearer way of encoding information. Our model, which combines these two separate branches, delivers the best performance,” said Brian Chen, lead author of a paper on the technique.

Chen, who will graduate from Columbia University in 2023 and conducted the research as a visiting student at the MIT-IBM Watson AI Lab, along with James Glass, a senior research scientist at the MIT-IBM Watson AI Lab and head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL), Hilde Kühne, a member of the MIT-IBM Watson AI Lab and also affiliated with Goethe University Frankfurt, and other researchers from MIT, Goethe University, MIT-IBM Watson AI Lab, and Quality Match GmbH contributed to the paper. The research will be presented at a conference on computer vision and pattern recognition.

Global and local learning

Researchers typically teach models to perform spatiotemporal grounding using videos in which humans have annotated the start and end times of specific tasks.

Not only is this data costly to generate, it can be difficult for humans to determine what exactly to label: if an action is “fry pancakes,” does the action start when the chef starts mixing the batter, or when he pours the batter into the pan?

“This time the task might be about cooking, and the next one might be about fixing a car. There's a huge range of domains that people need to annotate. But if we can learn all of them without labels, that's a more general solution,” Chen says.

In this approach, the researchers use unlabeled instructional videos and their accompanying text transcripts taken from websites such as YouTube as training data, which require no special preparation.

They split the training process into two parts: first, they teach the machine learning model to look at the entire video and understand what actions occur at a given time. This high-level information is called a global representation.

Second, we train the model to focus on specific areas of a video where the action is happening. For example, in a large kitchen, the model might only need to focus on the wooden spoon the chef is using to mix the pancake batter, rather than the entire counter. This fine-grained information is called a local representation.

The researchers built additional components into their framework to mitigate the discrepancy between the narration and the video—perhaps the chef first talks about how to cook the pancakes and then performs the action later.

To develop a more realistic solution, the researchers focused on several minutes of uncut video — in contrast to most AI techniques that are trained using a few-second clip that someone has trimmed to show just a single action.

A new benchmark

But when the researchers came to evaluate their approach, they couldn't find an effective benchmark to test their model on these longer, uncut videos, so they created one.

To build the benchmark dataset, the researchers devised a new annotation technique that was effective at identifying multi-step actions: Rather than drawing boxes around important objects, they had users mark the intersections of objects, such as the points where a knife blade cuts a tomato.

“This makes it more clearly defined and speeds up the annotation process, reducing human effort and costs,” Chen says.

Additionally, having multiple people annotate points on the same video can better capture actions that occur over time, such as a stream of pouring milk, as not all annotators will mark the exact same points in the liquid flow.

The researchers used this benchmark to test their approach and found that it could identify behaviors more accurately than other AI techniques.

Their method also excels in focusing on human-object interactions: for example, if the action is “serve pancakes,” many other approaches might only focus on the primary object, such as the pancakes piled on the counter. Instead, their method focuses on the actual moment when the chef flips the pancakes onto the plate.

The researchers next plan to enhance their approach so that the model can automatically detect when the text and narration do not match and switch focus from one modality to another. They also hope to extend their framework to audio data, as there is usually a strong correlation between actions and the sounds that objects make.

This research is funded by the MIT-IBM Watson AI Lab.

Source link