OmniPredict AI helps cars accurately predict pedestrian behavior

AI News


You are standing at a busy intersection. The car slows down as you approach the curb. No driver will look back at you. Instead, software inside the vehicle decides what the user is going to do next.

The research, published in the journal Computers & Engineering, describes one of the first efforts to use multimodal large-scale language models to predict pedestrian behavior in real time. The same family of AI systems that powers advanced chatbots is now aiming to make roads safer.

An overview of OmniPredict based on the GPT-4o model that utilizes diverse contextual inputs. (Credit: Computer and Electrical Engineering)

“Cities are unpredictable. Pedestrians are unpredictable,” Saripari said. “Our new model offers a glimpse into a future where machines not only see what's happening, but also predict what humans will do.”

From reaction on the street to expectations

Pedestrians remain the most vulnerable on the roads. Unlike drivers, they have no physical protection. For self-driving cars, predicting whether someone will cross the road can determine whether the car should slow down, stop, or keep moving.

Previous systems focused on reactions. The camera has detected a person. An algorithm tracked your movements. The vehicle responded as the movement began. These approaches work in familiar situations, but often fail when the weather changes, the lighting changes, or people behave in unexpected ways.

OmniPredict takes a different path. Try to understand the intent rather than just reacting to the movement.

“This opens the door to safer autonomous vehicle operation, fewer pedestrian-related accidents, and a shift from reactive to proactive hazards,” Saripari said.

These changes may change the way the city feels. No eye contact or hand signals at crosswalks. The vehicle silently plans its next likely move.

OmniPredict structural pipeline. OmniPredict utilizes four multimodalities: scene context image, local context image, bounding box, and own vehicle speed. (Credit: Computer and Electrical Engineering)

“There are fewer tense standoffs. There are fewer near misses,” Saripari said. “Roads may flow even more freely, all because vehicles understand not only their movements but, most importantly, their motives.”

How OmniPredict reads human behavior

OmniPredict is built on GPT-4o, a multimodal large-scale language model designed to reason across images, text, and structured data. Rather than generating dialogue, the model interprets the scene and predicts behavior.

This system uses four inputs. Analyze a wide range of scene images to understand the environment. Study close-up images of pedestrians. Read bounding box data that describes location and size. The speed of the vehicle itself is also taken into account.

16 historical video frames are fed into the system. OmniPredict predicts what will happen in about 1 second.

The researchers asked the model to identify four specific behaviors. Should pedestrians cross or not? Is the person walking or standing? Is the pedestrian partially obscured? Is the person looking in the direction of the vehicle?

To make this work, the spatial information was converted to text so that the model could reason about it in the same way it processes written instructions. Explicit prompts now cause the system to return structured answers rather than open-ended explanations.

Qualitative comparison of prediction performance of GPT4V-Pred and OmniPredict on four major pedestrian behaviors. (Credit: Computer and Electrical Engineering)

Unlike older neural networks that rely on memory states, the model evaluates all inputs together. Attention mechanisms allow us to consider subtle cues, such as hesitations or changes in body orientation, before reaching a conclusion.

Tested against the toughest benchmarks

The team evaluated OmniPredict using two widely evaluated datasets in pedestrian behavior research: JAAD and WiDEVIEW.

“JAAD contains over 82,000 annotated frames showing pedestrians in various traffic situations. WiDEVIEW was recorded on a university campus and includes daytime and late afternoon scenes,” Saripari told The Brighter Side of News.

“Without task-specific training, OmniPredict reached 67% accuracy in predicting whether a pedestrian would cross a road, a result that outperformed existing models by approximately 10%. The system also achieved the highest area under the curve score of all models tested,” he continued.

In particular, some supervised systems trained on large datasets recorded higher detection rates. However, these systems required extensive training and fine-tuning. OmniPredict matched or exceeded accuracy without such overhead.

“Performance was maintained even when we added contextual information such as partially hidden pedestrians or people facing the vehicle,” Saripari said.

Qualitative comparison of cross predictions on the WiDEVIEW dataset. (Credit: Computer and Electrical Engineering)

The model also responded faster and generalized better across different road settings. This is an essential characteristic for real-world deployments.

What's right and what's wrong with the model?

Performance is highly dependent on high pedestrian visibility. Smaller numbers in the frame provide less visual detail and therefore less accuracy. Predictions improved as pedestrians occupied more of the image.

Ablation testing revealed what was most important. Removing the global scene image caused the greatest performance degradation. Removing bounding box data or vehicle speed also reduces accuracy. The findings suggest that understanding the entire environment is just as important as tracking people.

In difficult scenes with snow, wet pavement, or multiple pedestrians, OmniPredict often succeeded where other models failed. We captured head orientation and early movement cues that indicate intent.

Still, this system isn't perfect. Predictions could be inaccurate in dark shadows, heavy occlusion, or when riding a bicycle. Researchers note that without explicit signage, it remains difficult to distinguish between pedestrians and similar road users.

Beyond crosswalks and traffic lights

Road safety drove the research, but its implications extend further.

“We are opening the door to exciting applications,” Saripari said. “For example, the potential for machines to detect, recognize, and predict the consequences of humans exhibiting threatening signs could have important implications.”

In military or emergency situations, the ability to read posture, stress signals, or hesitation could provide earlier warning and better situational awareness.

“Our goal with this project is not to replace humans, but to help augment them with smarter partners,” Saripari said.

The study also emphasizes transparency. When asked to explain its decisions, OmniPredict often provided clear reasons related to travel patterns and environmental context. That interpretability is important for trust in safety-focused AI systems.

Practical implications of the research

The findings suggest a shift in the way machines interact with people in shared spaces. Autonomous systems have the potential to reduce accidents and improve traffic flow by predicting actions rather than reacting to them. For researchers, this work shows that general-purpose multimodal models can compete with specialized systems without costly training.

In the future, this approach could lower the barrier to deploying advanced safety tools in different cities and environments. It could also impact how AI systems assist humans in high-risk situations, from disaster response to security operations. This research aims to create machines that more naturally adapt to human behavior by fusing perception and reasoning.






Source link