360-degree video and virtual reality (VR) experiences are transforming viewers from passive observers to active participants immersed in the scene. But this change raises important questions. Where do people direct their attention in such an environment, and how is that attention shaped?
New research led by Assoc. Professor Aykut Erdem, Department of Computer Engineering, Koc University; IEEE Transactions on Pattern Analysis and Machine Intelligenceprovides an innovative answer. The study was conducted in collaboration with researchers from the Vision Laboratory of the Faculty of Psychology at Boğaziçi University, Hacettepe University, and Japan’s National Institute of Advanced Industrial Science and Technology (AIST). The most distinctive feature of this study is that viewers’ attention can be predicted by jointly analyzing both visual and auditory information, rather than relying solely on visual cues.
In traditional video, the viewer’s line of sight is primarily guided by the camera’s framing. In contrast, 360-degree video shows the entire scene and allows viewers to look in any direction at any time. This makes it much more difficult to determine where your attention is directed.
At this time, sound becomes an important element. Just like in everyday life, when we hear a sound, we instinctively direct our attention to the source of the sound. However, many previous studies have addressed this phenomenon only to a limited extent, mainly focusing on visual data.
To address this gap, the research team developed a comprehensive dataset to examine how visual and auditory cues interact. The dataset contains 81 videos featuring different scenes, presented in different audio conditions (silent sound, mono sound, spatial sound). Spatial sound is a technology that creates the sensation of sound coming from a specific direction, just like in the real world. By tracking the eye movements of more than 100 participants, the researchers were able to closely analyze how attention changes under different auditory conditions.
The study also introduces two AI models tailored to the unique structure of 360-degree video data. While the first model relies solely on visual information, the second model integrates audio into the analysis, allowing for a more comprehensive understanding of attention. As a result, the model is able to capture not only visually salient elements, but also areas that attract attention due to sound.
The results were surprising: Incorporating audio significantly improved the model’s ability to predict viewer attention. In particular, when spatial sounds are included, the model is able to accurately identify not only visually salient regions, but also regions that may not seem visually salient but attract attention due to sound.
Overall, this study shows that human attention can be modeled more accurately by considering how people distribute their focus across both visual and auditory stimuli. Beyond scientific contributions, this approach has strong potential to power a wide range of applications, from video compression and content creation to quality assessment and user experience design in immersive environments.
Disclaimer: AAAS and EurekAlert! We are not responsible for the accuracy of news releases posted on EurekAlert! Use of Information by Contributing Institutions or via the EurekAlert System.
