KAIST launches AI that creates realistic dinosaur sounds

<(From left) Oh Hyun Bin, Yuta Takita, Toshimitsu Uesaka, Tae Hyun Oh, Yuki Mitsufuji>

When people see the scene in the movie “Jurassic Park” where a giant dinosaur walks towards them, they naturally imagine a heavy rumbling sound, as if the ground is shaking. This is because humans predict sound by considering not only the shape of an object, but also physical characteristics such as size, weight, and speed of movement. However, existing video and audio generation AI mainly generates sound based on the category and scene information of objects in the video, and does not sufficiently reflect physical properties that change depending on weight and speed.

KAIST (Chairman Lee Kwang-hyun) announced on May 26 that a joint research team consisting of Professor Oh Tae-hyun of KAIST’s School of Computing and co-researchers from POSTECH (Chairman Kim Sung-geun) and Sony AI has developed an artificial intelligence (AI) technology called “PAVAS (Physics-Aware Video-to-Audio Synthesis)” that understands the physical conditions in images and generates more images. Realistic sound.

A major feature of this technology is that it is designed to allow AI to infer invisible physical information such as the mass and velocity of objects in images. Ordinary videos do not provide accurate values for the weight or speed of an object, but the research team was able to use AI to estimate these by analyzing the surrounding environment and movement conditions, and the results can be reflected in the sound generation process.

In other words, AI is designed not only to recognize what you can see, but also to understand the physical causes of why this sound occurs.

As a result of technical validation, the research team’s AI produced sounds that closely resembled real-world environments in scenes involving physical interactions such as collisions and collisions between objects. In particular, we produced more realistic audio where the volume and tone naturally change as the mass and velocity of objects change.

In recent years, generative AI technology that simultaneously generates video and audio has progressed rapidly. Representative examples include Google’s “Veo 3” and ByteDance’s “Seedance 2.0.” However, in the actual production of movies, advertisements, and games, there is a much higher demand for post-production work that involves adding sound effects to match existing video scenes or supplementing audio than creating completely new videos.

While existing commercial AI models focus on generating video and audio together, PAVAS is differentiated by its ability to analyze the motion and collision characteristics of objects in the video and generate realistic sound effects that precisely match the scene.

The research team explained that this technology opens new possibilities in the field of “physical AI” (physically consistent generative AI). Physically consistent generative AI refers to AI that understands the physical laws and causal relationships of the real world, rather than just producing plausible results.

In the future, it is expected to provide a more immersive user experience in a wide range of fields such as automation of content audio production, augmented reality (AR) and virtual reality (VR) content, metaverse, and robot simulation.

Professor Oh Tae-hyun said, “Existing generative AI has developed by increasing the scale of data and models, but this research is significant in that the AI was designed to directly understand physical quantities and causal relationships.” “In the future, it has the potential to develop as the core foundational technology of next-generation multimodal AI that simultaneously understands and processes diverse information such as text, video, and audio.”

This research was led by POSTECH Integrated Masters-Ph.D. Student Hyun-Bin Oh is the lead author, and KAIST Professor Tae-Hyun Oh and Sony AI researchers Yuta Takita, Toshimitsu Uesaka, and Yuki Mitsufuji are co-authors. This research was recognized for its excellence at CVPR 2026 (Computer Vision and Pattern Recognition 2026), the world’s most prestigious academic conference in the field of computer vision (image-based artificial intelligence technology), where only the top 0.88% of all papers are selected for oral presentation. The presentation will be held on June 6th.

*Paper title: “PAVAS: Physics-Aware Video-to-Audio Synthesis”, DOI: https://arxiv.org/abs/2512.08282

This research was supported by the Mid-Career Research Program of the Fundamental Research Program of the Ministry of Science and Information and Communication, the Future Fusion Technology Pioneering Research Program of the Ministry of Science, Information and Communication and Future Planning, the AGI Program of the Ministry of Science and Information and Communication, and the KAIST InnoCORE Program.

/Open to the public. This material from the original organization/author may be of a contemporary nature and has been edited for clarity, style, and length. Mirage.News does not take any institutional position or position, and all views, positions, and conclusions expressed herein are solely those of the authors. Read the full text here.

Source link