

Creating high quality audio for video content presents many technical and creative challenges and affects both beginners and experienced audio professionals. Producers often tackle issues like noise management, balancing dialogue and sound effects, meeting budget and time constraints, and maintaining creative consistency. Transform your artistic vision into a cohesive final product that accurately reflects visual dynamics, acoustic environment and timing.
To address these challenges, Introducing Alibaba's Tongyi Speech Lab ThinkSound, a new open source multimodal LLM that utilizes Chain of Inference for Advanced Audio Generation and Editing (COT). ThinkSound offers a structured, interactive approach to audio production that is specifically tailored to video content. Model available in 3 compact sizes – 1.3b, 724m, and 533m parameters – Supports audio generation from video, text-based audio editing, and interactive audio creation, even on edge devices.
ThinkSound mimics the multi-stage workflow of human sound designers and ensures that the generated audio remains contextually accurate, cohesive and high quality throughout production. This model first analyzes the visual dynamics of the video, logically interprets the corresponding acoustic attributes, and then synthesizes audio suitable for the context.
Through an innovative approach, ThinkSound allows users to create detailed and consistent soundscapes, refine audio generated through intuitive user interaction, edit specific audio segments using natural language instructions, effectively filling the gap between creative intent and automated audio production.
Furthermore, Alibaba research team introduced it Audio Cota large multimodal dataset with audio-specific COT annotations that enhance alignment between visual content, text descriptions, and sound synthesis.
Large-scale evaluations have demonstrated that ThinkSound achieves modern performance in video-to-audio generation, providing a contextually accurate, accurately timed soundscape. This model excels in traditional audio quality metrics and COT-based evaluations. Furthermore, with the MovieGen Audio Bench, a benchmark that evaluates the audio generation capabilities of movies, ThinkSound is significantly better than other major models.


ThinkSound can seamlessly integrate with a variety of video generation models to provide realistic narration and soundtracks for synthesized videos. Sleek audio generation features provide critical potential applications in movie and television sound design, audio post-production, immersive sound experiences in gaming and virtual reality.
ThinkSound is now open source available on Hugging Face, Github and Alibaba's Model Studio.


