Neil Zeghidour, CEO and co-founder of Gradium AI, recently spoke about the evolution of voice AI and the long-awaited “Her” moment, drawing parallels to popular movies where artificial intelligence achieves highly human-like conversational abilities. Speaking at the AI Engineers event, Zeghidour explored the current state of voice AI, remaining challenges, and potential future advances.

Voice AI “girlfriend” moments
Zegidur began his discussion by building his discussion around the concept of true conversational AI, similar to Samantha, the sentient operating system from the movie Her. He emphasized that although great progress has been made, the goal of seamless, natural, and empathetic human-AI interaction is still a work in progress. While current voice AI works, it often falls short of the nuanced, fluid communication we expect from human conversation.
Gradium AI mission and technology
Zeghidour introduced Gradium AI’s mission to unlock the unrealized potential of voice AI by making fluid, natural voice the new interface for AI. The company focuses on training speech models for a variety of applications, including Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech (S2S) translation. This includes building voice agents and basic building blocks of solutions that can be integrated into a variety of products.
He elaborated on Gradium’s approach and highlighted the transition from research to production. The company’s efforts regarding “Moshi” were highlighted. This includes developing an STT with semantic voice activity detection (VAD), a customizable LLM for context, inference, and function calls, and a streaming multilingual TTS with voice cloning capabilities. This comprehensive approach aims to overcome the limitations of existing cascade systems.
Voice AI challenges: Latency and scalability
Much of Zeghidour’s talk focused on persistent challenges in voice AI, primarily latency and scalability. He explained that current cascade systems typically include separate models for STT, LLM processing, and TTS, which have inherent delays. Latency in these systems can disrupt the natural flow of conversation, making interactions feel awkward and less human. He presented data showing that most current TTS models have a latency of more than 200 milliseconds, which is a significant bottleneck for real-time conversations.
The presentation also touched on the need for models that can handle complex reasoning and understanding context. Zeghidour pointed out that while current AI can perform certain tasks, true conversational intelligence requires models that can maintain context across turns, understand user intent, and respond with some degree of empathy. He also raised the issue of scalability, noting that the computational resources required for advanced voice AI, especially inference, can be huge, making cost and efficiency key factors.
Future Directions: End-to-end models and on-device inference
Zeghidour proposed that the future of voice AI lies in the development of end-to-end models that can bypass the intermediate steps of cascading systems and process voice directly. He explained that this approach significantly reduces latency and improves the overall naturalness of interactions. As an example of this approach, he highlighted Gradium’s “Phonon” model, which runs real-time inference on the CPU and provides fast processing and personalization without the need for extensive retraining.
He presented benchmarks comparing Phonon to other leading TTS models, demonstrating its superior performance in terms of word error rate (WER) and speaker similarity while operating with significantly lower latency and less demanding hardware. Being able to run on-device means these advanced voice capabilities can be deployed to a wide range of devices, including smartphones, without relying on cloud infrastructure, and also address privacy concerns.
Applying “girlfriend” moments to practice
Zegidur concluded by inviting the audience to experience advances in voice AI first-hand. He shared examples of how voice AI can be used to create more natural and engaging user experiences, such as a chatbot demo for a travel agency. The demo showed how voice AI can understand complex requests, retrieve relevant information, and respond conversationally, mimicking human interaction better than ever before.
The presentation highlighted continued efforts to make conversational AI as envisioned in “Her” a reality, and emphasized that while the challenges are significant, advances made by companies like Gradium AI are bringing that future closer to reality. A focus on efficiency, scalability, and natural interactions are key to unlocking the true potential of voice AI.
