How developers can incorporate voice AI into telephony applications

The International Telecommunication Union (ITU) recommends a mouth-to-ear delay of less than 400 milliseconds to maintain natural speech. “Mouth to ear” refers to the length of time it takes for words to leave a person’s lips and reach the ear or be heard by the listener. After that, it typically takes a few hundred milliseconds for humans to start reacting. All of this means that to mimic human interaction, AI systems must be able to provide responses within a limited time frame. The AI’s response begins another journey as the sound travels back through the network, allowing the original speaker to hear the response. Overall, the entire interaction should take about 1 second. Otherwise, you will feel uncomfortable. In reality, most voice AI systems are on the verge of reaching this standard, which is improving with new and better technology.

Latency determines whether a real-time AI system can be effective. This has been seen in cases of missing and delayed language support in medical settings. For example, an Australian-based startup wanted to use an AI caller to check on the well-being of elderly Cantonese-speaking patients. This seems like an effective use of technology. However, the long latencies and lack of Cantonese TTS to the US-based voice AI infrastructure made the experience unnatural.

The solution to the delay problem is similar to an engineering change. We strive to reduce latency as much as possible during the development stage. This requires end-to-end real-time flow. This means that streaming input and output must occur simultaneously, rather than LLM generating full-text output and then passing it to TTS for compositing.

Source link