The advent of large-scale language models (LLMs) has transformed spoken interaction systems, but the optimal architecture for real-time, on-device voice agents remains an open question. Although end-to-end approaches promise theoretical advantages, cascade systems (CSs) continue to perform well in language comprehension tasks despite being constrained by sequential processing delays. In this study, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovation and streaming optimization. Our system integrates streaming (a) conversational speech recognition with expert mixture, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. ChipChat, implemented using MLX, achieves sub-second response latencies on Mac Studio without the use of a dedicated GPU, while protecting user privacy through fully on-device processing. Our study shows that a strategically redesigned CS can overcome previous latency limitations, providing a promising path forward for practical voice-based AI agents.
- † Thinking Machine Laboratory
- ** Work I did while at Apple
- § Equal contribution
