
PhD candidate Sejin Park>
- Se Jin Park, a researcher on Kaist's Professor Yongmanro's team, has published SpeechsSM, a speech language model that can generate speeches that sound naturally.
- Efficient processing techniques based on linear sequence modeling overcome the limitations of existing speech language models and enable high-quality speech production without time constraints.
- Its natural, long-term speech ability like humans is expected to be widely used in podcasts, audiobooks and audio assistants.
Recently, speech language models (SLMs) have been spotlighting as next-generation technology that surpasses the limits of text-based language models by learning human speech without text to understand and generate linguistic and nonverbal information. However, existing models showed significant limitations when generating long-term content needed for podcasts, audiobooks, and audio assistants. Currently, Kaist researchers have managed to overcome these limitations by developing “SpeechsSM,” which allows for consistent, natural speech production without time constraints.
Kaist (President Kwang Hyung Lee) announced his PhD on July 3rd. Candidate Sejin Park, a research team of Professor Yong Manro of the Faculty of Electrical Engineering, developed the spoken “Speechssm.” A speech language model that can generate long speeches.
This research will be presented as an oral paper for ICML (International Conference on Machine Learning) 2025. This not only serves as an opportunity to prove superior research capabilities, but also to once again demonstrate Kaist's world-leading AI research capabilities.
The main advantage of the Speech Language Model (SLM) is its ability to directly process speech without intermediate text conversion, taking advantage of the unique acoustic properties of human speakers, allowing it to quickly generate high-quality speech even on large models.

However, existing models have faced difficulties in maintaining long-term speech semantics and speaker consistency due to the increased “voice token resolution” and memory consumption when breaking down speech into fine pieces to capture highly detailed information.
To solve this problem, Se Jin Park has developed the SpeechSSM speech language model using a hybrid state space model, designed to efficiently process and generate long speech sequences.
This model employs a “hybrid structure” and alternates between recent information and “attention layers” that focus on the “recurrence layers” that remember the overall narrative flow (long-term context). This will allow the story to flow smoothly without losing consistency, even if you generate speeches for a long time. Furthermore, memory usage and computational load do not increase dramatically with input length, allowing for stable and efficient learning and long-term speech generation.
SpeechSSM effectively handles infinite audio sequences by splitting audio data into short fixed units (Windows), processing each unit individually, combining them to create long audio.
Additionally, the audio generation stage uses a “non-automatic savings” audio synthesis model (SoundStorm). This quickly generates multiple parts at once, rather than slowly creating one letter or word at a time, and quickly generates high quality, high quality audio.
While existing models typically evaluated short speech models of about 10 seconds, Se Jin Park created a new assessment task for speech generation based on the self-constructed benchmark data set “Librispeech-Long”, which can generate speech for up to 16 minutes.
Compared to PPL (confusion), an existing voice model evaluation metric that only shows grammatical correctness, she proposed new rating metrics such as “SC-l (consistency over time, regardless of time, regardless of time, consistent over time)” to assess content coherence over time.

Through these new assessments, it was confirmed that the speech generated by the speech-speech language model consistently characterized the particular individual mentioned at the initial prompt, and that new characters and events unfold naturally and contextually despite their long-term generation. This contrasted sharply with existing models, which tended to lose topic easily and show repetition during long periods of power generation.

PhD candidate Sezin Park explained, “Because existing speech language models had limited generation of long-term generation, our goal was to develop speech language models that could generate long-term speeches for real human use.” She said, “The results of this research are expected to contribute significantly to the creation of different types of audio content, such as voice assistants, and to the voice AI field by maintaining consistent content in a long context and responding more efficiently and quickly in real time than existing methods.”
The study will be conducted in collaboration with Google Deepmind with Se Jin Park as the first author, and will be presented as an oral presentation at ICML (International Conference on Machine Learning) 2025 on July 16th.
- Paper Title: Long-form Speech Generation with Speech Language Model
- doi:10.48550/arxiv.2412.18603
PhD candidate Se Jin Park demonstrated outstanding research capabilities as a member of Yongmanro's MLLM Professor (a major multimodal language model) research team through her work integrating vision, speech and language. Her achievements include a spotlight paper presentation at 2024 CVPR (Computer Vision and Pattern Recognition) and an outstanding paper award at 2024 ACL (Association of Computing Linguistics).
/Public release. This material of the Organization of Origin/Author is a point-in-time nature and may be edited for clarity, style and length. Mirage.news does not take any institutional position or aspect, and all views, positions and conclusions expressed here are the views of the authors alone.
