The Microsoft AI Team Announces NaturalSpeech 2: A State-of-the-Art TTS System with Latent Diffusion Models for Powerful Zero-Shot Speech Synthesis and Enhanced Expressive Prosody

Machine Learning


Source: https://arxiv.org/abs/2304.09116

The goal of text-to-speech (TTS) is to produce high-quality, diverse speech that sounds like a real person speaking. Prosody, speaker identity (gender, accent, timbre, etc.), speaking and singing style all contribute to the richness of human speech. TTS systems have greatly improved clarity and naturalness with advances in neural networks and deep learning. Some systems (such as NaturalSpeech) have reached human-level speech quality on single-speaker recording studio benchmark datasets.

Due to the lack of diversity in the data, previous speaker-only recording studio datasets were inadequate to capture the different speaker identities, prosody, and styles of human speech. However, with few-shot or zero-shot technology, TTS models can be trained on a large corpus to learn these differences, and these trained models can be used to generalize to an infinite number of unseen scenarios. increase. Quantizing a continuous speech waveform into discrete tokens and modeling these tokens with an autoregressive language model is common in today’s large-scale his TTS systems.

New research by Microsoft introduces NaturalSpeech 2. This is a TTS system that uses latent diffusion models to produce expressive prosody, good resilience, and most importantly, powerful zero-shot capabilities for speech synthesis. The researchers started by training a neural audio codec that uses a codec encoder to transform the speech waveform into a set of latent vectors and a codec decoder to recover the original waveform. After obtaining the previous vectors from the phoneme encoder, duration predictor, and pitch predictor, we construct these latent vectors using a diffusion model.

🚀 Check out 100 AI Tools in the AI ​​Tools Club

Below is an example of the design decisions described in the paper.

  • In previous studies, speech is usually quantized with a large number of residual quantizers to guarantee the quality of speech reconstruction for neural codecs. Since the resulting discrete token sequences are very long, this puts a heavy burden on acoustic models (autoregressive language models). Instead of using tokens, the team used continuous vectors. So they use continuous vectors instead of discrete tokens. This shortens the sequence and provides more data for accurate speech reconstruction at a granular level.
  • Replace the autoregressive model with a diffusion model.
  • Learning in context with mechanisms that encourage speech. The team developed a voice-prompt his mechanism that facilitates in-context learning with diffusion models and pitch/duration predictors, and improves zero-shot capability by encouraging the diffusion model to conform to the properties of the voice prompt. I was.
  • NaturalSpeech 2 is more reliable and stable than its autoregressive predecessor, as it only requires a single acoustic model (the diffusion model) instead of two-stage token prediction. This means that it can be applied to non-speech styles (such as singing) using duration/pitch prediction and non-autoregressive generation.

To demonstrate the effectiveness of these architectures, researchers trained NaturalSpeech 2 using 400 million model parameters and 44,000 hours of speech data. It was then used to create utterances in zero-shot scenarios (speech prompts of just a few seconds) using different speaker identities, prosody, and styles (such as singing). The findings show that NaturalSpeech 2 outperforms previous powerful TTS systems in experiments, producing natural-sounding speech in zero-shot conditions. Achieve more similar prosody between voice prompts and ground truth voice. It also achieves naturalness (in terms of CMOS) that is as good or better than ground truth speech on the LibriTTS and VCTK test sets. Experimental results also show that singing with novel timbres can be generated with short singing prompts or, interestingly, with only speaking prompts, unlocking truly zero-shot singing voice synthesis.

In the future, the team plans to investigate effective methods such as coherence models to accelerate diffusion models, and extensive speaking and singing training to enable stronger mixed speaking/singing abilities.


check out paper and project pagedon’t forget to join 20,000+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more. If you have any questions about the article above or missed something, feel free to email me. Asif@marktechpost.com

🚀 Check out 100 AI Tools in the AI ​​Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data her science enthusiast and has a keen interest in the scope of artificial intelligence applications in various fields. Her passion lies in exploring new advancements in technology and its practical applications.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *