Beginner's Guide to Vivevoice – Kdnuggets

Machine Learning


Vivevoice beginner's guideVivevoice beginner's guide
Images by the author | Canva

# introduction

Open source AI has experienced quite a few moments. With large-scale language models, general machine learning and advances in current voice technology, open source models are rapidly narrowing the gap with their own systems. One of the most exciting participants in this space is Vibevoice, Microsoft's open source audio stack. This model family is designed for natural, expressive and interactive conversations, comparable to the quality of first-class commercial products.

In this article, we will use the GPU runtime to explore Vibeveveice, download models, and run inferences in Google Colab. Additionally, it addresses troubleshooting common issues that may occur while performing model inference.

# Introducing Vibevoice

Vibevoice is the next generation text (TTS) framework for creating expressive, long-form multi-speaker audio, such as podcasts and dialogue. Unlike traditional TT, it has excellent scalability, speaker consistency and natural turn-taking.

Its co-innovation is a continuous acoustic token agent and semantic tokensor running at 7.5 Hz, and is in a large language model (QWEN2.5-1.5B) and a diffusion head for generating diffusion heads and high fidelity audio. This design allows for up to 90 minutes of speech using four different speakers, surpassing previous systems.

Vibevoice is available as an open source model Hugging my faceeasy to experiment and use community-maintained code.

Vivevoice beginner's guideVivevoice beginner's guideImages from Vibevoice

# Start Vibevoice-1.5b

In this guide you will learn how to clone the Vibevoice repository and run the demo, providing text files that generate multi-speaker natural audio. It only takes about 5 minutes to set up and generate audio.

// 1. Clone the community repository and installation

First, clone the community version of the Vibevoice repository (Vibevoice-Community/Vibevoice), install and install the required Python packages. Hub of hugging face Download the model using the library Python API.

Note: Before starting a Colab session, make sure the runtime type is set to a T4 GPU.

!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content/VibeVoice
%pip install -q -e /content/VibeVoice
%pip install -q -U huggingface_hub

// 2. Download model snapshot from face

Download the model repository using the Hugging Face Snapshot API. This will download all files microsoft/VibeVoice-1.5B Repository.

from huggingface_hub import snapshot_download
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="/content/models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

// 3. Create a transcript using speakers

Create a text file within Google Colab. To do this, use a magic function %%writefile Provides content. Below is a sample conversation between two speakers about Kdnuggets.

%%writefile /content/my_transcript.txt
Speaker 1: Have you read the latest article on KDnuggets?
Speaker 2: Yes, it's one of the best resources for data science and AI.
Speaker 1: I like how KDnuggets always keeps up with the latest trends.
Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community.

// 4. Performing inference (multi-speaker)

Next, run the demo Python script within the vibevoice repository. The script requires a model path, text file path, and speaker name.

Run #1: Map Speaker 1 → Alice, Speaker 2 → Frank

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Alice Frank

As a result, you will see the following output: This model uses CUDA to generate audio, with Frank and Alice as two speakers. It also provides a summary that can be used for analysis.

Using device: cuda
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Input file: /content/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Number of unique speakers: 2
Number of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Total tokens: 486
Generation time: 28.27 seconds
Audio duration: 15.47 seconds
RTF (Real Time Factor): 1.83x
==================================================

Play audio in your notebook:

Next, use the IPython function to listen to the generated audio in Colab.

from IPython.display import Audio, display
out_path = "/content/outputs/my_transcript_generated.wav"
display(Audio(out_path))

Vivevoice beginner's guideVivevoice beginner's guide

It took me 28 seconds to generate the audio, but it sounds clear, natural and smooth. I love it!

Try again with different voice actors.

Run #2: Try another voice (speaker 1, Carter's Mary on speaker)

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Mary Carter

The generated audio was even better, with a smooth transition between the initial background music and the speakers.

Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Carter
  Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B

Tip: If you are not sure which name is available, the script will print “Available Voice:” print on startup.

The general ones are:

en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman

# troubleshooting

// 1. Does the repository have a demo script?

The official Microsoft Vibeveveice repository has been pulled and reset. Community reports show that some code and demos have been removed or are not accessible in their original location. If you find that an inference example is missing from the official repository, check the community mirror or archive that holds the original demo and instructions: https://github.com/vibevoice-community/vibeveice

// 2. Slow error in a colab generation or CUDA error

GPU Runtime: Runtime → Change Runtime Type → Hardware Accelerator: Check GPU (T4 or Available GPU).

// 3. CudaOom (out of memory)

There are several steps you can take to minimize the load. Start by shortening the input text and reducing the length of the generation. If scripting is allowed, consider lowering the audio sample rate or adjusting the internal chunk size. Set the batch size to 1 and select a smaller model variant.

// 4. No audio or missing output folder

The script usually prints the final output path of the console. Scroll up to find the exact location

find /content -name "*generated.wav"

// 5. Can't find the voice name?

Copy the exact name listed under available voices. Use the alias names shown in the demo (Alice, Frank, Mary, Carter). They correspond to .wav assets.

# Final Thoughts

Many projects choose open source stacks like Vibevoice over paid APIs for some compelling reasons. First and foremost, it is easy to integrate, offers flexibility in customization, and is suitable for a wide range of applications. Additionally, GPU requirements are surprisingly lightweight and can be a huge advantage in resource-constrained environments.

Vibevoice is open source. This means that in the future, we can expect a better framework that will allow for faster generation of CPUs as well.

Abid Ali Awan (@1abidaliawan) is a certified data scientist who loves building machine learning models. Currently he focuses on content creation and creates technical blogs on machine learning and data science technology. Abid holds a Masters degree in Technology Management and a Bachelor of Arts degree in Telecommunications Engineering. His vision is to build AI products using graph neural networks for students suffering from mental illness.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *