Having a computer read to you is nothing new. But having a computer read it aloud is real enough to hear. And being able to add your own voice to the digital world of text-to-speech pushes the possibilities to another level. Computer-generated lifelike voice enables multiple applications for your business. And brands are keen to explore the chatty side of having high-quality digital audio assets.
If you haven’t listened to a computer-generated text-to-speech voice in a while, it might be worth a listen to see the latest algorithms in action. Some of the most amazing examples of text-to-speech I’ve heard can be found on the examples page of generative audio AI developer Suno.ai.
Deep learning and, more recently, generative AI tools have fueled speech synthesis with rocket fuel, and the resulting computer voices are now eerily believable. Under the hood, there’s a lot going on to transform text input into speech that doesn’t sound robotic and is more appealing to humans.
And when the audio is well understood, there is a greater connection in the listener’s brain and the effect kicks in. “Such neural connections are greatly diminished in the absence of communication, such as when listening to a foreign language that we do not understand,” said researchers Greg J. Stevens and Lauren J. Silverbart of Princeton University, USA. , Uli Hasson writes in a study called The Speaker. – Listener’s Neural Coupling Underlies Successful Communication [PDF].
no more limits
With access to high-quality digital voice, companies can engage audiences in ways that were not possible in the early days of computer-generated voice. Early digital voice has its roots in silicon designs such as the speech synthesis processor (VSP) LPC10 decoding chip manufactured by Texas Instruments in the late 1970s and 1980s. The market was also limited to niche applications as memory constraints required highly compressed output.
Today, the situation is quite different, and vocal models are convincing enough to overdub narration mistakes in recorded audio. There are also many YouTube videos showing how to use text-to-speech applications to make quick fixes. In fact, in the not-too-distant future, microphones may be of little use if the narrator’s voice signature is recorded and converted into a digital model.
Aesop’s fable “The North Wind and the Sun” is linguistically famous, and readers will have pronounced most of the English phonemes by the time they read the text to the end. “The Boy Who Cried Wolf,” which is twice as long as “The North Wind and the Sun,” but with fewer word repetitions, is another of his popular linguistic analysis texts. And these examples suggest how voice cloning and text-to-speech algorithms work.
For users who want to recreate their own voice, recording spoken language can provide a training set rich in phonemes (the acoustic basis for building spoken language). Unwanted artifacts and background noise can be removed through preprocessing before performing audio segmentation and feature extraction.
Alternatively, the algorithm could be trained on existing audio recordings and compared to the transcript to glean knowledge about how different sounds relate to different words. Once the model is built, we apply natural language processing to match incoming text-to-speech input with corresponding speech elements.
Business text-to-speech application
Also, reflecting differences in how each of us pronounces words, phonemes as they are spoken exhibit variations in frequency spectrum, timing, and signal energy. And now there are hundreds of different text-to-speech models for users to choose from, including the voices of celebrities like Snoop Dogg and Gwyneth Paltrow, as well as the option to duplicate your own voice as mentioned above.
Considering the popular apps, and having just mentioned Snoop Dogg and Gwyneth Paltrow, it’s fitting to highlight Speechify, which includes the aforementioned stars in its digital list of voices. Investors include Richard Branson, whose business is trusted by his Apple and Google teams for his text-to-speech applications, according to the Speechify website.
But for business users who want to generate audio for training videos, create audiobooks, or read documents and other company information in a realistic human voice, there are multiple software options to consider.
NaturalReader can be run directly from your browser and has free tier and paid options. The speech-to-text app also supports a long list of spoken languages, including variations such as French (Canada), Portuguese (Brazil), and more. Also, testing the speech-to-text app in this article will allow you to add accents to your voice, for example by choosing a German voice to read a document written in English.
Digital voice giants such as Nuance and IBM offer powerful speech-to-text applications for businesses. And one option here is for companies to develop a brand voice, a digital asset that listeners associate with the company or product. Branded audio is big news in marketing, and WPP’s April 2023 acquisition of Sonic branding agency Amp to expand the global advertising giant’s generative AI design offerings highlights this trend. .
Open-source text-to-speech AI models should also be considered. Back at Suno.ai, we made the model weights for our transformer-based text-to-speech algorithm (BARK) available in our AI and machine learning tools repository, Huggin Face, but we’re looking at the possibilities opened up by the latest technology. Exploring is interesting. research.
Try the open-source generated AI text-to-speech yourself
Previously Engineering DivisionI wrote about how Airbnb for GPUs helps reduce the cost of running generative AI models. Also, his BARK from Suno.ai is one of the pre-built options available in the Monster API, making it easy for users to see what they can do with generative AI text-to-speech.
Code Demo: Suno.ai’s BARK-generated AI text-to-speech model has a demo page where users can experiment with different prompts and commands to see what the next-generation research tool can do.
Another option is to launch BARK yourself. For example, I used the demo link available on Suno.ai’s GitHub page to get the system up and running on a free Google Collab GPU instance. And if commercial text-to-speech apps are like laps on an exciting racecourse, open-source options let users roam freely through barriers.
BARK authors warn that model output is uncensored and for research purposes. One major feature difference is that BARK can generate non-speech sounds such as laughs, sighs, gasps, clearing throats, and other behaviors (the developers hope to find many more as they delve deeper into the technology’s capabilities). ) ).
And surprisingly, generative AI text-to-speech tools can also sing at least in some cases. You can make BARK more musical by putting text prompts inside the notes that show the lyrics of the song. Other applicable guidance includes: [MAN] and [WOMAN] Bias the model towards male or female speakers. BARK can also speak Hindi, Japanese, Korean, Russian, Chinese and other languages.
This is an impressive list of achievements in this talent-packed space, and it’s still 2023 this year. Hold on, as text-to-speech applications for businesses are starting to become popular.
