Microsoft AI can clone voice from only three seconds of audio

VALL-E is an advanced program that can generate synthesized voice with minimal data of three-second audio


Tech Desk January 16, 2023
The microphone which announced Pakistan’s independence in 1947 will be displayed in the Radio Pakistan Museum. Photos: Express

Microsoft's new text-to-speech AI will clone voices, including tone and pitch, using only a three-second snippet of audio.

VALL-E has a "neural codec language model" that is a complex system but is quite easy to use with just a plug-in of audio and text.

The program's creators are optimistic that it can be used for high-quality text-to-speech applications like speech editing and audio content creation. Microsoft's program is built off of EnCodec which was announced by Meta last year in October.

VALL-E generates discrete audio codec codes from text and acoustic prompts, analyzing how a person sounds and breaking that information into discrete components. EnCodec uses training data to match what it knows about how that voice would sound if it spoke another phrase.

VALL-E's speech-synthesis capabilities have been trained from an audio library assembled by Meta, and containing 60,000 hours of English language speakers from more than 7,000 speakers. For a good result, the three-second voice clip sample has to closely match the training data provided.

The sample provided by Microsoft demonstrates that the program can generate variations in voice tone by changing the random seed used in the generation process. VALL-E can imitate the acoustic environment of the audio that the sample audio contained, like imitating how a voice would sound on the phone.

Many news sites use machine-powered dictation services, but speech-generating programs require a large amount of input. Most importantly, the voice doesn't sound human-like and is unable to convey emotions and inflections. VALL-E is quite advanced and provides a better and more accurate result with little required input. The program, however, carries potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.

COMMENTS (1)

Sean Harris | 1 year ago | Reply Great tech I am curious about what improvements there will be done in the near future. There is a Ukrainian start-up that does pretty much the same thing but more accurately. In order to clone a voice Respeecher s AI-powered speech synthesis software needs examples of that voice. Usually 30 minutes of quality recordings are enough and the software then analyzes the voice until it can produce a clone. The difference between VALL-E and Respeecher s voice cloning tech is that it can convey all the nuances of a particular voice including tone accent and emotions. Also it has two types of technologies. One for B2B clients and one for individual clients. B2B customers can clone someone s voice solely with permission and the ones using the Voice Marketplace get access to a voice library with permissions that they can use on a subscription basis. Find out more in this case study www.respeecher.com case-studies respeecher-synthesized-younger-luke-skywalkers-voice-disneys-mandalorian
Replying to X

Comments are moderated and generally will be posted if they are on-topic and not abusive.

For more information, please see our Comments FAQ