Technology

AI constructs accurate face portraits using voice recordings

Researchers have created an AI software that uses short voice clips of speakers to generate an accurate portrait.

Tech Desk April 05, 2022

AI researchers have created a program that creates a picture portrait of a person using just a short voice recording of the person speaking. The scientists at MIT’S Computer Science and Artificial Intelligence Laboratory (CSAIL) first published a paper on AI algorithm called Speech2Face in 2019, which had produced surprisingly accurate results.

The researchers designed and trained a neural network with the help of millions of YouTube videos of people talking on the internet. The AI, whilst undergoing training, learnt the corelation between the indivtual speaking and how they looked, with special regard to gender, ethnicity and age.

The study had the least involvement of humans in trying to create the AI algorithm, which learnt on its own with a trove of videos and determining the corelation between the speaker's voice and their appearance. In an effort to analyze the accuracy of the portraits the AI was constructing, the researchers built a 'face decoder' that created standard reconstruction of a person's face using a still frame, ignoring other irrelevant features like light in the picure. The scientists were easily able to compare the voice reconstructions with the actual facial features of the speaker.

	Original image (reference frame)	Reconstruction from image	Reconstruction from audio	Original image (reference frame)	Reconstruction from image	Reconstruction from audio

Input speech

Input speech

Input speech

Input speech

Input speech

Input speech

Input speech

Input speech

While the results were strikingly accurate, the AI did perform poorly when presented with factors like accent, spoken language, and voice pitch which led to “speech-face mismatches" that appeared in incorrect age, gender or ethnicity guesses. People with high voices were generally identified by the AI software to be females, while low pitched voice was categorized as male. Similarly, an Asian man speaking english rather than chinese, created incorrect ethnicity guesses.



(a) Gender mismatch			(c) Age mismatch (old to young)



(b) Ethnicity mismatch			(d) Age mismatch (young to old)

In their paper, the researchers wrote that “Our reconstructed faces may also be used directly, to assign faces to machine-generated voices used in home devices and virtual assistants.”

While the paper admitted to the study's ethical weaknesses and issues regarding just using a selected few videos on the intrenet for their date, the researchers are optimistic that "a more comprehensive view of voice face correlations can open up new research opportunities and applications.”

COMMENTS (1)

Andy Maria Kanthan | 2 years ago | Reply Well done...more research to be done..on speech pitch..ethnicity n age..secondly..on undeveloped countries..with lack of Forensic Research..would have problem enrolling into this program..

Replying to X

Comments are moderated and generally will be posted if they are on-topic and not abusive.

For more information, please see our Comments FAQ

Entertainment

Logan Paul accuses BBC of hiring predators, trolls interviewer with lookalike in viral stunt

YouTuber Trevor Jacob reuploads fake plane crash video that led to his six-month prison sentence

Trevor Jacob reuploads controversial plane crash video, sparking backlash after serving six months in federal prison.
Kim Kardashian confirms dog Sushi is alive after fan concerns

Kim Kardashian addresses fan concerns over dog Sushi’s absence, featuring the pup in her SKKN holiday photoshoot.
Linkin Park’s new lead singer Emily Armstrong linked to Scientology upbringing

Emily Armstrong’s Scientology background revealed, sparking controversy over her role as Linkin Park’s new frontwoman.
Banana duct-taped to wall sells for $6.2M to crypto mogul who plans to eat it

A duct-taped banana by Maurizio Cattelan sold for $6.2M at Sotheby’s; buyer Justin Sun plans to eat the artwork.