Columbia engineers teach robot to lip-sync by watching itself and humans
Columbia University engineers have built a robotic face that can learn to move its lips in sync with speech and singing by watching itself in a mirror and then observing humans in online videos, aiming to make humanoid robots appear less "uncanny" in face-to-face interaction. In a study published in Science Robotics, the Columbia University team describes a two-step "observational learning" approach rather than programming fixed rules for facial motion.
In the 1970s, robotics professor at the Tokyo Institute of Technology Masahiro Mori published a paper on how he imagined people's reactions to robots that looked and acted almost human. In particular, he hypothesised that a person's response to a humanlike robot would abruptly shift from empathy to disgust as it approached, but failed to attain, a lifelike appearance. This descent into eeriness is known as the uncanny valley.
"We used AI in this project to train the robot, so that it learned how to use its lips correctly," said Hod Lipson, James and Sally Scapa Professor of Innovation in the Department of Mechanical Engineering and director of Columbia’s Creative Machines Lab.
First, a robotic face driven by 26 motors generated thousands of random expressions while facing a mirror, learning how motor commands change its visible mouth shapes.
Next, the system watched recordings of people talking and singing and learned how human mouth movements relate to emitted sounds.
"That learning is a sort of motor-to-face kind of model," added Lipson.
"Then, using that learned information of how it moves and how humans move, it could sort of combine these together and learn how to move its motors in response to various sounds and different audio."
With both models combined, the robot could translate incoming audio into coordinated motor actions and lip-sync across a range of languages and contexts without understanding the audio's meaning, the researchers said.
Read: Meta’s AI expands real-time lip-synced translation for Reels
They also demonstrated how their robot used its abilities to articulate words to even sing a song called "metalman" from its AI-generated debut album hello world_.
The results are not perfect: the team reported difficulty with sounds such as "B" and puckering sounds like "W," and said performance should improve with more exposure.
Lipson said their lip motion research is part of a broader push toward more natural robot communication in applications such as entertainment, education and care settings.
"I guarantee you, before long, these robots are going to look so human. People will start connecting them, and it's going to be an incredibly powerful and disruptive technology," he added.