New Procedural Speech Animation From Disney Research Could Make for More Realistic VR Avatars


A new paper authored by researchers from Disney Research and several universities describes a new approach to procedural speech animation based on deep learning. The system samples audio recordings of human speech and uses it to automatically generate matching mouth animation. The method has applications ranging from increased efficiency in animation pipelines to making social VR interactions more convincing by animating the speech of avatars in real-time in social VR settings.

Researchers from Disney Research, University of East Anglia, California Institute of Technology, and Carnegie Mellon University, have authored a paper titled A Deep Learning Approach for Generalized Speech Animation. The paper describes a system which has been trained with a ‘deep learning / neural network’ approach, using eight hours of reference footage (2,543 sentences) from a single speaker to teach the system the shape the mouth should make during various units of speech (called phonemes) and combinations thereof.

Below: The face on the right is the reference footage. The left face is overlaid with a mouth generated from the system based only on the audio input, after training with the video.

The trained system can then be used to analyze audio from any speaker and automatically generate the corresponding mouth shapes which can then be applied to face model for automated speech animation. The researchers say the system is speaker-independent and can “approximate other languages.”

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches.

Creating speech animation which matches an audio recording for a CGI character is typically done by hand by a skilled animator. And while this system falls short of the sort of high fidelity speech animation you’d expect from major CGI productions, it could certainly be used as an automated first-pass in such productions or used to add passable speech animation in places where it might otherwise be impractical, such as NPC dialogue in a large RPG, or for low budget projects that would benefit from speech animation but don’t have the means to hire an animator (instructional/training videos, academic projects, etc).

In the case of VR, the system could be used to make social VR avatars more realistic by animating the avatar’s mouth in real-time as the user speaks. True mouth tracking (optical or otherwise) would be the most accurate method for animating an avatar’s speech, but a procedural speech animation system like this one could be a practical stopgap if / until mouth tracking hardware becomes widespread.

Disney Research Shows How VR Can Be Used to Study Human Perception

Some social VR apps are already using various systems for animating mouths; Oculus also provides a lip sync plugin for Unity which aims to animate avatar mouths based on audio input. However, this new system based on deep learning appears to provide significantly high detail and accuracy in speech animation than other approaches that we’ve seen thus far.

This article may contain affiliate links. If you click an affiliate link and buy a product we may receive a small commission which helps support the publication. See here for more information.

Ben is the world's most senior professional analyst solely dedicated to the XR industry, having founded Road to VR in 2011—a year before the Oculus Kickstarter sparked a resurgence that led to the modern XR landscape. He has authored more than 3,000 articles chronicling the evolution of the XR industry over more than a decade. With that unique perspective, Ben has been consistently recognized as one of the most influential voices in XR, giving keynotes and joining panel and podcast discussions at key industry events. He is a self-described "journalist and analyst, not evangelist."
  • It there any voice over work Mila Kunis won’t do? (lol)

    Seriously though, this is very nice tech. I don’t think it could be used in Real-Time though. It has to process speech to text, then find the correct morphs to match both the volume and the text. That’s quite a delay, even if the code is written well. You don’t have to look further then any voice to text software, used in anything, anywhere. There’s always a long pause as it processes the audio file. That would make conversations in real-time a bit brutal. It would be like everyone was drunk or very distracted.

    That said, what a liberating tool this would be for animators! I’d love to use this with my animation tools. It would also be a big boom to VR capture/acting tools, like Mindshow. Those guys should be alerted to this idea right away.

  • Cool, like it. The problem is that the brain is very sensitive in detecting things that appear as non-natural on humans. So, while to me it seems perfect on the cartoon avatar, on the realistic avatar I can easily spot it is a simulation, lips appear un-natural. This means that we have still a very long road to go. Anyway, it’s an amazing result

    • Mike

      Personally I like the “uncanny valley”.

  • VRgameDevGirl

    I would love to get my hands on this software!!!!!