Facebook Reality Labs, the company’s AR/VR R&D group, published detailed research on a method for hyper-realistic real-time virtual avatars, expanding on prior work which the company calls ‘Codec Avatars’.

Facebook Reality Labs has created a system capable of animating virtual avatars in real-time with unprecedented fidelity from compact hardware. From just three standard cameras inside the headset, which capture the user’s eyes and mouth, the system is able to represent the nuances of a specific individual’s complex face gestures more accurately than previous methods.

More so than just sticking cameras on to a headset, the thrust of the research is the technical magic behind using the incoming images to drive a virtual representation of the user.

The solution relies heavily on machine learning and computer vision. “Our system runs live in real-time and it works for a wide range of expressions, including puffed-in cheeks, biting lips, moving tongues, and details like wrinkles that are hard to be precisely animated for previous methods,” says one of the authors.

Facebook Reality Labs published a technical video summary of the work to coincide with SIGGRAPH 2019:

The group also published their full research paper, which dives even deeper into the methodology and math behind the system. The work, ‘VR Facial Animation via Multiview Image Translation’, was published in ACM Transactions on Graphics, which is self-described as the “foremost peer-reviewed journal in graphics.” The paper is authored by Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, Yaser Sheikh.

(a) The ‘Training’ headset, with nine cameras. (b) The ‘Tracking’ headset with three cameras; camera positions shared with the Training headset circled in red. | Image courtesy Facebook Reality Labs

The paper explains how the project involved the creation of two separate experimental headsets, a ‘Training’ headset and a ‘Tracking’ headset.

SEE ALSO
Vision Pro Owners Hopeful Apple Event Will Bring News of Unreleased Panoramic Display Feature

The Training headset is bulkier and uses nine cameras which allow it to capture a wider range of views of the subject’s face and eyes. Doing so makes easier the task of finding the ‘correspondence’ between the input images and a previously captured digital scan of the user (deciding which parts of the input images represent which parts of the avatar). The paper says that this process is “automatically found through self-supervised multiview image translation, which does not require manual annotation or one-to-one correspondence between domains.”

Once correspondence is established, the more compact ‘Tracking’ headset can be used. The alignment of its three cameras mirror three of the nine cameras on the ‘Training’ headset; the views of these three cameras are better understood thanks to the data collected from the ‘Training’ headset, which allows the input to accurately drive animations of the avatar.

The paper focuses heavily on the accuracy of the system. Prior methods create lifelike output, but the accuracy of the user’s actual face compared to the representation breaks down in key areas, especially with extreme expressions and the relationship between what the eyes are doing and what the mouth is doing.

Image courtesy Facebook Reality Labs

The work is especially impressive when you take a step back at what’s actually happening here: for a user whose face is largely obscured by a headset, extremely close camera shots are being used to accurately rebuild an unobscured view of the face.

As impressive as it is, the approach still has major hurdles preventing mainstream adoption. The reliance on both a detailed preliminary scan of the user and the initial need to use the ‘Training’ headset would necessitate something along the lines of ‘scanning centers’ where users could go to have their avatar scanned and trained (might as well capture a custom HRTF while you’re at it!). Until VR is a significant part of the way society communicates, it seems unlikely that such centers would be viable. However, advanced sensing technologies and continued improvements in automatic correspondence building atop this work could eventually lead to a viable in-home process.

Newsletter graphic

This article may contain affiliate links. If you click an affiliate link and buy a product we may receive a small commission which helps support the publication. More information.


Ben is the world's most senior professional analyst solely dedicated to the XR industry, having founded Road to VR in 2011—a year before the Oculus Kickstarter sparked a resurgence that led to the modern XR landscape. He has authored more than 3,000 articles chronicling the evolution of the XR industry over more than a decade. With that unique perspective, Ben has been consistently recognized as one of the most influential voices in XR, giving keynotes and joining panel and podcast discussions at key industry events. He is a self-described "journalist and analyst, not evangelist."
  • Billy Jackson

    I see D&D groups celebrating :P

  • Rogue Transfer

    The other major hurdle to this approach is the need for two high-end GPUs per person(one to encode, one to decode) – according to the linked paper. That’s not really feasible when you also need most of your GPU just to render a VR game, with little spare to do any of this.

    • Mouthwash

      Then gpu progress will eventually result in this being feasible. Awesome.

  • cataflic

    ok… in 2027…ah ah ah….

  • Ted Joseph

    Keep it coming. I can’t wait until the release of the “visor” or set of glasses that will either do AR or VR, and will replace the cellphone. Hopefully I live long enough to be able to use this tech daily…

  • Ted Joseph

    I am not sure if the weight add is worth this. They could scan the face a different way, or outside of the headset then just use that scan when you talk etc with additional eye tracking. After purchasing the Quest (like the tech, hate the weight – even with Vive pro strap added) I believe comfort (lower weight, better fit etc.) is a must have.

  • The Bard

    A simple approach would be to put 1 camera on table, in front of the user. This here is expensive.

    • aasd

      um no. if you think just a webcam equals this tech then youre dumb.

      • brandon9271

        Yeah, not even a stereo camera pair or Kinect could do this. Maybe some photogrammetry could scan you and “deep fake” AI could “learn” your face from videos.. But webcam? NO :)

        • Jerald Doerr

          Ahhhh give it time… A combination of different things can make this not so difficult… Also resolution of the model is not that big of deal.. You can get away with a lot simply using different image maps correctly.

          https://youtu.be/MMa2oT1wMIs

      • gothicvillas

        He is

    • Jerald Doerr

      Or just adding 2 more cameras on the lower left and right outsides sides of Tracking unit (B)

    • Andrew Jakobs

      and how do you capture the eyes etc. which are covered by the headset?

  • Amazing job! I love this research. Of course it is not ready for consumer market yet and of course there is still some uncanny valley effect, but it’s impressive nonetheless

  • There’s almost enough pixels there to just use the camera’s output directly on a spherical face. It wouldn’t look very realistic, but it would convey feeling.

  • Andrew Jakobs

    Just wow… this just shows they are really doing good work for future headsets.

  • david vincent

    Impressive but who wants to keep his ugly head when you could have any avatar you want ?