Image courtesy Facebook Reality Labs

Facebook Publishes New Research on Hyper-realistic Virtual Avatars

Sep 2, 2019

Facebook Reality Labs, the company’s AR/VR R&D group, published detailed research on a method for hyper-realistic real-time virtual avatars, expanding on prior work which the company calls ‘Codec Avatars’.

Facebook Reality Labs has created a system capable of animating virtual avatars in real-time with unprecedented fidelity from compact hardware. From just three standard cameras inside the headset, which capture the user’s eyes and mouth, the system is able to represent the nuances of a specific individual’s complex face gestures more accurately than previous methods.

More so than just sticking cameras on to a headset, the thrust of the research is the technical magic behind using the incoming images to drive a virtual representation of the user.

The solution relies heavily on machine learning and computer vision. “Our system runs live in real-time and it works for a wide range of expressions, including puffed-in cheeks, biting lips, moving tongues, and details like wrinkles that are hard to be precisely animated for previous methods,” says one of the authors.

Facebook Reality Labs published a technical video summary of the work to coincide with SIGGRAPH 2019:

The group also published their full research paper, which dives even deeper into the methodology and math behind the system. The work, ‘VR Facial Animation via Multiview Image Translation’, was published in ACM Transactions on Graphics, which is self-described as the “foremost peer-reviewed journal in graphics.” The paper is authored by Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, Yaser Sheikh.

(a) The ‘Training’ headset, with nine cameras. (b) The ‘Tracking’ headset with three cameras; camera positions shared with the Training headset circled in red. | Image courtesy Facebook Reality Labs

The paper explains how the project involved the creation of two separate experimental headsets, a ‘Training’ headset and a ‘Tracking’ headset.

The Training headset is bulkier and uses nine cameras which allow it to capture a wider range of views of the subject’s face and eyes. Doing so makes easier the task of finding the ‘correspondence’ between the input images and a previously captured digital scan of the user (deciding which parts of the input images represent which parts of the avatar). The paper says that this process is “automatically found through self-supervised multiview image translation, which does not require manual annotation or one-to-one correspondence between domains.”

Once correspondence is established, the more compact ‘Tracking’ headset can be used. The alignment of its three cameras mirror three of the nine cameras on the ‘Training’ headset; the views of these three cameras are better understood thanks to the data collected from the ‘Training’ headset, which allows the input to accurately drive animations of the avatar.

The paper focuses heavily on the accuracy of the system. Prior methods create lifelike output, but the accuracy of the user’s actual face compared to the representation breaks down in key areas, especially with extreme expressions and the relationship between what the eyes are doing and what the mouth is doing.

The work is especially impressive when you take a step back at what’s actually happening here: for a user whose face is largely obscured by a headset, extremely close camera shots are being used to accurately rebuild an unobscured view of the face.

As impressive as it is, the approach still has major hurdles preventing mainstream adoption. The reliance on both a detailed preliminary scan of the user and the initial need to use the ‘Training’ headset would necessitate something along the lines of ‘scanning centers’ where users could go to have their avatar scanned and trained (might as well capture a custom HRTF while you’re at it!). Until VR is a significant part of the way society communicates, it seems unlikely that such centers would be viable. However, advanced sensing technologies and continued improvements in automatic correspondence building atop this work could eventually lead to a viable in-home process.

Billy Jackson

I see D&D groups celebrating :P
Rogue Transfer

The other major hurdle to this approach is the need for two high-end GPUs per person(one to encode, one to decode) – according to the linked paper. That’s not really feasible when you also need most of your GPU just to render a VR game, with little spare to do any of this.
- Mouthwash
  
  Then gpu progress will eventually result in this being feasible. Awesome.
cataflic

ok… in 2027…ah ah ah….
Ted Joseph

Keep it coming. I can’t wait until the release of the “visor” or set of glasses that will either do AR or VR, and will replace the cellphone. Hopefully I live long enough to be able to use this tech daily…
Ted Joseph

I am not sure if the weight add is worth this. They could scan the face a different way, or outside of the headset then just use that scan when you talk etc with additional eye tracking. After purchasing the Quest (like the tech, hate the weight – even with Vive pro strap added) I believe comfort (lower weight, better fit etc.) is a must have.
The Bard

A simple approach would be to put 1 camera on table, in front of the user. This here is expensive.
- aasd
  
  um no. if you think just a webcam equals this tech then youre dumb.
  - brandon9271
    
    Yeah, not even a stereo camera pair or Kinect could do this. Maybe some photogrammetry could scan you and “deep fake” AI could “learn” your face from videos.. But webcam? NO :)
    - Jerald Doerr
      
      Ahhhh give it time… A combination of different things can make this not so difficult… Also resolution of the model is not that big of deal.. You can get away with a lot simply using different image maps correctly.
      
      https://youtu.be/MMa2oT1wMIs
  - gothicvillas
    
    He is
- Jerald Doerr
  
  Or just adding 2 more cameras on the lower left and right outsides sides of Tracking unit (B)
- Andrew Jakobs
  
  and how do you capture the eyes etc. which are covered by the headset?
TonyVT SkarredGhost

Amazing job! I love this research. Of course it is not ready for consumer market yet and of course there is still some uncanny valley effect, but it’s impressive nonetheless
Walter Sharrow

There’s almost enough pixels there to just use the camera’s output directly on a spherical face. It wouldn’t look very realistic, but it would convey feeling.
Andrew Jakobs

Just wow… this just shows they are really doing good work for future headsets.
david vincent

Impressive but who wants to keep his ugly head when you could have any avatar you want ?

Latest Headlines

Formula 1 Racing Game ‘F1 24’ Revealed, Offering PC VR Support at Launch Next Month

Quest 2 Accessories Got a Massive Price Cut, Is This a Fire Sale?

OpenXR 1.1 Update Shows Industry Consensus on Key Technical Features

Features & Reviews

SOUL COVENANT Review – Ineffectual Melee Sandwiched in a Very Skippable Story

The Secret to ‘Beat Saber’s’ Fun Isn’t What You Think – Inside XR Design

Vision Pro is Hands-down the Best Movie Experience You Can Have on a Plane