Facebook Reality Labs, the company’s R&D division, has been leading the charge on making virtual reality avatars realistic enough to cross the dreaded ‘uncanney valley’. New research from the group aims to support novel facial expressions so that your friends will accurately see your silly faces VR.

Most avatars used in virtual reality today are more cartoon than human, largely as a way to avoid the ‘uncanny valley’ problem—where more ‘realistic’ avatars become increasingly visually off-putting as they get near, but not near enough, to how a human actually looks and moves.

The Predecessor: Codec Avatars

The ‘Codec Avatar’ project at Facebook Reality Labs aims to cross the uncanny valley by using a combination of machine learning and computer vision to create hyper-realistic representations of users. By training the system to understand what a person’s face looks like and then tasking it with recreating that look based on inputs from cameras inside of a VR headset, the project has demonstrated some truly impressive results.

Recreating typical facial poses with enough accuracy to be convincing is already a challenge, but then there’s a myriad of edge-cases to deal with, any of which can throw the whole system off and dive the avatar right back into the uncanny valley.

The big challenge, Facebook researchers say, is that it’s “impractical to have a uniform sample of all possible [facial] expressions” because there’s simply so many different ways that one can contort their face. Ultimately this means there’s a gap in the system’s example data, leaving it confused when it sees something new.

The Successor: Modular Codec Avatars

Image courtesy Facebook Reality Labs

Researchers Hang Chu, Shugao Ma, Fernando De la Torre, Sanja Fidler, and Yaser Sheikh from the University of Toronto, Vector Institute, and Facebook Reality Labs, propose a solution in a newly published research paper titled Expressive Telepresence via Modular Codec Avatars.

While the original Codec Avatar system looks to match an entire facial expression from its dataset to the input that it sees, the Modular Codec Avatar system divides the task by individual facial features—like each eye and the mouth—allowing it to synthesize the most accurate pose by fusing the best match from several different poses in its knowledge.

In Modular Codec Avatars, a modular encoder first extracts information inside each single headset-mounted camera view. This is followed by a modular synthesizer that estimates a full face expression along with its blending weights from the information extracted within the same modular branch. Finally, multiple estimated 3D faces are aggregated from different modules and blended together to form the final face output.

The goal is to improve the range of expressions that can be accurately represented without needing to feed the system more training data. You could say that the Modular Codec Avatar system is designed to be better at making inferences about what a face should look like compared to the original Codec Avatar system which relied more on direct comparison.

The Challenge of Representing Goofy Faces

One of the major benefits of this approach is improving the system’s ability to recreate novel facial expressions which it wasn’t trained against in the first place—like when people intentionally contort their faces in ways which are funny specifically because people don’t normally make such faces. The researchers called out this particular benefit in their paper, saying that “making funny expressions is part of social interaction. The Modular Codec Avatar model can naturally better facilitate this task due to stronger expressiveness.”

They tested this by making ‘artificial’ funny faces by randomly shuffling face features from completely different poses (ie: left eye from {pose A}, right eye from {pose B}, and mouth from {pose C}) and looked to see if the system could produce realistic results given the unexpectedly dissimilar feature input.

Image courtesy Facebook Reality Labs

“It can be seen [in the figure above] that Modular Codec Avatars produce natural flexible expressions, even though such expressions have never been seen holistically in the training set,” the researchers say.

As the ultimate challenge for this aspect of the system, I’d love to see its attempt at recreating the incredible facial contortions of Jim Carrey.

Eye Amplification

Beyond making funny faces, the researchers found that the Modular Codec Avatar system can also improve facial realism by negating the difference in eye-pose that is inherent with wearing a headset.

In practical VR telepresence, we observe users often do not open their eyes to the full natural extend. This maybe due to muscle pressure from the headset wearing, and display light sources near the eyes. We introduce an eye amplification control knob to address this issue.

This allows the system to subtly modify the eyes to be closer to how they would actually look if the user wasn’t wearing a headset.

Image courtesy Facebook Reality Labs

– – – – –

While the idea of recreating faces by fusing together features from disparate pieces of example data isn’t itself entirely new, the researchers say that “instead of using linear or shallow features on the 3D mesh [like prior methods], our modules take place in latent spaces learned by deep neural networks. This enables capturing of complex non-linear effects, and producing facial animation with a new level of realism.”

The approach is also an effort to make this kind of avatar representation a bit more practical. The training data necessary to achieve good results with Codec Avatars requires first capturing the real user’s face across many complex facial poses. Modular Codec Avatars achieve similar results with greater expressiveness on less training data.

Facebook Reality Labs Says Varifocal Optics Are "almost ready for primetime," Details HDR Research

It’ll still be a while before anyone without access to a face-scanning lightstage will be able to be represented so accurately in VR, but with continued progress it seems plausible that one day users could capture their own face model quickly and easily through a smartphone app and then upload it as the basis for an avatar which crosses the uncanny valley.

Newsletter graphic

This article may contain affiliate links. If you click an affiliate link and buy a product we may receive a small commission which helps support the publication. More information.

Ben is the world's most senior professional analyst solely dedicated to the XR industry, having founded Road to VR in 2011—a year before the Oculus Kickstarter sparked a resurgence that led to the modern XR landscape. He has authored more than 3,000 articles chronicling the evolution of the XR industry over more than a decade. With that unique perspective, Ben has been consistently recognized as one of the most influential voices in XR, giving keynotes and joining panel and podcast discussions at key industry events. He is a self-described "journalist and analyst, not evangelist."
  • duck

    Thumbs up
    But I have a better suggestion.
    Get hold of some acting artists (who play in dramas, TV series or in movies) they are the real experts in making body expressions , actually this is what they are taught in acting school. Thats why all movies you see are actually fake with faked expressions but your mind believes these to be real.
    So a general data base of all these expressions can be generated with 3d objects and then this can be used to artificially generate expressions .

    • benz145

      From my reading of a few of the research papers on Codec Avatars, my understanding is that getting enough accuracy to cross the uncanny isn’t currently feasible from a general dataset. So basically it’s currently too hard to do this sort of facial reconstruction without a large dataset of the specific user you are modeling.

      I think a big part of the Codec Avatars project is not that they simply want realistic avatars, they want ones which can accurately replicate a specific person’s unique facial movements. For instance, a general model may be able to be mapped to my face, but it would not likely be accurate enough so that people who know me wouldn’t be weirded out (since my micro-expressions wouldn’t be represented accurately).

      It’s similar to if you dressed someone up like me they could probably trick people who never met me into believing that they are me. But as soon as they talked to someone who knows me, that person would immediately realize it wasn’t me based on other cues.

      • Ad

        Yeah the goal here is clearly be the way people hang out in the future with some Facebook VR, although based on their convention speeches and basic reason I think this is all work towards getting it in AR glasses.

    • sfmike

      Shows what you don’t know about acting.

    • The point of this is to exit from that way to do things. You can’t ask actors to make a collection of all the things you can do with your face, including goofy faces you make with your friends.

  • Amazing work. The idea of going modular, if it works, is great, because can reduce the training set and be more flexible. There are some downsides, like the fact that now the system can also produce faces that are “weird” because we don’t do that often, or that it doesn’t cope perfectly to the fact that all the parts of the faces are correlated each other in pose.

  • Ad

    They’re obviously gunning with headsets with internal face cameras so they can set up a Facebook VR, and a Facebook AR as soon as possible. And then we’re all pretty much screwed.

    • marcandrdsilets

      we’re all pretty much screwed?

      • Ad

        In terms of them having a VR and AR monopoly, of taking Facebook the social media service and making it a lot bigger and more expansive, and eroding things like privacy and competition.

        • HA! This idiot still thinks he has “privacy”! You must live in a fantasy world. There is no privacy, not for me, not for you, not for anyone. “Privacy” is one of those ethereal concepts like “freedom” that Americans are always screaming about, but can’t really define when you ask them to, that was lost long ago but you still hold onto it as though you actually know what it is.

          • Not Ad

            Oh it does exist, just not for oversharing chumps who want things to “just work” without them having to gain any knowledge on the subject, aka laymen.

            This is an extreme example but hacker groups stay afloat because their members are quite skilled at keeping everything they do private, even from each other. You don’t have to be that good to have some level of privacy in today’s world but you do need some knowledge and skill, online tutorials and anonymous browser tabs won’t cut it.

    • Jesus, quit being so overly-dramatic.

  • TechPassion

    Would be good for business meetings etc.

  • Duane Aakre


    It feels like they’ve been showing off all this fancy stuff for years and years. Time to actually start rolling it out!

    I mean their latest headsets having nothing more than the five year old original Rift except slightly higher resolution panels. Nothing. Where’s the eye tracking or face tracking or whatever?

    Good VR is starting to feel a lot like flying cars – it will always be three years further into the future.

    • marcandrdsilets

      What about inside out tracking ? Hand tracking ? usb connectivity ?mobile VR ?

      • Duane Aakre

        None of those fundamentally change the experience from what was available with the original rift (I count the original Rift as being when the Rift got touch controllers) from a user’s perspective. Okay, it is nice to get rid of the cable with mobile, but we are still stuck with avatars unable to have anything but a frozen expression or something that just changes randomly at the whim of a programmer.

        They just need to roll out eye-tracking sensors with their next headset, even if it only works 80% of the time. If the hardware is there, I think the engineers will figure out how to make it better over time through software updates. If the sensors aren’t included, there is 0% chance it improves over the life cycle of the product.

        • marcandrdsilets

          Do you understand how complex is realtime facial capture? a system that is so efficient and require no post-processing and achieve a plausible result in vr ?

    • Blaexe

      This tech never got promised for 2020. But we got hand tracking earlier than expected.

      It’s research. You can’t really predict it.

      • Rogue Transfer

        Yeah, most research fails to deliver something suitable for a product.

  • Timothy Bank

    I don’t know about you, but this opens the door for all sorts of virtual actors. Once the training is complete with your face, then it can be applied to any other face and even virtual ones! I’d love to show up to a meeting in VR and be wearing my dark elf face.

  • Rogue Transfer

    How much processing is required by this? Previously, Codec Avatars required two high-end GPUs per face(one to encode, one to decode).

    I think the problem with Facebook Research Labs is that they don’t have the John Carmack game-optimised mindset. Instead of trying to think up optimised fast routines, they tend to brute force it with AI and averaging multiple training models results together.

    They get deeper and deeper into perfection, but never get something applicable to real-time VR with minimal overhead doing it. After all, while having a chat in VR between two people is laudable, most want this with groups of people(e.g. VRChat or during multiplayer). Imagine how long it’ll take for GPUs to get enough processing to do 8 people, a total of 9 Titan-level GPUs per player, or more currently. Nevermind running a game alongside it!

    It’s going to take a very long time to get there, at the rate of around 20% increase every year or two, we see from GPUs. Never on standalone, and since that’s Facebook’s clear future for their VR, I doubt we’ll see any of this research transpire into something ‘real’.

    It reminds me of their DeepFocus, most of their full body tracking research – all beyond feasible, due to huge processing requirements.

    • Lord Piccus

      No, each person’s computer handles capturing his/her own face. That’s the complex part and once that’s done it’s sent over the network to the other participants as animation data which is easy to display. So if they get it down to one pc capturing one face with enough processing power to spare for rendering the vr environment they’ve managed it.