Image courtesy Meta Reality Labs

Prototype Meta Headset Includes Custom Silicon for Photorealistic Avatars on Standalone

May 4, 2022

Researchers at Meta Reality Labs have created a prototype VR headset with a custom-built accelerator chip specially designed to handle AI processing to make it possible to render the company’s photorealistic Codec Avatars on a standalone headset.

Long before the company changed its name, Meta has been working on its Codec Avatars project which aims to make nearly photorealistic avatars in VR a reality. Using a combination of on-device sensors—like eye-tracking and mouth-tracking—and AI processing, the system animates a detailed recreation of the user in a realistic way, in real-time.

Or at least that’s how it works when you’ve got high-end PC hardware.

Early versions of the company’s Codec Avatars research were backed by the power of an NVIDIA Titan X GPU, which monstrously dwarfs the power available in something like Meta’s latest Quest 2 headset.

But the company has moved on to figuring out how to make Codec Avatars possible on low-powered standalone headsets, as evidenced by a paper published alongside last month’s 2022 IEEE CICC conference. In the paper, Meta reveals it created a custom chip built with a 7nm process to function as an accelerator specifically for Codec Avatars.

Specially Made

According to the researchers, the chip is far from off the shelf. The group designed it with an essential part of the Codec Avatars processing pipeline in mind—specifically, analyzing the incoming eye-tracking images and generating the data needed for the Codec Avatars model. The chip’s footprint is a mere 1.6mm².

“The test-chip, fabricated in 7nm technology node, features a Neural Network (NN) accelerator consisting of a 1024 Multiply-Accumulate (MAC) array, 2MB on-chip SRAM, and a 32bit RISC-V CPU,” the researchers write.

In turn, they also rebuilt the part of the Codec Avatars AI model to take advantage of the chip’s specific architecture.

“By re-architecting the Convolutional [neural network] based eye gaze extraction model and tailoring it for the hardware, the entire model fits on the chip to mitigate system-level energy and latency cost of off-chip memory accesses,” the Reality Labs researchers write. “By efficiently accelerating the convolution operation at the circuit-level, the presented prototype [chip] achieves 30 frames per second performance with low-power consumption at low form factors.”

The prototype headset is based on Quest 2 | Image courtesy Meta Reality Labs

By accelerating an intensive part of the Codec Avatars workload, the chip not only speeds up the process, but it also reduces the power and heat required. It’s able to do this more efficiently than a general-purpose CPU thanks to the custom design of the chip which then informed the rearchitected software design of the eye-tracking component of Codec Avatars.

One Part of a Pipeline

But the headset’s general purpose CPU (in this case, Quest 2’s Snapdragon XR2 chip) doesn’t get to take the day off. While the custom chip handles part of the Codec Avatars encoding process, the XR2 manages the decoding process and rendering the actual visuals of the avatar.

The work must have been quite multidisciplinary, as the paper credits 12 researchers, all from Meta’s Reality Labs: H. Ekin Sumbul, Tony F. Wu, Yuecheng Li, Syed Shakib Sarwar, William Koven, Eli Murphy-Trotzky, Xingxing Cai, Elnaz Ansari, Daniel H. Morris, Huichu Liu, Doyun Kim, and Edith Beigne.

It’s impressive that Meta’s Codec Avatars can run on a standalone headset, even if a specialty chip is required. But one thing we don’t know is how well the visual rendering of the avatars is handled. The underlying scans of the users are highly detailed and may be too complex to render on Quest 2 in full. It’s not clear how much the ‘photorealistic’ part of the Codec Avatars is preserved in this instance, even if all the underlying pieces are there to drive the animations.

– – — – –

The research represents a practical application of the new compute architecture that Reality Lab’s Chief Scientist, Michael Abrash, recently described as a necessary next step for making the sci-fi vision of XR a reality. He says that moving away from highly centralized processing to more distributed processing is critical for the power and performance demands of such headsets.

One can imagine a range of XR-specific functions that could benefit from chips specially designed to accelerate them. Spatial audio, for instance, is desirable in XR across the board for added immersion, but realistic sound simulation is computationally expensive (not to mention power hungry!). Positional-tracking and hand-tracking are a critical part of any XR experience—yet another place where designing the hardware and algorithms together could yield substantial benefits in speed and power.

Fascinated by the cutting edge of XR science? Check out our archives for more breakdowns of interesting research.

Sofian

Looks more like 1.6cm² to me.
- Anastasia Mitchell
  
  I get paid $98 per hour to do a regular job at home.~kk38~I never thought this was possible, after all one of my closest friends made $25k in just 3 weeks doing this side job He made me join.~kk38~See this page for more information.
  —->>>> http://glo.wf/jOd3F
- asdf
  
  1.6 mm * 1.6 mm = 2.56 mm^2 (correctly computed in the paper)
kontis

If this ASIC isn’t just a generic NN accelerator, but something only for Codec Avatars then Meta is basically building a video game hardcoded into actual transistors of a device.

This is a software lock down on a level far beyond of what Apple or any console maker ever tried in history.

They would have a solid excuse, but it would also be (suspiciously) quite a convenient anti-interoperability solution.

EU wouldn’t be able to tell them to make it work with other devices and apps, like they are trying to do with iMessage. Nice plan you have here, Zuck.
- Christian Schildwaechter
  
  I doubt that is in anyway that fancy. I haven’t read the paper, but this looks basically like a kind of “Mobile Real-Time Deepfake”, where a detailed texture of an existing person is projected onto the image/model of another one, with a DNN making it look more realistic/less creepy. You’d need a decent 3D image of the users’s face, something you can create with a modern phone within a few seconds by moving your head around and then either combining the image with data from a depth sensor or photogrammetry. Then you have to extract some of the facial movement in real time, obviously eye and mouth movement, but also minor details like raising a brow etc.
  
  Based on these facial movements and a 3D representation of the facial/bone structure derived from the users initial face scan, you now reconstruct the facial pose on a structurally similar avatar, and in a last step you now “deep fake” the user’s stored static face image onto the live dynamic avatar. The parts for this all exist, there is no need for some Meta special sauce, all you need is software that can do the initial scans and create a user specific base avatar, and then enough DNN computational power to apply it in real time. Which is why I’d guess that this is in fact mostly just a generic NN accelerator.
  - benz145
    
    I’m far from an expert but I’ve read the papers. I don’t think that Codec Avatars is fundamentally related to deepfaking. As far as I know, the deep-fake method is functionally a 2D method; I’m not sure how easy it would be to translate it into true 3D.
    
    This article and the associated paper are a pretty clear look at the Codec Avatars approach: https://www.roadtovr.com/facebook-expands-on-hyper-realistic-virtual-avatar-research/
    
    As far as I understand, the method is actually altering (animating) the geometry of the rendered face.
    - Christian Schildwaechter
      
      [I posted this answer six days ago, but it got stuck in the Disqus “pending” limbo, most likely due to me including a YouTube link. As I have included these and images on RoadToVR comments in the past, I didn’t expect moderation or at least someone checking if there are pending comments. And with this being an answer to Ben’s answer to me, I assumed the chances of it getting approved were above average. Nothing has happened, even though Ben has been active on Disqus since then. So I repost it now without the link ~~and delete the original comment within the seven day edit limit~~ (couldn’t). The video is titled “Talking Head Projection, Architecture Machine Group, MIT, 1979” and worth watching.
      
      TL;DR:
      – Codec Avatars uses similar technology to deep fakes, what makes the Meta approach special is working on incomplete facial scans
      – realistic telepresence research has been ongoing for more than 40 years
      – the neural processor works as an encoder for (currently) eye tracking, its advantages are specialization for the model and efficiency, it isn’t completely new tech
      
      Original comment:
      
      I’m not an expert either and picked the term deepfake primarily because most people will know that this is about perspectively projecting an image on top of another for which no proper 3D model exists, otherwise it would be just regular 3D texture mapping. By now I have read the introductions to both papers and skimmed the rest and am pretty sure that this is very closely related to the technology behind deepfaking. The 2019 paper even references it, but properly as “In recent years, end-to-end regression from images has proven effective for high-quality and real-time facial motion estimation.”
      
      A deep fake usually uses two parts, an encoder and a decoder. The trained encoder takes a complex image like a video frame, and derives from it a reduced, abstract version, e.g. eye, nose and mouth position plus head rotation in 3D. This representation resides only in the neural network, it isn’t exported as information that could be used directly on a rendered avatar.
      
      The decoder takes that abstract representation and applies a previously trained model to it, e.g. the face of an actor or VR user where eye, nose and mouth position is known and basically translates these into the 3D position on top of the original image. This isn’t really limited to 2D sources, researchers “deep faked” lung cancer into volumetric CT scans that fooled radiologists. Deep fake isn’t even a particular technology, it broadly refers to projecting images onto others using neural networks creating genuine looking results.
      
      Codec avatars do something similar. The focus of the 2019 paper isn’t the image transfer itself, but how to generate a proper source image inside a VR HMD, where it is impossible to see the whole face (which is usually required), and the partial facial image is often recorded by IR cameras. Their solution is to split the encoder into separate parts and create a special decoder that can handle the incomplete data and still produce the complete face. The encoder doesn’t have to deal with the head position, as this information comes from the HMD sensors, so it only has to precisely detect eye, mouth, cheek etc. motion. Their decoder uses a pre-trained model based on nine images from different perspectives, but as I just skimmed over it, I’m not sure whether this aids with the partial face scans or is simply a way to reduce the amount of required images from the user.
      
      The 2022 paper simply describes integrating a low power neural processor into a mobile SoC that can hold the complete encoder model adapted for recognizing the partial facial image coming from multiple IR cameras. In the current implementation it only derives the eye gaze, the video showing eye and mouth tracking was done with the conventional PC powered tracking HMD they used before.
      
      It was never my intention to claim that these aren’t significant results, both the incomplete face tracking and running the trained neural network on a small power budget are hard problems, and it wouldn’t be trivial for others to achieve similar results. I was mostly reacting to another claim that Meta does something that nobody else could ever do (, here interpreted as a tool to lock in users), even though we know that both the Quest hardware and software are based largely on components developed by others and available to others too.
      
      Meta did a lot of work, packaged it well, sold it at an incredible price and invests a lot into the platform, it’s just that this isn’t magic sauce, but good research and engineering, that others can achieve too. People have been working on this stuff for literally decades. In 1979 the Architecture Machine Group (precursor to the Media Lab) at MIT demonstrated a telepresence system where the live video image of the participants was back-projected into a hollow human face shaped form, with the head being rotated and tilted by motors. This resulted in a massively more immersive representation, the presence that Meta today is aiming for too with much better technology. Fittingly they called the system “Talking Heads”, and it is no coincidence that the cover of the 1980 “Remain in Light” album by the Talking Heads was also created by the Architecture Machine Group, featuring portraits of the band members with their faced overlaid with stylized computer graphics.
  - James Cobalt
    
    This isn’t even that fancy. This chip is for eye tracking only.
- Lucidfeuer
  
  AND, the only reason they’d want to do is so that they can infringe, violate, track, exploit and sell the fuck out of your most precise facial features.
Nothing to see here

Meta is making rapid progress towards a future that none of us want. The only thing that makes their current social experiences barely tolerable is that our avatars are anonymous and other users can’t see our facial expressions or any other personal information. Can I be the only one in the metaverse who seeks solitary experiences away from creepy interactions with strangers?
- Andrew Jakobs
  
  Who says none of us wants this? Some people really think they are speaking for everybody because they think it isn’t something they like.
  - Lucidfeuer
    
    Are you at least paid to be such a chill of the worst company in tech?
    - Andrew Jakobs
      
      Oh please get your head out of your ass, because some of you really have a big hate for Meta doesn’t mean everybody has, if that was the case nobody would have bought a Quest, nobody would be using any of Meta’s products, but reality tells differently. Yeah, there are parts of Meta that I don’t like, but you just can’t say their hardware is crap as the Quest is one of the best headsets at the market (for real consumers).
      - Lucidfeuer
        
        That’s not the point, I have Oculus and a very different opinion, however correlating the relatively niche use of these with the toxic monopoly every sane people hates, is irrelevant to the fact that NO, the same way most people despise and reject Horizon besides chill looneys, this is the case for having your physiognomic datas violates, tracked and exploited by this company.
- Timothy Bank
  
  It is a CODEC and therefore can be applied to anyone/anything. It does not have to be me. It could easily be someone else that reacts just like me in real time. I could see a whole new market where you could buy a likeness to your favorite celebrity and they get royalties. Other creators could make custom people that you could wear like skins. If you were in a business meeting, you could have a high quality scan made of you that always makes you look really good even if you are in your PJs with bedhead. What worries me is the people who impersonate someone else with their likeness then use an AI voice modulator that makes them sound like you. There needs to be some sort of system that certifies the wearer and maybe that is achieved through the cameras on the headset.
  - Guest
    
    This company already impersonates your friends. Do some screen grabs of your friends and bring them next time you see them. Really.
- Hivemind9000
  
  Strangers, sure. But what about family, friends and business colleagues? I think we’ll see this come out first in Cambria (maybe not the first generation), which is targeted at professional/business users (and enthusiasts I guess). Post-pandemic business is becoming increasingly virtualized, and tech like this will make it easier to have more natural interactions with people in a virtual work space.
  
  And as Timothy says, for other interactions it doesn’t necessarily have to be your exact face.
- MeowMix
  
  This is likely for virtual meetings and other work use cases. Real life avatars will be needed for business and enterprise uses; they won’t be cool with Cartoon Executives.
  
  Something that has to be repeated over and over again, not everything VR is about gaming and casual socializing. META is banking on XR being the next computing platform, and not just the next gaming platform.
- gothicvillas
  
  How about visiting in VR your family who are on the other side of the pond? Would you choose photorealistic or cartoon avatars?
- ShaneMcGrath
  
  I wouldn’t use any sort of photorealisitc avatar feature either for security/privacy concerns as some people in their particular line of work may need a more anonymous approach, But I’m sure you won’t have to, Will be other options available.
  Only becomes a problem if they start forcing you to use them on everything, Hopefully they learned their lesson from the backlash of forcing people to need a Facebook account for new Oculus headsets.
eadVrim

The rest of body and the room could be 3D scanned by the MR headset cameras. the teleportation is no longer a science fiction :)
xyzs

RISC-V :)
Good to see that Meta favor this amazing ISA.
Lucidfeuer

So it’s like a proprietary “secure” enclave to infringe, violate, track, exploit and sell your most precise morphological feature which no other company has done before. Thankfully, Cambria will not even be niche given the price, and since this is facebook this is mostly vaporware, like their wack lightfield conference screen. Kudos for the optimised face reconstruction and articulation however.
silvaring

Powerful, and amazing, but very scary tech.
Bryan Jones

Well I’m excited.
James Cobalt

Based on the comments, people may not be fully reading or are reading too much into the headline, “Prototype [HMD] Includes [Chip] for Photorealistic Avatars on Standalone”.

This chip is a POC for advanced eye tracking at low energy use. It’s not for tracking or rendering a face. It isn’t even for tracking the eyebrows to interpret facial expressions. What this paper details is a more energy-efficient and lower-latency approach to capture what the eyes are looking at and get that information over to an avatar. This is important for a battery and CPU-limited device. It is one of the first steps towards offering photorealistic avatars on standalone HMDs.
- Zelda Gutierrez
  
  Working online from home and earning more than $15,000 simply doing easy work. (MPB892) last month I made and received $17915 from this job by doing easy work part-time. Go to this web and follow the instructions.
  
  – – – – >>>> https://rq.fyi/OCd35q
  *******************************************************
Marsel Khisamutdinov

1.6mm * 1.6mm = 2.56mm²

Prototype Meta Headset Includes Custom Silicon for Photorealistic Avatars on Standalone

Specially Made

One Part of a Pipeline

Latest Headlines

Apple May Halt Vision Pro Production by Year-End Amid Report of Sharply Reduced Output

VR’s Favorite Combat Sandbox is Bringing Long-awaited Campaign to Quest Next Week, Trailer Here

‘Minecraft’ is Also Now Dropping PC VR Support in 2025 in Addition to PSVR

Features & Reviews

Quest 3S Review – Value That Can’t Be Beat, With the Same Rough Edges as Its Siblings

The First $100 You Should Spend on Meta Quest Games

25 Free Games & Apps Quest 3S Owners Should Download First