One Possible Solution

One option for overcoming this challenge could be not to see the user’s face, but to sense it.
Way back in 2017, I demoed a face-tracking technology from a company called MindMaze. Instead of cameras, the company’s prototype used an array of electrodes in the headset’s facepad to measure facial muscle activity

This early prototype used eight electrodes, and thus created eight streams of data corresponding to the movement of my face. Even without personal calibration, the system was able to accurately match a range of face motions.

Although it wasn’t nearly as precise as what we see today on Vision Pro, the combination of machine learning progress over the last eight years, the potential to use significantly more electrodes, and the potential for personal calibration, leads me to believe this solution could be a viable path to face-tracking without direct line-of-sight with cameras.
Even with a more advanced version of this electrode-based system, it could still be challenging to achieve realistic mouth motions. To help on that front, lip-sync prediction based on audio input (and using personal calibration) could boost accuracy further.
Of course, this approach (if it works at all) only works as long as XR headsets remain in the ‘goggles’ era, wherein the headset maintains significant contact around the user’s eyes. As we approach fully-featured XR ‘glasses’, yet another solution for accurate face-tracking will be needed!