Stunning View Synthesis Algorithm Could Have Huge Implications for VR Capture

15

As far as live-action VR video is concerned, volumetric video is the gold standard for immersion. And for static scene capture, the same holds true for photogrammetry. But both methods have limitations that detract from realism, especially when it comes to ‘view-dependent’ effects like specular highlights and lensing through translucent objects. Research from Thailand’s Vidyasirimedhi Institute of Science and Technology shows a stunning view synthesis algorithm that significantly boosts realism by handling such lighting effects accurately.

Researchers from the Vidyasirimedhi Institute of Science and Technology in Rayong Thailand published work earlier this year on a real-time view synthesis algorithm called NeX. It’s goal is to use just a handful of input images from a scene to synthesize new frames that realistically portray the scene from arbitrary points between the real images.

Researchers Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon Yenphraphai, and Supasorn Suwajanakorn write that the work builds on top of a technique called multiplane image (MPI). Compared to prior methods, they say their approach better models view-dependent effectis (like specular highlights) and creates sharper synthesized imagery.

On top of those improvements, the team has highly optimized the system, allowing it to run easily at 60Hz—a claimed 1000x improvement over the previous state of the art. And I have to say, the results are stunning.

Though not yet highly optimized for the use-case, the researchers have already tested the system using a VR headset with stereo-depth and full 6DOF movement.

The researchers conclude:

Our representation is effective in capturing and reproducing complex view-dependent effects and efficient to compute on standard graphics hardware, thus allowing real-time rendering. Extensive studies on public datasets and our more challenging dataset demonstrate state-of-art quality of our approach. We believe neural basis expansion can be applied to the general problem of light-field factorization and enable efficient rendering for other scene representations not limited to MPI. Our insight that some reflectance parameters and high-frequency texture can be optimized explicitly can also help recovering fine detail, a challenge faced by existing implicit neural representations.

You can find the full paper at the NeX project website, which includes demos you can try for yourself right in the browser. There’s also WebVR-based demos that work with PC VR headsets if you’re using Firefox, but unfortunately don’t work with Quest’s browser.

Notice the reflections in the wood and the complex highlights in the pitcher’s handle! View-dependent details like these are very difficult for existing volumetric and photogrammetric capture methods.

Volumetric video capture that I’ve seen in VR usually gets very confused about these sort of view-dependent effects, often having trouble determining the appropriate stereo depth for specular highlights.

SEE ALSO
U.S. Army Needs Microsoft AR Headset to Be "substantially less" Than Projected $80K Price Tag

Photogrammetry, or ‘scene scanning’ approaches, typically ‘bake’ the scene’s lighting into textures, which often makes translucent objects look like cardboard (since the lighting highlights don’t move correctly as you view the object at different angles).

The NeX view synthesis research could significantly improve the realism of volumetric capture and playback in VR going forward.

This article may contain affiliate links. If you click an affiliate link and buy a product we may receive a small commission which helps support the publication. See here for more information.


Ben is the world's most senior professional analyst solely dedicated to the XR industry, having founded Road to VR in 2011—a year before the Oculus Kickstarter sparked a resurgence that led to the modern XR landscape. He has authored more than 3,000 articles chronicling the evolution of the XR industry over more than a decade. With that unique perspective, Ben has been consistently recognized as one of the most influential voices in XR, giving keynotes and joining panel and podcast discussions at key industry events. He is a self-described "journalist and analyst, not evangelist."
  • kontis

    Great stuff.

    Reflections and highlights worked very well in light field methods known and used since the 90s, but there are perspective discontinuity artifacts, area size limits, large file size and capturing challenges.

    If they can solve or mitigate these issues with this solution then it may really have great potential.

    A reminder that it’s only a research not some “almost ready app” as many people often interpret news like this:

    Our model is optimized independently for each scene.
    For a scene with 17 input photos of resolution 1008 × 756, the training took around 18 hours using a single NVIDIA V100 with a batch size of 1.

    • Jan Ciger

      Yeah, that last detail pretty much rules it out as anything practical. Because that’s a single scene and effectively single point of view – if that requires 18 hours training for a relatively low res images already, the idea of using this for frame interpolation inside of some game is a non-starter …

      • benz145

        Rules it out as impractical for this specific implementation. But this is just research, and it already made a 1000x optimization in rendering time compared to the prior method. Expect this kind of thing to improve further and at a rapid pace.

      • Christian Schildwaechter

        Google Jump was introduced in 2015 as a method to create 3D scenes and videos for VR, basically a rig with a lot of GoPros plus custom software. They used it to collect image data that then allowed them to recreate volumetric scenes. The resulting scenes were used in Google Expeditions, an educational VR program that allowed teachers to visit locations in VR with their students, where every student was located at a different position in the scene and could describe their specific observations to others.

        These different positions weren’t recorded, but calculated from the Google Jump source material, and according to Google you basically needed one of their data centers to be able to do construct the scene in a reasonable time. As this targeted Google Cardboard, the view at the virtual positions had to be prerendered for each fixed position for rotational tracking only, but even this still very limited option at a very high computational price was useful back then.

        And now you can do a similar scene reconstruction in days on a single high end GPU and move a virtual camera in the calculated scene in real time. It may not be particularly useful for VR game performance optimization at this point, but there are a lot of use cases where being able to examine a static volumetric scene that was automatically created from a limited number of image could be extremely useful. Just imagine creating photorealistic virtual house tours without having to manually correct all reflecting surfaces that photogrammetry couldn’t catch, or being able to examine a properly lit crime scenes later by walking around in a reconstruction build out of images taking there.

        Their current camera rig uses 16 800*480 cameras facing in the same direction, with 20 – 62 images used per scene, but it shouldn’t be a problem to scale this up to higher resolutions or a wider view, if you are willing to either use more hardware and/or time for the rendering process. For the virtual house or crime scene reconstructions it wouldn’t matter if the initial calculation took a week or more, it would already be a very practical technology today.

        • sfmike

          Sadly Google has given up on 3D and VR as they see no short term profits in it and quarterly profits are what make American corporations tick.

    • VRFriend

      Dude, they provide code and exhausting details how to make it all on your own machine, with your own photos. Literally, it is an “app” in your thinking process.

  • ViRGiN

    If it says it could, it won’t. Simple as that.

  • Kevin White

    Saw this on Two Minute Papers the other day. Great channel to follow for keeping up with machine learning techniques, usually as applicable to visuals and photorealism.

    • Diane Meraz

      I am able to make 85 US bucks every hour to complete some work staying at home…~nk154~I not at all thought like it can achievable however my close friend has collected $27,000 within three weeks simply doing this easy career & also she had convinced me to avail…~nk154~Get additional instructions reaching this web-link >>> https://plu.sh/GoogleTrends

  • Amazing research! I’m especially impressed by the speed this algorithm runs on

  • Wild Dog

    Photorealistic.

  • Ad

    I want VR cutscenes like this so much. Also we should start teaching people about VR misinformation right now because it’s coming.

  • Janosch Obenauer

    It’s already absolutely amazing tech but will probably take some more time. And with new papers on neural radiance fields coming out almost every week I have high hopes it’s not gonna be too long…

  • Hivemind9000
  • Emad Khan

    Someone add this to Unity and Unreal right now!!!!