July 1, 2020

Introducing neural supersampling for real-time rendering

By: Lei Xiao

Real-time rendering in virtual reality presents a unique set of challenges — chief among them being the need to support photorealistic effects, achieve higher resolutions, and reach higher refresh rates than ever before. To address this challenge, researchers at Facebook Reality Labs developed DeepFocus, a rendering system we introduced in December 2018 that uses AI to create ultra-realistic visuals in varifocal headsets. This year at the virtual SIGGRAPH Conference, we’re introducing the next chapter of this work, which unlocks a new milestone on our path to create future high-fidelity displays for VR.


Read full paper

Our SIGGRAPH technical paper, entitled “Neural Supersampling for Real-time Rendering,” introduces a machine learning approach that converts low-resolution input images to high-resolution outputs for real-time rendering. This upsampling process uses neural networks, training on the scene statistics, to restore sharp details while saving the computational overhead of rendering these details directly in real-time applications.

Our approach is the first learned supersampling method that achieves significant 16x supersampling of rendered content with high spatial and temporal fidelity, outperforming prior work by a large margin.

Animation comparing the rendered low-resolution color input to the 16x supersampling output produced by the introduced neural supersampling method.

What’s the research about?

To reduce the rendering cost for high-resolution displays, our method works from an input image that has 16 times fewer pixels than the desired output. For example, if the target display has a resolution of 3840×2160, then our network starts with a 960×540 input image rendered by game engines, and upsamples it to the target display resolution as a post-process in real-time.

While there has been a tremendous amount of research on learned upsampling for photographic images, none of it speaks directly to the unique needs of rendered content such as images produced by video game engines. This is due to the fundamental differences in image formation between rendered and photographic images. In real-time rendering, each sample is a point in both space and time. That is why the rendered content is typically highly aliased, producing jagged lines and other sampling artifacts seen in the low-resolution input examples in this post. This makes upsampling for rendered content both an antialiasing and interpolation problem, in contrast to the denoising and deblurring problem that is well-studied in existing superresolution research by the computer vision community. The fact that the input images are highly aliased and that information is completely missing at the pixels to be interpolated presents significant challenges for producing high-fidelity and temporally-coherent reconstruction for rendered content.

Example rendering attributes used as input to the neural supersampling method — color, depth, and dense motion vectors — rendered at a low resolution.

On the other hand, in real-time rendering, we can have more than the color imagery produced by a camera. As we showed in DeepFocus, modern rendering engines also give auxiliary information, such as depth values. We observed that, for neural supersampling, the additional auxiliary information provided by motion vectors proved particularly impactful. The motion vectors define geometric correspondences between pixels in sequential frames. In other words, each motion vector points to a subpixel location where a surface point visible in one frame could have appeared in the previous frame. These values are normally estimated by computer vision methods for photographic images, but such optical flow estimation algorithms are prone to errors. In contrast, the rendering engine can produce dense motion vectors directly, thereby giving a reliable, rich input for neural supersampling applied to rendered content.

Our method is built upon the above observations, and combines the additional auxiliary information with a novel spatio-temporal neural network design that is aimed at maximizing the image and video quality while delivering real-time performance.

At inference time, our neural network takes as input the rendering attributes (color, depth map and dense motion vectors per frame) of both current and multiple previous frames, rendered at a low resolution. The output of the network is a high-resolution color image corresponding to the current frame. The network is trained with supervised learning. At training time, a reference image that is rendered at the high resolution with anti-aliasing methods, paired with each low-resolution input frame, is provided as the target image for training optimization.

Example results. From top to bottom shows the rendered low-resolution color input, the 16x supersampling result by the introduced method, and the target high-resolution image rendered offline.

Example results. From top to bottom shows the rendered low-resolution color input, the 16x supersampling result by the introduced method, and the target high-resolution image rendered offline.

Example results. From left to right shows the rendered low-resolution color input, the 16x supersampling result by the introduced method, and the target high-resolution image rendered offline.

What’s next?

Neural rendering has great potential for AR/VR. While the problem is challenging, we would like to encourage more researchers to work on this topic. As AR/VR displays reach toward higher resolutions, faster frame rates, and enhanced photorealism, neural supersampling methods may be key for reproducing sharp details by inferring them from scene data, rather than directly rendering them. This work points toward a future for high-resolution VR that isn’t just about the displays, but also the algorithms required to practically drive them.

Read the full paper: Neural Supersampling for Real-time Rendering, Lei Xiao, Salah Nouri, Matt Chapman, Alexander Fix, Douglas Lanman, Anton Kaplanyan, ACM SIGGRAPH 2020.