Original paper. How NVIDIA does it. Notes taken from the UTwente slides.
Learn ray-tracing. Like actually; this is the core of this method. I can definitely see it applied in gaming. fucking mega. Ray-Tracing in One Weekend.
PyTorch Implementation. Official Implementation Github.
Explanatory Video from Google (MIT) â the actual authors themselves. Fucking mega..
The concept itself sounds awesome. Letâs see how it looks on paper. Also, Steven pointed out how this must have been a catalyst of all the fly-through videos we see today.
Before we start, NeRF is data hungry. It simply memorizes the world, but itâs unable to generalize. If we give it a set of drums, it will know that set and not another one.
NERF
- Generating rays from camera poses
- Using an MLP to predict color and density for each point in space
- Rendering novel views of a scene with differentiable volumetric rendering
- Apply NeRF techniques for 3D reconstruction tasks.
![]()
Rendering
Rendering is the process of generating a photorealistic or non-photorealistic image from input data such as 3D models
- Forward rendering means taking a scene defined with geometry, materials, cameras, and lights, and generating an image.
- Inverse rendering starts from the rendered image, compares it to some form of ground truth image, and updates the parameters of the scene elements with the loss that compares the images.
![]()
Volumetric Rendering
Radiance
Radiance is some amount of light (differential energy) that you can see per unit area, with a solid angle and wavelength .
- (position, direction, wavelength)
Radiance along an unblocked ray is constant (energy conservation). The âLight Fieldâ is the radiance for every possible ray. I still donât understand what this all means..
Novel View Synthesis:
Input of NeRF is 5D data
- input is a single continuous 5D coordinate
- spatial location Â
- and viewing direction (, )
- We can omit the wavelength in the inputs. We only need 5 parameters.
- output is the volume density and view-dependent emitted radiance at that spatial location.
- Volume density represents how much âstuffâ is at a specific 3D point (). If the density is high, the point is likely part of a solid object; if it is zero, the point is empty space.
- Radiance is simply the color and brightness of the light emitted from a specific point () in a specific direction ().
Now that we have the color and density (r,g,b,), we calculate the color of every camera ray using:
explaining the integral: as a camera ray travels through the âcloudâ, it collects color (c) from every point. However, that color is weighted by the density () at that spot and the transmittance (T).
- Transmittance is essentially âhow much light can still reach the camera from this point without being blocked by objects in front of itâ. If the ray hits a high-density wall, the transmittance for anything behind that wall drops to zero.
- No hits before is equal to integral over density up until .
- Volumetric Density is the probability that the ray stops in a small interval around .
- If a ray traveling through the scene hits a particle at , we return this radiance/color .
Absorption = something that doesnât let light get through
Scattering = like a glass that splits the light in different direction
Emission = Something related to energy that produces light?
Volumetric Formulation for NeRF
is the probability density function (PDF) that represents the probability that a ray stops at .
Expected color along a ray is a convex combination of colors . By now I should be able to understand what each of those terms represent.
To implement NeRF in practice, we canât solve an infinite integral on a computer. Instead, we discretize the ray into segments to numerically estimate the accumulated color.
Computing the color for a set of rays through the pixels of an image yields a rendered image.
From the author: we first lay down points evenly and those points are supposed to tell us where the stuff is. If we get more opaque values at the beginning of the ray, then we know something is there and we can focus on that area. And so we âtrainâ the model to understand this. (I say âtrainâ because an MLP is really small. Itâs more about optimizing the weights)
Neural Networks as representations
Basically, slap a coordinate-based MLP to give us the (r,g,b) for each (). Instead of saving colors in a grid, you train a neural network to behave like a lookup table: you give it a coordinate, and it âremembersâ the color and density at that spot.
The Spectral Bias Problem
A standard neural network has a natural âSpectral Bias,â which means it is biased toward learning smooth, low-frequency functions first. When you try to train a basic MLP to represent an image or a 3D scene directly from raw coordinates (x,y,z), the results are always blurry. The network can easily learn the general âblobsâ of color but struggles to learn sharp edges, fine textures, or small details. This is why the âNaive Approachâ fails
![]()
![]()
The solution in this case is to make use of the positional encoding concept from transformers(the sin and cos functions with exponentially increasing frequencies) to map the low-dimensional input coordinates into a much higher-dimensional space before they enter the network. The lower frequencies () help the network understand the overall shape, while the higher frequencies () provide the âhooksâ necessary for the network to anchor sharp details and complex textures. This is such a nice application of this concept..
Up until now, NeRF = volume rendering + coordinate-based network
Including the ray direction in the input to the MLP allows for capturing and rendering view-dependent effects (e.g., shiny surfaces)
So weâre going to store the values along the ray, and put it all together. The loss function is your classical MSE.
Safe to say that when Iâm using lots of pictures; I donât really master the concept. But this concept is so vast and includes so many terms like SfM, ray tracing, ray marching, etc. Itâs very advanced stuff.
A quick summary of NN representations in NeRF. From Gemini.
- Ray Sampling:Â For every pixel in a target image, a camera ray is cast into the scene.
- Point Evaluation: Multiple points along that ray are sampled. Each 3D point () and the viewing direction (, ) are passed through the Positional Encoding and then the MLP.
- Output Generation:Â The MLP outputs the color () and density () for each point.
- Volume Rendering:Â These values are integrated along the ray to calculate the final pixel color.
- Loss Calculation: The rendered color is compared against the ground truth using a squared error loss.
- Optimization: The network uses gradient descent to update its weights until it can accurately reproduce the scene from any of the input camera views.
Notes from class
Structure from Motion(SfM): SfM aims to reconstruct the 3D structure of a scene and the cameraâs motion from a sequence of images. This is how they do the motion prediction.
The mathematics is not at the exam