This post is reviewing a paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" by Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik.
Goal
Take a bunch of input images from different angles(multi-view input image)
➡️ Optimize NeRF: Synthesize a 3d view
➡️ Render new views(scenes) with smooth transition

Network architecture
"How the single point would be viewed from a certain angle?"
Input: Single-pixel point(x, y, z, θ, φ), Viewing angle
Network(F): Simple fully connected layers(9 layers)
Output(r, g, b, σ-volume density): color, volume density(Is there something actually here?)

Key Idea
- Overfit our NN to the particular scene that we're interested in(unusual concept)
How does it all work?
1. Sample random points along the particular direction line

2. Send 5D(x, y, z, θ, φ) input to NN and get output

3. Volume rendering
- From classical volume rendering, they rendered the color of any ray passing through the scene.


- C(r): the expected color
- camera ray r(t) = o + td,
- t: ray distance from the point of view
- near and far bounds tn and tf
4. Rendering Loss
Optimize the scene representation by minimizing the residual between synthesized(both fine and coarse renderings) and ground truth observed images.

Trick #1: Positional encoding
The neural network doesn't work well when we have low-dimensional features.
So, they performed transformation by applying the sin, cos function.
Trick #2: Hierarchical volume sampling
They first sample a set of Nc locations using stratified sampling: Fine network
Then, they chose points close to the particular object that we perceive that we have found: Coarse Network
They get the squared error between the rendered and true pixel colors for both the coarse and fine renderings.

Reference