From NeRF to 3D Gaussian Splatting, the Frontiers of Text-to-3D

6 min readNov 19, 2023

Text-to-3D, to generating countless 3D meshs by simply provide a few text descriptions. It was commonly believed as an unrealistic wild sc-fi imagination just a few years ago, and now it’s almost within our reach!

This article walks you through the powerhouse behind those technical miracles and the current state-of-the-art Text-to-3D methods.

Neural Radiance Field (NeRF)

NeRF is an popular approach towards inverse rendering in which a volumetric raytracer is combined with a neural mapping from spatial coordinates to color and volumetric density.

Originally, NeRF was found to work well for “classic” 3D reconstruction tasks: many images of a scene are provided as input to a model, and a NeRF is optimized to recover the geometry of that specific scene, which allows for novel views of that scene from unobserved angles to be synthesized.

3D Gaussian Splatting (3DGS)

3DGS is, at its core, a rasterization technique. That means:

Have data describing the scene.
Draw the data on the screen.

The rasterization involves:

Project each gaussian into 2D from the camera perspective.
Sort the gaussians by depth.
For each pixel, iterate over each gaussian front-to-back, blending them together.

Procedure of the 3DGS algorithm:

Use the Structure from Motion (SfM) method to estimate a point cloud from a set of images
Convert each point to a gaussian field, however, only position and color can be inferred from the SfM data, so we need to train a representation that yields high quality results
The training procedure uses Stochastic Gradient Descent, similar to a neural network, but without the layers.

The training steps are:

Rasterize the gaussians to an image using Differentiable Gaussian Rasterization
Calculate the loss based on the difference between the rasterized image and ground truth image
Adjust the gaussian parameters according to the loss
Apply automated densification and pruning

Steps 1–3 are conceptually pretty straightforward. Step 4 involves the following:

If the gradient is large for a given gaussian (i.e. it’s too wrong), split/clone it
If the gaussian is small, clone it
If the gaussian is large, split it
If the alpha of a gaussian gets too low, remove it

This procedure helps the gaussians better fit fine-grained details, while pruning unnecessary gaussians.

It’s also essential that the rasterizer is differentiable, so that it can be trained with stochastic gradient descent. However, this is only relevant for training — the trained gaussians can also be rendered with a non-differentiable approach.

Dreamers of Text-to-3D

First Dreamer

About a year ago, the DreamFusion: Text-to-3D using 2D Diffusion came out as the first research to leverage a pretrained 2D text-to-image diffusion model as prior for optimize a randomly-initialized 3D model (NeRF) via gradient descent, such that its 2D renderings from random angles looks like the images generated from the 2D diffusion model.

One of the main contribution of the Dreamfusion is that it introduced Score Distillation Sampling (SDS) Loss

SDS loss ensures the 3D models that look like good images when rendered from random angles. Achieve this by:

3D models are specified as a differentiable image parameterization (DIP), where a differentiable generator g transforms parameters θ to create an image x = g(θ)
Passing a rendered image of a 3D model through the denoiser U-Net, and backpropagating to update the parameters θ. In practice, SDS loss ****omitting the U-Net Jacobian term, because it’s expensive to compute and did not produce realistic samples.
Score in Score Distillation Sampling means the negative predicted noise scaled according to the time-step.

More Dreamers

On 28 Sep 2023, approximately one and half month after the 3DGS paper droped, two works with the similar design principles based on 3DGS and the training procedure at the core of Dreamfusion published.

One is DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

DreamGaussian observed that the generated Gaussians in the 1st step often look blurry and lack details even with longer SDS training iterations. This could be explained by the ambiguity of SDS loss. Since each optimization step may provide inconsistent 3D guidance, it’s hard for the algorithm to correctly densify the under-reconstruction regions or prune over-reconstruction regions as in reconstruction. It is this observation that leads to the designs of mesh extraction and texture refinement stage.

Following are some generated results compare to other methods. Note the part of model is blurry when lack of consistent reference image at corresponding camera angle, since this method also incorporates MSE loss between generated image and input reference image during UV-Space Texture Refinement stage.

Another paper is GSGEN: Text-to-3D using Gaussian Splatting

In order to keep the shape of generated 3D Gaussians while refine its appearance, instead of constrain the shape of bake the Gaussians into a 3D mesh and refine its appearance directly in the pixel space like what DreamGaussian does, GSGEN introduced Compactness-based Densification as a supplement to positional gradient-based split in 3DGS.

The problem Compactness-based Densification trying to solve is: Due to the stochastic nature of SDS loss, employing a small threshold is prone to be misled by some stochastic large gradient thus generating an excessive number of Gaussians, whereas a large threshold will lead to a blurry appearance.

In order for appearance refinement stage to work, other two additional loss is also presented:

Prune unnecessary Gaussians
An extra loss is used to regularize opacity with a weight proportional to its distance to the center and remove Gaussians with opacity smaller than a threshold αmin periodically.
Ensuring the geometry consistency
An extra loss is used to penalize Gaussians which deviates significantly from their positions obtained during the preceding geometry optimization.

The loss function in the appearance refinement stage is summarized as the following:

Same Dream

It is not hard to see the overall design of both DreamGaussian and GSGEN falls into those 2 steps:

Using 2D diffusion model and 3DGS to generate the coarse 3D Gaussians with desire 3D shape
Refinement of 3D appearance details

This 1–2 combo follows the principles of design any complex automation system: If we can’t make the system smarter then we make the task stupider!

In the future, we expect to see more researches in each of those two steps.

From NeRF to 3D Gaussian Splatting, the Frontiers of Text-to-3D

Neural Radiance Field (NeRF)

3D Gaussian Splatting (3DGS)

Dreamers of Text-to-3D

Reference

Written by Mr. For Example

No responses yet