AI Text-to-3D Generation: Meta 3D Gen, OpenAI Shap-E, and more

The ability to generate 3D digital assets from text prompts is one of the most exciting recent developments in AI and computer graphics. With the 3D digital assets market predicted to grow from $28.3 billion in 2024 to $51.8 billion by 2029, text-to-3D AI models are poised to play a major role in revolutionizing content creation across industries like gaming, film, and e-commerce. But how do these AI systems work specifically? In this article, we take a closer look at the technical details behind text-to-3D generation.

Challenges of the 3D Generation

Generating 3D assets from text is a much more complex task than generating 2D images: a 2D image is essentially a grid of pixels, but a 3D asset needs to have geometry, textures, materials, and often animations represented in a three-dimensional space. This added dimensionality and complexity makes the generation task much more difficult.

The main challenges in generating 3D from text are:

Representation of 3D geometry and structures
Generate consistent textures and materials across 3D surfaces
Ensure physical plausibility and consistency from multiple perspectives
Capture details and overall structure at the same time
Generate assets that can be easily rendered or 3D printed

To address these challenges, text-to-3D model conversion leverages several key technologies and techniques.

Key components of the text to 3D conversion system

Most state-of-the-art text-to-3D generation systems share several core components.

Text Encoding: Converts the input text prompt to a numeric representation
3D Expression: How to represent 3D shapes and appearances
Generative Model: Core AI model for generating 3D assets
rendering: Converting 3D representations into 2D images for visualization

Let's take a closer look at each one.

Text Encoding

The first step is to convert the input text prompt into a numerical representation that an AI model can process, which is typically done using large-scale language models such as BERT or GPT.

3D Expression

There are several common ways to represent 3D geometry in AI models.

Voxel Grid: 3D array of values representing occupancy or features
Point Cloud: A set of 3D points
mesh: The vertices and faces that define the surface
Implicit Functions: A continuous function that defines a surface (e.g. a signed distance function)
Neuroradiation Field (NeRF): A neural network for representing density and color in 3D space

Each has tradeoffs in terms of resolution, memory usage, and ease of generation. Many recent models use implicit functions or NeRFs, as they enable high-quality results with reasonable computational requirements.

For example, we can represent a simple sphere as a signed distance function:

import numpy as np
def sphere_sdf(x, y, z, radius=1.0):
    return np.sqrt(x**2 + y**2 + z**2) - radius
# Evaluate SDF at a 3D point
point = [0.5, 0.5, 0.5]
distance = sphere_sdf(*point)
print(f"Distance to sphere surface: {distance}")

Generative Model

The core of any text-to-3D system is a generative model that generates a 3D representation from a text embedding. Most state-of-the-art models use some variation of a diffusion model similar to those used in 2D image generation.

Diffusion models work by gradually adding noise to the data and learning to reverse this process. In the case of 3D generation, this process occurs in the space of a chosen 3D representation.

The simplified pseudocode for the training procedure of the diffusion model is as follows:

def diffusion_training_step(model, x_0, text_embedding):
# Sample a random timestep
t = torch.randint(0, num_timesteps, (1,))
# Add noise to the input
noise = torch.randn_like(x_0)
x_t = add_noise(x_0, noise, t)
# Predict the noise
predicted_noise = model(x_t, t, text_embedding)
# Compute loss
loss = F.mse_loss(noise, predicted_noise)
return loss
# Training loop
for batch in dataloader:
x_0, text = batch
text_embedding = encode_text(text)
loss = diffusion_training_step(model, x_0, text_embedding)
loss.backward()
optimizer.step()

During generation, we start with pure noise and iteratively denoise it depending on the text embeddings.

rendering

To visualize results and compute losses during training, the 3D representation needs to be rendered into a 2D image, which is typically done using differentiable rendering techniques that allow gradients to flow back through the rendering process.

For mesh-based representations, a rasterization-based renderer may be used.

import torch
import torch.nn.functional as F
import pytorch3d.renderer as pr
def render_mesh(vertices, faces, image_size=256):
    # Create a renderer
    renderer = pr.MeshRenderer(
        rasterizer=pr.MeshRasterizer(),
        shader=pr.SoftPhongShader()
    )
    
    # Set up camera
    cameras = pr.FoVPerspectiveCameras()
    
    # Render
    images = renderer(vertices, faces, cameras=cameras)
    
    return images
# Example usage
vertices = torch.rand(1, 100, 3)  # Random vertices
faces = torch.randint(0, 100, (1, 200, 3))  # Random faces
rendered_images = render_mesh(vertices, faces)

For implicit representations like NeRF, we typically use ray marching techniques to render the views.

Putting it all together: The text-to-3D pipeline

Now that we've covered the main components, let's look at how they come together in a typical text-to-3D generation pipeline.

Text Encoding: Input prompts are encoded into dense vector representations using a language model.
Early Generation: A diffusion model conditioned on text embeddings generates an initial 3D representation (e.g., NeRF or implicit functions).
Multi-view consistency: The model renders multiple views of the generated 3D asset, ensuring consistency between viewpoints.
Improvements: Additional networks allow you to refine the geometry, add textures, enhance details, and more.
Final Output: The 3D representation is converted into the required format (e.g. textured mesh) for use in downstream applications.

Here's a simplified example of what this might look like in code:

class TextTo3D(nn.Module):
    def __init__(self):
        super().__init__()
        self.text_encoder = BertModel.from_pretrained('bert-base-uncased')
        self.diffusion_model = DiffusionModel()
        self.refiner = RefinerNetwork()
        self.renderer = DifferentiableRenderer()
    
    def forward(self, text_prompt):
        # Encode text
        text_embedding = self.text_encoder(text_prompt).last_hidden_state.mean(dim=1)
        
        # Generate initial 3D representation
        initial_3d = self.diffusion_model(text_embedding)
        
        # Render multiple views
        views = self.renderer(initial_3d, num_views=4)
        
        # Refine based on multi-view consistency
        refined_3d = self.refiner(initial_3d, views)
        
        return refined_3d
# Usage
model = TextTo3D()
text_prompt = "A red sports car"
generated_3d = model(text_prompt)

Top Text to 3D Asset Models Available

3DGen – Meta

3DGen is designed to address the problem of generating 3D content such as characters, props, and scenes from a text description.

Large-scale language and text to 3D models – 3d-gen

3DGen supports Physically Based Rendering (PBR), essential for realistic relighting of 3D assets in real-world applications. It also enables generative retexturing of previously generated or artist-created 3D shapes with a new text input. The pipeline integrates two core components: Meta 3D AssetGen and Meta 3D TextureGen, which handle text-to-3D generation and text-to-texture generation, respectively.

Meta 3D Asset Generation

Meta 3D AssetGen (Siddiqui et al., 2024) is responsible for the initial generation of 3D assets from text prompts. This component generates a 3D mesh with textures and PBR material maps in about 30 seconds.

Meta 3D Texture Generator

Meta 3D TextureGen (Bensadoun et al. 2024) refines the textures generated by AssetGen and can also be used to generate new textures for existing 3D meshes based on additional textual descriptions. This stage takes about 20 seconds.

Point E (OpenAI)

Point-E, developed by OpenAI, is another notable text-to-3D generative model. Unlike DreamFusion, which produces a NeRF representation, Point-E produces a 3D point cloud.

Key features of Point-E:

a) Two-Stage PipelinePoint-E first uses a text-to-image diffusion model to generate a synthetic 2D view, and then uses this image to calibrate a second diffusion model that generates a 3D point cloud.

b) efficiencyPoint-E is designed to be computationally efficient, generating 3D point clouds within seconds on a single GPU.

c) Color information: Models can generate colored point clouds while preserving both geometric and appearance information.

Limitations:

Lower fidelity compared to mesh-based and NeRF-based approaches
Point clouds require additional processing in many downstream applications

Shap-E (OpenAI):

OpenAI has introduced Shap-E, which builds on Point-E to generate 3D meshes instead of point clouds, addressing some of Point-E's limitations while remaining computationally efficient.

Key features of Shap-E:

a) Implicit RepresentationShap-E learns to generate an implicit representation (a signed distance function) of 3D objects.

b) Mesh Extraction: This model uses a differentiable implementation of the marching cubes algorithm to convert the implicit representation into a polygonal mesh.

c) Texture GenerationShap-E can also generate textures for 3D meshes, resulting in visually appealing output.

advantage:

Fast generation time (seconds to minutes)
Direct mesh output for rendering and downstream applications
Ability to generate both geometry and textures

GET3D (NVIDIA):

Developed by researchers at NVIDIA, GET3D is another powerful text-to-3D generative model that focuses on generating high-quality textured 3D meshes.

Key features of GET3D:

a) Explicit Surface Representation: Unlike DreamFusion and Shap-E, GET3D directly generates an explicit surface representation (mesh) without any intermediate implicit representation.

b) Texture Generation: This model includes a differentiable rendering technique for learning and generating high-quality textures for 3D meshes.

c) GAN-based architecture: GET3D uses a generative adversarial network (GAN) approach, which allows for fast generation once the model is trained.

advantage:

High quality geometry and textures
Fast inference time
Direct integration with 3D rendering engines

Limitations:

Requires 3D training data, which may be scarce for some object categories

Conclusion

AI text-to-3D generation represents a fundamental change in how 3D content is created and manipulated. Leveraging advanced deep learning techniques, these models can create complex, high-quality 3D assets from simple text descriptions. As the technology continues to evolve, we expect to see increasingly sophisticated, high-performance text-to-3D systems that will revolutionize industries ranging from games and film to product design and architecture.

Source link