Why 3D computer vision beats traditional machine learning

Machine Learning


When viewed from the side, the hat looks like a hat, but when viewed from above, it is a flat disc. For humans, it's obvious. For machine learning models trained on 2D images, this is a fundamental failure. Phani Harish Wajjala, a lead machine learning engineer leading content understanding for a large avatar marketplace, spends his days solving exactly these kinds of problems. His team is building AI that determines what a 3D asset is, whether it violates platform policies, and how it should be displayed to users.

In this interview, Phani details why 3D computer vision is still harder than 2D computer vision, how his team compensates for limited training data, and what he sees as the next frontier for avatar-scale AI.

Fanny, how would you explain 3D computer vision to a non-expert and why is it important for virtual environments and digital marketplaces?

I usually explain it by comparing it to how it looks in photos. In the early days, computer vision (SIFT and pig) targetUsed to identify objects regardless. the position innimagee. Major advances have been made by deep learning. alex net; We realized that if we threw enough internet data at a model, it could learn to handle those variations naturally.

But 3D is a different animal. We don't have the same huge data libraries as before, so the calculations are even more difficult. In 2D, a photo is just a photo. In 3D, you have a “frame of reference” problem. When viewed from the side, the hat looks like a hat, but when viewed from above, it may look like a flat disc. If the computer does not realize that these two views are the same object, it will break.

For a marketplace like Roblox, this is important. Because we don't just sell images. we sell such things must work. When the system cannot understand 3D geometry, The drape of the shirt and the fit of the item, We cannot effectively regulate, price or recommend it.

You will oversee the “Content Understanding” pod. Can you elaborate on what that specifically means in the context of the Avatar Marketplace?

Avatar Marketplace allows users to create and sell 3D assets. My pod's job is basically to make sense of the large influx of content.

First, there are the basics: safety and policy. we must Make sure your assets don't violate any rules before anyone sees them. Next is taxonomyunderstand what the item is actually teeth You will be able to organize your catalog.

But beyond that, our 'content intelligence' platform generates signals that drive the economy. We will look at:

Fit and compatibility: Is this shirt suitable for this body type?

Market indicators: Estimate quality, uniqueness, and potential demand.

Rich features: Tag your items with style, theme, and material for recommendations.

composition: We don't just look at products. We look at the entire outfit to understand the overall “vibe” of the avatar and its relationship to pop culture trends.

I mentioned earlier about the lack of 3D data. In the world of LLMs and 2D image generators, we often hear about “trillions of parameters” and large datasets. How do you build high-performance AI to understand 3D when you don't have the internet's worth of training data?

You can't just throw everything into one giant model just because you don't have the data. Instead, we use “model cascades” that connect small, specialized models with LLM.

Let's take asset fit as an example. You don't need a supercomputer to know if your shirt will cross your body. To check it first, we use a fast geometric model (using a standard mannequin as an anchor). This rules out obvious misfits. Then we pass the subtleties to the LLM to understand the nuances.

Do the same for classification. Shoes aren't always shoes. IThis is a “hat” with an interesting theme that fa users wear on their heads. The standard model lacks that context. So, We use pipelines. Find the best camera angle and render it to the mannequin, extract the geometry, and compare it to user search signals to see how the community describes it. Finally LLM I will consider All those signals areterminationWhat the item actually is.

You manage a team at the intersection of AI research and product engineering. Often these are two completely different cultures. How do you configure a “pod” to ensure research? actually make Do you want to deploy it to production?

We build our team around strong ML engineering principles. We don't treat research and engineering as separate silos.

First, we will focus on translation. That is, taking product requirements and converting them into quantifiable mathematical objectives. If you can't measure it, you can't build it. Second, we are modularizing our architecture. Because “content understanding” covers so many different problems, we separate the heavy infrastructure from the experimental stuff.

This allows you to follow the “Fail Fast” philosophy. Team members can build prototype modules, test them, and ship or retire them without breaking the entire system. We expect things to change all the time, so we build systems that can be replaced from day one.

Generative AI is clearly a hot topic. As we build this kind of understanding system, do you see a future where “content understanding” and “content generation” are fused?

I believe that understanding is a prerequisite for a good generation. Although Roblox is always trying to make creation easier, creating an avatar is more difficult than just creating a pretty 3D model. exchange.

Currently, I am working on layered clothing that transforms into the body. However, we hope to move towards a future where creators can generate assets that behave just like they would in real life. The world – clothes that fold, mugs that drink from, guitars that play. To create an object, actually function Correctly, AI must First, let's deeply understand what the object is.

Your work directly impacts the market. How does improving your “content understanding” improve it? actually change What is the user or creator experience on Roblox?

The biggest benefit is removing friction when users do something unexpected.

For example, after opening up UGC creation in 2024, we saw a huge wave of creators creating “speech bubbles.”Static mesh containing text. It did not fit into existing categories such as “hats” and “shirts” and was buried under the rug.

Introduced “Open Set Recognition” system,In other words Basically, it's a model that can detect clusters of completely new items that have never been seen before. Now that we've identified these bubbles, we're currently launching a dedicated “prop” hierarchy (for speech bubbles, auras, and companions). Instead of fighting the system to sell items, creators now have their own home.

Safety is clearly a top priority for Roblox. Identifying prohibited images is one thing, but how do you manage 3D assets? Are the challenges different if your content has volume and physics?

It's infinitely trickier. In 2D, what you see is what you get. 3D allows bad actors to hide things within geometry or use specific angles to hide violations. we must Use the model to predict the “correct” view and capture images for safety scanner or Convert your animation to video and check for inappropriate movements.

The most difficult one, however, is what I call a combination violation: two items are fine when worn separately, but they create problems when worn together. We've seen cases where users try to spell slurs across different parts of their bodies (such as wearing shirts with letters or other accessories with letters on them). You can't figure it out by looking at each item individually.

To solve this, we treat costumes like a graph of related assets. Use graph embedding to analyze the final combined look and mathematically determine which assets simply cannot be equipped together.

Looking at your career, you've worked for major technology companies. How is Roblox's engineering culture and technical challenges different from your previous experiences?

That's a huge number of unresolved issues. At other large technology companies, innovation often happens in “.lightweight”, we explore new ideas, often as side projects or optimizations.

At Roblox, with 3D collaborative experiences at this scale, solutions typically don't exist yet. Off-the-shelf models are not available as is. We're not just encouraging you to try new solutions here. It's an operational requirement. It creates a culture where everyone is always trying to do something different. must.

Finally, looking ahead to 2026, what is the “North Star” for your team?

My goal is to move from a static catalog to a dynamic catalog. We want to minimize the effort it takes for creators to showcase their work and for users to find it.

Specifically, what I want to change is how people search. Now you are searching for the keyword “blue shirt”. In the future, we would like to allow users to search by: idea. They need to be able to describe the avatar, mood, character, and specific aesthetic they want, and have our systems proactively discover and assemble the assets to achieve that.



Source link