– Your research has long focused on helping machines move from recognizing objects to understanding three-dimensional scenes. Looking back over the past decade, what do you think has been the most important change in machine perception?
– There was clearly a major turning point around 2021, with new advances in AI, especially machine learning. The advent of large-scale language models signaled a broader shift that was also making its presence felt in 3D computer vision and perception.
One of the biggest changes is the move to an end-to-end generalist model. Previously, many tasks were processed through pipelines consisting of multiple algorithms designed to work together. Today, we are increasingly replacing these pipelines with a single model that can solve the same task end-to-end.
Ideally, that same model can be reused for different tasks. This has both advantages and disadvantages. On the positive side, knowledge about multiple tasks and issues can be concentrated within one model. If you continue to provide enough data, the model can continue to learn and improve across different domains.
The downside is that we are moving more and more towards giant black boxes. While there may be a single end-to-end model that generates the answer, we often lack the ability to clearly inspect it and understand what’s going on under the hood.
This is especially important in many real-world applications, such as industrial processes, manufacturing, autonomous systems, robotics, and autonomous driving. When something goes wrong, you need to know what happened and how to improve the specific component where the problem occurred. So one of the areas of AI has been around for a while, but it’s becoming especially important right now. It’s explainability. It’s an effort to open up the black box to better understand how different parts of the model work and exactly where they fail.
– Is explainability a particularly pressing concern in visual AI, for example in the generation of images and videos? In the case of text, inaccuracies are not always immediately obvious. But in images and videos, viewers tend to notice mistakes quickly, such as a hand with six fingers.
– yes. However, it presents a somewhat different set of challenges. Even though it overlaps with explainability, it is a challenge that is more closely related to the certifiable and secure aspects of AI.
Explainability becomes important when machine learning models are placed directly in the critical path of real-time applications. For example, if you’re driving a car and a machine learning model is making important decisions about how the car should behave, if something goes wrong and the wrong decision is made, a dangerous situation could result. In cases like this, you need to understand exactly what happened.
In contrast, what you’re referring to is more closely related to generative AI in digital content creation. One of the central challenges is determining whether a particular piece of content was generated by AI. This is clearly something that should be addressed not only technically through better ways to identify and classify AI-generated content, but also through governance.
We need policies that establish clear guidelines for this type of data and are shared as widely as possible across industry and government. This also leads to the importance of traceability, being able to identify when and how a particular image or video was produced. Watermarks are one example. Many ways image and video generation models can effectively mark the content they create to reveal that it was generated by AI are being considered. This could help limit some of the risks associated with deepfakes and related challenges.

– Have we reached a point where the average viewer can no longer reliably tell the difference between a fully AI-generated video and real footage?
– Of course – we’re very close to that point. This is exactly why you need to have this kind of safety measure in place.
There is also another problem. In some cases, the generated content may correlate very closely with material that the model has already seen. Therefore, it is important for commercially used models to be able to identify copyrighted material and operate with clear policies in that regard. This is a top priority for companies like Google working in this space. Clearer policies are already in place for handling training data, but of course more can be done.
– The presentation showed interesting examples of how to turn videos and photos into immersive 3D environments. Which use cases do you think will become truly transformative first, such as navigation, remote collaboration, retail, or design?
– In fact, there are many applications that can be unlocked by this kind of technology. The goal is to use generative AI to further the creation of digital content. It’s not just about creating something that’s visually compelling, it’s about creating something that is geometrically faithful, something that truly captures three dimensions.
This is important for various use cases. As I mentioned in my talk, one of the key areas is creating immersive environments that people can actually move through. Whether these environments are reconstructions of real locations or completely generated. This applies not only to games and mixed reality, but also to autonomous systems.
We are currently witnessing the emergence of so-called world models, systems that create interactive digital representations of 3D environments. While the application may be entertainment-oriented, in some cases these models can generate highly useful customized data for training robotic systems, robotic arms, autonomous agents, or self-driving models to run more safely and effectively.
In all of these applications, the third dimension is more than just a visual enhancement. Must be geometrically faithful. The underlying 3D structure must be preserved. Otherwise, you risk feeding noisy or misleading data into a robot learning how to navigate its environment or a self-driving car learning how to drive safely on real roads. If the geometry does not closely match the real world, failures can occur during system deployment.
-You have worked in both academia and industry. How has working across both worlds shaped your thinking about what brings real value to AI research?
– In recent years, there has been a real rebalancing in the relationship between academia and industry, especially in AI.
Traditionally, academia has been the main driver of innovation. Many disruptive ideas were born there, and the industry focused on technology transfer, or taking the most promising ideas and turning them into products and applications.
In recent years, that balance has shifted. One reason is the very trend I mentioned earlier. Increasingly, we are training large, integrated end-to-end models across many tasks, which requires vast amounts of data and compute. Access to both has become important.
A lot of innovation is happening now in environments where that scale of data and computation is available, but often those environments are industrial rather than academic. This has changed the balance of research and innovation to some extent.
However, I call this a balance adjustment rather than a replacement. Because I don’t think the importance of either side has diminished. Both continue to play a fundamental role in the advancement of the field. Academia still plays an important role in pursuing disruptive and risky ideas, and it remains essential. On the other hand, the industry is often well-positioned for cutting-edge development and expansion on large-scale models.
Of particular interest right now is the rise of consortia and partnerships that bring together academia and industry. These collaborations are becoming increasingly important as they allow different institutions to pool their resources and reach the critical mass they need in terms of data and computing. Therefore, one of the consequences of recent AI developments is that collaboration itself has become more important.
– It was impressive to see how many different platforms, from mobile phones to smart glasses to headsets, were now able to interact with the 3D world. How does the 3D content experience change when you move from flat screens to headsets and smart glasses, and what does that transition allow you to do that a laptop screen can’t?
– The real value of immersive applications is that they enable fundamentally different kinds of user experiences.
These use cases are driven by the desire for a stronger sense of presence, especially in relation to the physical world far away from us. Immersive experiences clearly add value when you want to connect with people who are not physically nearby. The same applies when we want to explore or learn about the world. Many concepts are easier to understand, more intuitive, and more convincing when presented through an immersive interface rather than a flat screen.
I am particularly interested in how AI can enable rich and meaningful experiences in areas such as tutoring and education. AI can help create tools that make complex ideas easier to understand, and potentially make those tools more widely available, including to people who otherwise wouldn’t have access to them. I believe that it could be one of the true breakthroughs in AI and one of its most positive impacts.

– What is most important in driving widespread adoption of XR and spatial computing? Better algorithms, better hardware, or a richer developer ecosystem?
– Actually, it’s all of them together. They are closely interconnected.
On the hardware side, we’re talking about mechanical components and sensors, as well as displays. Displays are important to the immersive experiences we’ve been talking about, but they also need to be lightweight, high-resolution, energy-efficient, and practical. It’s a very demanding engineering problem.
Another very important component is the chipset. This is a mobile or embedded computing unit that can run these models and algorithms directly on the device, subject to strict constraints on battery life, latency, and accuracy.
Of course, you also need cutting-edge machine learning models and algorithms. And finally, there’s the developer ecosystem. That is also essential. We already know from the history of successful smartphone platforms how important an active and engaged developer community is. To build it, we need to provide the right tools, make the platform attractive, and give developers the freedom to be creative and turn their ideas into reality. That is another important challenge in this field.
– Looking ahead five years, what will it take to be confident that spatial AI and immersive 3D scene generation have moved definitively beyond impressive demos into true mainstream use?
Given how quickly things are moving, it’s difficult to predict the future, even just five years away.
One area I would like to point out is autonomous agents. For now, one of their main limitations remains the way they interact with the physical world. We are reaching a stage where such agents understand the world fairly well and can move through it with increasing precision in terms of obstacle avoidance, dynamic control, and overall navigation. However, their ability to manipulate objects and interact meaningfully with the world remains limited.
I think the next big step will be to show that spatial intelligence can power a new generation of autonomous agents that can actually operate within and manipulate objects in the physical world. This will open up many new applications and markets.
Of course, it’s not just about AI. At least it’s not just a matter of spatial intelligence. We will also need better robotic tools to handle and interact with the environment. So what we really want is progress on both fronts at the same time.
