Exploration of physical world AGI using “visual reasoning”

The capabilities of large-scale AI models exceed those of the average human in some aspects, such as programming and mathematics. Anthropic is reported to have achieved nearly 100% of its AI programming in-house. Google’s Gemini Deep Think solved 5 out of 6 problems in IMO 2025, reaching gold medal level.

But when it comes to visual reasoning, even the flagship Gemini 3 Pro only reaches the level of a 3-year-old in BabyVision, a benchmark for testing basic visual reasoning abilities.

Why are large models strong at programming and mathematics, but weak at visual reasoning? It’s because their “modes of thinking” are limited. Visual language models (VLMs) must first convert visual input into language and then perform text-based inference. However, many visual tasks simply cannot be accurately described in words, reducing the model’s visual reasoning ability.

Andrew Dai, who worked at Google DeepMind for 14 years, teamed up with Yingfei Yang, a senior AI expert at Apple, to found a company called Elorian AI. The goal is to raise the visual reasoning capabilities of models from “child-level” to “adult-level” and enable models to think natively in “visual space,” aiming for AGI in the physical world.

Elorian AI received $55 million in early-stage funding co-led by Striker Venture Partners, Menlo Ventures, and Altimeter. Top AI scientists participated in the investment, including 49 Palms and Jeff Dean.

Multimodal model pioneers want to give visual models reasoning power

Andrew Dai, a Chinese national, holds a BA in Computer Science from the University of Cambridge and a PhD in Machine Learning from the University of Edinburgh. He interned at Google during his PhD and joined Google in 2012, where he remained for 14 years before starting his own business.

Image source: Andrew Dai’s LinkedIn

Shortly after joining Google, he co-authored his first paper on language model pre-training and supervised fine-tuning, “Semi-supervised Sequence Learning,” with Quoc V. Le. This paper laid the foundation for the birth of GPT. Another of his foundational papers was “Glam: Efficient scaling of language models with expert mixing,” which paved the way for today’s mainstream MoE architectures.

Image source: Google

During his time at Google, he was also deeply involved in training nearly every large-scale model, from Plam to Gemini 1.5 to Gemini 2.5. Arranged by Jeff Dean, he will lead Gemini’s data section (including synthetic data) in 2023, a team that later grew to several hundred people.

Image source: Yangfei Yang’s LinkedIn

Yingfei Yang, who co-founded the business with Andrew Dai, spent four years at Google Research, focusing on multimodal representation learning. He then joined Apple, where he was responsible for research and development of multimodal models.

Image source: arxiv

His representative research result, “Scaling up vision and visual perception – language representation learning with noisy text monitoring,” promoted the development of multimodal representation learning.

Elorian AI’s co-founders also include Seth Neal, a former Harvard University AP and expert in the fields of data and AI.

Why discuss a groundbreaking paper written by the co-founders of Elorian AI? Because what they are trying to do is not optimize at an engineering level. A paradigm update to the underlying architecture aimed at upgrading AI from text-based to visual-based intelligent understanding.

The current state of AI models is that, while they perform well on text-based tasks, even the most advanced multimodal large-scale models still stumble on the most basic visual-based tasks.

For example, how can we precisely attach certain parts to a mechanical device to make it work more accurately and efficiently? Such spatial and physical tasks are very easy for elementary school students, but extremely difficult for existing multimodal large-scale models.

We need to find clues from biology. In the human brain, vision is a fundamental foundation that supports many thought processes. Humans’ ability to use visual and spatial reasoning is much older than their use of language-based logical reasoning.

For example, if you’re teaching someone how to get through a maze, explaining it verbally would be confusing, but using sketches will help them understand quickly.

Another example is that even birds, despite having no language, are able to recognize and reason about geographic features through vision, thereby achieving long-distance global migration. This is a strong signal that vision is probably the right evolutionary direction to truly facilitate the reasoning abilities of machines.

Now imagine that you imprint this biological visual instinct into the genes of your AI from the very beginning of model building, building a native multimodal model that can “understand and process text, images, video, and audio simultaneously,” and that model becomes capable of visual understanding. Andrew Dye and his team want to teach machines to not just “see” the world, but to “understand” it, building natural “synaesthetes.”

In the view of Andrew Dye and his team, a deep understanding of the real ‘physical world’ is the key to achieving next-generation machine intelligence leaps and ultimately reaching ‘visual AGI’.

VLM with post-reasoning is not the right path to visual reasoning

It’s not like there hasn’t been a team that wanted to do this before. In fact, Andrew Dai’s previous Gemini team was already a very strong team in the global multimodal field. However, traditional multimodal models still mainly rely on VLM (Visual Language Model) and their logic is based on a “two-step” approach. That is, it first converts visual input into language and then performs text-based inference (possibly with the help of external tools).

However, a posteriori reasoning has inherent limitations. On the other hand, models are prone to hallucinations. On the other hand, many visual tasks cannot be accurately described in words.

Also, visual generation models like NanoBanana have excellent multimodal generation capabilities, but generation ability is not the same as inference ability. Their “thinking” from previous generations still essentially relies on language models and not on native reasoning abilities.

If we want to develop models that truly understand the spatial, structural, and relational complexity of the visual world, we must perform disruptive innovations in the underlying technology.

So how can we innovate? The founders of Elorian AI have been deeply involved in the multimodal field for many years. Their approach is to deeply integrate multimodal training with an entirely new architecture designed specifically for multimodal inference. They abandon the traditional practice of considering images as static input and instead train models to directly interact with and manipulate visual representations, independently analyzing their internal structures, relationships, and physical constraints.

Of course, another core element is data, which is key to determining the performance and success of these models.

Andrew Dai said he places great importance on data quality, data mix, data sources, and data diversity. We also innovated at the data level, rebuilt inference links in visual space, and used synthetic data at scale and in detail.

Combining these efforts will create new AI systems that can move beyond simple visual “perception” to higher-order visual “reasoning.”

This AI system can be a basic visual reasoning model. This means that you can build very general models that perform very well on a specific set of features (visual inference).

Since it is a general basic model, it should have a wide range of applications.

First, in the field of robotics, they are the underlying nerve center of powerful systems, giving robots the ability to operate autonomously in a variety of unfamiliar environments.

For example, in the field of robotics, robots must make quick, accurate, split-second decisions to dispatch them to deal with sudden safety failures in hazardous environments. If a robot lacks a basic model with detailed reasoning capabilities, people won’t have the courage to force it to press random buttons or manipulate levers. However, if you have strong reasoning abilities, you might think, “Before I operate this panel, shouldn’t I first pull this lever to activate the safety mechanism?”

And in disaster management, models with visual reasoning can monitor and prevent forest fires by analyzing satellite imagery. In the engineering field, you can accurately understand complex visual drawings and system schematics. The importance of this ability lies in the fact that the laws of operation of the physical world are fundamentally different from those of the pure code world. You can’t design an airplane wing by typing a few lines of pure code.

However, for now, Elorian AI’s model and capabilities are still only on paper. They plan to release a model that reaches SOTA level in the field of visual reasoning in 2026, after which they will be able to test whether their results match their claims.

How will the physical world change when AI truly has “visual reasoning” capabilities?

This technology has gone through many iterations to enable AI to understand and influence the real physical world.

From image recognition in the traditional CV era to image generative models/multimodal models and world models in the generative AI era, our understanding of the physical world has been continuously enhanced.

The basic model of visual reasoning has great potential to be taken a step further. Because if AI can achieve visual reasoning, it will be able to better understand the physical world and achieve higher levels of machine intelligence.

Imagine that models with deep understanding and fine-grained operational capabilities will be “charged” into the embodied intelligence and AI hardware industries, greatly expanding their scope of application. For example, robots can be used more reliably in industrial production and in the medical field. AI hardware, especially wearable devices, has the potential to become smarter personal assistants.

However, data remains the foundation of these technologies. As mentioned earlier by Andrew Dai, data quality, data mix ratio, data source, and data diversity all determine model performance.

In the field of physical AI, compared to text-based large-scale models, Chinese companies are approaching world-leading levels in terms of both models and data. If you can leverage rich application scenarios and data to speed up iterations, whether it’s embodied intelligence, AI hardware, or industrial, medical, or home applications, you have a better chance of reaching first-class levels and creating world-class companies.

This article is from the WeChat official account “Alpha Commune” (ID: alphastartups). The author is a discoverer of extraordinary entrepreneurs. Published with permission from 36Kr.

Source link