In an era when large-scale language models dominate headlines, a quiet but bold bet on “world models” is establishing itself as the next frontier in artificial intelligence. The vision, championed by General Intuition (GI) CEO Pim de Witte and backed by Khosla Ventures' largest seed investment since OpenAI, claims that spatiotemporal-based models trained on a unique treasure trove of human gameplay data will redefine how AI interacts with the physical and simulated world.
Pim de Witte recently sat down with Latent Space editor Swix at General Intuition's offices to delve into the underlying technology, the strategic advantage of the company's data, and the wide range of future applications of the global model. This discussion reveals why this distinct yet complementary approach to LLM represents a significant shift in AI's capabilities, moving beyond mere content generation to active, intuitive understanding.
At the heart of General Intuition is building agents that learn to perceive and act in their environments, mimicking human intuition. Unlike traditional video models, which simply predict the next possible frame, world models face a much more complex challenge. “What the world model does is it actually has to understand all the possibilities and outcomes… and it has to generate the next state based on the actions that the user takes,” De Witte explained. This behavioral conditional generation is critical, allowing AI to not only observe, but also interact with and predict outcomes within a dynamic environment.
The foundation of GI's innovation is the vast dataset accumulated from de Witte's previous business, Medal. Medal, a game clipping platform with 12 million users, has amassed a staggering “3.8 billion clips of gaming's best moments and action,'' resulting in one of the most unique and diverse datasets on peak human behavior. This treasure trove of “episodic memory for simulation” provides an unparalleled resource for training AI. Importantly, this data protects privacy and maps actions to visual inputs and game outcomes without revealing users' personal data. This is a visionary design choice that has become a goldmine for world model development.
GI's agents are purely vision-based and operate on a “frame in, action out” paradigm. De Witte demonstrated a model that predicts actions from raw pixels without reinforcement learning (RL), without fine-tuning, and without checking for “no game state.” These agents exhibit “incredibly human-like” behavior, sometimes making the same mistakes or even acting like gamers checking a scoreboard, evidence of the fidelity of their imitative learning.
de Witte's key insight is the difference between world models and simple video generation. The world model needs to understand actions, memories, and partial observability (factors like smoke, occlusion, camera shake, etc.). This makes it possible to use human-like spatial reasoning to navigate, hide, and peek around corners, an essential feature for real-world applications. The ability to distill these huge policies into small real-time models makes them even more useful.
Its ambitions go far beyond the game. General Intuition demonstrates the potential of moving from arcade-style games to more realistic games and even real-world video. This means their model can process and predict actions in internet videos, laying the foundation for applications in robotics. De Witte envisions a spatiotemporal-based model that will power most of the “interactions between atoms” in both simulations and the physical world by 2030, suggesting a future in which intelligent agents seamlessly navigate and manipulate their environments. These models are seen as complementary forces rather than rivals to large-scale language models, with LLMs handling symbolic reasoning and world models excelling at embodied spatial intelligence.
From running and reverse engineering RuneScape Private Server to leading the Frontier AI Lab, De Witte's personal journey highlights an unconventional path to innovation. His self-taught approach to the fundamentals of deep learning, actively seeking out and mastering core concepts, reflects the dedication required to tackle such complex problems. Their decision to decline the $500 million offer from OpenAI and remain independent was driven by recognition of their unique data moat and belief in their ability to independently lead this research, a belief demonstrated by Khosla Ventures' significant investment.
