COSIMO’s geometry videos aim to solve perceptual problems in AI

AI Video & Visuals


Irvine, CA – May 13, 2026 – In a move that challenges the fundamental principles of how machines perceive the world, Irvine-based startup COSIMO has announced a new technology it calls “geometric video.” The company claims that its new video format, purpose-built for artificial intelligence, can reduce costs, improve accuracy, and dramatically accelerate the deployment of real-world AI systems such as robotaxis and humanoid robots.

The announcement targets a core complaint in the multibillion-dollar “physical AI” industry. Despite massive investments in larger models, more powerful GPUs, and vast data centers, the timeline for truly autonomous systems continues to slip. As COSIMO’s presentation astutely points out, delayed product launches and underwhelming demonstrations have become the norm, and “the physical AI industry has been reduced to anecdote.” The prevailing strategy in the industry is one of brute-force scaling, but COSIMO argues that the problem lies not in scale but in the source, the video itself.

New primitives for AI recognition

At the heart of COSIMO’s argument is a simple but profound observation. Legacy video was never designed for AI. Digital video formats from MPEG to H.264 are designed to compress visual information for the human eye, prioritizing human perceptual quality over algorithmic information clarity. For AI, this pixel-based data is inherently noisy and inefficient. This requires models to consume large amounts of computational resources just to infer fundamental properties of the physical world, such as shape, motion, and object persistence, from streams of colored dots.

This inefficiency led to an arms race. Companies developing autonomous systems are being forced to rely on increasingly complex and power-hungry solutions. These include sensor fusion, which combines camera data with other sensors such as LiDAR and radar, and the deployment of increasingly large neural networks on specialized and expensive hardware. Although these methods have brought progress, they have contributed to the ballooning costs and development schedules that have plagued the industry. Brute-force methods require more data, more powerful chips, and more energy, leading to cycles of increasing cost and complexity that have yet to deliver on the promise of widespread, reliable physical AI.

COSIMO proposes a paradigm shift by tackling the root of the problem. Rather than trying to build a better brain to interpret flawed data, the company created what it claims is better data for AI brains.

From pixels to pure geometry

COSIMO’s solution is a proprietary “physics engine” that converts raw sensor data into geometric video. Unlike traditional video, which captures the scene as a grid of pixels, geometric video encodes the geometry of the underlying shapes and their movement directly into the data stream. The company says the process “removes noise” and represents objects and their motion in a “deterministic, mathematically pure form.”

The technical core of this transformation is the COSIMO Deterministic Structural Transformation (DST) kernel, which generates a sparse geometric matrix (SGM). The company says the kernel is stateless, uses efficient fixed-point integer arithmetic, and does not require learned weights, suggesting it can operate at high speed and with low computational overhead. By preprocessing the visual world into a geometric language specific to mathematics and physics, this technology gives AI models the ability to represent reality in a cleaner, more direct way.

This fundamentally changes the task of AI. Rather than having to analyze millions of hours of pixelated video to learn the world’s physics from scratch, AI is fed a stream that already describes the world in terms of its essential geometric properties. This has the potential to significantly reduce the size and complexity of AI models needed for perceptual tasks.

Equation of economy and performance

The true test of new technology is its performance, and COSIMO has released a series of impressive benchmark numbers to back up that claim. The company tested geometric videos against traditional video baselines using the UCF-101 dataset, widely known as the academic benchmark for action recognition in videos.

We ran five separate training runs on NVIDIA L4 hardware and saw remarkable results. AI models using geometric videos have reportedly been realized. +12.4 percentage point accuracy improvement than those that use traditional video. What’s even more impressive is that they did it. 78.5% reduction in model parameters and necessary GPU memory reduced by 27x Inferring. The company also highlighted its run on a five-year-old MacBook Pro, with the system processing each frame in 1.17 milliseconds while consuming less than 1 watt of power, demonstrating its potential as a low-power edge device.

Perhaps most important to developers is what the results showed. 3x precision clustering throughout the test run. This suggests a level of stability and predictability that is often absent in the complex world of deep learning, where small changes can lead to unpredictable performance. COSIMO claims that this stability makes the system “stable enough to be debugged like source code,” a claim that will resonate with any engineer struggling with the probabilistic nature of AI training. To support these claims, the company states that all test runs are public and cryptographically verified through its website.

If these numbers hold up in real-world applications, the economic impact will be impressive. COSIMO predicts cost savings for Tier 1 physical AI companies $8 billion to $10 billion annually In terms of compute, storage and power costs. At the micro level, the following savings are estimated: $2,700 per edge device. Beyond direct financial savings, the company claims its technology can reduce time to market by 6 to 12 months, a valuable advantage in a competitive environment.

Possibility of a paradigm shift in physical AI

COSIMO’s technology enters a field dominated by tech giants. NVIDIA’s Metropolis platform, Google’s Gemini-powered robotics initiative, and Tesla’s Vision-Dedicated Fully Self-Driving (FSD) system are all using vast resources to tackle the challenge of AI perception. These established players have focused on building more powerful hardware, such as custom AI chips, and more sophisticated software models for processing traditional video and sensor data.

For example, Tesla is famously betting its future on using large numbers of cameras and powerful neural networks to solve autonomous driving, eschewing other sensors such as LiDAR. This vision-only approach relies on AI’s ability to extract all the necessary information from traditional video streams. Google and others often employ sensor fusion, which combines data from multiple sources to build more robust models of the world.

COSIMO’s geometry videos do not necessarily replace these efforts, but reframe the problem. This can be considered a foundation layer that makes all subsequent processing more efficient. Sensor fusion could also be employed in self-driving cars using geometric video, but the information from the cameras would be much richer and require less computation to interpret. This could allow Tesla’s vision-only approach to run on cheaper hardware and achieve higher levels of reliability.

By creating video primitives that are inherently machine-readable, COSIMO is betting that the industry will choose elegant and efficient solutions over current brute-force approaches. If Geometric Video delivers on its promise of delivering higher accuracy with dramatically fewer resources, it could lower the barrier to entry for new players and allow existing players to reallocate significant capital from data centers to deployments. The industry has long been waiting for a breakthrough beyond incremental gains, and redefining the very data that AI understands could be the fundamental change AI needs.



Source link