Perceptron Mk1 shocks with high-performance video analytics AI model 80-90% cheaper than Anthropic, OpenAI, and Google

AI Video & Visuals


AI that can see and understand what’s happening on video, especially live feeds, is understandably an attractive product for many businesses and organizations. In addition to acting as security “watchdogs” for sites and facilities, such AI models can also be used to cut out the most provocative parts of marketing videos and repurpose them for social use, identify inconsistencies and gaffes in videos and flag them for removal, and identify the body language and behavior of participants in controlled studies or candidates applying for new roles.

There are currently some AI models that offer this kind of functionality, but they are far from mainstream. But two-year-old startup Perceptron Inc. is looking to change all that. Today, the company announced the release of its flagship proprietary video analytics inference model, Mk1 (short for “Mark One”), for $0.15 per million token inputs / $1.50 per million outputs via an application programming interface (API). This is approximately 80-90% less expensive than other major proprietary competitors, including Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s products. Gemini 3.1 Pro.

Perceptron Mk1 Cost Pareto Chart

Perceptron Mk1 cost Pareto chart. Credit: Perceptron

The company, led by former Meta FAIR and Microsoft co-founder and CEO Armen Aghajanyan, spent 16 months developing a “multimodal recipe” from scratch to deal with the complexity of the physical world.

This announcement heralds a new era in which models are expected to understand causal relationships, object dynamics, and physical laws with the same fluency that was once applied to grammar.

Interested users and potential enterprise customers can try it out for themselves at Perceptron’s public demo site here.

Performance across spatial and video benchmarks

The model’s performance is backed by a set of industry-standard benchmarks focused on grounded understanding.

Perceptron Mk1 Benchmark Comparison Table

Perceptron Mk1 benchmark comparison table. Credit: Perceptron

In spatial reasoning (ER benchmark), Mk1 achieved a score of 85.1 on EmbSpatialBench, beating Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (around 84.5).

In the specialized RefSpatialBench, Mk1’s score of 72.4 significantly outperforms competitors such as GPT-5m (9.0) and Sonnet 4.5 (2.2), highlighting its significant advantage in understanding reference expressions.

Perceptron Mk1 Video Benchmark Comparison Table

Perceptron Mk1 video benchmark comparison table. Credit: Perceptron

Video benchmarks show similar advantages. On EgoSchema’s “hard subset” where first and last frame inference is poor, Mk1’s score was 41.4, comparable to Alibaba’s Q3.5-27B and significantly higher than Gemini 3.1 Flash-Lite (25.0).

On the VSI bench, Mk1 reached 88.5, the highest score of the compared models, further validating its ability to handle real-world temporal reasoning tasks.

Market positioning and efficiency frontiers

Perceptron explicitly targets the “efficiency frontier,” a metric that plots a reified inference benchmark against the average score across videos and blending cost per million tokens.

Benchmark data reveals that the Mk1 occupies a unique position. This means it matches or exceeds the performance of “Frontier” models such as GPT-5 and Gemini 3.1 Pro, while maintaining a cost profile close to “Lite” or “Flash” versions.

Specifically, Perceptron Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens. By comparison, in the “Efficiency Frontier” chart, GPT-5 has a much higher blending cost (nearly $2.00) and Gemini 3.1 Pro is around $3.00, while Mk1 has a blending cost at the $0.30 level and a better inference score.

This aggressive pricing strategy aims to make high-end physical AI available for large-scale industrial applications rather than just experimental research.

Architecture and temporal continuity

The technical core of Perceptron Mk1 is its ability to process native video at up to 2 frames per second (FPS) across a 32K token context window.

Unlike traditional visual language models (VLMs), which often treat video as a series of disjointed still images, Mk1 is designed with temporal continuity in mind.

This architecture allows the model to “watch” the extension stream and maintain object identity even in the presence of occlusions. This is a key requirement for robotics and surveillance applications.

Developers can query the model for specific moments in long streams and receive structured time codes, streamlining the process of video clipping and event detection.

Reasoning based on physical laws

The Mk1’s main differentiator is its “Physical Reasoning” capability. Perceptron defines this as high-precision spatial awareness that allows models to understand the dynamics and physical interactions of objects in real-world settings.

For example, a model can analyze a scene and determine whether a basketball shot was taken before the buzzer or after the buzzer by jointly inferring the ball’s position in the air and shot clock readings.

This requires more than just pattern recognition. You need to understand how objects move through space and time.

The model can point with “pixel precision” and count up to hundreds within dense, complex scenes. It can also read analog gauges and clocks, which until now have been difficult to reliably interpret with purely digital vision systems.

They also seem to have a lot of general world and history knowledge. For a quick test, I uploaded an old public domain film from the Library of Congress of the construction of a New York City skyscraper in 1906. Not only was the Mk1 able to accurately describe the contents of the footage, including strange and atypical sights such as workers being suspended from ropes, but it was even able to quickly pinpoint a rough date (early 1900s) based on the footage’s appearance alone.

Perceptron Mk1 VentureBeat demo test screenshot

Perceptron Mk1 VentureBeat demo test screenshot

Physical AI developer platform

Model releases come with an enhanced developer platform designed to turn these high-level recognition capabilities into functional applications with minimal code.

The Perceptron SDK, available via Python, introduces several special features such as “focus”, “counting”, and “in-context learning”.

The Focus feature allows you to automatically zoom or crop specific areas of the frame based on natural language prompts, such as detecting and locating personal protective equipment (PPE) on a construction site. The counting feature is optimized for crowded scenes, such as identifying and pointing out all the puppies in a group or individual items of produce.

Additionally, the platform supports in-context learning, allowing developers to adapt Mk1 to specific tasks by providing just a few examples, such as displaying an image of an apple or instructing the model to label all instances of category 1 in a new scene.

Licensing strategy and Isaac series

Perceptron employs a dual-track strategy regarding model weights and licenses. The flagship Perceptron Mk1 is a closed-source model accessible via API and designed for enterprise-grade performance and security.

However, the company is also maintaining its “Isaac” series, which began with the launch of Isaac 0.1 in September 2025, as an open weight alternative. Isaac 0.2-2b-preview, released in December 2025, is a 2 billion parameter vision language model with inference capabilities available for edge and low-latency deployments.

Isaac model weights are published on the popular AI code sharing community Hugging Face, while Perceptron offers commercial licenses to companies that require maximum control over weights or on-premises deployment.

This approach allows the company to support both the open source community and specialized industry partners who require unique flexibility. The documentation states that the Isaac 0.2 model is specifically optimized for a time to first token of less than 200ms, making it ideal for real-time edge devices.

Background and focus of the establishment of Perceptron

Perceptron AI is a physics AI startup based in Bellevue, Washington, founded by Aghajanyan and Akshat Shrivastava. Both are former researchers at Meta’s Facebook AI Research (FAIR) lab.

Although the company’s public documents list the date of incorporation as November 2024, Perceptron Eye’s Washington corporate filing records show an earlier foreign registration application dated October 9, 2024, with Shrivastava and Aghajanyan listed as governors.

In the founder’s launch post in late 2024, Agajanyan said he was leaving Meta after about six years and “joining forces” with Shrivastava to build AI for the physical world, but Shrivastava said the company grew from his research on efficiency, multimodality, and new model architectures.

This establishment appears to have been a direct continuation of their work on multimodal fundamental models at Meta. In May 2024, Meta researchers announced Chameleon, a family of initial fusion models designed to understand and generate mixed sequences of text and images. This achievement was later explained by Perceptron as part of the lineage behind its own model.

A follow-up paper in July 2024, MoMa, considers more efficient early fusion training of mixed-mode models and lists both Shrivastava and Aghajanyan among the authors. The paper Perceptron said extends its research direction to “physical AI,” or models that can process real-world video and other sensory streams for use cases such as robotics, manufacturing, geospatial analysis, security, and content moderation.

Partner ecosystem and future prospects

Mk1’s real-world impact has already been demonstrated through Perceptron’s partner network. Early adopters are using this model for a variety of applications. For example, automatic clipping of live sports highlights. It leverages the model’s temporal understanding to identify critical plays without human intervention.

In the field of robotics, our partners are combining teleoperation episodes into training data to effectively automate the process of data labeling and cleaning for robotic arms and mobile units.

Other use cases include multimodal quality control agents on manufacturing lines that can detect defects and validate assembly steps in real-time, and wearable assistants on smart glasses that provide context-aware help to users.

Aghajanyan said these releases are the culmination of research aimed at making AI work best in the physical world, moving us toward a future where “physical AI” becomes as ubiquitous as digital AI.



Source link