Thinking Machines previews near real-time AI voice and video conversations using new ‘interaction models’

AI Video & Visuals


Will AI graduate from the era of “turn-based” chat?

All of us who now regularly use AI models in our work and personal lives know that the fundamental modes of interaction across text, images, audio, and video have not changed. That is, a human user provides input, waits anywhere from milliseconds to minutes (or sometimes hours or days for particularly difficult queries), and an AI model provides output.

But if AI is actually going to take on workloads that require natural interaction, it will need to do more than provide this kind of “turn-based” interactivity. Ultimately, we need to respond more fluidly and naturally to human input, and even respond as we process. next Human input, whether in text or another format.

That, at least, seems to be the argument of Thinking Machines, a well-funded AI startup founded last year by former OpenAI chief technology officer Mira Murati and former OpenAI researcher and co-founder John Schulman.

Today, the company announced a research preview of what it believes to be an “interaction model.” This is a new class of native multimodal systems that treat interactions as first-class citizens of the model architecture rather than external software “harnesses,” and has achieved some impressive improvements in third-party benchmarks, resulting in reduced latency.

However, these models are not yet available to the general public or even businesses. The company said in an announcement blog post: “In the coming months, we will be rolling out a limited research preview to gather feedback, with a broader release planned for later this year.”

“Full duplex” simultaneous input/output processing

At the heart of this announcement is a fundamental shift in the way AI perceives time and existence. Current frontier models typically experience reality with a single thread. Recognition freezes while waiting for the user to complete the input before starting processing and generating the response.

In a blog post, Thinking Machines researchers described the current situation as a limitation that forces humans to “warp themselves” to AI interfaces, asking questions and gathering thoughts, much like email.

To solve this “collaboration bottleneck,” Thinking Machines moved away from the standard alternating token sequence.

Instead, a multi-stream, microturn design is used that processes 200ms chunks of input and output simultaneously.

This “full-duplex” architecture allows the model to listen, speak, and see in real-time, allowing it to chime in when you’re speaking, or interject when it notices visual cues, such as a user writing a bug in a code snippet or a friend typing in a video frame. Technically, this model utilizes early fusion without an encoder.

Rather than relying on a large standalone encoder like Whisper for audio, the system ingests the raw audio signal as a dMel, an image patch (40×40) through a lightweight embedding layer, and co-trains all components from scratch within a transformer.

dual model system

The research preview will introduce: TML-Interaction-Small, 276 billion parameter Mix of Experts (MoE) A model with 12 billion active parameters. Because real-time interactions require near-instantaneous response times, which are often inconsistent with deep reasoning, the company built a two-part system.

  1. Interaction model: Maintain continuous interaction with users and handle dialog management, presence, and immediate follow-up.

  2. Background model: An asynchronous agent that handles continuous inference, web browsing, or complex tool calls and streams the results into a conversation model to weave them naturally into the conversation.

This setup allows the AI ​​to perform tasks like live translation and UI chart generation while continuing to listen to user feedback. This feature was demonstrated in a presentation video where the model produced bar graphs while providing typical human reaction times to various cues.

Superior performance on key benchmarks compared to fast interaction models from other leading AI labs

To demonstrate the effectiveness of this approach, our laboratory utilized: FD bencha benchmark specifically designed to measure the quality of interactions, not just raw intelligence. The results show that: TML-Interaction-Small It significantly outperforms existing real-time systems.

  • Responsiveness: Achieved turn-taking latency. 0.40 secondscompared to 0.57 seconds for Gemini-3.1-flash-live and 1.18 seconds for GPT-realtime-2.0 (minimum).

  • Quality of interaction: FD Bench V1.5 has improved scores 77.8almost double the score of its main competitors (minimum GPT-realtime-2.0 score of 46.8).

  • Visual positivity: In specialized tests such as RepCount-A (counting physical repetitions in the video) and proactive video QAThe Thinking Machines model worked well with the visual world while other Frontier models remained silent or gave inaccurate answers.

metric

TML-Interaction-Small

GPT real-time 2.0 (minutes)

Gemini-3.1-Flash Live (min)

Waiting time for substitution (seconds)

0.40

1.18

0.57

Interaction quality (average)

77.8

46.8

54.3

IFEval (Voice Bench)

82.1

81.7

67.6

Harm bench (rejection rate)

99.0

99.5

99.0

Once the model is available, it could be a huge boon for companies

When Thinking Machines’ interaction model becomes available to the enterprise sector, it will represent a fundamental shift in how companies integrate AI into their operational workflows.

Native interaction models like TML-Interaction-Small enable several enterprise features that are currently not possible or very weak with standard multimodal models.

Today’s enterprise AI must complete a “turn” before analyzing data. In manufacturing and research settings, native interaction models can monitor video feeds and proactively intervene the moment a safety violation or deviation from protocol is detected, without having to wait for workers to request feedback.

The model’s success on visual benchmarks such as RepCount-A (accurate repetition counting) and ProactiveVideoQA (answering questions when visual evidence is presented) suggests that it has the potential to serve as a real-time audit for high-stakes physical tasks.

The main problem with voice-based customer service is the 1-2 second “processing” delay that is common in standard APIs in 2026. Thinking Machines’ model achieves a turn-taking latency of 0.40 seconds, which is about the same as the speed of natural human conversation.

Because enterprise support bots natively handle simultaneous audio, they can listen to customer complaints, provide “backchannel” cues (like “I see” or “hmm”) without interrupting the user, and provide live translations that feel like a natural conversation rather than a series of disjointed recordings.

Standard LLM has no internal clock. They will only “know” time if the time is provided with a text prompt. The interaction model is natively time-aware, so you can manage time-sensitive processes such as “Remind me to check the temperature every 4 minutes” or “Warn me if this process takes longer than the last one.” This is critical for industrial maintenance and pharmaceutical research where timing is a critical variable.

Thinking machine background

This release marks Thinking Machines’ second major milestone, following the October 2025 release of Tinker, a managed API for fine-tuning language models that gives researchers and developers control over their data and training methods, while Thinking Machines handles the infrastructure burden of distributed training.

The company says Tinker supports small and large open-weight models, including expert mixture models, and early users included groups at Princeton, Stanford, Berkeley, and Redwood Research.

When it launched in early 2025, Thinking Machines positioned itself as an AI research and product company that aims to make advanced AI systems “more widely understood, customizable, and versatile.”

In July 2025, Thinking Machines announced that it had raised approximately $2 billion at a $12 billion valuation in a round led by Andreessen Horowitz with participation from Nvidia, Accel, ServiceNow, Cisco, AMD, and Jane Street. According to WIRED, this is the largest seed funding round in history.

wall street journal According to a report in August 2025, after Mark Zuckerberg, the CEO of a rival technology company, approached Murati about acquiring Thinking Machines Lab and she declined, Meta went after more than a dozen of the startup’s roughly 50 employees.

In March and April 2026, the company also became known for its computing ambitions. The company announced an Nvidia partnership to deploy at least 1 gigawatt of next-generation Vera Rubin systems. We have since expanded our relationship with Google Cloud to use Google’s AI hypercomputer infrastructure and Nvidia GB300 systems for model research, reinforcement learning workloads, frontier model training, and Tinker.

Business Insider reported that by April 2026, Meta had hired seven founding members from Thinking Machines, including Mark Jen and Yinghai Lu, and another Thinking Machines researcher, Tianyi Zhang, also moved to Meta. The same report states that Joshua Gross, who helped develop Thinking Machines’ flagship fine-tuning product, Tinker, has joined Meta Superintelligence Labs, and despite his departure, the company has grown to about 130 employees.

But Thinking Machines wasn’t just losing talent. It hired Meta veteran Soumith Chintala, creator of PyTorch, as CTO and added notable technical talent such as Neal Wu. TechCrunch separately reported in April 2026 that Weiyao Wang, a veteran of Meta who worked on multimodal perception systems for eight years, had joined Thinking Machines, emphasizing that the flow of talent is not unidirectional.

Thinking Machines previously said it would work on “significant open source components” in the release to empower the research community. It is unclear whether these new interaction models fall under the same philosophy and release criteria.

But one thing is for sure. That said, Thinking Machines believes that by making interactivity native to models, scaling models will make them smarter and more effective collaborators.



Source link