Things you need to know about Gemini 2.5 Deep Think

Machine Learning


Parallel thinking model
Image 4 created with images

Google has released a new, massive inference model, Gemini 2.5 Deep Think. This will extend the “thinking time” by scaling the inference calculation. This approach drove the model version to the International Mathematics Olympiad (IMO) gold medal standard.

The newly released models are less capable, but achieve bronze-level performance on the 2025 IMO benchmark.

A parallel approach to reasoning

The Deep Think method is partly inspired by human cognition. This model uses parallel thinking techniques to generate and explore many ideas at once. Instead of following one path, according to Jack Rae, a research scientist at Google Deepmind, they explore “the deeper chains of thought and parallel thinking that can be integrated into one another.”

For example, when solving mathematical problems, one might test solutions based on Rolle's theorem and Newton's inequality, while also exploring proofs through contradictions. The system can modify or combine these different ideas before settling on the final answer. To effectively use this extended thinking time, Google has developed a new reinforcement learning technique that encourages the use of extended inference paths. (Unfortunately, the new RL algorithm does not have any details. This appears to be a key component behind Deep Think's excellent performance on inference problems.)

https://www.youtube.com/watch?v=8eqo4j2bwkw

Under the hood of a Gemini 2.5

The Gemini 2.5 family, including Deep Think, is built on a “sparse mixture” (MOE) trans architecture, which is also used in other inference models such as Deepseek-R1. This design is key to its efficiency. Sparse MOE models learn to route each input token dynamically and to a special subset of model parameters (“experts”) that have the skills necessary to process it. This separates the total capacity of the model from the computational costs required to process each token.

The model is natively multimodal and accepts text, images, audio and video files within a million token context windows. It can generate text output of up to 192,000 tokens, which solves problems requiring a very long inference chain. In comparison, the Gemini 2.5 Pro's output capacity is 65,536 tokens.

Unfortunately, there is little to no in both model architecture and training techniques. From what we know, Google is primarily changing its post-training regime to allow the model to generate more consistent chain (COT) sequences. At the same time, we combine both RL and multisampling techniques to allow the model to not only think longer, but also sample multiple answers, adjust them, and combine them to generate the final answer.

Performance on complex benchmarks

Gemini 2.5 Deep Think Benchmarks
Think about Gemini 2.5 performance with various key benchmarks (Source: Google Blog)

Deep Think's performance is demonstrated in benchmarks that measure creative and strategic problem solving. At the USA Math Olympiad, this model reached the 65th percentile of participants, a noticeable improvement over the 50th percentile achieved by the Gemini 2.5 Pro.

It also achieves cutting-edge performance with the LiveCodebench V6, a competitive coding benchmark, and the final exam for humanity to measure domain-wide expertise, such as science and mathematics. This feature leads to practical applications that require iterative development. The examples shared by Google show that Deep Think can design complex graphics that are much more detailed and complex than previous versions of Gemini. Deep thinking can improve both the aesthetics and functionality of a website, or excel in harsh coding problems where problem formulation and time complexity is important. (It also works well in Simon Willison's famous “Pelican on a Bicycle” test.)

Gemini 2.5 Deepton Art Generation
Gemini 2.5 Deep Think can generate highly complex and detailed graphics (source: Google Blog)

How does this lead to a real application? I can't see that yet. I haven't accessed the model yet, but from the examples other users share with X, Gemini Deep seems impressively good at handling one complicated prompt. (Note that in reality, you usually want to solve the task with a few repetitions, so Deep Think might be a good place to start. And you might be able to make small adjustments on smaller, inexpensive models like the Gemini 2.5 Flash or Pro.)

Access and Safety Considerations

Google is thinking in stages and deeper. For now, that's very limited. Google AI Ultra subscribers ($250 per month) have access to a fixed number of prompts per day to the Gemini app model. This version is integrated with tools such as Google search and code execution.

Logan Kilpatrick, a product lead at Google AI Studio, proposed in X that the current limitations are due to the enormous cost of running the model. This may mean that Gemini 2.5 Deep Think will become more widely available as Google understands how to optimize its inference infrastructure and run it at scale.

A small group of mathematicians and scholars will receive access to the full IMO Gold Medal version to enhance their research. In the coming weeks, Google plans to release Deep Think via the Gemini API to a set of trusted testers. Tests show that content is more secure and tone objectivity compared to Gemini 2.5 Pro, but the model is more likely to reject benign requests. Google says it looks more deeply into the risks associated with this increased complexity through its frontier safety assessment.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *