Things you need to know about Gemini 2.5 Deep Think

Parallel thinking model — Image 4 created with images

Google has released a new, massive inference model, Gemini 2.5 Deep Think. This will extend the “thinking time” by scaling the inference calculation. This approach drove the model version to the International Mathematics Olympiad (IMO) gold medal standard.

The newly released models are less capable, but achieve bronze-level performance on the 2025 IMO benchmark.

A parallel approach to reasoning

The Deep Think method is partly inspired by human cognition. This model uses parallel thinking techniques to generate and explore many ideas at once. Instead of following one path, according to Jack Rae, a research scientist at Google Deepmind, they explore “the deeper chains of thought and parallel thinking that can be integrated into one another.”

For example, when solving mathematical problems, one might test solutions based on Rolle's theorem and Newton's inequality, while also exploring proofs through contradictions. The system can modify or combine these different ideas before settling on the final answer. To effectively use this extended thinking time, Google has developed a new reinforcement learning technique that encourages the use of extended inference paths. (Unfortunately, the new RL algorithm does not have any details. This appears to be a key component behind Deep Think's excellent performance on inference problems.)

https://www.youtube.com/watch?v=8eqo4j2bwkw

Under the hood of a Gemini 2.5

The Gemini 2.5 family, including Deep Think, is built on a “sparse mixture” (MOE) trans architecture, which is also used in other inference models such as Deepseek-R1. This design is key to its efficiency. Sparse MOE models learn to route each input token dynamically and to a special subset of model parameters (“experts”) that have the skills necessary to process it. This separates the total capacity of the model from the computational costs required to process each token.

The model is natively multimodal and accepts text, images, audio and video files within a million token context windows. It can generate text output of up to 192,000 tokens, which solves problems requiring a very long inference chain. In comparison, the Gemini 2.5 Pro's output capacity is 65,536 tokens.

Unfortunately, there is little to no in both model architecture and training techniques. From what we know, Google is primarily changing its post-training regime to allow the model to generate more consistent chain (COT) sequences. At the same time, we combine both RL and multisampling techniques to allow the model to not only think longer, but also sample multiple answers, adjust them, and combine them to generate the final answer.

Performance on complex benchmarks

Gemini 2.5 Deep Think Benchmarks — Think about Gemini 2.5 performance with various key benchmarks (Source: Google Blog)

Deep Think's performance is demonstrated in benchmarks that measure creative and strategic problem solving. At the USA Math Olympiad, this model reached the 65th percentile of participants, a noticeable improvement over the 50th percentile achieved by the Gemini 2.5 Pro.

It also achieves cutting-edge performance with the LiveCodebench V6, a competitive coding benchmark, and the final exam for humanity to measure domain-wide expertise, such as science and mathematics. This feature leads to practical applications that require iterative development. The examples shared by Google show that Deep Think can design complex graphics that are much more detailed and complex than previous versions of Gemini. Deep thinking can improve both the aesthetics and functionality of a website, or excel in harsh coding problems where problem formulation and time complexity is important. (It also works well in Simon Willison's famous “Pelican on a Bicycle” test.)

Gemini 2.5 Deepton Art Generation — Gemini 2.5 Deep Think can generate highly complex and detailed graphics (source: Google Blog)

How does this lead to a real application? I can't see that yet. I haven't accessed the model yet, but from the examples other users share with X, Gemini Deep seems impressively good at handling one complicated prompt. (Note that in reality, you usually want to solve the task with a few repetitions, so Deep Think might be a good place to start. And you might be able to make small adjustments on smaller, inexpensive models like the Gemini 2.5 Flash or Pro.)

I encouraged Gemini 2.5 Deep. I then asked it to improve the design several times.

The complete design and all the code and calculations came from AI, with each version running without errors. pic.twitter.com/zt8efsxitg

– Ethan Morrick (@emollick) August 2, 2025

Access and Safety Considerations

Google is thinking in stages and deeper. For now, that's very limited. Google AI Ultra subscribers ($250 per month) have access to a fixed number of prompts per day to the Gemini app model. This version is integrated with tools such as Google search and code execution.

Logan Kilpatrick, a product lead at Google AI Studio, proposed in X that the current limitations are due to the enormous cost of running the model. This may mean that Gemini 2.5 Deep Think will become more widely available as Google understands how to optimize its inference infrastructure and run it at scale.

Great feedback, this is a big model and release is constrained as it requires computing the boats burning when the TPU is already burning to keep up with big growth with VEO, Gemini 2.5 Pro, AI mode rollout, etc.

– Logan Kilpatrick (@Officiallogank) August 1, 2025

A small group of mathematicians and scholars will receive access to the full IMO Gold Medal version to enhance their research. In the coming weeks, Google plans to release Deep Think via the Gemini API to a set of trusted testers. Tests show that content is more secure and tone objectivity compared to Gemini 2.5 Pro, but the model is more likely to reject benign requests. Google says it looks more deeply into the risks associated with this increased complexity through its frontier safety assessment.

Source link