Gemini 2.5 Pro vs. Claude 3.7 Sonnet (thinking) vs. Grok 3 (think)

Google dropped its best-ever creation, Gemini 2.5 Pro Experimental, on March 25. It is a stupidly incredible reasoning model shining on every benchmark, but guess what went wrong? Yes, you guessed it right: OpenAI. They released the GPT-4o image generation model, and everyone isekai’d to Ghibli world. Not that I complain; I enjoyed it myself, but it wholly shrouded the release hype of Gemini 2.5 Pro, a much bigger deal.

This tells a lot about how humans work more than anything. We’re a lot more hedonist than we like to admit. Anyway, we now have the best from Google and for FREE. Yes, you can access the model in Google AI Studio for free. Again, why is it not on Gemini, and why is AI Studio not available for mobile users? If it wants to win over the masses, Google needs to sort these things out. As an LLM user, I often find myself using mobile apps and not having these complimentary offerings substantially impact adoption.
We saw how Deepseek trended in app stores after Deepseek r1, and Google does not have them despite owning Android. Crazy!

Table of Contents
- • Notable improvements
- • Shining on benchmarks
- • Gemini 2.5 in real-world
- • Comparing performance with Claude 3.7 Sonnet and Grok 3 (thinking)
- • Conclusion
TL;DR
If you’ve somewhere to go, here’s the summary.
- • Finally, a great model from Google that is on par with the rest of the AI big boys.
- • Improved coding performance, one million context window, and much better multi-lingual understanding.
- • The model excels in raw coding performance and highly positive user feedback.
- • It can solve most day-to-day coding tasks better than any reasoning model.
- • However, Grok 3 and Claude 3.7 Sonnet have more nuanced and streamlined reasoning.
Notable improvements
Gemini 2.5 Pro is an advanced reasoning model, building upon Gemini 2.0 Flash’s architecture with enhanced post-training and potentially larger parameter sizes for superior performance.
What stands out is its colossal one million context size. You can throw an entire codebase at it and still get reasonably good performance. The closest was the 200k context length from Claude 3.7 Sonnet.

The new Gemini 2.5 Pro is much better at multilingual understanding. As noted by Omar, it even broke the LMSYS leaderboard for Spanish.

Shining on Benchmarks
Gemini 2.5 Pro has performed exceedingly well on all major benchmarks, such as Lmsys, Livebench, GPQA, AIME, SWEbench verified, etc.

However, on the ARC-AGI benchmark, it scored on par with Deepseek r1 but less than Claude 3.7 (thinking) and o3-mini-high.

The Gemini 2.5 Pro topped WeirdML benchmark, a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

It’s also now the leading model on Paul Gauthier‘s Aider Polyglot benchmark.

Gemini 2.5 Pro in real-world
Since the new Gemini 2.5 Pro was launched, people have been having a field day, especially in coding. Here are some great examples.
Xeophon tested Gemini 2.5 Pro, Claude 3.7 Sonnet, and GPT-4.5 against today’s Wordle problem.

Prompt to shader by Ethan Mollick.
Gemini 2.5 is a very impressive model so far.
For example, it is only the third LLM (after Grok 3 and o3-mini-high) to be able to pull off “create a visually interesting shader that can run in twigl-dot-app make it like the ocean in a storm,” and I think the best so far.
Phillip Schmid generated a flight simulation in 3js in one go with Gemini 2.5 Pro.
Mathew Berman on made a Rubik’s cube solver in 3d.
A Rubik’s cube generator AND solver. I’ve tried this with Claude 3.7 thinking, DeepSeek etc and never came close to this.
Phil created a zombie game in single shot.
Wow crazy, Gemini 2.5 Pro one-shotted this.
“Create a 3d zombie wave shooter without any downloads with Three.js which runs in the browser. The settings should be in an apocalyptic city which is procedural generated. Without the need of a server”
Karan Vaidya built an agent to build games end-to-end with Composio tools.
Gemini 2.5 Pro might just be the best model for vibe coding games!!!
We built an AI Agent and asked it to create Galaga, and in seconds, it generated a fully playable arcade shooter. How it works:
• Uses Pygame to create the game
• Generates, saves, and runs the game instantly
• Powered by Composio’s File and Shell Tool
Gemini 2.5 Pro vs. Claude 3.7 thinking vs. Grok 3 thinking
I have used and tested all the models, and among reasoning models, Claude 3.7 and Grok 3 stood out the most. And they are currently the best reasoning models available on the market.
So, how does the new Gemini 2.5 Pro compare to these reasoning models? I was as curious as the second person, so I ran three coding tests to determine which one was the best.
Coding problems
1. Interactive Bouncing Balls in a 3D Cube
Prompt: Create a 3D scene using Three.js with a visible transparent cube containing 10 coloured balls. The balls should bounce off the cube walls and respond to mouse movement and camera position. Use OrbitControls and add ambient lighting. All code should be in a single HTML file.
Response from Grok 3 Think:
You can find the code it generated here: Link
Here’s the output of the program:
The result from Grok 3 looked good at first. The balls were moving the right way in the beginning, and everything showed up as expected. But the balls started sticking together after a short time and stopped bouncing properly. They began to collapse into each other, and the scene did not work the way I hoped it would.
Response from Gemini 2.5:
You can find the code it generated here: Link
Here’s the output of the program:
Gemini 2.5 gave the most accurate result. The movement was smooth, and the way the balls bounced felt surprisingly close to real-world physics. Everything ran flawlessly from the start.
Response from Claude 3.7 Sonnet Thinking:
You can find the code it generated here: Link
Here’s the output of the program:
Claude 3.7 didn’t quite land it. The cube and balls looked great, but the ball movement felt off. They bounced at the start, then just froze. The interaction was missing, and the whole thing felt unfinished.
Summary
Gemini 2.5 gave the best result. The ball movement was smooth and realistic, and everything worked as expected. Grok 3 started off well, but the balls quickly began to stick together and stop bouncing, which broke the effect. Claude 3.7 had a nice setup, but the balls froze soon after starting, and the scene felt unfinished.
Gemini was the only one that stayed solid from start to finish.
3. Minecraft with Pygame
It’s time to get creative with the models. For this one, I wanted to see how well they could prototype a simple Minecraft-style game using Pygame.
The goal wasn’t to rebuild Minecraft from scratch but to generate a basic world made of blocks with the ability to move around and place or remove blocks—basically, something that looks and feels like Minecraft Lite.
Prompt:
Create a 2D Minecraft-style game in Pygame. Build a colourful, scrollable grid world with smooth camera movement using WASD keys. Left-click to place blocks, right-click to break them. Include at least three block types with different textures or colours. Add simple animations, a sky background, and basic sound effects for placing and breaking blocks. Everything should run in one Python file and feel playful and responsive.
Response from Grok 3 Think
You can find the code it generated here: Link
Grok gave me a working Minecraft-style sandbox, but the result was disappointing. Placing blocks had some delay, the movement felt a bit slow, and the overall design was not very appealing. It worked, but the experience didn’t feel smooth or polished.
There was no scrolling or proper camera movement, and the blocks didn’t snap to a grid, which made the building feel messy. It was usable but far from enjoyable.
Here is the output:
Response from Gemini 2.5
You can find the code it generated here: Link
Gemini absolutely nailed this one. It gave me a smooth, fully playable Minecraft-style game that felt more polished than the other two. The controls felt like they came straight out of an actual game.
I could move the camera with WASD, place blocks with the left mouse button, and remove them with the right. Switching between block types like Grass, Dirt, and Stone using the number keys worked perfectly, and even ESC to quit was built in.
Everything ran cleanly, the block placement felt satisfying, and the camera movement was smooth. It just felt right.
Here is the output:
Response from Claude 3.7 Sonnet Thinking
You can find the code it generated here: Link
Claude went all in on this one. It gave me a fully playable 2D Minecraft-style game that actually felt like a game. I could move around with WASD, place and break blocks with mouse clicks, and even switch between five different block types using the number keys. Its visual block selector at the bottom made it feel more like a real UI than just boxes on a screen.
And then came the extras. Smooth camera movement, particle effects when blocks broke, soft shadows, a gradient sky, and even terrain that looked like something you’d actually want to build on. Claude was not messing around.
Here is the output:
Summary
All three models managed to build a working Minecraft-style game, but the results varied. Grok 3’s version was functional but disappointing. Block placement had a noticeable delay, the movement felt slow, and the lack of grid snapping or camera controls made the experience rough.
Gemini 2.5 and Claude 3.7 both delivered strong, playable games with smooth controls and responsive mechanics. Gemini focused on clean execution and solid gameplay, while Claude added extra touches like particle effects, block switching, and UI enhancements.
In terms of overall quality, Gemini and Claude stood out equally, each offering a complete and enjoyable experience in their own way. Grok worked but didn’t match the level of the other two.
3. Build a Creative Web App
For the final challenge, I wanted to see how the models handle building a complete, interactive website. This one isn’t just about clean HTML and CSS — it needs structure, logic, storage, and a touch of design sense.
Prompt:
Build a modern, responsive “Task Tracker” web app using HTML, CSS, and JavaScript. The app should let users add, edit, delete, and complete tasks. Include a sidebar with filters like “All”, “Active”, and “Completed”. Use localStorage to save tasks between sessions. The layout should be split-screen, with the task list on the right and filters on the left. Style it with a clean, minimal UI. Use soft shadows, rounded corners, and a muted color palette. Add basic animations for task transitions (like fade or slide), and make sure it works well on both desktop and mobile. Everything should run in a single HTML file.
Response from Grok 3 Think
You can find the code it generated here: Link
Grok showed up for this one. It gave me a fully working task tracker with every feature I requested. Adding, editing, completing, deleting, and filtering tasks all worked precisely how they should.
The layout followed the split-screen style, and the design looked surprisingly polished. Soft shadows, clean fonts, and smooth interactions made it feel like something you could actually use.
Here is the output:
Response from Gemini 2.5
You can find the code it generated here: Link
Gemini completely outperformed the rest on this one. The flow of the app felt natural from start to finish. Every interaction was smooth, the UI was clean and modern, and the layout followed the split-screen structure perfectly. Adding, editing, deleting, and filtering tasks all worked without a hitch.
The transitions were subtle, the animations felt polished, and even the little UI details made the whole thing feel a step above the others. It just clicked.
Here is the output:
Response from Claude 3.7 Sonnet Thinking
You can find the code it generated here: Link
Claude’s output was pretty much on the same level as Grok. It checked all the boxes, task creation, editing, deleting, completing, and filtering. The layout followed the prompt well, and the split-screen design looked clean. It even had nice visual touches that made the app more enjoyable.
Here is the output:
Summary
Grok 3, Claude 3.7, and Gemini 2.5 all delivered solid task tracker apps that followed the prompt closely. They handled adding, editing, deleting, filtering, and saving tasks with ease. The split-screen layout was implemented well in all three, and the designs felt clean and easy to use.
Claude added a bit more visual polish, Grok kept things simple and smooth, and Gemini brought a well-structured layout with a strong sense of flow. Each had its own strengths, but overall, they all performed at the same level and produced results that felt complete and ready to use.
Complex Reasoning
1. Riddle to check cognitive bias
This is a classic riddle to check a model’s cognitive bias. Most models fail this simple question.
Prompt: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to the hospital. When the doctor sees the boy, he says, “I can’t operate on this child; he is my son! How is this possible?

Claude 3.7 Sonnet was able to get it right.

However, changing the same a bit derails the model.
Prompt: The surgeon, who is the boy’s father, says, “I can’t operate on this child; he is my son”, who is the surgeon of this child. Be straightforward”.

Grok 3’s answer for the original prompt.
Therefore, the answer is: The doctor is the boy’s father. The woman who died in the accident was the boy’s mother and the doctor’s wife. The doctor cannot operate on the boy because he is his son, and medical ethics discourage doctors from treating close family members due to emotional involvement. This explains both the relationship and the doctor’s refusal to proceed with the surgery.<br><br>For the second prompt that we tweked to confuse models.
For the second prompt.
While the surgeon says he cannot operate, likely due to ethical guidelines preventing doctors from treating close family members to avoid emotional bias, the question does not ask who will perform the surgery. Instead, it asks, “Who is the surgeon of this child?” This focuses on identity, not action. Since the problem defines the surgeon as the boy’s father and provides no indication of another surgeon, the straightforward answer is based on the given description.<br>Thus, the surgeon of this child is the boy’s father.<br>Answer: the boy’s father<br>
So, Grok 3 (thinking) was better at overall reasoning, it was correctly able to answer it.
2. Playing Tic-tac-toe
In this test, I will give it a partial Tic tac toe board state and will ask the models to find the next position that will ensure 100% victory.

Okay, let's analyze the board:
X is in the Top-Left and Bottom-Left corners.
O is in the Middle-Left and Bottom-Middle.
To guarantee a win (100%), you need to create a situation where you have two possible winning lines simultaneously (a "fork"), and your opponent can only block one of them on their next turn.
Let's look at the empty squares and the potential threats:
Top-Middle: Placing an X here creates a threat on the top row (needs Top-Right) and a threat in the middle column (needs Center).
Top-Right: Placing an X here creates a threat on the top row (needs Top-Middle) and a threat on the diagonal (needs Center).
Center (Middle-Middle): Placing an X here creates three threats:
Middle column (needs Top-Middle)
Diagonal Top-Left to Bottom-Right (needs Bottom-Right)
Diagonal Top-Right to Bottom-Left (needs Top-Right)
Middle-Right: Placing an X here creates a threat in the middle row (needs Center) and the right column (needs Top-Right and Bottom-Right).
Bottom-Right: Placing an X here creates a threat on the bottom row (needs Bottom-Middle - blocked by O) and the diagonal (needs Center).
The Winning Move:
Place your 'X' in the Center (Middle-Middle) square.
Why this guarantees a win:
You place X in the Center.
You now have three potential winning lines:
Top-Left, Center, Bottom-Right (Diagonal)
Top-Right, Center, Bottom-Left (Diagonal)
Top-Middle, Center, Bottom-Middle (Middle Column - already blocked by O, so not a threat) - Correction: O is in Bottom-Middle, not Top-Middle. The threat is Top-Middle, Center, Bottom-Middle.
Let's re-evaluate the Center move threats:
Diagonal 1: X (Top-Left), X (Center), Empty (Bottom-Right) -> Threat 1
Diagonal 2: Empty (Top-Right), X (Center), X (Bottom-Left) -> Threat 2
Middle Column: Empty (Top-Middle), X (Center), O (Bottom-Middle) -> No threat here.
So, placing X in the Center creates two simultaneous winning threats:
Completing the diagonal with Bottom-Right.
Completing the other diagonal with Top-Right.
'O' can only block one of these squares on their next turn.
Whichever square 'O' doesn't block, you place your next 'X' there and win the game.
Therefore, place your 'X' in the Center.
The model was able to find out one possible situation, Great indeed.
Claude 3.7 Sonnet’s response
You should place your X in the top-right corner (upper right position). This creates a "fork" - a situation where you have two potential winning lines:
You'll threaten to complete the top row (where you already have the top-left X)
You'll also threaten to complete the diagonal from top-right to bottom-left (where you already have the bottom-left X)
Since your opponent can only block one of these threats on their next turn, you'll be able to win on your subsequent move by completing whichever line they didn't block. This is a 100% guaranteed win strategy.
Grok 3 thinking’s response
Conclusion
Placing "X" in position 5 (middle-middle) ensures a 100% win. It creates two simultaneous threats (diagonals 1-5-9 and 3-5-7), and no matter where "O" places their next mark among the remaining empty squares (2, 3, 6, 9), "X" can win on the following move by completing the unblocked diagonal. Therefore, the optimal move is:
Position 5 (middle-middle square).
All three models were able to find it. Grok, however, was too verbose, and none could find all the two positions, 3 and 5.
3. Complex blood relationship
So, this is a very complex and somewhat vague blood relationship riddle that most likely wasn’t in the training data.
Ptompt: Jeff has two brothers, and each of his brothers has three sisters. Each of the sisters has four step-brothers, and each step-brother has five step-sisters. So, how many siblings are in the extended family?
Gemini 2.5 Pro’s response
Let's break this down step-by-step:
Jeff's Biological Siblings:
Jeff has 2 brothers. So there are 1 (Jeff) + 2 = 3 brothers in total.
Each of Jeff's brothers has 3 sisters. Since brothers share the same sisters, Jeff also has these same 3 sisters.
So, the biological family consists of 3 brothers and 3 sisters.
The Step-Brothers:
Each of the 3 sisters has 4 step-brothers. They all share the same group of step-brothers.
So, there is one group of 4 step-brothers.
The Step-Sisters:
Each of the 4 step-brothers has 5 step-sisters. Again, they share the same group of step-sisters.
Importantly, these 5 step-sisters are distinct from Jeff's 3 biological sisters (otherwise, the number wouldn't be 5).
So, there is one group of 5 step-sisters.
Total Siblings in the Extended Family:
Biological Brothers: 3
Biological Sisters: 3
Step-Brothers: 4
Step-Sisters: 5
Total = 3 + 3 + 4 + 5 = 15
There are 15 siblings in the extended family.
Gemini could not solve it, let’s see the other models.
Claude 3.7 Sonnet’s response
This problem is about counting family members while being careful about overlapping relationships.
Let's break it down:
"Jeff has two brothers" - So there are 3 brothers total (Jeff + 2 brothers)
"Each of his brothers has three sisters" - Since these are all siblings, there are 3 sisters total (shared by all brothers including Jeff)
"Each of the sisters has four step-brothers" - These step-brothers are different from Jeff and his brothers (otherwise they'd just be "brothers"). So there are 4 step-brothers.
"Each step-brother has five step-sisters" - The step-brothers already have the 3 sisters as step-sisters, so there must be 2 additional step-sisters to make 5 total.
Counting everyone:
3 brothers (Jeff and his 2 brothers)
3 sisters
4 step-brothers
2 additional step-sisters
That gives us 12 siblings in the extended family.
Perfect, correct answer.
Grok 3’s response
Let’s determine the total number of siblings in Jeff’s extended family based on the given information.
Step 1: Jeff’s Immediate Family
Jeff has two brothers, meaning there are three boys in his immediate family, including Jeff himself. The problem states that each of his brothers has three sisters. Since Jeff’s brothers are boys, their sisters must be the girls in the family, and there are three of them. From Jeff’s perspective, he has two brothers and three sisters, which is consistent: the three sisters are shared among all the brothers. Thus, Jeff’s immediate family consists of:
3 boys: Jeff and his two brothers.
3 girls: The three sisters.
This makes a total of 6 siblings in Jeff’s immediate family.
Step 2: The Sisters’ Step-Brothers
Next, each of the three sisters has four step-brothers. Step-brothers are not biological siblings but are related through a parent’s marriage (e.g., children of a step-parent from a previous relationship). Since each of the three sisters has the same four step-brothers, there are 4 step-brothers in total. These step-brothers are distinct from the boys in Jeff’s immediate family because they are step-siblings, not biological siblings. So, we now have:
6 siblings in Jeff’s immediate family (3 boys, 3 girls).
4 step-brothers, who are step-siblings to the sisters (and, by extension, to Jeff and his brothers, though the problem focuses on the sisters).
This brings the total to 6 + 4 = 10 individuals so far.
Step 3: The Step-Brothers’ Step-Sisters
Now, each of the four step-brothers has five step-sisters. These step-sisters are the girls who are step-siblings to the step-brothers, likely through a marriage involving one of the step-brothers’ parents. Since all four step-brothers share the same five step-sisters, there are 5 step-sisters in total. The question is: who are these five step-sisters, and are they distinct from the individuals already counted?
The three sisters from Jeff’s immediate family are step-sisters to the step-brothers because they are the biological daughters of the step-brothers’ step-parent (e.g., if Jeff’s mother married the step-brothers’ father). However, each step-brother has five step-sisters, not three. This suggests there are additional step-sisters beyond the original three. If the three sisters account for three of the step-sisters, there must be 2 additional step-sisters to reach a total of five per step-brother. These additional step-sisters must be distinct from the sisters in Jeff’s immediate family and are likely from another family connected through a marriage.
Step 4: Constructing the Extended Family
Let’s model the family structure:
Family A (Jeff’s immediate family): 3 boys (Jeff and his two brothers) + 3 girls (the sisters) = 6 siblings.
Family B: The step-brothers’ biological family, contributing 4 step-brothers.
Family C: Another family contributing additional step-sisters.
Suppose Jeff’s mother married a man (the step-father) who has 4 sons from a previous marriage—these are the 4 step-brothers. From the step-brothers’ perspective, their step-sisters include Jeff’s 3 sisters. To reach five step-sisters, there must be 2 more girls. These could be daughters from another marriage of the step-father (biological sisters to the step-brothers but step-sisters to Jeff’s family) or from a separate family connected through another marriage, such as the step-brothers’ biological mother remarrying a man with daughters. For simplicity, assume these 5 step-sisters are a distinct group of 5 girls who are step-siblings to the step-brothers, including the 3 sisters plus 2 others.
However, to avoid overlap and align with the problem’s intent, consider that the 5 step-sisters are a separate group of 5 girls from a third family (Family C), connected to the step-brothers through a different marriage (e.g., the step-brothers’ mother’s remarriage). This keeps the groups distinct:
Family A: 6 siblings.
Family B: 4 step-brothers.
Family C: 5 step-sisters.
Step 5: Total Siblings in the Extended Family
The extended family includes all unique individuals connected through these marriages:
6 from Jeff’s immediate family.
4 step-brothers, distinct from Family A.
5 step-sisters, distinct from both Family A and Family B, assuming they are a separate group connected to the step-brothers.
This totals 6 + 4 + 5 = 15 individuals. If the 5 step-sisters included the piú3 sisters from Family A, we’d have only 2 additional girls (5 – 3 = 2), totaling 6 + 4 + 2 = 12. However, the problem’s structure suggests the step-sisters are a separate count of 5, not overlapping with the 3 sisters, as “each step-brother has five step-sisters” implies a consistent group of 5 distinct from other counts unless specified.
Final Verification
Jeff’s brothers have 3 sisters: the 3 girls in Family A.
Each sister has 4 step-brothers: the 4 from Family B.
Each step-brother has 5 step-sisters: the 5 from Family C.
With no overlap (the 3 sisters are step-sisters to the step-brothers, but the step-brothers’ 5 step-sisters are a separate group), the total is:
15 siblings in the extended family.
Final Answer
15
Though it didn’t outright answer it correctly, it could reason well and consider the right relationship.
Summary of math abilities
Of all the thinkers, Grok 3 thinking is the clearest, followed by Claude 3.7 Sonnet and then Gemini 2.5 Pro.
Mathematics
1. GCD of an infinite series
Prompt: Find the GCD of this series set {n^99(n^60-1): n>1}
It is a difficult question, and I would not be able to solve it. So, let’s see if the models get it right. Only Deepseek r1 and o1-pro were able to do it.
Gemini 2.5 Pro

Okay, it was correct.
Grok 3 was wrong

2. Find (x-14)^10 where x is the number of vowels in the response
To be honest, it requires more reasoning than math, but let’s see how the models fare.

Claude 3.7 Sonnet was able to answer it correctly.

Grok 3
Completely failed to get the gist of the question.

Summary of Math Abilities
In raw mathematics I find Gemini 2.5 Pro to be better but when it involves a bit of reasoning it doesn’t do that well. While Claude 3.7 Sonnet is most balanced here and Grok 3 is the third.
Conclusion
Google has finally dropped a great model that is competitive with the other frontier models, which is definitely a sign that this is going to improve further in the future.
So, here are my final findings.
- • Extremely good at coding
- • Not good in raw reasoning compared to Grok 3 (thinking) and Claude 3.7 Sonnet (thinking).
- • For raw math, it’s a great model.