MCP

pricing

blog

Solutions

docs

MCP

Pricing

blog

tools

docs

EXPLORE

AGENT AUTH

ENTERPRISES

STARTUPS

CoT Reasoning Models – Which One Reigns Supreme in 2025?

Harsh

•

Mar 8, 2025

Claude

A comprehensive analysis for o3-Mini-High vs Claude Sonnet 3.7 Thinking vs Grok 3 Think vs Deep Seek R1 on multiple reasoning, math, coding, and writing questions. Which one is bang for your buck in 2025?

Motivation

It’s been a fascinating few months in the AI landscape with the emergence of a new breed of models – the “COT Reasoning” LLMs. These models don’t just generate text; they reason through problems step by step, often producing dramatically better results on complex tasks. But which one deserves your attention (and potentially your subscription dollars)?I’ve spent the last few weeks testing these models across reasoning, mathematics, coding, and creative writing tasks. Let’s examine how these thinking titans compare.But first, what is CoT? Let’s uncover that!

What is COT Based on Thinking

Chain of Thoughts (CoT) is a way of solving problems by breaking down problems into manageable steps, considering multiple approaches, and even correcting their errors – much like humans. This helps in getting more accurate and clear results.

Let’s understand through an example:

Example

Say you’re calculating the total cost of a meal at a restaurant. (Assume 18% GST) included

You ordered:

A burger for ₹100,
A pizza for ₹200,
A drink for ₹50. </aside>

Now, instead of guessing the total cost, the CoT Reasoning model follows a logical chain of thoughts:

First, add the burger and pizza: ₹100 + ₹200 = ₹300
Then, add the drink: ₹300 + ₹50 = ₹350
If there’s an 18% tax, calculate: 18% of ₹350 = ₹63
Add the tax to the total: ₹350 + ₹63 = ₹413
So, the final bill is ₹413. </aside>

This step-by-step thinking process is called Chain of Thoughts—it helps avoid mistakes and makes it easier to understand and solve complex problems!

For deep understanding, check out the paper: Demystifying Long Chain-of-Thought Reasoning in LLMs

But as with all things, each model implements COT Reasoning Ability differently. Let’s examine that.

Frontier Reasoning LLMs

COT Reasoning LLM’s special is their ability to allocate additional computational resources to tackle complex problems – what the AI community calls “test-time compute.”

This means they can spend anywhere from a few seconds to several minutes reasoning through a problem, often considering multiple approaches before arriving at an answer.

Here is a simplistic breakdown of Each of today’s contender models and how they approach “thinking” differently:

OpenAI o3-mini-high: The smaller but mighty cousin of o1, offering adjustable “effort” levels.
Deep Seek R1: An open-source reasoning powerhouse that rivals proprietary models.
Grok 3 (Think): xAI’s reasoning model with dedicated “Think Mode”.
Claude 3.5/3.7 Sonnet (Extended Thinking): Anthropic’s model with configurable “thinking budget”.

Ok don’t take my words for it, see benchmark score yourself!

Benchmark Scores

Popular Benchmark Scores

Sources: LLM Stats

If you are still unsure, here is a simple breakdown of the results.

Grok-3 leads in AIME 2024 with 93.0%, the highest among all models.

Claude 3.7 Sonnet is best in GPQA with 84.8%, slightly ahead of Grok-3 (84.6%).
DeepSeek-R1 tops MATH-500 with 97.3%, slightly better than o3-mini (96.2%).
Grok-3 performs best in Live Code Bench with 79.0%, while DeepSeek-R1 follows at 65.9%.
o3-mini is strong in AIME 2024 (87.3%) and GPQA (79.7%) but lacks scores in some benchmarks.
DeepSeek-R1 has mixed performance—strong in MATH-500 but lower in GPQA (71.5%).
Claude 3.7 Sonnet performs well overall, ranking high in GPQA and AIME 2024.

Now let’s look at the ARC – AGI Score

B. ARC – AGI Score

As of March 2025, the contender models have achieved notable ARC AGI scores:

AI Model	ARC AGI Score
OpenAI o3-mini-high	87.3%
DeepSeek R1	79.8%
Claude 3.7 Sonnet (Extended Thinking)	~ 21.1%
Grok 3 (Think Mode)	Data not available

Note: As of this date, the specific ARC AGI scores for Grok 3’s “Think Mode” have not been publicly disclosed.

Raw Capability Showdown - Which One Prevails?

Though benchmarks are a good way to evaluate a model’s performance, they’re not the best when it comes to rigorous testing. Raw capability matters as well. So, I have prepared a bunch of questions related to reasoning, math, coding, and writing. Let’s see how all our contenders perform on them!

A. Mathematics

Mathematics tasks are where thinking models truly shine compared to their traditional counterparts due to their reasoning capabilities.Let’s examine how these models tackle a classic geometry problem

Q1. Find the possible 4th vertex of a parallelogram

A straightforward and intuitive math question from Gilbert Strang’s Linear Algebra class.Prompt:

If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

Expected Response: (4,4), (-2,2) or (4,0)It would be interesting to see how different model think about it.

OpenAI O3-Mini-High: Pass

O3-Mini-High got all the answers correct without any nudging, showing it brilliant COT Reasoning skills. What’s brilliant is it first deduced the reasoning and then went for the math part; partly human I would say!

Claude 3.7 Sonnet: Pass

Impressive not only Claude 3.7 Sonnet Thinking solved the problem, it also verified it, which was not done by Open AI O3-Mini-High. Let’s see its COT Traces.

From COT Traces it becomes evident; model have capability to self-correct / self-nudge itself and try to be absolutely sure before giving answers.

Grok 3 (Think) – Pass

Grok3 Passed the test, but again, there was no verification step before finalizing the answer. Let’s check the model’s COT traces.

Surprisingly, it generated the longest chain of thoughts during reasoning compared to all other LLM’s (12,754) words, so much that I had to add it as a doc file, pasting was not supported.

Deep Seek R1: Pass

As usual, Deep Seek didn’t have any challenges to solve the problem. However, it didn’t show any verification part like Claude, I guess it might be a part of the COT Reasoning Process.

I was hoping that COT reasoning traces might contain verification component, but sadly it is not covered (You can check it yourself).

Q2. Compute the GCD of an infinite series

One of my favourite questions is to check the mathematical ability of a model.Prompt: Compute the GCD of this series set {n^99(n^60-1): n>1}.Expected Response: 6,814,407,600O3 Mini High

Claude 3.7 Sonnet Thinking

Similar to Open AI O3, Claude 3.7 Sonnet solve it, but at a reasoning time of 361 Seconds. However it missed the verification part, same as O3.You can check out the COT Reasoning traces and let me know if any thinking traces are present, though I didn’t found it.Grok3 Think

Grok3 gave the answer in the format question was asked. However, after a slight nudge, it delivered the final answer with a verification step included, like if it knows I am writing the blog and stating it’s not producing verification steps.

Refer to thinking traces to learn more.Deep Seek R1

Despite being 767 second, model still struggles to find the answer which all other models found very fast. This may be due to over-reasoning. I will update this part if it is able to find the answer.

Overall SummaryIf you are into research and academia, go with Grok3 and Calude Sonnet 3.7. Both of them performed answer validation, which is crucial in the domain.If you are more worried about token count and inference speed and don’t care about answer validation – mostly creative task, go with OpenAI-Mini-High and Claude Sonnet 3.7I would never recommend Deep Seek R1 for math-related tasks, as over-reasoning causes it to hallucinate frequently.Next, off to the reasoning test!

B. Advanced Reasoning

Advanced Reasoning is fundamental to COT Reasoning Models, so let’s see how all contenders perform on some of the Easy Problems That LLMs Get Wrong.

Q1. Will You Switch Door or Not?

This problem is a simple version of the classic Monty Hall problem, which is used to explain concepts of Bayes’ Theorem to high school students, let’s see which model get the actual output.

Prompt: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, “Do you want to pick door No. 2 instead?”Is it to your advantage to switch your choice?

Expected Response: Switching is not an advantage. It makes no difference if I switch because no additional information has been provided since the initial choice.Let’s see which model find it first!

Claude 3.7 Sonnet: Pass 100%

Grok 3: Pass

Grok 3 seemed to reason and fact-check a lot, finally deducing the right answer and even the actual reasoning task.

O1-Mini High: Fails

Deep Seek R1: Fails

Impressive only Claude Sonnet 3.7 & Grok 3 were the only model to figure out the right answer. However, Claude sonnet 3.7 gave answer the least verbosely and fastest. In Contrast, O3 and Deep Seek R1 fails the test,

Q2. The River Crossing Puzzle

Let’s see how different model performs on this classic flash game puzzle variation I used to play a lot in my old days. It’s a good reasoning puzzle variation too.

Prompt: A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer efficiently bring the wolf, the goat and the cabbage across the river without anything being eaten?Expected Response:Place the wolf, goat, and cabbage in separate secure compartments in the boat and row across the river. This will prevent any of them from being eaten by the others.

Grok 3 – Pass

Grok 3 captures the fact that there are three compartments and uses the fact to solve the problem correctly.

Deep Seek R1 – Pass

As can be seen clearly from the thinking snippet, Deep Seek figured out the problem was a trick problem which shows its reasoning and thought capacity.

Claude Sonnet 3.7 – Pass

As per the COT context, it’s clear the model is overthinking, but it worked out well, and the model answered unsurely.

O3 Mini High – Fail

Q3. How high am I?

This is a straightforward and easy question, but most LLMs get these wrong, so let’s check if contender models can crack it or not

Prompt: I get out on the top floor (third floor) at street level. How many stories is the building above the ground?Expected Response: One story above the ground. – Basically, 2 are basements

Model Responses

O3-mini-high

Claude Sonnet 3.7 Thinking with COT – Partially Fails

Claude concluded that: One story above the ground. – Basically 2 floors are basements (refer para2 in COT Traces) but due to excessive thinking model nudged towards saying 0, which alternatively can be inferred as 1.Deep Seek R1 with COT – Partially Fails

As can be seen in the right, R1 figured out the building is one Story but was not confident enough to answer it as one.This shows these models have learnt some behaviour that was not part of the training process.

Grok 3 Thinking – Fails

Grok 3 Thinking fails the test, as demonstrated by its thinking process.

Overall Summary

For advanced reasoning-related tasks, Deep Seek r1 and Claude Sonnet 3.7 Thinking can be used without a doubt.For lower-end reasoning tasks, Grok 3 & Open AI O3 Mini High can be considered to some extent.

C. Coding

As a developer, it’s common for me to use LLMs to write clean, efficient, and easy-to-understand code while following all the coding principles. Let’s test how all these LLMs stack up.

Personally, I am pretty excited about Sonnet 3.7 Thinking.

Q1: Power Of Heroes

I have taken this hard problem from LeetCode. Let’s see how all models perform.

Prompt:

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:
Let i0, i1, … ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])^2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 10^9 + 7.

O3-Mini-High – Pass (5.2 seconds)

from typing import List
class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        mod = 10**9 + 7
        nums.sort()
        n = len(nums)
        ans = 0
            # Add contribution from single-element groups: a[i]^3.
        for x in nums:
            ans = (ans + x * x % mod * x) % mod  # (x^2 * x) mod mod

        dp = 0  # This will hold S[j] = Σ(a[i]*2^(j-i-1)) for the current j.
        for j in range(1, n):
            # Update dp to be for the group ending at index j.
            dp = (2 * dp + nums[j-1]) % mod
            # Add the contribution for groups with max element nums[j]:
            # They contribute nums[j]^2 * S[j]
            ans = (ans + (nums[j] * nums[j] % mod) * dp) % mod
        return ans

Claude Sonnet 3.7 Thinking – Pass (179 Seconds)

class Solution:
def sumOfPower(self, nums: List[int]) -> int:
    MOD = 10**9 + 7
    nums.sort()
    n = len(nums)
    total_power = 0

    # This variable keeps track of the contribution of all possible
    # maximum values for a given minimum value
    subArrayMaxPower = 0

    # Iterate from largest to smallest element
    for i in range(n-1, -1, -1):
        # Add power when current element is both min and max (singleton group)
        total_power = (total_power + (nums[i]**3) % MOD) % MOD

        # Add power for all groups where current element is the minimum
        # and some larger element is the maximum
        total_power = (total_power + (nums[i] * subArrayMaxPower) % MOD) % MOD

        # Update subArrayMaxPower for the next iteration
        # This uses the recurrence relation: T[i-1] = nums[i]^2 + 2*T[i]
        if i > 0:
            subArrayMaxPower = (nums[i]**2 % MOD + (2 * subArrayMaxPower) % MOD) % MOD

    return total_power

Grok 3 Think – Pass (Time not specified)

class Solution:
def sumOfPower(self, nums: List[int]) -> int:
    MOD = 10**9 + 7
    nums.sort()  # Sort the array in ascending order
    S = 0        # Initialize cumulative sum S_0
    total_sum = 0  # Initialize total sum

    for j in range(len(nums)):
        # Contribution when nums[j] is both min and max (singleton)
        contribution = (nums[j] * nums[j] * nums[j]) % MOD
        # Contribution from subsets where nums[j] is max, using S_j
        temp = (nums[j] * nums[j]) % MOD
        contribution = (contribution + (temp * S) % MOD) % MOD
        # Add to total sum
        total_sum = (total_sum + contribution) % MOD
        # Update S for next iteration: S_{j+1} = nums[j] + 2 * S_j
        S = (nums[j] + 2 * S) % MOD

    return total_sum

content_copydownloadUse code with caution.Python

Deep Seek R1 – Fails (ongoing)

Performance Monitoring

OpenAI O3 Mini High is the fastest model that generates the most optimised solution. However, it utilised the typing module – an external library, which may be problematic in specific scenarios.
Next to it are Claude 3.7 Sonnet & Grok 3; they generated a good solution, but not the best one. I would prefer Grok 3 because of its answer validation method.
Sadly, Deep Seek R1 was stuck in a loop, as it had been many times before, and didn’t even generate a response. Not recommended for use at all.

Q2. Check If Digits Are Equal in String After Operations II

This one I found on LeetCode has an acceptance rate of 8.2%, so I thought of giving it a try with all our LLMs; let’s see how they perform on it.

Prompt:

You are given a string s consisting of digits. Perform the following operation repeatedly until the string has precisely two digits:
For each pair of consecutive digits in s, starting from the first digit, calculate a new digit as the sum of the two digits modulo 10. Replace s with the sequence of newly calculated digits, maintaining the order in which they are computed.
Return true if the final two digits in s are the same; otherwise, return false.
Example 1:
Input: s = "3902"
Output: true
Explanation: Initially, s = "3902".
First operation: (3+9)%10=2, (9+0)%10=9, (0+2)%10=2. s becomes "292".
Second operation: (2+9)%10=1, (9+2)%10=1. s becomes "11".
Since the digits in "11" are the same, the output is true.
Example 2:
Input: s = "34789"
Output: false
Explanation: Initially, s = "34789". After the first operation, s = "7157". After the second, s = "862". After the third, s = "48". The output is false since '4' != '8'.

Code:

content_copydownloadUse code with caution.Python

O1-Mini-High: Fails (15 Seconds)

class Solution:
    def hasSameDigits(self, s: str) -> bool:
        while len(s) > 2:
            s = "".join(str((int(s[i]) + int(s[i + 1])) % 10) for i in range(len(s) - 1))
        return s[0] == s[1]

# Example usage:
solution = Solution()
print(solution.hasSameDigits("3902"))  # Output: True
print(solution.hasSameDigits("34789"))  # Output: False

Sadly, O3 generated the right code, but not the optimal one, resulting in a Time Limit Exceeded Error when I tried to submit.

Claude Sonnet 3.7 Thinking: Fails (45 Seconds)

class Solution:
def hasSameDigits(self, s: str) -> bool:
    # Base case: if the string already has 2 digits
    if len(s) == 2:
        return s[0] == s[1]

    # If more than 2 digits, we need to perform the operation once
    new_s = ""
    for i in range(len(s) - 1):
        digit_sum = (int(s[i]) + int(s[i+1])) % 10
        new_s += str(digit_sum)

    # Recursively call with the new string
    return self.hasSameDigits(new_s)

Sadly, Claude Sonnet 3.7 also fails the question.

Grok 3 Think: Pass (221 Seconds)

class Solution:
def hasSameDigits(self, s: str) -> bool:
    # Precomputed table for C(a, b) % 5 for a, b from 0 to 4
    C = [
        [1, 0, 0, 0, 0],  # C(0,0)=1, C(0,1)=0,...
        [1, 1, 0, 0, 0],  # C(1,0)=1, C(1,1)=1,...
        [1, 2, 1, 0, 0],  # C(2,0)=1, C(2,1)=2,...
        [1, 3, 3, 1, 0],  # C(3,0)=1, C(3,1)=3,...
        [1, 4, 1, 4, 1]   # C(4,0)=1, C(4,1)=4, C(4,2)=6%5=1,...
    ]

    # CRT table: x such that x ≡ a mod 2 and x ≡ b mod 5
    table = {
        (0, 0): 0, (0, 1): 6, (0, 2): 2, (0, 3): 8, (0, 4): 4,
        (1, 0): 5, (1, 1): 1, (1, 2): 7, (1, 3): 3, (1, 4): 9
    }

    # Helper function to get digits of a number in base p
    def get_digits(num: int, p: int) -> list[int]:
        digits = []
        while num > 0:
            digits.append(num % p)
            num //= p
        return digits[::-1] if digits else [0]

    # Helper function to compute C(m, k) % 10 using Lucas' theorem
    def binom_mod10(m: int, k: int) -> int:
        if k > m or k < 0:
            return 0

        # Compute C(m, k) % 2: 1 if k's 1-bits are subset of m's
        c2 = 1 if (k & m) == k else 0

        # Compute C(m, k) % 5 using Lucas' theorem
        m_digits = get_digits(m, 5)
        k_digits = get_digits(k, 5)
        max_len = max(len(m_digits), len(k_digits))
        m_digits = [0] * (max_len - len(m_digits)) + m_digits
        k_digits = [0] * (max_len - len(k_digits)) + k_digits
        prod = 1
        for mi, ki in zip(m_digits, k_digits):
            if ki > mi:
                prod = 0
                break
            prod = (prod * C[mi][ki]) % 5
        c5 = prod

        # Combine using CRT
        return table[(c2, c5)]

    n = len(s)
    if n == 2:
        return s[0] == s[1]

    # Number of operations is n-2, coefficients from row n-2 of Pascal's triangle
    m = n - 2
    first_sum = 0
    second_sum = 0
    for k in range(m + 1):
        binom_k = binom_mod10(m, k)
        first_sum = (first_sum + binom_k * int(s[k])) % 10
        second_sum = (second_sum + binom_k * int(s[k + 1])) % 10

    return first_sum == second_sum

Despite thinking for 221 seconds, Grok 3 aced it and made me the top 8.2% of coders who solved this problem within the given time frame for evaluation. However, the CoT reasoning was so extensive that I had to include it as a document file. Nevertheless, upon reviewing the code and reasoning, you will find that the code was developed using first-principles reasoning, which is remarkable.

Deep Seek R1: Fails

class Solution:
def hasSameDigits(self, s: str) -> bool:
    digits = [int(c) for c in s]
    while len(digits) > 2:
        next_digits = []
        for i in range(len(digits) - 1):
            next_digits.append((digits[i] + digits[i+1]) % 10)
        digits = next_digits
    return digits[0] == digits[1]

As expected, Deep Seek R1 failed again. At this point, I am pretty confident I am not going to use Deep Seek R1 for my coding-related tasks, at least.

Coding: Overall Summary

We have a clear winner here: Grok 3 Think. Grok 3 not only aced all the problems but was the only model that was able to solve the last problem by providing the most optimized code. I will definitely go for it.
If I am not concerned about hard coding problems, my second pick will be OpenAI O3, super-fast and answered 2/3 questions with the fastest response time. However, I will include a prompt to format the output the right way.
Last, I would pick Claude Sonnet 3.7. It's less fast and more token-consuming. I thought it would beat Grok 3, but it failed my expectations, at least.
From this point onwards, the last question becomes my personal benchmark to judge an LLM’s coding capabilities.

Finally, let’s move on to the writing tests!

D. Writing

For this section, instead of three, I will be testing on just one question as, at this point, all LLMs’ performances are almost similar, and it just becomes a matter of preference. However, to present a challenge, I am going to test the writing capability on a proven question, which many LLMs get wrong.

Q1. Generate Meaningful Modified Pangrams

A pangram is a sentence that uses every letter of the alphabet at least once. Pangrams are often used for typing practice, testing fonts, or for fun linguistic exercises.

For this question, there is a slight modification: it is not necessary to use all the letters of the alphabet. Let’s see which model captures this.

Prompt:

Write a grammatically correct sentence without reusing any letter more than once.

Output Examples:

Mr. Jock, TV quiz PhD, bags few lynx (not so meaningful)
The quick brown fox jumps over the lazy dog. (meaningful but misleading, reuses letters)
How razorback-jumping frogs can level six piqued gymnasts! (playful)

Model Responses

o3-Mini-High: Pass
Mr. Jock, TV quiz PhD, bags few lynxes.
Claude Sonnet 3.7 Thinking: Pass
Mr. Jock, TV quiz PhD, bags few lynx (contains all words)
We fix jam.
Grok3 Think: Pass (121 Seconds)
The fox jumps.
Deep Seek R1: Fails
Glib jocks quiz nymph to vex dwarf

Writing: Overall Summary

If you are into serious accuracy, Grok 3 might be a good fit; it was the only model that understood the changes (we don’t have to use all letters) to the question and modified its answer based on that. Talk about taking a shortcut.
If you are into writing and creating content, go for Claude Sonnet 3.7. Not only does it provide a good response, but it also allows you to customize the response according to your liking. Also, the CoT reasoning was superb, and it generated 5-7 sentences close to being pangrams.
OpenAI o3-Mini-High again displayed that it favors training data and generated a famous pangram, which others arrived at by careful reasoning. However, its response is almost meaningless, similar to Deep Seek R1.

Analysis-Based Top AI Models Across Different Categories

Mathematics
- Grok 3 Think and Claude Sonnet 3.7 Thinking stand out, especially for their ability to validate answers and ensure correctness.
- If you need strong mathematical reasoning, these are excellent choices.
Advanced Reasoning
- When it comes to handling complex logical problems and deep reasoning tasks, Deep Seek R1 and Claude Sonnet 3.7 Thinking take the lead.
- These models excel in understanding intricate scenarios and delivering well-structured insights.
Coding
- Grok 3 Think is the best overall performer for programming tasks, delivering high-quality code solutions efficiently.
- This is followed closely by OpenAI o3-Mini-High and Claude Sonnet 3.7 Thinking, both of which provide solid coding support but may fall slightly behind in handling intricate logic.
Writing
- If your primary focus is creative and linguist-based writing tasks, you can pick any model apart from Deepseek R1, as the differences are not highly significant.
- However, Claude Sonnet 3.7 Thinking stands out due to its support for writing style customization and tone adjustments, making it a great choice for those who need flexibility in their writing.

Final Thoughts

In the end, I would only say this: Each model has its strengths, and the best choice depends on your specific needs. Whether you are working on complex math problems, advanced reasoning tasks, coding challenges, or writing projects, there is an AI model suited for you.

Motivation

What is COT Based on Thinking

Let’s understand through an example:

Example

Say you’re calculating the total cost of a meal at a restaurant. (Assume 18% GST) included

You ordered:

A burger for ₹100,
A pizza for ₹200,
A drink for ₹50. </aside>

Now, instead of guessing the total cost, the CoT Reasoning model follows a logical chain of thoughts:

First, add the burger and pizza: ₹100 + ₹200 = ₹300
Then, add the drink: ₹300 + ₹50 = ₹350
If there’s an 18% tax, calculate: 18% of ₹350 = ₹63
Add the tax to the total: ₹350 + ₹63 = ₹413
So, the final bill is ₹413. </aside>

This step-by-step thinking process is called Chain of Thoughts—it helps avoid mistakes and makes it easier to understand and solve complex problems!

For deep understanding, check out the paper: Demystifying Long Chain-of-Thought Reasoning in LLMs

But as with all things, each model implements COT Reasoning Ability differently. Let’s examine that.

Frontier Reasoning LLMs

COT Reasoning LLM’s special is their ability to allocate additional computational resources to tackle complex problems – what the AI community calls “test-time compute.”

This means they can spend anywhere from a few seconds to several minutes reasoning through a problem, often considering multiple approaches before arriving at an answer.

Here is a simplistic breakdown of Each of today’s contender models and how they approach “thinking” differently:

OpenAI o3-mini-high: The smaller but mighty cousin of o1, offering adjustable “effort” levels.
Deep Seek R1: An open-source reasoning powerhouse that rivals proprietary models.
Grok 3 (Think): xAI’s reasoning model with dedicated “Think Mode”.
Claude 3.5/3.7 Sonnet (Extended Thinking): Anthropic’s model with configurable “thinking budget”.

Ok don’t take my words for it, see benchmark score yourself!

Benchmark Scores

Popular Benchmark Scores

Sources: LLM Stats

If you are still unsure, here is a simple breakdown of the results.

Grok-3 leads in AIME 2024 with 93.0%, the highest among all models.

Claude 3.7 Sonnet is best in GPQA with 84.8%, slightly ahead of Grok-3 (84.6%).
DeepSeek-R1 tops MATH-500 with 97.3%, slightly better than o3-mini (96.2%).
Grok-3 performs best in Live Code Bench with 79.0%, while DeepSeek-R1 follows at 65.9%.
o3-mini is strong in AIME 2024 (87.3%) and GPQA (79.7%) but lacks scores in some benchmarks.
DeepSeek-R1 has mixed performance—strong in MATH-500 but lower in GPQA (71.5%).
Claude 3.7 Sonnet performs well overall, ranking high in GPQA and AIME 2024.

Now let’s look at the ARC – AGI Score

B. ARC – AGI Score

As of March 2025, the contender models have achieved notable ARC AGI scores:

AI Model	ARC AGI Score
OpenAI o3-mini-high	87.3%
DeepSeek R1	79.8%
Claude 3.7 Sonnet (Extended Thinking)	~ 21.1%
Grok 3 (Think Mode)	Data not available

Note: As of this date, the specific ARC AGI scores for Grok 3’s “Think Mode” have not been publicly disclosed.

Raw Capability Showdown - Which One Prevails?

A. Mathematics

Q1. Find the possible 4th vertex of a parallelogram

A straightforward and intuitive math question from Gilbert Strang’s Linear Algebra class.Prompt:

If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

Expected Response: (4,4), (-2,2) or (4,0)It would be interesting to see how different model think about it.

OpenAI O3-Mini-High: Pass

Claude 3.7 Sonnet: Pass

Impressive not only Claude 3.7 Sonnet Thinking solved the problem, it also verified it, which was not done by Open AI O3-Mini-High. Let’s see its COT Traces.

From COT Traces it becomes evident; model have capability to self-correct / self-nudge itself and try to be absolutely sure before giving answers.

Grok 3 (Think) – Pass

Grok3 Passed the test, but again, there was no verification step before finalizing the answer. Let’s check the model’s COT traces.

Surprisingly, it generated the longest chain of thoughts during reasoning compared to all other LLM’s (12,754) words, so much that I had to add it as a doc file, pasting was not supported.

Deep Seek R1: Pass

As usual, Deep Seek didn’t have any challenges to solve the problem. However, it didn’t show any verification part like Claude, I guess it might be a part of the COT Reasoning Process.

I was hoping that COT reasoning traces might contain verification component, but sadly it is not covered (You can check it yourself).

Q2. Compute the GCD of an infinite series

One of my favourite questions is to check the mathematical ability of a model.Prompt: Compute the GCD of this series set {n^99(n^60-1): n>1}.Expected Response: 6,814,407,600O3 Mini High

Claude 3.7 Sonnet Thinking

Refer to thinking traces to learn more.Deep Seek R1

Despite being 767 second, model still struggles to find the answer which all other models found very fast. This may be due to over-reasoning. I will update this part if it is able to find the answer.

B. Advanced Reasoning

Advanced Reasoning is fundamental to COT Reasoning Models, so let’s see how all contenders perform on some of the Easy Problems That LLMs Get Wrong.

Q1. Will You Switch Door or Not?

This problem is a simple version of the classic Monty Hall problem, which is used to explain concepts of Bayes’ Theorem to high school students, let’s see which model get the actual output.

Claude 3.7 Sonnet: Pass 100%

Grok 3: Pass

Grok 3 seemed to reason and fact-check a lot, finally deducing the right answer and even the actual reasoning task.

O1-Mini High: Fails

Deep Seek R1: Fails

Q2. The River Crossing Puzzle

Let’s see how different model performs on this classic flash game puzzle variation I used to play a lot in my old days. It’s a good reasoning puzzle variation too.

Grok 3 – Pass

Grok 3 captures the fact that there are three compartments and uses the fact to solve the problem correctly.

Deep Seek R1 – Pass

As can be seen clearly from the thinking snippet, Deep Seek figured out the problem was a trick problem which shows its reasoning and thought capacity.

Claude Sonnet 3.7 – Pass

As per the COT context, it’s clear the model is overthinking, but it worked out well, and the model answered unsurely.

O3 Mini High – Fail

Q3. How high am I?

This is a straightforward and easy question, but most LLMs get these wrong, so let’s check if contender models can crack it or not

Prompt: I get out on the top floor (third floor) at street level. How many stories is the building above the ground?Expected Response: One story above the ground. – Basically, 2 are basements

Model Responses

O3-mini-high

Claude Sonnet 3.7 Thinking with COT – Partially Fails

Grok 3 Thinking fails the test, as demonstrated by its thinking process.

Overall Summary

C. Coding

As a developer, it’s common for me to use LLMs to write clean, efficient, and easy-to-understand code while following all the coding principles. Let’s test how all these LLMs stack up.

Personally, I am pretty excited about Sonnet 3.7 Thinking.

Q1: Power Of Heroes

I have taken this hard problem from LeetCode. Let’s see how all models perform.

Prompt:

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:
Let i0, i1, … ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])^2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 10^9 + 7.

O3-Mini-High – Pass (5.2 seconds)

from typing import List
class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        mod = 10**9 + 7
        nums.sort()
        n = len(nums)
        ans = 0
            # Add contribution from single-element groups: a[i]^3.
        for x in nums:
            ans = (ans + x * x % mod * x) % mod  # (x^2 * x) mod mod

        dp = 0  # This will hold S[j] = Σ(a[i]*2^(j-i-1)) for the current j.
        for j in range(1, n):
            # Update dp to be for the group ending at index j.
            dp = (2 * dp + nums[j-1]) % mod
            # Add the contribution for groups with max element nums[j]:
            # They contribute nums[j]^2 * S[j]
            ans = (ans + (nums[j] * nums[j] % mod) * dp) % mod
        return ans

Claude Sonnet 3.7 Thinking – Pass (179 Seconds)

class Solution:
def sumOfPower(self, nums: List[int]) -> int:
    MOD = 10**9 + 7
    nums.sort()
    n = len(nums)
    total_power = 0

    # This variable keeps track of the contribution of all possible
    # maximum values for a given minimum value
    subArrayMaxPower = 0

    # Iterate from largest to smallest element
    for i in range(n-1, -1, -1):
        # Add power when current element is both min and max (singleton group)
        total_power = (total_power + (nums[i]**3) % MOD) % MOD

        # Add power for all groups where current element is the minimum
        # and some larger element is the maximum
        total_power = (total_power + (nums[i] * subArrayMaxPower) % MOD) % MOD

        # Update subArrayMaxPower for the next iteration
        # This uses the recurrence relation: T[i-1] = nums[i]^2 + 2*T[i]
        if i > 0:
            subArrayMaxPower = (nums[i]**2 % MOD + (2 * subArrayMaxPower) % MOD) % MOD

    return total_power

Grok 3 Think – Pass (Time not specified)

class Solution:
def sumOfPower(self, nums: List[int]) -> int:
    MOD = 10**9 + 7
    nums.sort()  # Sort the array in ascending order
    S = 0        # Initialize cumulative sum S_0
    total_sum = 0  # Initialize total sum

    for j in range(len(nums)):
        # Contribution when nums[j] is both min and max (singleton)
        contribution = (nums[j] * nums[j] * nums[j]) % MOD
        # Contribution from subsets where nums[j] is max, using S_j
        temp = (nums[j] * nums[j]) % MOD
        contribution = (contribution + (temp * S) % MOD) % MOD
        # Add to total sum
        total_sum = (total_sum + contribution) % MOD
        # Update S for next iteration: S_{j+1} = nums[j] + 2 * S_j
        S = (nums[j] + 2 * S) % MOD

    return total_sum

content_copydownloadUse code with caution.Python

Deep Seek R1 – Fails (ongoing)

Performance Monitoring

OpenAI O3 Mini High is the fastest model that generates the most optimised solution. However, it utilised the typing module – an external library, which may be problematic in specific scenarios.
Next to it are Claude 3.7 Sonnet & Grok 3; they generated a good solution, but not the best one. I would prefer Grok 3 because of its answer validation method.
Sadly, Deep Seek R1 was stuck in a loop, as it had been many times before, and didn’t even generate a response. Not recommended for use at all.

Q2. Check If Digits Are Equal in String After Operations II

This one I found on LeetCode has an acceptance rate of 8.2%, so I thought of giving it a try with all our LLMs; let’s see how they perform on it.

Prompt:

You are given a string s consisting of digits. Perform the following operation repeatedly until the string has precisely two digits:
For each pair of consecutive digits in s, starting from the first digit, calculate a new digit as the sum of the two digits modulo 10. Replace s with the sequence of newly calculated digits, maintaining the order in which they are computed.
Return true if the final two digits in s are the same; otherwise, return false.
Example 1:
Input: s = "3902"
Output: true
Explanation: Initially, s = "3902".
First operation: (3+9)%10=2, (9+0)%10=9, (0+2)%10=2. s becomes "292".
Second operation: (2+9)%10=1, (9+2)%10=1. s becomes "11".
Since the digits in "11" are the same, the output is true.
Example 2:
Input: s = "34789"
Output: false
Explanation: Initially, s = "34789". After the first operation, s = "7157". After the second, s = "862". After the third, s = "48". The output is false since '4' != '8'.

Code:

content_copydownloadUse code with caution.Python

O1-Mini-High: Fails (15 Seconds)

class Solution:
    def hasSameDigits(self, s: str) -> bool:
        while len(s) > 2:
            s = "".join(str((int(s[i]) + int(s[i + 1])) % 10) for i in range(len(s) - 1))
        return s[0] == s[1]

# Example usage:
solution = Solution()
print(solution.hasSameDigits("3902"))  # Output: True
print(solution.hasSameDigits("34789"))  # Output: False

Sadly, O3 generated the right code, but not the optimal one, resulting in a Time Limit Exceeded Error when I tried to submit.

Claude Sonnet 3.7 Thinking: Fails (45 Seconds)

class Solution:
def hasSameDigits(self, s: str) -> bool:
    # Base case: if the string already has 2 digits
    if len(s) == 2:
        return s[0] == s[1]

    # If more than 2 digits, we need to perform the operation once
    new_s = ""
    for i in range(len(s) - 1):
        digit_sum = (int(s[i]) + int(s[i+1])) % 10
        new_s += str(digit_sum)

    # Recursively call with the new string
    return self.hasSameDigits(new_s)

Sadly, Claude Sonnet 3.7 also fails the question.

Grok 3 Think: Pass (221 Seconds)

class Solution:
def hasSameDigits(self, s: str) -> bool:
    # Precomputed table for C(a, b) % 5 for a, b from 0 to 4
    C = [
        [1, 0, 0, 0, 0],  # C(0,0)=1, C(0,1)=0,...
        [1, 1, 0, 0, 0],  # C(1,0)=1, C(1,1)=1,...
        [1, 2, 1, 0, 0],  # C(2,0)=1, C(2,1)=2,...
        [1, 3, 3, 1, 0],  # C(3,0)=1, C(3,1)=3,...
        [1, 4, 1, 4, 1]   # C(4,0)=1, C(4,1)=4, C(4,2)=6%5=1,...
    ]

    # CRT table: x such that x ≡ a mod 2 and x ≡ b mod 5
    table = {
        (0, 0): 0, (0, 1): 6, (0, 2): 2, (0, 3): 8, (0, 4): 4,
        (1, 0): 5, (1, 1): 1, (1, 2): 7, (1, 3): 3, (1, 4): 9
    }

    # Helper function to get digits of a number in base p
    def get_digits(num: int, p: int) -> list[int]:
        digits = []
        while num > 0:
            digits.append(num % p)
            num //= p
        return digits[::-1] if digits else [0]

    # Helper function to compute C(m, k) % 10 using Lucas' theorem
    def binom_mod10(m: int, k: int) -> int:
        if k > m or k < 0:
            return 0

        # Compute C(m, k) % 2: 1 if k's 1-bits are subset of m's
        c2 = 1 if (k & m) == k else 0

        # Compute C(m, k) % 5 using Lucas' theorem
        m_digits = get_digits(m, 5)
        k_digits = get_digits(k, 5)
        max_len = max(len(m_digits), len(k_digits))
        m_digits = [0] * (max_len - len(m_digits)) + m_digits
        k_digits = [0] * (max_len - len(k_digits)) + k_digits
        prod = 1
        for mi, ki in zip(m_digits, k_digits):
            if ki > mi:
                prod = 0
                break
            prod = (prod * C[mi][ki]) % 5
        c5 = prod

        # Combine using CRT
        return table[(c2, c5)]

    n = len(s)
    if n == 2:
        return s[0] == s[1]

    # Number of operations is n-2, coefficients from row n-2 of Pascal's triangle
    m = n - 2
    first_sum = 0
    second_sum = 0
    for k in range(m + 1):
        binom_k = binom_mod10(m, k)
        first_sum = (first_sum + binom_k * int(s[k])) % 10
        second_sum = (second_sum + binom_k * int(s[k + 1])) % 10

    return first_sum == second_sum

Deep Seek R1: Fails

class Solution:
def hasSameDigits(self, s: str) -> bool:
    digits = [int(c) for c in s]
    while len(digits) > 2:
        next_digits = []
        for i in range(len(digits) - 1):
            next_digits.append((digits[i] + digits[i+1]) % 10)
        digits = next_digits
    return digits[0] == digits[1]

As expected, Deep Seek R1 failed again. At this point, I am pretty confident I am not going to use Deep Seek R1 for my coding-related tasks, at least.

Coding: Overall Summary

We have a clear winner here: Grok 3 Think. Grok 3 not only aced all the problems but was the only model that was able to solve the last problem by providing the most optimized code. I will definitely go for it.
If I am not concerned about hard coding problems, my second pick will be OpenAI O3, super-fast and answered 2/3 questions with the fastest response time. However, I will include a prompt to format the output the right way.
Last, I would pick Claude Sonnet 3.7. It's less fast and more token-consuming. I thought it would beat Grok 3, but it failed my expectations, at least.
From this point onwards, the last question becomes my personal benchmark to judge an LLM’s coding capabilities.

Finally, let’s move on to the writing tests!

D. Writing

Q1. Generate Meaningful Modified Pangrams

A pangram is a sentence that uses every letter of the alphabet at least once. Pangrams are often used for typing practice, testing fonts, or for fun linguistic exercises.

For this question, there is a slight modification: it is not necessary to use all the letters of the alphabet. Let’s see which model captures this.

Prompt:

Write a grammatically correct sentence without reusing any letter more than once.

Output Examples:

Mr. Jock, TV quiz PhD, bags few lynx (not so meaningful)
The quick brown fox jumps over the lazy dog. (meaningful but misleading, reuses letters)
How razorback-jumping frogs can level six piqued gymnasts! (playful)

Model Responses

o3-Mini-High: Pass
Mr. Jock, TV quiz PhD, bags few lynxes.
Claude Sonnet 3.7 Thinking: Pass
Mr. Jock, TV quiz PhD, bags few lynx (contains all words)
We fix jam.
Grok3 Think: Pass (121 Seconds)
The fox jumps.
Deep Seek R1: Fails
Glib jocks quiz nymph to vex dwarf

Writing: Overall Summary

If you are into serious accuracy, Grok 3 might be a good fit; it was the only model that understood the changes (we don’t have to use all letters) to the question and modified its answer based on that. Talk about taking a shortcut.
If you are into writing and creating content, go for Claude Sonnet 3.7. Not only does it provide a good response, but it also allows you to customize the response according to your liking. Also, the CoT reasoning was superb, and it generated 5-7 sentences close to being pangrams.
OpenAI o3-Mini-High again displayed that it favors training data and generated a famous pangram, which others arrived at by careful reasoning. However, its response is almost meaningless, similar to Deep Seek R1.

Analysis-Based Top AI Models Across Different Categories

Mathematics
- Grok 3 Think and Claude Sonnet 3.7 Thinking stand out, especially for their ability to validate answers and ensure correctness.
- If you need strong mathematical reasoning, these are excellent choices.
Advanced Reasoning
- When it comes to handling complex logical problems and deep reasoning tasks, Deep Seek R1 and Claude Sonnet 3.7 Thinking take the lead.
- These models excel in understanding intricate scenarios and delivering well-structured insights.
Coding
- Grok 3 Think is the best overall performer for programming tasks, delivering high-quality code solutions efficiently.
- This is followed closely by OpenAI o3-Mini-High and Claude Sonnet 3.7 Thinking, both of which provide solid coding support but may fall slightly behind in handling intricate logic.
Writing
- If your primary focus is creative and linguist-based writing tasks, you can pick any model apart from Deepseek R1, as the differences are not highly significant.
- However, Claude Sonnet 3.7 Thinking stands out due to its support for writing style customization and tone adjustments, making it a great choice for those who need flexibility in their writing.