Grok 3 vs. Deepseek r1

Grok 3 vs. Deepseek r1

After much anticipation, xAI has finally released the third iteration of Grok. It is apparently the smartest LLM in the world, scoring above 1400 in the Chatbot arena, the first model to do so.

But is it the new SOTA? Apparently yes.

Teortaxes on xAI Grok 3

But how good is it compared to the people’s favourite Deepseek r1? The Grok 3 has been trained on that colossal 100k H100 cluster, which won’t be fair to Deepseek, but still, from a consumer point of view, I wanted to see if it lived up to the hype. So, I put all three models through their paces using my collection of reasoning problems, math challenges, coding tasks, and creative writing prompts.

So, let’s find out.

Table of Contents

TL;DR

If you want to jump straight to the conclusion, here’s a quick summary of the findings comparing the models:

  • Reasoning and Math: DeepSeek R1 and Grok 3 models perform similarly on reasoning questions.
  • Coding: It outperforms DeepSeek R1 considerably and writes much better code.
  • Creative Writing: Both models perform well in creative writing. DeepSeek is known to be a great model, but I prefer Grok 3.

Brief on Grok 3 AI Model

Grok 3 is the latest language model from xAI, offering 10x the computational power. It includes tools like DeepSearch for step-by-step reasoning and Big Brain Mode for handling complex tasks.

You can access the model for free on your X/Twitter account. Yes, it’s free with severe rate limits for a limited time.

According to the official benchmarks shared by the xAI team at the launch event, Grok 3 appears to be a game-changer, outperforming all its competitors in almost every benchmark.

DeepSeek R1 vs. Grok 3

I’ve used Chatbot Arena to test both models side by side, as it is the only available and trusted third-party site that allows testing the early Grok 3 model.

Reasoning Problems

Here, we will check the reasoning capabilities of both models.

1. Russian Roulette

Let’s start up the show with an interesting question:

Prompt: You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing at you. Should he spin again?

  • Response from Grok 3:
Grok 3 answering resoning questions
  • Response from DeepSeek R1:
Deepseek r1 answering resoning questions

Final Verdict: Both models answered the problem correctly with correct reasoning.

2. Olympic Athlete Siblings

Prompt: I am the sister of two Olympic athletes. But these two athletes are not my sisters. How is this possible?

  • Response from Grok 3:
Grok 3 answering resoning questions
  • Response from DeepSeek R1:
Deepseek r1 answering resoning questions

Final Verdict: Both the models answered the problem correctly and with correct reasoning.

3. Find out the Liar

The first two questions were straightforward. Let’s conclude the reasoning test with a slightly trickier question:

Prompt: You meet three people: Haris, Antony, and Michael. Haris says, “At least one of us is a liar.” Antony says, “Haris is lying.” Michael says, “Antony is telling the truth.” Determine who is lying and who is telling the truth.

  • Response from Grok 3:
Grok 3 answering resoning questions
Grok 3 answering resoning questions

Here, Grok 3 provided thorough reasoning, considering each case and breaking down the problem step by step to support its conclusion.

  • Response from DeepSeek R1:

Both models answered the problem correctly, but the reasoning of the Grok 3 model stands out to me.

Summary of reasoning abilities

From this, we can see that both models are quite strong in reasoning capabilities, as they both provided correct answers to all my reasoning questions. Still, the Grok 3 model’s reasoning capability seems to outperform R1 at times.

Mathematics

Looking at the reasoning question answers from both models, I was convinced that both of these models should also pass the math questions.

Starting with an easy question, let’s see how they perform:

1. Number of people on the train

Prompt: There were some people on a train. Nineteen people get off the train at the first stop. 17 people get on the train. Now, there are 63 people on the train. How many people were on the train to begin with?

  • Response from Grok 3:
Grok 3 solving math questions
  • Response from DeepSeek R1:
Deepseek r1 solving math questions

Final Verdict: Both models chose a similar approach and ended up with the correct answer.

2. Prime Number Theorem Approximation

Prompt: Estimate the number of prime numbers less than 10^8.

  • Response from Grok 3:
Grok 3 solving math questions
Grok solving math questions
  • Response from DeepSeek R1:
Grok solving math questions
Deepseek solving math questions

Final Verdict: Both models chose the correct approach of using the Prime Number Theorem, but the answer from the DeepSeek R1 model was off by around ~1K, while Grok 3 was off by around ~20K. So, DeepSeek R1 outperformed Grok 3 here.

3. Count vowels and exponentiate

This is the most challenging problem for the models to crack. It is a modified version of a famous LLM question with a math twist: “Calculate the number of r’s in strawberry.”

Prompt: Calculate (x – 14)^10, where x is the number of vowels in the answer.

  • Response from Grok 3
Grok solvinf the math question
  • Response from DeepSeek R1:
Grok solving math questions

Final Verdict: As expected, neither of the models could reach the solution.

However, OpenAI’s o1 model seems to have cracked this question.

Summary of Math Abilities

As for math, both models performed quite well on the first two questions, which were more manageable, but both failed to solve a tricky question requiring some thought. Based on these results, both models seem to perform similarly, so favouring one over the other is hard.

Coding

Now, let’s check how these models solve a tricky LeetCode question with an acceptance rate of just 12.8%: Find the Occurrence of First Almost Equal Substring. It was added recently, so it’s rare that they have been trained on it.

You are given two strings s and pattern.

A string x is called almost equal to y if you can change at most one character in x to make it identical to y.

Return the smallest starting index of a substring in s that is almost equal to pattern. If no such index exists, return -1.

A substring is a contiguous non-empty sequence of characters within a string.

Example 1:
Input: s = "abcdefg", pattern = "bcdffg"
Output: 1

Explanation:
The substring s[1..6] == "bcdefg" can be converted to "bcdffg" by changing s[4] to "f".

Example 2:
Input: s = "ababbababa", pattern = "bacaba"
Output: 4

Explanation:
The substring s[4..9] == "bababa" can be converted to "bacaba" by changing s[6] to "c".

Example 3:
Input: s = "abcd", pattern = "dba"
Output: -1

Example 4:
Input: s = "dde", pattern = "d"
Output: 0

Constraints:
1 <= pattern.length < s.length <= 105
s and pattern consist only of lowercase English letters.

  • Response from Grok 3:
class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        n = len(pattern)

        # Helper function to check if two strings differ by at most one character
        def differsByAtMostOne(s1: str, s2: str) -> bool:
            if len(s1) != len(s2):
                return False

            # Count differences
            diff_count = 0
            for c1, c2 in zip(s1, s2):
                if c1 != c2:
                    diff_count += 1
                if diff_count > 1:
                    return False
            return True

        # Slide window of size n (pattern length) through s
        for i in range(len(s) - n + 1):
            # Get current substring
            substring = s[i:i+n]
            # Check if it differs by at most one character
            if differsByAtMostOne(substring, pattern):
                return i

        return -1

The code accomplished what was requested, but it hit Time Limit Exceeded on some test sets. So, while it solved the problem, it isn’t the most optimal solution to this problem.

  • Response from DeepSeek R1:
class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        ans = self.find_almost_equal_substring(s, pattern)
        return ans

    def compute_z(self, s):
        n = len(s)
        Z = [0] * n
        Z[0] = 0  # The first character is the entire string, which is not considered here
        l, r = 0, 0
        for i in range(1, n):
            if i > r:
                l = r = i
                while r < n and s[r - l] == s[r]:
                    r += 1
                Z[i] = r - l
                r -= 1
            else:
                k = i - l
                if Z[k] < r - i + 1:
                    Z[i] = Z[k]
                else:
                    l = i
                    while r < n and s[r - l] == s[r]:
                        r += 1
                    Z[i] = r - l
                    r -= 1
        return Z

    def find_almost_equal_substring(self, s, pattern):
        m = len(pattern)
        n = len(s)
        if m == 0:
            return 0
        if n = len(Z_lcs):
                lcs_val = 0
            else:
                lcs_val = Z_lcs[pos_in_concat]
            lcs[n - 1 - i] = lcs_val

        min_index = float('inf')

        # Iterate over each possible j (position in pattern)
        for j in range(m):
            required_prefix = j
            required_suffix = m - j - 1

            # Iterate possible starting indices i in s
            for i in range(n - m + 1):
                # Check prefix condition
                if lcp[i] = n:
                    continue
                if lcs[suffix_start] < required_suffix:
                    continue
                # Check if current character is a mismatch
                if s[i + j] != pattern[j]:
                    if i < min_index:
                        min_index = i
                        # Early exit if possible
                        if min_index == 0:
                            return 0
                        # Break the inner loop as we're looking for the earliest i
                        break

        return min_index if min_index != float('inf') else -1

This was very unexpected from DeepSeek R1. The code failed the very first test case. If I have to compare the code quality, it is also very poorly written.

I couldn’t agree more with this tweet:

Summary of Coding Abilities

Here, Grok 3 is the winner. It at least provided a working code, though it wasn’t the most optimal. On the other hand, DeepSeek R1 wrote code that couldn’t pass the very first test case, was unnecessarily long, and was poorly written.

Creative Writing

Prompt: Write a story: “A man buys an old typewriter from a flea market, only to realize that whatever he types on it comes true, but with a dark twist.”. Summarize the entire story with the twist in three paragraphs.

Response from Grok 3

Grok 3 response

Response from DeepSeek R1:

Deepseek r1 response

Summary:

Even though DeepSeek R1 is said to be the best model for writing:

Ranking models for creative writing

Here, I must say that both did a great job crafting this story and wrapping up the entire twist within three paragraphs, but I prefer the response from the Grok 3 model more than the DeepSeek R1 model.

The story just felt to have a better flow.

Final Verdict!

Based on these observations, here’s my final verdict:

  • Grok 3 and the DeepSeek R1 model perform similarly for Reasoning and Mathematics. Choosing one over the other doesn’t seem to make much difference.
  • Grok 3 is the clear winner for Coding compared to the DeepSeek R1 model. DeepSeek R1 doesn’t come close regarding code quality or overall answers.
  • Both models are pretty strong for Creative Writing, but I prefer Grok 3’s responses. They feel more engaging, natural, and polished.

I pretty much agree with Satoshi on the Grok 3 and DeepSeek R1 part of this comparison:

Leave a Reply

Your email address will not be published. Required fields are marked *

  • Pricing
  • Explore
  • Blog