Gemini 2.0 vs Flash vs OpenAI o1 and Claude 3.5 Sonnet

•

Dec 17, 2024

Claude

Google has finally woken up and decided to drop the bombshell Gemini 2.0, completing the AI trifecta. Google has launched two new models: the smaller Gemini 2.0 Flash and the Gemini 2.0 1206. And there's a lot of excitement around it.

But how does it really compare to o1 and the Claude 3.5 Sonnet? After all the fanfare, I wanted to see if it lived up to the hype. So, I put all three models through their paces using my collection of reasoning problems, math challenges, coding tasks, and creative writing prompts.

So, let's find out if Gemini 2.0 models can truly compete with these AI heavyweights or if Google's latest offering is more flash than substance.

For easy understanding, I will refer to the Gemini Experimental 1206 as Gemini 1206 and

Notes on Gemini 2.0
Complex Reasoning
Mathematics
Coding
Creative Writing
Final Verdict

TL;DR

Gemini 2.0 Flash and 1206 are Google's latest and greatest models.
It is the first true multimodal model supporting image, video, voice inputs and voice and image outputs.
Regarding raw reasoning and mathematical ability, OpenAI o1 tops the chart, followed by Gemini 1206 and Claude 3.5 Sonnet.
In coding, Claude is king, considering speed, capability, and message cap.
OpenAI o1 triumphs in creative writing as well.

Notes on Gemini 2.0 Flash

Gemini 2.0 Flash and 1206 are Google's newest models, with native tool calling and multi-modal abilities. It can process images and videos and generate images and speech for the first time, including editing images using natural language. It is due for a public release early next year.

What stood out most was the model's streaming ability, which allows it to stream audio and video in real-time.

This is truly the first true multi-modal large language model; though we don’t know its size, it is smaller than Gemini-1.5-pro. Yet, it packs the best performance from all the existing Google models.

With a massive context window and raw capability, it can easily be the default choice for developers and users. So, let’s see how good it is compared to the incumbents.

Comparing the Reasoning abilities

Reasoning is one of the most sought-after skills of LLMs. Reasoning ability correlates directly to contextual awareness, a must-have for agentic reliability. A better reasoning model will also better handle real-world use cases necessary for agentic automation.

This section tests all three models’ reasoning ability regarding textbook and real-world problems.

1. Counting words in the response

Let’s start with a small problem.

Prompt: What’s the fourth word in your response to this prompt?

Response from Gemini 2.0 Flash:

Gemini Flash failed at this task in the zero-shot attempt. However, it got the correct answer when I asked it to recheck.

Responses from Gemini 1206:

However, Gemini 1206 was spot on in the first attempt. The starting

These are the responses from OpenAI o1 and Claude 3.5 Sonnet.

OpenAI o1:

Claude 3.5 Sonnet:

Of all the models, o1 and 1206 were the only ones that got it right in the first attempt. While Gemini Flash required some nudging, Claude couldn’t solve it even after nudging.

2. What is C Doing in the Room?

Prompt: Five people (A, B, C, D, and E) are in a room. A is watching TV with B, D is sleeping, B is eating chow min, and E is playing table tennis. Suddenly, a call comes on the telephone. B goes out of the room to pick up the call. What is C doing?

It's a classic but easy reasoning question. Let’s see who nails it.

Gemini 2.0 Flash response:

Gemini 1206

OpenAI o1:

Claude 3.5 Sonnet:

Only o1 seems to get the puzzle right; responses from the rest were the same. Even o1-preview failed at this question. O1 is undoubtedly a step up from the o1-preview in complex reasoning.

3. Next move in Tic-tac-toe

This test will give the models multiple situations in a tic-tac-toe game and ask for the next move.

I created a tic-tac-toe situation and asked all these models to find the optimal position to ensure I would win. Here are the responses.

Gemini 2.0 Flash:

Gemini 1206 response

The Gemini’s confidence is great. I hoped it would get right but couldn’t solve the problem. Let’s see how other models respond.

OpenAI o1:

Claude 3.5 Sonnet:

Of all the models, o1 was the only one that successfully found the position that ensured victory. The reasoning behind the choice was also great.

Summary on Reasoning

In reasoning tasks, o1 is much ahead of Sonnet and Gemini. The second best model is the Gemini 1206. In terms of raw IQ for an LLM, OpenAI o1 is much superior.

Mathematics

Next is mathematics. Let’s examine the mathematical proficiency of these models in answering some of the questions no other LLM could solve.

1. 5.11-5.90 =?

I'm not sure why this is an interesting question, but many LLMs fail to correctly answer this, even if they can get the right answers for similar questions.

Here’s the answer from Gemini 2.0 Flash

This seems to be a non-issue for Gemini 1206, o1 and Sonnet.

2. Finding the possible values of the fourth vertex

This is a problem from Gilbert Strang’s linear algebra. Let’s see the responses from all these models.

Prompt: If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

Response from Gemini 2.0 Flash

Let the given corners be A(1,1), B(4,2), and C(1,3). Let the fourth corner be D(x,y). There are three possible ways to arrange the parallelogram using these three given vertices:

Case 1: ABCD is a parallelogram

In this case, the diagonals bisect each other. Let M be the midpoint of AC.
M = ((1+1)/2, (1+3)/2) = (1,2)
Since M is also the midpoint of BD, we have:
( (4+x)/2, (2+y)/2 ) = (1,2)
Equating the x-coordinates: (4+x)/2 = 1 => 4+x = 2 => x = -2
Equating the y-coordinates: (2+y)/2 = 2 => 2+y = 4 => y = 2
So, D = (-2, 2)

Case 2: ABDC is a parallelogram

In this case, the diagonals are AD and BC. Let M be the midpoint of AD.
M = ((1+x)/2, (1+y)/2)
Let N be the midpoint of BC.
N = ((4+1)/2, (2+3)/2) = (5/2, 5/2)
Since the midpoints are the same, we have:
(1+x)/2 = 5/2 => 1+x = 5 => x = 4
(1+y)/2 = 5/2 => 1+y = 5 => y = 4
So, D = (4,4)

Case 3: ADBC is a parallelogram

In this case, the diagonals are AB and CD. Let M be the midpoint of AB.
M = ((1+4)/2, (1+2)/2) = (5/2, 3/2)
Let N be the midpoint of CD.
N = ((1+x)/2, (3+y)/2)
Since the midpoints are the same, we have:
(1+x)/2 = 5/2 => 1+x = 5 => x = 4
(3+y)/2 = 3/2 => 3+y = 3 => y = 0
So, D = (4,0)

Therefore, the possible fourth corners are (-2, 2), (4, 4), and (4, 0)

Impressive! Both the Gemini models were able to solve the problem successfully. Let’s see the responses from the other models.

OpenAI o1:

Claude 3.5 Sonnet:

Both o1 and Gemini solved the problem, while Sonnet could only find a single possible vertex.

3. Finding the GCD of a number series

Next, let’s see if it can find the GCD of this series set {n^99(n^60-1): n>1}.

Gemini 2.0 Flash:

Gemini 1206:

Both the Geminis failed to answer the question correctly.

OpenAI o1:

Claude 3.5 Sonnet:

Out of all the models, only o1 correctly solved the issue. Both Gemini 2.0 Flash and Claude 3.5 Sonnet confidently output wrong answers. Will a model admit something it doesn’t know instead of hallucinating or outputting wrong answers?

Summary on Mathematics

If you need a model for solving math-related problems, O1 should be your default choice. It is mature and capable of solving math tasks. Gemini is balanced, not as well as o1, but better than Sonnet.

Coding

Coding and writing are the most sought-after use cases of LLMs. Claude has established itself as the de facto king in this space. Let’s see how the new Gemini 2.0 Flash.

For this, I will use a LeetCode Hard question. This is the “Power of heroes” problem from LeetCode. The o1 and Claude had previously solved the issue; let’s see if Gemini can solve this.

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:

Let i0, i1, ... ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 109 + 7.

 

Example 1:

Input: nums = [2,1,4]
Output: 141
Explanation: 
1st group: [2] has power = 22 * 2 = 8.
2nd group: [1] has power = 12 * 1 = 1. 
3rd group: [4] has power = 42 * 4 = 64. 
4th group: [2,1] has power = 22 * 1 = 4. 
5th group: [2,4] has power = 42 * 2 = 32. 
6th group: [1,4] has power = 42 * 1 = 16. 
7th group: [2,1,4] has power = 42 * 1 = 16. 
The sum of powers of all groups is 8 + 1 + 64 + 4 + 32 + 16 + 16 = 141.

Example 2:

Input: nums = [1,1,1]
Output: 7
Explanation: A total of 7 groups are possible, and the power of each group will be 1. Therefore, the sum of the powers of all groups is 7.
 

Constraints:

1 <= nums.length <= 105
1 <= nums[i] <= 109

Gemini 2.0 Flash response

from typing import List

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        nums.sort()
        n = len(nums)
        MOD = 10**9 + 7
        total_power = 0
        sum_min_power = 0
        pow2 = 1
        prev_pow2 = 1

        for i in range(n):
            num = nums[i]
            total_power = (total_power + num * num % MOD * num) % MOD
            total_power = (total_power + sum_min_power * num % MOD * num) % MOD
            sum_min_power = (sum_min_power + prev_pow2 * num) % MOD
            prev_pow2 = pow2
            pow2 = (pow2 * 2) % MOD
        return total_power

I expected the model to cruise through it, but it couldn’t clear all the test cases.

Gemini 1206:

However, the 1206 was able to crack it, and the answer is even better than that of o1 and Sonnet 3.5

def sumOfPower(nums):
    """
    Calculates the sum of the power of all non-empty groups of heroes.

    Args:
      nums: A 0-indexed integer array representing the strength of heroes.

    Returns:
      The sum of the power of all non-empty groups of heroes modulo 10^9 + 7.
    """
    n = len(nums)
    mod = 10**9 + 7
    nums.sort()
    ans = 0
    pre_sum = 0
    for i in range(n):
        ans = (ans + (nums[i]**2) * (nums[i] + pre_sum)) % mod
        pre_sum = (pre_sum * 2 + nums[i]) % mod
    return ans

In comparison, these are the responses from o1 and Sonnet, solving the same questions.

OpenAI o1:

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()

        R = 0
        total_sum = 0
        for i, x in enumerate(nums):
            x_mod = x % MOD
            x2 = (x_mod * x_mod) % MOD
            x3 = (x2 * x_mod) % MOD

            if i == 0:
                # Only one subset: {x}
                contribution = x3
            else:
                # contribution = x^3 + x^2 * R
                contribution = (x3 + x2 * R) % MOD

            total_sum = (total_sum + contribution) % MOD

            # Update R for the next iteration
            R = (2 * R + x_mod) % MOD

        return total_sum % MOD

Claude 3.5 Sonnet:

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()
        
        # Running sum to track previous numbers
        R = 0
        # Final answer accumulator
        total = 0
        
        # Process each number
        for i, x in enumerate(nums):
            # Handle modulo arithmetic
            x_mod = x % MOD
            x2 = (x_mod * x_mod) % MOD  # x^2
            x3 = (x2 * x_mod) % MOD     # x^3
            
            if i == 0:
                # First number only contributes x^3
                contribution = x3
            else:
                # For each subsequent number:
                # Add x^3 (when alone) and x^2 * R (combinations with previous numbers)
                contribution = (x3 + x2 * R) % MOD
            
            # Add to total sum
            total = (total + contribution) % MOD
            
            # Update running sum for next iteration
            # New numbers will create twice as many combinations
            R = (2 * R + x_mod) % MOD
        
        return total

While this may not represent the model's coding capability, it gives us an idea that Gemini still needs to improve in this area. I have expected Gemini 2.0 Flash to nail this on the first try.

Summary on Coding

Regarding coding, Claude gets stiff competition from Gemini 1206, and both are better than o1 in coding, considering the speed and rate limits—Flash, for its size, punches above its weight.

Creative Writing

Creative writing is a personal preference. Each model has their personality, which reflects the company's philosophy in general. This is something you will know only when you use it enough. However, for testing, ask these modes a simple story.

Prompt: Write a story with a twist about a college guy’s routine life, which turned upside down when he encountered a mysterious woman.

I only provided the prompt without additional information regarding writing style and tone.

Here are the creations from all three models.

Response from OpenAI o1:

Response from Claude 3.5 Sonnet

Of all the responses, o1 was the most creative, with less ornamental words. It is natural and reads well. The Gemini response was also innovative, but it tended to use literary words. Claude’s version lacked creativity and captivated readers like the other two.

Final Verdict

You already know the verdicts after going through the complete blog post.

For complex reasoning: OpenAI o1>Gemini 2.0 1206 >Flash>Claude 3.5 Sonnet.
For mathematics: OpenAI o1>Gemini 2.0 1206> Flash>Claude 3.5 Sonnet.
For coding, use Claude 3.5 Sonnet~Gemini 2.0 1206>OpenAI o1>Gemini Flash (only because of the message cap in the plus tier).
For creative writing: OpenAI o1>Gemini 2.0 1206> Flash>Claude 3.5 Sonnet. For technical writing: OpenAI o1>Claude 3.5 Sonnet~Gemini 1206>Gemini 2.0 Flash. (Not tested here, but from personal uses)

The game is still on; anything can happen at any time.

So, let's find out if Gemini 2.0 models can truly compete with these AI heavyweights or if Google's latest offering is more flash than substance.

For easy understanding, I will refer to the Gemini Experimental 1206 as Gemini 1206 and

Notes on Gemini 2.0
Complex Reasoning
Mathematics
Coding
Creative Writing
Final Verdict

TL;DR

Gemini 2.0 Flash and 1206 are Google's latest and greatest models.
It is the first true multimodal model supporting image, video, voice inputs and voice and image outputs.
Regarding raw reasoning and mathematical ability, OpenAI o1 tops the chart, followed by Gemini 1206 and Claude 3.5 Sonnet.
In coding, Claude is king, considering speed, capability, and message cap.
OpenAI o1 triumphs in creative writing as well.

Notes on Gemini 2.0 Flash

What stood out most was the model's streaming ability, which allows it to stream audio and video in real-time.

With a massive context window and raw capability, it can easily be the default choice for developers and users. So, let’s see how good it is compared to the incumbents.

Comparing the Reasoning abilities

This section tests all three models’ reasoning ability regarding textbook and real-world problems.

1. Counting words in the response

Let’s start with a small problem.

Prompt: What’s the fourth word in your response to this prompt?

Response from Gemini 2.0 Flash:

Gemini Flash failed at this task in the zero-shot attempt. However, it got the correct answer when I asked it to recheck.

Responses from Gemini 1206:

However, Gemini 1206 was spot on in the first attempt. The starting

These are the responses from OpenAI o1 and Claude 3.5 Sonnet.

OpenAI o1:

Claude 3.5 Sonnet:

Of all the models, o1 and 1206 were the only ones that got it right in the first attempt. While Gemini Flash required some nudging, Claude couldn’t solve it even after nudging.

2. What is C Doing in the Room?

It's a classic but easy reasoning question. Let’s see who nails it.

Gemini 2.0 Flash response:

Gemini 1206

OpenAI o1:

Claude 3.5 Sonnet:

Only o1 seems to get the puzzle right; responses from the rest were the same. Even o1-preview failed at this question. O1 is undoubtedly a step up from the o1-preview in complex reasoning.

3. Next move in Tic-tac-toe

This test will give the models multiple situations in a tic-tac-toe game and ask for the next move.

I created a tic-tac-toe situation and asked all these models to find the optimal position to ensure I would win. Here are the responses.

Gemini 2.0 Flash:

Gemini 1206 response

The Gemini’s confidence is great. I hoped it would get right but couldn’t solve the problem. Let’s see how other models respond.

OpenAI o1:

Claude 3.5 Sonnet:

Of all the models, o1 was the only one that successfully found the position that ensured victory. The reasoning behind the choice was also great.

Summary on Reasoning

In reasoning tasks, o1 is much ahead of Sonnet and Gemini. The second best model is the Gemini 1206. In terms of raw IQ for an LLM, OpenAI o1 is much superior.

Mathematics

Next is mathematics. Let’s examine the mathematical proficiency of these models in answering some of the questions no other LLM could solve.

1. 5.11-5.90 =?

I'm not sure why this is an interesting question, but many LLMs fail to correctly answer this, even if they can get the right answers for similar questions.

Here’s the answer from Gemini 2.0 Flash

This seems to be a non-issue for Gemini 1206, o1 and Sonnet.

2. Finding the possible values of the fourth vertex

This is a problem from Gilbert Strang’s linear algebra. Let’s see the responses from all these models.

Prompt: If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

Response from Gemini 2.0 Flash

Let the given corners be A(1,1), B(4,2), and C(1,3). Let the fourth corner be D(x,y). There are three possible ways to arrange the parallelogram using these three given vertices:

Case 1: ABCD is a parallelogram

In this case, the diagonals bisect each other. Let M be the midpoint of AC.
M = ((1+1)/2, (1+3)/2) = (1,2)
Since M is also the midpoint of BD, we have:
( (4+x)/2, (2+y)/2 ) = (1,2)
Equating the x-coordinates: (4+x)/2 = 1 => 4+x = 2 => x = -2
Equating the y-coordinates: (2+y)/2 = 2 => 2+y = 4 => y = 2
So, D = (-2, 2)

Case 2: ABDC is a parallelogram

In this case, the diagonals are AD and BC. Let M be the midpoint of AD.
M = ((1+x)/2, (1+y)/2)
Let N be the midpoint of BC.
N = ((4+1)/2, (2+3)/2) = (5/2, 5/2)
Since the midpoints are the same, we have:
(1+x)/2 = 5/2 => 1+x = 5 => x = 4
(1+y)/2 = 5/2 => 1+y = 5 => y = 4
So, D = (4,4)

Case 3: ADBC is a parallelogram

In this case, the diagonals are AB and CD. Let M be the midpoint of AB.
M = ((1+4)/2, (1+2)/2) = (5/2, 3/2)
Let N be the midpoint of CD.
N = ((1+x)/2, (3+y)/2)
Since the midpoints are the same, we have:
(1+x)/2 = 5/2 => 1+x = 5 => x = 4
(3+y)/2 = 3/2 => 3+y = 3 => y = 0
So, D = (4,0)

Therefore, the possible fourth corners are (-2, 2), (4, 4), and (4, 0)

Impressive! Both the Gemini models were able to solve the problem successfully. Let’s see the responses from the other models.

OpenAI o1:

Claude 3.5 Sonnet:

Both o1 and Gemini solved the problem, while Sonnet could only find a single possible vertex.

3. Finding the GCD of a number series

Next, let’s see if it can find the GCD of this series set {n^99(n^60-1): n>1}.

Gemini 2.0 Flash:

Gemini 1206:

Both the Geminis failed to answer the question correctly.

OpenAI o1:

Claude 3.5 Sonnet:

Summary on Mathematics

If you need a model for solving math-related problems, O1 should be your default choice. It is mature and capable of solving math tasks. Gemini is balanced, not as well as o1, but better than Sonnet.

Coding

Coding and writing are the most sought-after use cases of LLMs. Claude has established itself as the de facto king in this space. Let’s see how the new Gemini 2.0 Flash.

For this, I will use a LeetCode Hard question. This is the “Power of heroes” problem from LeetCode. The o1 and Claude had previously solved the issue; let’s see if Gemini can solve this.

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:

Let i0, i1, ... ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 109 + 7.

 

Example 1:

Input: nums = [2,1,4]
Output: 141
Explanation: 
1st group: [2] has power = 22 * 2 = 8.
2nd group: [1] has power = 12 * 1 = 1. 
3rd group: [4] has power = 42 * 4 = 64. 
4th group: [2,1] has power = 22 * 1 = 4. 
5th group: [2,4] has power = 42 * 2 = 32. 
6th group: [1,4] has power = 42 * 1 = 16. 
7th group: [2,1,4] has power = 42 * 1 = 16. 
The sum of powers of all groups is 8 + 1 + 64 + 4 + 32 + 16 + 16 = 141.

Example 2:

Input: nums = [1,1,1]
Output: 7
Explanation: A total of 7 groups are possible, and the power of each group will be 1. Therefore, the sum of the powers of all groups is 7.
 

Constraints:

1 <= nums.length <= 105
1 <= nums[i] <= 109

Gemini 2.0 Flash response

from typing import List

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        nums.sort()
        n = len(nums)
        MOD = 10**9 + 7
        total_power = 0
        sum_min_power = 0
        pow2 = 1
        prev_pow2 = 1

        for i in range(n):
            num = nums[i]
            total_power = (total_power + num * num % MOD * num) % MOD
            total_power = (total_power + sum_min_power * num % MOD * num) % MOD
            sum_min_power = (sum_min_power + prev_pow2 * num) % MOD
            prev_pow2 = pow2
            pow2 = (pow2 * 2) % MOD
        return total_power

I expected the model to cruise through it, but it couldn’t clear all the test cases.

Gemini 1206:

However, the 1206 was able to crack it, and the answer is even better than that of o1 and Sonnet 3.5

def sumOfPower(nums):
    """
    Calculates the sum of the power of all non-empty groups of heroes.

    Args:
      nums: A 0-indexed integer array representing the strength of heroes.

    Returns:
      The sum of the power of all non-empty groups of heroes modulo 10^9 + 7.
    """
    n = len(nums)
    mod = 10**9 + 7
    nums.sort()
    ans = 0
    pre_sum = 0
    for i in range(n):
        ans = (ans + (nums[i]**2) * (nums[i] + pre_sum)) % mod
        pre_sum = (pre_sum * 2 + nums[i]) % mod
    return ans

In comparison, these are the responses from o1 and Sonnet, solving the same questions.

OpenAI o1:

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()

        R = 0
        total_sum = 0
        for i, x in enumerate(nums):
            x_mod = x % MOD
            x2 = (x_mod * x_mod) % MOD
            x3 = (x2 * x_mod) % MOD

            if i == 0:
                # Only one subset: {x}
                contribution = x3
            else:
                # contribution = x^3 + x^2 * R
                contribution = (x3 + x2 * R) % MOD

            total_sum = (total_sum + contribution) % MOD

            # Update R for the next iteration
            R = (2 * R + x_mod) % MOD

        return total_sum % MOD

Claude 3.5 Sonnet:

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()
        
        # Running sum to track previous numbers
        R = 0
        # Final answer accumulator
        total = 0
        
        # Process each number
        for i, x in enumerate(nums):
            # Handle modulo arithmetic
            x_mod = x % MOD
            x2 = (x_mod * x_mod) % MOD  # x^2
            x3 = (x2 * x_mod) % MOD     # x^3
            
            if i == 0:
                # First number only contributes x^3
                contribution = x3
            else:
                # For each subsequent number:
                # Add x^3 (when alone) and x^2 * R (combinations with previous numbers)
                contribution = (x3 + x2 * R) % MOD
            
            # Add to total sum
            total = (total + contribution) % MOD
            
            # Update running sum for next iteration
            # New numbers will create twice as many combinations
            R = (2 * R + x_mod) % MOD
        
        return total

While this may not represent the model's coding capability, it gives us an idea that Gemini still needs to improve in this area. I have expected Gemini 2.0 Flash to nail this on the first try.

Summary on Coding

Regarding coding, Claude gets stiff competition from Gemini 1206, and both are better than o1 in coding, considering the speed and rate limits—Flash, for its size, punches above its weight.

Creative Writing

Prompt: Write a story with a twist about a college guy’s routine life, which turned upside down when he encountered a mysterious woman.

I only provided the prompt without additional information regarding writing style and tone.

Here are the creations from all three models.

Response from OpenAI o1:

Response from Claude 3.5 Sonnet

Final Verdict

You already know the verdicts after going through the complete blog post.

For complex reasoning: OpenAI o1>Gemini 2.0 1206 >Flash>Claude 3.5 Sonnet.
For mathematics: OpenAI o1>Gemini 2.0 1206> Flash>Claude 3.5 Sonnet.
For coding, use Claude 3.5 Sonnet~Gemini 2.0 1206>OpenAI o1>Gemini Flash (only because of the message cap in the plus tier).
For creative writing: OpenAI o1>Gemini 2.0 1206> Flash>Claude 3.5 Sonnet. For technical writing: OpenAI o1>Claude 3.5 Sonnet~Gemini 1206>Gemini 2.0 Flash. (Not tested here, but from personal uses)

The game is still on; anything can happen at any time.