On OpenAI o3-mini
![On OpenAI o3-mini: how good it is compared to Deepseek r1?](https://composio.dev/wp-content/uploads/2025/02/On-OpenAI-o3-mini-1024x576.png)
OpenAI launched its latest model, the o3-mini, last Friday. It is the first member of the o3 family of models.
There are two specialized variants of o3-mini: o3-mini-high, which takes more time to reason for more in-depth answers, and o3-mini-low, which prioritizes speed for quicker responses. Along with o3-mini, OpenAI has also launched Deep Research, a feature that allows pro subscribers to do deep research.
Indeed, they are not putting much effort into model naming.
![Meme for OpenAI naming scheme](https://composio.dev/wp-content/uploads/2025/02/gpt-naming-982x1024.jpeg)
Well, anyway. It appears to be a capable model for the cost.
According to benchmarks, o3-mini displays comparable performance to model o1 while being roughly 15 times cheaper and about five times faster. This impressive cost-efficiency is especially interesting given that o3-mini remains cheaper than GPT-4o yet has a stricter usage limit of 150 messages per hour, unlike the unrestricted GPT-4o.
Does this mean that OpenAI is subsidizing o3-mini for now?
Whatever the reason, this is the best available model per cost-to-performance ratio, as Deepseek r1 (Chinese Hosted) is primarily unavailable.
It has scored better than o1 on FrontierMath, Codeforces, and GPQA. I have already made a comparison post of OpenAI o1 with Deepseek r1, and o1 was better in most reasoning tasks.
TL;DR
- • You can see the chain-of-thought in o3-mini unlike the o1. However, the traces are not raw but summarized.
- • The first reasoning model with official function-calling support from OpenAI.
- • For reasoning and math tasks, the models are really great, especially the o3-mini-high (of course). But Deepseek r1’s CoT is gold.
- • For writing, I would still prefer Deepseek r1 as it doesn’t have a corporat’ish personality. But you can get most things done with o3-minis.
A few notes on OpenAI o3-mini
This is the only model from OpenAI that has a chain of thought. The CoT shown is not raw and natural like Deepseek but a simplified version.
![](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-4.35.14-PM-1024x412.png)
The CoT is instead a summarized version of the actual CoT, and I’m not sure what the reason behind it could be. To avoid competitors from training over it or it is in general, not so readable
Anyway, I feel this is much better than we had with o1.
Also, it is the first reasoning model to have a function-calling feature. This is huge for AI agents and, of course, for Composio. If you haven’t checked us out, do check out now. We might be something you have been looking for all along.
So, anyway. But one interesting thing happened in Sam Altman’s Reddit AMA, which was not on my bingo cards.
![Sam Altman on open-sourcing models in future](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-5.05.37-PM-1024x494.png)
I don’t know how to process this. Are we looking at potential open-source OpenAI models? I don’t remember when they last released an open-source model. Was it Whisper or perhaps GPT-3? However, this is a welcome decision if OpenAI decides to be true to its name again.
Check out this AMA if you have missed it.
OpenAI o3-mini vs Deepseek r1
So, is the o3-mini a better model than the r1?
So, I tested the OpenAI o3-mini and Deepseek r1 on multiple tasks, including complex reasoning, mathematics, coding, and writing. The questions were sourced from various sources and were sufficiently capable of giving an idea of the model’s capabilities.
So, let’s get started.
Complex Reasoning Problems
The o3-mini has performed great on reasoning benchmarks. On ARC-AGI, it has performed comparably to o1.
![o3-mini on ARC-AGI](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-5.34.25-PM-1024x967.png)
So, let’s see how good it is.
1. Test for Cognitive bias
This is a classic riddle, and almost all models will give a bullshit answer except Deepseek r1 and Gemini 2.0 experimental. Even o1 has not answered correctly.
Prompt: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to the hospital. When the doctor sees the boy, he says, “I can’t operate on this child; he is my son! How is this possible?
OpenAI o3-mini’s response:
![OpenAI o3-mini solving puzzles](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-6.17.29-PM-1024x404.png)
Deepseek Response:
![Deepseek r1i solving puzzles](https://composio.dev/wp-content/uploads/2025/02/image-2-1024x298.png)
Yep, it got it right. Let’s see OpenAI o1’s response.
![OpenAI o1 solving puzzles](https://composio.dev/wp-content/uploads/2025/02/image-3-1024x332.png)
But I’ve increasingly seen that when you tweak the question a bit, the models start going astray. The same happened with OpenAI o1 and Deepseek r1.
Prompt: The surgeon, who is the boy’s father, says, “I can’t operate on this child; he is my son”, who is the surgeon of this child. Be straightforward.
OpenAI o3-mini’s response
![OpenAI o3-mini solving puzzles](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-6.32.27-PM-1024x311.png)
Ok, that was expected, but how about o3-mini-high? Can it get it correct given more test-time compute?
o3-mini-high’s response
![OpenAI o3-mini-high solving puzzles](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-6.35.20-PM-1024x311.png)
Ok, the model did what the o3-mini couldn’t. It definitely feels like a better model so far.
2. Blood Relationship
A bit-tricky question.
Prompt: Jeff has two brothers, and each of his brothers has three sisters. Each of the sisters has four step-brothers, and each has five step-sisters. How many siblings are there in this family?
Not to my surprise, both the models failed to answer it correctly.
o3-mini’s response:
![OpenAI o3-mini solving puzzles](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-7.09.38-PM-1024x527.png)
Deepseek r1’s response was similar. However, it was able to find the right answer in its reasoning trace but chose to output the wrong one.
![](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-7.18.56-PM-1024x476.png)
However, the final answer was
![](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-07-at-7.19.59-PM-1024x385.png)
Deepseek was able to reason this problem much better than o3-mini-high, even if they came at the exact wrong answers.
Summary of reasoning ability
Though o3-mini-high is better on paper, Deepseek r1’s reasoning trace is really good, even better and more informative than the final answer. If I were to choose a model for brainstorming, I would pick Deepseek r1 wholly based on the vibes it gives.
Mathematics
Let’s see how it performs on some good math questions and compares with Deepseek-r1.
I won’t ask it to solve the Rieman hypothesis; I’m sure many have already done it. We can keep things simple and stick to standard questions.
1. Compute the GCD of an infinite series
Prompt: Compute the GCD of the series {n^99(n^60-1): n>1}
OpenAI-o3-mini response:
![OpenAI o3-mini solving greatest GCD math question](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-08-at-5.11.23-PM-1024x592.png)
This is one of the math questions only R1 and O1 could solve, and it is a good benchmark to gauge the math prowess of a reasoning model. But o3-mini-high couldn’t;t solve it.
Deepseek r1’s response:
![Deepseek r1 solving greatest GCD math question](https://composio.dev/wp-content/uploads/2025/02/image-1-1024x570.png)
2. The number of vowels in the answer is the power
It is a tricky question; until now, only o1-pro could solve this in a single attempt.
Prompt: Compute (x-14)^10, where x is the number of vowels in the response to this prompt
![OpenAI o3-mini solving self-referential math puzzle.](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-08-at-5.31.45-PM-1024x257.png)
Okay, this was clever. I’ve got to give o3-mini-high the credit here. Even o1-pro wasn’t this smart at solving this question.
For reference, this was o1-pro’s response.
![OpenAI o1-pro solving self-referential math puzzle.](https://composio.dev/wp-content/uploads/2025/02/image-1024x281.png)
While this was correct, o3-mini’s action was too smooth to ignore. I genuinely didn’t expect this answer from an LLM.
Summary of math abilities
For raw math ability, o3-mini-high is up there with o1 and r1. However, it is a better model in terms of general reasoning, which might make it a better model when the questions are reasoning-heavy.
Coding
usually, pick the most recent LeetCode Hard question to reduce the chances of this being in the training set.
So, I used the problem “Longest Special Path” to test the r1. It’s a reasonably tricky question. (at least for me, lol)
![](https://composio.dev/wp-content/uploads/2025/02/Screenshot-2025-02-08-at-5.50.59-PM-1024x369.png)
Given how good it is at general reasoning, I fully expected it to ace this question. All the reasoning models were able to solve this problem so far.
However, LeetCode is no longer considered the benchmark for gauging the coding abilities of large language models; people on the internet have shifted to a new standard—generating an animated ball moving within a geometric object.
Here’s an example from Yuchen Jin showing a ball moving inside a Tesseract.
If that is not impressive enough, here’s an even more complex example. May Yamamura created this in a single try using o3-mini.
Summary of Coding Abilities
OpenAI’s o3-mini is a better coding model than Deepseek r1, mainly because it is faster and doesn’t overthink many simple problems. Latency is a big deal when coding. Hence, I strongly feel o3-mini-high will be a better model for your coding tasks.
Creative Writing
Given that Deepseek r1 is such a prolific writer, I don’t think many people are using any OpenAI model for creative writing at this point. While using o3-mini for my tasks, I felt it was a competent model for writing and proofreading technical stuff. However, if you need that unshackled personality that you can steer at your will, then perhaps Deepseek r1 still is your best buddy.
Conclusion
This is certainly a great model from OpenAI, giving us a glimpse into the soul of the actual O3 model. Considering the cost, latency, and performance, it’s definitely a model you won’t despise using.
So, here’s a final breakdown
- • For reasoning tasks, o3-mini-high is the best available model right now.
- • For math, o1 and o3-mini-high are on par, a tad bit better than Deepseek r1.
- • For coding again, o3-mini-high felt better in my use cases but can vary from case to case.
- • I can’t get over Deepseek r1 for creative writing, well, especially its CoT traces. I wish OpenAI would disclose the raw CoT in the coming models.