GPT-4.1 vs. Deepseek v3 vs. Sonnet 3.7 vs. GPT-4.5

by Sunil Kumar DashApr 17, 20257 min read
LLM

GPT 4.1 is here. As many speculated, the mysterious Quasar Alpha on OpenRouter was GPT-4.1, and Optimus Alpha was GPT-4.1 mini. And they are not open-source. OpenAI has been teasing about open-sourcing some of its models, and everyone expected this to be an open-source launch, but I guess we have to wait.

Anyway, they launched three models: GPT-4.1, Mini, and Nano. The main attraction is they are significantly cheaper while maintaining comparable performance with peers. The 4.1 has a one million context window; this is the first from OpenAI and has performed excellently on their internal benchmarks.

However, the most important detail from the launch was that they deprecated GPT-4.5 in the API they released a few months back. The reason given was that GPT-4.1 is better than GPT-4.5. The model is not directly accessible in the Chatgpt app, which means it has already been incorporated in some capacity with the existing GPT-4o.

This is my first ever seeing a product named 4.1 deprecating 4.5.

This article will cover GPT-4.1 and compare the coding and creative writing abilities with Deepseek v3, GPT-4.5, and Claude 3.7 Sonnet.

Table of Contents

  • GPT-4.1: The good and the bad

  • How does it compare to GPT-4.5, Deepseek v3 0324, and Claude 3.7 Sonnet

  • Coding

    • Simon Willison's SVG test

    • Pygame Simulation

    • 3js Simulation

  • Creative Writing

    • Sisyphus's Advice

  • Conclusion

TL;DR

If you've somewhere else to be, here's a summary.

  • The new GPT-4.1 has one million context windows and 32k in output tokens.

  • The GPT-4.1 has scored 52.4% in Aider's Polyglot benchmark and 55% in the SWE-bench verified.

  • OpenAI committed a chart crime

  • The model lacks the code generation capabilities of Deepseek v3 and Claude 3.7 Sonnet.

  • It doesn't excel at creative writing, GPT-4.5 is still the best at it.

GPT-4.1: The good and the bad

The GPT-4.1 has one million in the context window and can generate up to 32k tokens. This is a twofold jump from the GPT-4o, which had 16k.

As Openai claims, the model is specially tuned for coding and function-calling tasks. The model scored 55% in the verified SWE-bench and 52.4% in Aider’s Polyglot benchmark. The numbers are simillar to those of Deepseek v3 0324 and Claude 3.5 Sonnet.

It would’ve been impressive if Deepseek v3 and Gemini 2.5 Pro hadn't been available, which would have offered more for the price. However, that’s not enough to woo developers.

They didn't bother showing competitors

Deepseek offers the same performance at 1/9th of the price, and Gemini is more performant at 1/3rd.

But worst of all, they committed a heinous hate crime.

What even is it supposed to mean? And how did it get approved? Crazy stuff is happening. This was more disappointing than the launch.

But anyway, there is one other thing that nobody is talking about. OpenAI released two evals, mrcr and Graphwalks. The MRCR dataset evaluates the model's performance on long-context retrieval, and it's much better and more nuanced than a typical needle-in-the-haystack test. The Grpahwalk, the other hand, is used to evaluate multi-hop long-context reasoning.

How does it compare to GPT-4.5, Claude Sonnet 3.7, and Deepseek v3 0324

Ok, the immediate competitor to GPT-4.1 is Deepseek v3 0324, the latest checkpoint of the original v3. It offers simillar coding performance at a much better price, and I was also very interested in knowing how much 4.1 offers over the outgoing GPT-4.5. Especially in writing, because it has been the best model for me for creative writing tasks.

So, let's compare these four models on coding, reasoning, and writing, as most real-life uses are around that.

Coding

1. Simon Willison's SVG test

It's an interesting test in which we generate SVG of a Pelican riding a bicycle. It gives a very good idea about the general creative and coding capability of LLMs.

Here's the outcome from all the models

GPT-4.1 seems to have done better than all the models, closely followed by Deepseek v3 0324.

2. Ball simulation

You may not like this test, but it's still a very good test for knowing instruction following, code understanding and generation. All you need to do is tweak the prompt.

Prompt: Write a Python script that visually simulates five differently colored balls dynamically bouncing within multiple nested, rotating hexagons. Each hexagon rotates at a distinct speed, creating an engaging, mesmerizing pattern. All nested hexagons, except the outermost one, must have exactly one randomly selected missing wall, allowing balls to seamlessly pass between inner and outer hexagons. The balls initially start within the innermost hexagon. Implement realistic physics by incorporating gravity, friction, and angular momentum, ensuring natural and convincing interactions between the balls and the spinning walls. Provide adjustable parameters for gravity strength, friction level, rotation speeds, and hexagon sizes, allowing easy experimentation with the simulation's dynamics. Pay particular attention to the collisions between the balls and the rotating walls to ensure accuracy and realism. Make it minimal

Let's see how all these models performed.

3. 3JS Simulation

Prompt: write me the complete code in 3js for a metropolitan city landscape simulation. Make it browser-friendly so it can run lag-free

GPT-4.1

GPT-4.5

Deepseek v3 0324

Claude 3.7 Sonnet

Deepseek v3 0324 generated the best-looking simulation with more functionality, though a few were not working. The Claude 3.7 Sonnet was impressive as well.

However, OpenAI models failed to generate working code and even failed to solve problems even with repeated inputs. In fact, Deepseek v3 debugged GPT-4.1 and GPT-4.5 code. It is understandable for GPT-4.5 as it was never a coding model, but GPT-4.1 was very bad.

Creative writing

This is my favourite. If you spend a lot of time talking to these models, you will know that each has a different personality, just like humans.

For this, we can test the models by giving them a situation and asking them to narrate a 1,000-word story. The prompts are from EQ-bench.

Sisyphus Gives Advice

Prompt: Write a comedic scene in the style of Terry Pratchett featuring Sisyphus, who has started an advice column while continuing his eternal boulder-pushing. Rather than being bitter about his punishment, Sisyphus has developed a comfortably zen emotional intelligence and wisdom from his centuries of repetitive labor. Not in the cliche ways you might expect -- he's actually very grounded, warm & connected with the everyday human plight. He dictates responses to "Crow," his feathered familiar who serves as both scribe and mail carrier. The piece should include Sisyphus thoughtfully answering several advice-seekers' letters about everyday problems (relationship troubles, work-life balance, dealing with difficult people, life purpose doubts) while periodically returning to his boulder-pushing duties. His advice should be surprisingly insightful and practical. The humor should emerge naturally from the messages he receives, from mundane to risque. Sisyphus should genuinely enjoy helping others, finding meaning in this new pursuit while maintaining his boulder routine. The humour should be *understated and not forced*. Contrary to expectation, almost none of his advice involves boulder rolling metaphors (though he does sneak *one* subtle in there), nor a treatise on stoicism. We are also going against the obvious fish-out-of-water trope; in fact, he's surprisingly a natural at this. In all of this , it's important to *show, don't tell*. The format should be largely epistolary, with the occasional interjection for scene setting. 1000 words.

GPT-4.1

GPT-4.5

Deepseek v3 0324

Claude 3.7 Sonnet

It was fascinating to see the outputs of all these models. You can try other prompts and see for yourself. In my limited testing, GPT-4.5 was the best one for creative writing. It has humour, nuance, and everything that makes prose good.

It was followed by Deepseek v3, Claude 3.7 Sonnet, and GPT-4.1, surprisingly the worst.

Conclusion

The GPT-4.1, Openai's developer model, is not for developers or general users. It is decent at tasks but nothing that will make you switch from Gemini 2.5 or Deepseek v3 0324. It doesn't have cost or capability benefits over the immediate competitors.

Share