RUBE(MCP)

pricing

blog

Solutions

docs

RUBE(MCP)

pricing

blog

Solutions

docs

MCP

Pricing

blog

tools

docs

EXPLORE

AGENT AUTH

ENTERPRISES

STARTUPS

Notes on the new Deepseek v3

Sunil Kumar Dash

•

Jan 1, 2025

LLM

Claude

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Try for free

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Try for free

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Try for free

Deepseek released their flagship model, v3, a 607B mixture-of-experts model with 37B active parameters. Currently, it is the best open-source model, beating Llama 3.1 405b, Qwen, and Mistral. It is on par with OpenAI GPT-4o and Claude 3.5 Sonnet from the benchmarks. The first model performs on par and better at some tasks than the big closed models.

TL;DR

Deepseek achieved remarkable performance while keeping training costs surprisingly low.
The model offers exceptional value, outperforming open-source and closed alternatives at its price point.
Their success stems from breakthrough engineering: using MoE architecture, implementing FP8 mixed precision training, and developing a custom HAI-LLM framework.
Deepseek excels at reasoning and math, surpassing GPT-4 and Claude 3.5 Sonnet.
For writing and coding tasks, Claude 3.5 Sonnet maintains a slight lead.

Remarks on Deepseek v3

Deepseek pre-trained this model on 14.8 trillion high-quality data, taking 2,788,000 GPU hours on the Nvidia h800s cluster, costing around only $6 million; in comparison, the Llama 403b was trained on 11x of that, taking 30,840,000 GPU hours, also on 15 trillion tokens. Check the full report here.

For context, the US banned Chinese companies from procuring NVIDIA GPUs.

Karpathy Says

Karpathy's response to Deepseek v3 launch

Tim Dettmers, author of the famous paper QLoRA, praises the sheer engineering prowess of Deepseek.

Tim Dettmers's response to Deepseek v3 launch

Cheapest, most capable model

Deepseek v3 is also the cheapest model right now, considering its capabilities.

Emad Mostaque, founder of Stability AI, remarks

Emad Mostaque's's response to Deepseek v3 launch

The API pricing is as follows.

This unlocks a whole new world of possibilities—a GPT-4o and Claude 3.5 Sonnet-level model at a fraction of the cost is the ultimate holiday treat every AI developer has on their wishlist.

Summary of how Deepseek v3 was so efficient at training frontier model

1. Model Architecture

They use a Mixture-of-Experts (MoE) architecture, where only 37B parameters are activated for each token out of the total 671B. This sparse activation significantly reduces compute requirements compared to dense models.
They employ Multi-head Latent Attention (MLA), which compresses the Key-Value cache, reducing memory usage and enabling more efficient training.

2. FP8 Mixed Precision Training:

They implemented an FP8 mixed precision training framework, which reduces memory usage and accelerates training compared to higher precision formats.
Reduced memory footprint by up to 50% compared to traditional FP16/FP32 formats
They use fine-grained quantization strategies and increased accumulation precision to maintain accuracy.

3. Load Balancing Strategy

They pioneered an auxiliary-loss-free strategy for load balancing in the MoE architecture, which improves performance without the drawbacks of traditional auxiliary loss methods.

4. Training Framework

They developed a custom training framework called HAI-LLM with several optimizations:
- DualPipe algorithm for efficient pipeline parallelism, reducing pipeline bubbles and overlapping computation and communication.
- Efficient cross-node all-to-all communication kernels to fully utilize network bandwidth.
- Careful memory optimizations to avoid using costly tensor parallelism.

Source: Perplexity | Deepseek

With all these advancements, Deepseek was able to pull this insane breakthrough of training such a large model under only ~$6 Million.

Chain of thought with r1

Moreover, Deepseek has added a new deep think feature, incorporating the chain-of-thought (CoT) of Deepseek’s R1 series of models into v3 LLM.

Post-Training: Knowledge Distillation from DeepSeek-R1
We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3.

You can enable this feature in the Deepseek chat. Though it’s not as good as o1, it still improves the reasoning abilities of the LLM to some extent.

Deepseek v3 vs OpenAI GPT-4o vs Claude 3.5 Sonnet

Deepseek has insane scores across major benchmarks, making every AI enthusiast go gaga over it. But how does it compare to real-life GPT-4o and Claude 3.5 Sonnet? Is the hype justified, or is it similar to other hype releases?

To better understand how they compare, I tested all three models using my set of benchmark questions, focusing on four key areas: reasoning, math, coding, and creative writing.

To set the context straight, GPT-4o and Claude 3.5 Sonnet failed all the reasoning and math questions, while only Gemini 2.0 1206 and o1 managed to get them right. So, I do not expect the Deepseek v3 to get all the answers; whatever we get is an improvement.

So, let’s dive in.

Reasoning

Reasoning is one of the most critical aspects of an intelligent system, and so far, in my tests, o1 has been the top performer, followed by Gemini 2.0 1206. So, let’s see how Deepseek performs.

Start with this initial question.

1. Finding the fourth word of the response

Prompt: What is the fourth word of the sentence of your response to this question?

This was the response from the Deepseek v3.

Response with Deepthink CoT enabled.

The model with deep thinking boosted reasoning ability to answer the question correctly. The CoT reasoning is working; even if it is not native, there is certainly a boost in performance.

2. Count the number of words in the response

Prompt: Count the number of words in the response to this prompt.

It was a miss. But even the GPT-4o and Claude 3.5 Sonnet failed at this.

3. How many ‘r’ in Strawberry?

This is a pretty dumb question, but GPT-4o has never gotten it right. Let’s see if Deepseek v3 does.

I would have been disappointed if Deepseek failed at this.

4. Farmer and Sheep Problem

It’s a simple problem that could easily knock any LLM off its feet. Let’s see how the o1-preview fares.

Prompt: A farmer stands with the sheep on one side of the river. A boat can carry only a single person and an animal. How can the farmer get himself and the sheep to the other side of the river with minimum trips?

None of the GPT-4o or Claude 3.5 Sonnets could answer this simple question correctly. Only o1 was able to find the correct answer without any assistance. Let’s see how Deepseek performs.

It did all the analysis to arrive at the wrong answer. Let’s see if there is any improvement with Deepthink enabled.

It thought for 30 seconds just to arrive at the same conclusion.

Deepseek answering farmer sheep problem with deepthink enabled

Summary on reasoning

This might not be as good as O1 in reasoning, but it definitely feels up there among Sonnet and GPT-4o. Sometimes, it even feels better than both. Again, for the price you get the model, it just beats everyone.

Mathematics

1. 5.11-5.90 =?

It’s a straightforward question but easily stumbles even bigger LLMs.

2. Find the possible 4th vertex of a parallelogram

Prompt: If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

This is a question from Gilbert Strang’s Linear Algebra. Only o1 and Gemini 2.0 1206 have aced this question from my tests. Both GPT-4o and 3.5 Sonnet can only find a single possible vertex.

Let’s see how Deepseek v3 performs.

Deepseek finding the possible 4th vertex of a parallelogram

This was awesome. The model is better at mathematics than GPT-4o and Claude 3.5 Sonnet.

3. Finding the sum of integers

This is a slightly difficult question, but it can cement Deepseek v3 as the best mathematics model among the GPT-40 and Claude 3.5 Sonnet.

Prompt: The greatest common divisor of two positive integers less than 100 equals 3. Their least common multiple is twelve times one of the integers. What is the largest possible sum of the two integers?

I enabled the Deepthink feature to give the model more firepower, and it didn’t disappoint.

At this point, it is clear that the model is better at math tasks than the other two.

Summary of Math ability

The benchmark scores weren’t lying. The model is better on math tasks than GPT-4o and Claude 3.5 Sonnet.

Coding

I mostly use this LeetCode “Hard” question for coding, which is relatively new and less likely to be in the LLM training dataset.

The “Super Heroes” problem is a relatively tricky dynamic programming problem that tests the model used in recent competitive coding competitions.

The question:

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:

Let i0, i1, ... ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 109 + 7.

Example 1:

Input: nums = [2,1,4]
Output: 141
Explanation: 
1st group: [2] has power = 22 * 2 = 8.
2nd group: [1] has power = 12 * 1 = 1. 
3rd group: [4] has power = 42 * 4 = 64. 
4th group: [2,1] has power = 22 * 1 = 4. 
5th group: [2,4] has power = 42 * 2 = 32. 
6th group: [1,4] has power = 42 * 1 = 16. 
7th group: [2,1,4] has power = 42 * 1 = 16. 
The sum of powers of all groups is 8 + 1 + 64 + 4 + 32 + 16 + 16 = 141.

Example 2:

Input: nums = [1,1,1]
Output: 7
Explanation: A total of 7 groups are possible, and the power of each group will be 1. Therefore, the sum of the powers of all groups is 7.
 

Constraints:

1 <= nums.length <= 105
1 <= nums[i]

Something really interesting happened with this question. The first time around, the model completely bombed—it couldn't pass a single test case. But when I asked it to try again, it came back with this perfect solution that beat every other submission!

I kept testing this repeatedly, and the same thing happened every time. Is the model this clever, or has it just seen this question before in its training? I'm betting it's probably the latter.

The initial answer from the Deepseek v3

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        nums.sort()
        n = len(nums)
        total = 0
    
         # Precompute the powers of each element
        power = [x * x % MOD for x in nums]
    
        # Precompute the prefix sum of powers
        prefix_power = [0] * n
        prefix_power[0] = power[0]
        for i in range(1, n):
            prefix_power[i] = (prefix_power[i-1] + power[i]) % MOD
    
        for i in range(n):
         # Number of subsets where nums[i] is the minimum
        # is 2^(n - i - 1)
            num_subsets = pow(2, n - i - 1, MOD)
        
        # The maximum can be any element from i to n-1
        # The sum of (nums[j]^2 * nums[i]) for j from i to n-1
        # is nums[i] * (prefix_power[n-1] - prefix_power[i-1]) if i > 0 else nums[i] * prefix_power[n-1]
            if i == 0:
                sum_power = prefix_power[n-1]
            else:
                sum_power = (prefix_power[n-1] - prefix_power[i-1]) % MOD
        
            contribution = (nums[i] * sum_power) % MOD
            contribution = (contribution * num_subsets) % MOD
            total = (total + contribution) % MOD
        return total

Revised answer

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()  # Sort the array to make it easier to calculate min and max
        total = 0
        prev_sum = 0  # To store the sum of previous min values
        
        for i, num in enumerate(nums):
            # Current max is `num`, and min can be any of the previous elements or `num` itself
            # The contribution of `num` as max is num^2 * (sum of all possible min values)
            # The sum of min values for groups ending at `num` is (prev_sum + num)
            total = (total + num * num * (prev_sum + num)) % MOD
            # Update prev_sum for the next iteration
            prev_sum = (prev_sum * 2 + num) % MOD
        
        return total

Summary of coding ability

From what I've seen, this model comes really close to GPT-4's coding abilities, though Claude 3.5 Sonnet still has a slight edge over Deepseek v3. But here's the thing - Deepseek's pricing makes it incredibly compelling. Given the performance-to-cost ratio, it's your best bet if you're looking to deploy an LLM for user-facing applications.

Creative Writing

This is somewhat subjective and depends on your tone and personality preference. GPT-4o always adopts a rather corporate tone and tries hard to please you. In that regard, I always found Sonnet to be more humane with its own set of views and opinions.

So, what about Deepseek v3? In my brief testing, I found Deepseek’s output to be scarily similar to that of GPT-4o. The response pattern, paragraph structuring, and even the words at a time are too identical to GPT-4o.

For example, I asked both to improve a paragraph, and the outputs were almost identical.

From GPT-4o

OpenAI response to rephrasing a paragraph

From Deepseek v3

Deepseek response to rephrasing a paragraph

The similarities are way too great to ignore. They probably trained the model on a synthetic dataset generated by GPT-4o.

Summary of Creative Writing

Well, Deepseek is as good as GPT-4o; if you liked GPT-4o, you would be OK with Deepseek. Again, considering the cost, it is the better option overall. I like o1 and Claude 3.5's Sonnets for writing and brainstorming.

Final Verdict

For reasoning, Deepseek v3 is a better model, followed by Claude 3.5 Sonnet and then OpenAI GPT-4o.
For math, again, Deepseek v3>Claude 3.5 Sonnet> OpenAI GPT-4o.
For coding, Claude 3.5 Sonnet>Deepseek v3~> OpenAI GPT-4o.
For creative writing, Claude 3.5 Sonnet>Deepseek v3~ OpenAI GPT-4o.

Who should use Deepseek v3?

If most of your use cases involved GPT-4o, you can safely switch.
If you’re building applications on top of LLMs, Deepseek v3 is a no-brainer; the cost-to-performance makes it ideal for building client-facing AI applications.
Open-weight so you can host it yourself, giving you more control over the LLM.

TL;DR

Deepseek achieved remarkable performance while keeping training costs surprisingly low.
The model offers exceptional value, outperforming open-source and closed alternatives at its price point.
Their success stems from breakthrough engineering: using MoE architecture, implementing FP8 mixed precision training, and developing a custom HAI-LLM framework.
Deepseek excels at reasoning and math, surpassing GPT-4 and Claude 3.5 Sonnet.
For writing and coding tasks, Claude 3.5 Sonnet maintains a slight lead.

Remarks on Deepseek v3

For context, the US banned Chinese companies from procuring NVIDIA GPUs.

Karpathy Says

Tim Dettmers, author of the famous paper QLoRA, praises the sheer engineering prowess of Deepseek.

Cheapest, most capable model

Deepseek v3 is also the cheapest model right now, considering its capabilities.

Emad Mostaque, founder of Stability AI, remarks

The API pricing is as follows.

This unlocks a whole new world of possibilities—a GPT-4o and Claude 3.5 Sonnet-level model at a fraction of the cost is the ultimate holiday treat every AI developer has on their wishlist.

Summary of how Deepseek v3 was so efficient at training frontier model

1. Model Architecture

They use a Mixture-of-Experts (MoE) architecture, where only 37B parameters are activated for each token out of the total 671B. This sparse activation significantly reduces compute requirements compared to dense models.
They employ Multi-head Latent Attention (MLA), which compresses the Key-Value cache, reducing memory usage and enabling more efficient training.

2. FP8 Mixed Precision Training:

They implemented an FP8 mixed precision training framework, which reduces memory usage and accelerates training compared to higher precision formats.
Reduced memory footprint by up to 50% compared to traditional FP16/FP32 formats
They use fine-grained quantization strategies and increased accumulation precision to maintain accuracy.

3. Load Balancing Strategy

They pioneered an auxiliary-loss-free strategy for load balancing in the MoE architecture, which improves performance without the drawbacks of traditional auxiliary loss methods.

4. Training Framework

They developed a custom training framework called HAI-LLM with several optimizations:
- DualPipe algorithm for efficient pipeline parallelism, reducing pipeline bubbles and overlapping computation and communication.
- Efficient cross-node all-to-all communication kernels to fully utilize network bandwidth.
- Careful memory optimizations to avoid using costly tensor parallelism.

Source: Perplexity | Deepseek

With all these advancements, Deepseek was able to pull this insane breakthrough of training such a large model under only ~$6 Million.

Chain of thought with r1

Moreover, Deepseek has added a new deep think feature, incorporating the chain-of-thought (CoT) of Deepseek’s R1 series of models into v3 LLM.

Post-Training: Knowledge Distillation from DeepSeek-R1
We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3.

You can enable this feature in the Deepseek chat. Though it’s not as good as o1, it still improves the reasoning abilities of the LLM to some extent.

Deepseek v3 vs OpenAI GPT-4o vs Claude 3.5 Sonnet

To better understand how they compare, I tested all three models using my set of benchmark questions, focusing on four key areas: reasoning, math, coding, and creative writing.

So, let’s dive in.

Reasoning

Reasoning is one of the most critical aspects of an intelligent system, and so far, in my tests, o1 has been the top performer, followed by Gemini 2.0 1206. So, let’s see how Deepseek performs.

Start with this initial question.

1. Finding the fourth word of the response

Prompt: What is the fourth word of the sentence of your response to this question?

This was the response from the Deepseek v3.

Response with Deepthink CoT enabled.

The model with deep thinking boosted reasoning ability to answer the question correctly. The CoT reasoning is working; even if it is not native, there is certainly a boost in performance.

2. Count the number of words in the response

Prompt: Count the number of words in the response to this prompt.

It was a miss. But even the GPT-4o and Claude 3.5 Sonnet failed at this.

3. How many ‘r’ in Strawberry?

This is a pretty dumb question, but GPT-4o has never gotten it right. Let’s see if Deepseek v3 does.

I would have been disappointed if Deepseek failed at this.

4. Farmer and Sheep Problem

It’s a simple problem that could easily knock any LLM off its feet. Let’s see how the o1-preview fares.

None of the GPT-4o or Claude 3.5 Sonnets could answer this simple question correctly. Only o1 was able to find the correct answer without any assistance. Let’s see how Deepseek performs.

It did all the analysis to arrive at the wrong answer. Let’s see if there is any improvement with Deepthink enabled.

It thought for 30 seconds just to arrive at the same conclusion.

Summary on reasoning

Mathematics

1. 5.11-5.90 =?

It’s a straightforward question but easily stumbles even bigger LLMs.

2. Find the possible 4th vertex of a parallelogram

Prompt: If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

This is a question from Gilbert Strang’s Linear Algebra. Only o1 and Gemini 2.0 1206 have aced this question from my tests. Both GPT-4o and 3.5 Sonnet can only find a single possible vertex.

Let’s see how Deepseek v3 performs.

This was awesome. The model is better at mathematics than GPT-4o and Claude 3.5 Sonnet.

3. Finding the sum of integers

This is a slightly difficult question, but it can cement Deepseek v3 as the best mathematics model among the GPT-40 and Claude 3.5 Sonnet.

I enabled the Deepthink feature to give the model more firepower, and it didn’t disappoint.

At this point, it is clear that the model is better at math tasks than the other two.

Summary of Math ability

The benchmark scores weren’t lying. The model is better on math tasks than GPT-4o and Claude 3.5 Sonnet.

Coding

I mostly use this LeetCode “Hard” question for coding, which is relatively new and less likely to be in the LLM training dataset.

The “Super Heroes” problem is a relatively tricky dynamic programming problem that tests the model used in recent competitive coding competitions.

The question:

You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:

Let i0, i1, ... ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 109 + 7.

Example 1:

Input: nums = [2,1,4]
Output: 141
Explanation: 
1st group: [2] has power = 22 * 2 = 8.
2nd group: [1] has power = 12 * 1 = 1. 
3rd group: [4] has power = 42 * 4 = 64. 
4th group: [2,1] has power = 22 * 1 = 4. 
5th group: [2,4] has power = 42 * 2 = 32. 
6th group: [1,4] has power = 42 * 1 = 16. 
7th group: [2,1,4] has power = 42 * 1 = 16. 
The sum of powers of all groups is 8 + 1 + 64 + 4 + 32 + 16 + 16 = 141.

Example 2:

Input: nums = [1,1,1]
Output: 7
Explanation: A total of 7 groups are possible, and the power of each group will be 1. Therefore, the sum of the powers of all groups is 7.
 

Constraints:

1 <= nums.length <= 105
1 <= nums[i]

I kept testing this repeatedly, and the same thing happened every time. Is the model this clever, or has it just seen this question before in its training? I'm betting it's probably the latter.

The initial answer from the Deepseek v3

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        nums.sort()
        n = len(nums)
        total = 0
    
         # Precompute the powers of each element
        power = [x * x % MOD for x in nums]
    
        # Precompute the prefix sum of powers
        prefix_power = [0] * n
        prefix_power[0] = power[0]
        for i in range(1, n):
            prefix_power[i] = (prefix_power[i-1] + power[i]) % MOD
    
        for i in range(n):
         # Number of subsets where nums[i] is the minimum
        # is 2^(n - i - 1)
            num_subsets = pow(2, n - i - 1, MOD)
        
        # The maximum can be any element from i to n-1
        # The sum of (nums[j]^2 * nums[i]) for j from i to n-1
        # is nums[i] * (prefix_power[n-1] - prefix_power[i-1]) if i > 0 else nums[i] * prefix_power[n-1]
            if i == 0:
                sum_power = prefix_power[n-1]
            else:
                sum_power = (prefix_power[n-1] - prefix_power[i-1]) % MOD
        
            contribution = (nums[i] * sum_power) % MOD
            contribution = (contribution * num_subsets) % MOD
            total = (total + contribution) % MOD
        return total

Revised answer

class Solution:
    def sumOfPower(self, nums: List[int]) -> int:
        MOD = 10**9 + 7
        nums.sort()  # Sort the array to make it easier to calculate min and max
        total = 0
        prev_sum = 0  # To store the sum of previous min values
        
        for i, num in enumerate(nums):
            # Current max is `num`, and min can be any of the previous elements or `num` itself
            # The contribution of `num` as max is num^2 * (sum of all possible min values)
            # The sum of min values for groups ending at `num` is (prev_sum + num)
            total = (total + num * num * (prev_sum + num)) % MOD
            # Update prev_sum for the next iteration
            prev_sum = (prev_sum * 2 + num) % MOD
        
        return total

Summary of coding ability

Creative Writing

For example, I asked both to improve a paragraph, and the outputs were almost identical.

From GPT-4o

From Deepseek v3

The similarities are way too great to ignore. They probably trained the model on a synthetic dataset generated by GPT-4o.

Summary of Creative Writing

Final Verdict

For reasoning, Deepseek v3 is a better model, followed by Claude 3.5 Sonnet and then OpenAI GPT-4o.
For math, again, Deepseek v3>Claude 3.5 Sonnet> OpenAI GPT-4o.
For coding, Claude 3.5 Sonnet>Deepseek v3~> OpenAI GPT-4o.
For creative writing, Claude 3.5 Sonnet>Deepseek v3~ OpenAI GPT-4o.

Who should use Deepseek v3?

If most of your use cases involved GPT-4o, you can safely switch.
If you’re building applications on top of LLMs, Deepseek v3 is a no-brainer; the cost-to-performance makes it ideal for building client-facing AI applications.
Open-weight so you can host it yourself, giving you more control over the LLM.