Reka Flash 3 vs QWQ 32b

by HarshApr 4, 202512 min read

Reka Flash three and QwQ 32b are two of the best thinking models in their segment. But which one is best for running it on consumer hardware with the best reasoning, coding, and creative writing ability?

So, I ran a few tests to know which model is better for local use cases.

Table of Contents

  • Coding

  • Reasoning

  • Math

  • Creative Writing

    • Sci-fi story writing

    • Role Play

  • Conclusion

Coding

The first domain in which we will test both models is coding. Current LLMS are proficient at writing clean and optimized code. Let’s see how both models perform on two different tasks.

3D Simulation - Javascript

3D simulation requires accurate particle physics calculations, proper collision detection logic & performance optimizations for large particle counts. Most proprietary LLMs nowadays do pretty well but lack clean and concise code; let’s see how our contender performs.

Prompt: Create a JavaScript simulation of a rotating 3D sphere made up of letters.
The closest letters should be brighter, while the ones farthest away should appear grey.

Output: Reka Flash 3

Code link: 3d_rotating_letters_reka.html

How To Use:

  1. Save the code as an HTML file

  2. Open it in a modern web browser

Output: Reka Flash 3

It is completely different from what was expected, with no dark background, no functionality & very long generation time.

Despite multiple efforts, the model failed to generate the right code for right output and kept reiterating the previous mistakes. I guess this was the result of overthinking.

However, when I tried the same prompt with the number, it generated a good output with all functionalities.

Reka 3D Generation Test 2: 3d_rotating_numbers_reka.html

Output: QwQ 32b

Code Link: JS (3d_rotating_letters_qwq.js), HTML (3d_rotating_letters_reka.html)

Browser Output (3d letter rotating simulation)

Though the model generated bad code in the first test, the second one gave an output. The output is modularized, polished, rendered perfectly, and follows all the instructions in the prompt except speed controls. It's definitely a go for a small 3d scene rendering task.

I only hated that it didn’t provide the run instructions unless stated explicitly.

Let’s check the performance on LeetCode.

Leet Code - Problem #3463

I found this one while writing my other blog, and it became my benchmark question to evaluate the coding capabilities of LLM, as this requires 1st principle thinking rather than library usage. Let’s see how Reka Flash 3 vs. QwQ 32b performs:

Note: This problem has only an 8.8% acceptance rate on Leet Code, and only Grok3 was able to solve it. In essence, it's a really hard problem.

Prompt: 3463. Check If Digits Are Equal in String After Operations II

Output: Reka Flash 3

Results (Time Limit Exceed Error)

Time Limit Exceeded

The code passed all public test cases but failed to solve private test cases in record time.

Output: QWQ 32b

Results (time limit exceed error)

Time Limit Exceeded

As expected, both of the models failed to solve the given question; despite nudging to solve the problem from 1st principal reasoning, this shows the capability of model is limited

Final Thoughts

  • For small coding tasks, both models work well; however, they require explicit and careful prompting. However, for rendering-related tasks, QwQ is a better choice.

  • The limiting factor in not using Reka Flash 3 is that the reasoning steps are embedded in the output itself rather than in a separate collapsible bar and its common AI Hallucinations.

  • I would not use both of them; instead, I would go to Gemma / Llama. They are way better at coding and rendering tasks.

Let’s move on and check models on the next task

Reasoning

The second domain for testing is Reasoning. Modern LLMs excel at identifying details and solving problems. Let’s see how both models perform on my reasoning questions.

River Crossing Puzzle - Modified

This question is a modified version of the flash game variant I used to play as a kid. The idea is to transfer a wolf, a cow, and fodder, but he can carry one at a time. If left alone, the two food chains kick in, so the question is how to transfer them.

The modified version simplifies by introducing three secure compartments while retaining the original details. This identification is key to the solution—all three can be taken as the compartments are isolated.

Let’s see which models will discover this.

Prompt: Crossing The River

A farmer wants to cross a river and take a wolf, a goat and a cabbage with him. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer efficiently bring the wolf, the goat and the cabbage across the river without anything being eaten?

Output: Reka Flash 3

Fantastic, got the intricate details and solved the problem in 64.9 seconds (most time went for COT reasoning - 4391 tokens). Despite being an open source its good!

Now let’s check the response for QwQ 32b

Output: QwQ 32b

The model failed to understand the intricate details of the question and reverted to the classic logic puzzle answer, so it failed to generate the correct response in 21.8 seconds (560 tokens)

Now let’s test it on another question

Minimum Number of Weights Needed

My uncle asked this question. It comes under - binary decision testing (divide and conquer), logical deduction & optimisation testing. I failed to do it, let’s see how both models perform the task

Prompt: Minimum Weight Balancing

You are given 8 identical-looking balls. One of them is slightly heavier, but you don't know which one. You have a balance scale that allows you to compare the weight of two sets of balls. What is the minimum number of weights necessary to guarantee finding the heavier ball?

Solution

  1. Divide the eight balls into 3-3-2 groups.

  2. Weigh the two groups of 3 balls against each other.

  3. If one group is heavier, weigh any 2 balls from that group.

  4. If one of the two balls is heavier, you have found the heavier ball.

Expected: Thus, two numbers of weights

Output: Reka Flash 3 (Beautified)

Perfect—Reka Flash 3 nailed it. It not only solved the problem but also verified across multiple scenarios in 37.2 seconds (2452 tokens).

Let’s see how Qwen does

Output: QwQ 32b (Beautified)

QwQ also solved the problem easily, but it didn’t verify its result across multiple scenarios. However, it took 12.8 seconds (2383 tokens)

Final Thoughts

  • For general-purpose and simple reasoning tasks, go for QwQ 32b, as it generates faster responses with less token usage. It is also good if you are low on budget.

  • However, if need complex reasoning task solved, go for Reka Falsh 3, despite being both open source, its much better. Can opt for QwQ if low on budget, but results may not be optimal.

  • I will choose Groq 3 Deep Think (still open), as its reasoning abilities are great!

Anyway, let’s move on and check the models on the next task.

Math

The third domain for testing is Math, which also involves reasoning. Modern LLMs are great at solving complex math problems. Let’s see if both models can compete. To keep it fair, I’ll test them on one simple and one complex problem.

Sequence Puzzle

It is a simple question for humans but a tough one for LLMs. Many models have failed to get it right. Let’s see if QwQ or Reka can crack it or not with reasoning steps!

Prompt: Sequence Puzzle

What is the next number in the sequence? 2, 6, 12, 20, 30,?

Output: Reka Flash 3

Nice! The model got the right answer in 16.4 seconds (984 tokens). After checking the COT traces, I found it reasoned like a human, not a mathematician - using simple math instead of complex formulas. This makes the explanation easy to understand!

Reka Flash 3 - Reasoning Steps

Output: QwQ 32b

Qwen also solved the problem but required more time and tokens (39.8 sec ~ 1785 tokens). This was due to reasoning like a mathematician, deriving a complex formula to find the solution as part of the verification process.

QwQ 32b → Reasoning Steps

Now, let’s test the model on the medium-complex one to find its actual capabilities.

GCD of a Series Set

I have also used this question before (as a hard one). Though proprietary solutions all solved it, let’s see how open-source ones do.

Prompt: Series GCD

Find the GCD of this series set: {n^99(n^60-1): n>1}

Output: Reka Flash 3

As expected, the model got it wrong. Compared to O1-reasoning, which got it right, this makes sense—open-source models have hallucination issues. Inspecting the COT traces, indeed it was the cause Traces with lots of "But wait" moments.

COT Reasoning Steps

Output: QwQ 32b

Despite the problem being hard, QwQ got it right! Upon inspection, I noticed the model started to hallucinate but then corrected itself through a self-evaluation step—something missing in Reka Flash 3. 👇

QwQ 32b - Self-Evaluation Step

Final Thoughts

  • For complex / higher-order math tasks/questions, go for QwQ 32b. It generated the right response for both questions. The only drawback is the tight budget (204 sec ~ 9k+ tokens).

  • However simple math tasks, Reka Flash 3 seems to be the better model as it uses less token and have fast response time.

  • As for me, I would be happy with O1 - reasoning / Grok 3 Think / Lamma Models.

Finally, let’s move on and check the models on the last task.

Writing

The final evaluation task is Writing. Most LLMs excel at this, so I expect Reka and Qwen to perform well, too. Let’s see how they handle different writing scenarios!

Sci-Fi Story Narration

I love sci-fi texts and narrations, so let’s generate one by giving both models the gist of the story and seeing how the narration unfolds hidden links.

Prompt: Mars Colonization. Narrarate a story whose gist is: A person is sent on a mission to colonize a new planet (mars) but discovers that their ship has a dangerous malfunction due to certain happenings on earth. Keep it grounded and as close to reality as possible, but it must be sci-fi simultaneously.

Output: Reka Flash 3

Story Moral: The cost of determination in an era of fractured loyalty. Grounded in real tech (AI systems, radiation shielding) and human frailty, the story pits idealism against Earth’s self-destruction, asking: Can survival be worth the price of forgetting why we left?

(Note: "Termite" refers to a NASA-derived term for self-replicating code, a nod to real asteroid-defense jargon.)

Gist of Output

  • Though the model excelled at narration, it failed to stay grounded in the prompt.

  • The last note mentions NASA, but it's missing from the story.

  • Mars's mention comes much later.

  • in Epilogue: The Burden, the narrator says, “Elena never learned who sabotaged them”, directly contradicting the required condition/connection in the prompt.

  • A few spelling errors were also found.

  • Utilized 1485 tokens in 21.6 sec

Let's see how Qwen Performs.

Output: QwQ 32b

Story Moral: A grounded sci-fi narrative exploring the tension between human ambition and institutional caution, emphasizing that space colonization’s greatest challenge may not be the void itself, but the choices made amid chaos on Earth.

Gist of Output

  • Great one, I was hooked,

  • For the most part, it is grounded in reality but sci-fi at the same time.

  • It had a clear ending, including real concerns leading to space-bearing civilizations like climate disasters and political upheaval.

  • The story captures the context of Mars very early.

  • No spelling mistakes.

  • Utilized 2060 tokens in 33.0 seconds (21 seconds for reasoning)

Now, let’s test both models on another subtask

Role Play

For this test, I will be creating a story along the lines of Demon Slayer (a popular anime). Let’s see how elaborative, creative and exceptional Reka 3 Flash & QwQ 32b becomes

Prompt: Demon Slayer Style

Roleplay as Kiyoharu and play the given scene

Scene

A quiet city suddenly faces an onslaught of the undead. Your character, a seemingly normal resident, now has to step up and lead the fight for survival. Do they succeed or fail? And what is the truth behind this strange occurrence?

Output: Reka 3 Flash (28.9 ~ 1975 tokens)

The story delivers a nuanced character arc and emotional depth, aligning closely with the prompt’s intent of an “ordinary” character rising to lead.

Output: QwQ 32b (14.2 ~ 1709 tokens)

Outcome: Success, but with sacrifice. The city is saved, but the origin of the outbreak remains a shadowy secret. Kiyoharu emerges as a leader, forever changed by the fragility of normalcy.

The story follows a more structured survival-action narrative with a stronger sense of closure.

I would love to read story 1 for the following reasons:

Component

Story 1

Story 2

Character Consistency

The virus's origin is external, with the reveal remaining a cautionary tale.

Kiyoharu is a capable part-time librarian with mechanical skills, leading from the start.

Engagement & Immersion

Dynamic dialogue and cinematic framing enhance reader connection.

Kiyoharu evolves from a hesitant survivor to a strategic leader.

Leadership Development

A personal connection to the virus mystery through his father’s involvement.

Kiyoharu is portrayed as a leader early, with less development.

Truth Revelation

Kiyoharu is a nervous photography student who is gradually stepping up.

Kiyoharu is a capable part-time librarian with mechanical skills who has led from the start.

Conflict & Stakes

Focuses on personal dilemmas, guilt, and survival, with a haunting resolution.

Action-driven with clear mission objectives and a more resolved conclusion.

Ending Impact

Reflective, melancholic tone, showcasing quiet resilience.

Heroic, with a sense of closure, portraying Kiyoharu as a hardened survivor.

Final Thoughts

  • Both models generated good stories, but Reka Flash 3 stands out in In-depth Storytelling.

  • If explicit prompting is not your forte and you want a great role-play story, choose Reka Flash 3 over QwQ 32b.

  • Choose QwQ 32b for a more grounded story in terms of narration.

  • Overall, both models are capable of simple writing tasks.

Now, let’s move to the conclusion.

Conclusion

The table below summarizes model usage according to the task at hand:

Task

Reka 3 Flash

QwQ 32b

Coding

Complex coding

Easy coding

Reasoning

Complex reasoning

Simple reasoning

Math

Simple problems

Hard problems

Writing

Indepth Story Telling

More grounded response

Both models handle multi-domain tasks equally well,

However, throughout my experiments, I was consistently amazed by the performance of these open-source models. They often par with proprietary ones, showing their impressive capabilities.

So, each model has its own pros and cons, depending on the domain and task. Choose wisely, as these models effectively bridge the gap between Commercial and open-source.

H
AuthorHarsh

Share