Claude 4.5 Opus vs. Gemini 3 Pro vs. GPT-5-codex-max: The SOTA coding model

Claude 4.5 Opus vs. Gemini 3 Pro vs. GPT-5-codex-max: The SOTA coding model

Nov 28, 2025

Nov 28, 2025

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Ship powerful agents fast

Add 10K+ tools to your AI Agent

Three flagship AI coding models launched within weeks of each other. Claude Opus 4.5 on November 24. Gemini 3.0 Pro on November 18. GPT 5.1 Codex-Max on November 19. All three claim to be the best model for complex coding tasks and agentic workflows.

The benchmarks show they're neck-and-neck. I wanted to see what that means for actual development work. So I gave all three the same prompts for two complex problems in my observability platform: statistical anomaly detection and distributed alert deduplication: same codebase, exact requirements, same IDE setup.

Here. I compared all these models on some projects I was working on in my spare time. I've used the Tool router, which is beta, in the first test, which also helps in dogfood the product. Do check out if you're someone who wants to use tools with your agents but doesn't want to be bothered with context pollution. Read more on the tool router here.

The Official Benchmarks

Image 2

Pricing Comparison (Per 1M Tokens)

Model

Input Cost

Output Cost

Claude Opus 4.5

$5.00

$25.00

GPT-5.1 Codex

$1.25

$10.00

Gemini 3 Pro

$2.00 (<200K context) / $4.00 (≥200K)

$12.00 (<200K) / $18.00 (≥200K)

Key Benchmark Scores:

  • SWE-bench Verified: Opus 4.5 leads at 80.9%, followed by GPT 5.1 Codex-Max at 77.9% and Gemini 3 Pro at 76.2%

  • Terminal-Bench 2.0: Gemini 3 Pro tops at 54.2%, demonstrating exceptional tool use capabilities

  • MMMU-Pro (Visual Reasoning): Gemini 3 Pro leads with superior multimodal understanding

  • WebDev Arena: Gemini 3 Pro reaches 1487 Elo score for "vibe coding" capabilities

TL;DR:

  • Opus 4.5(Claude): Outstanding at strategy and design, but its solutions tend to be elaborate, slower to integrate, and prone to practical hiccups once they hit the metal.

  • GPT-5.1 Codex: The most dependable for real-world development, integrates cleanly, handles edge cases, and produces code that holds up under load.

  • Gemini 3 Pro: Lean, fast, and cost-efficient, strong for prototyping and greenfield builds, though its outputs need some hardening for production-grade resilience.

How did I test this?

To really see what these models are made of, I threw all three at the exact same prompts and asked them to solve two problems that have bitten our production pipeline more than once: building a solid anomaly detection path, and hardening alert deduplication across multiple processors. These aren’t academic exercises; they’re the kind of tasks where clock drift, concurrency quirks, and partial crashes turn into 3 a.m. pages.

I ran everything in the same Cursor environment, so no model got special treatment. From there, I wasn’t just watching token counts tick upward; I paid attention to how well each model understood the shape of the system, whether its code actually plugged into the real project, and ultimately, whether the output felt like something I could trust in production and ultimately, one key question:

Is this something I would confidently deploy in a real system?

Tooling notes:

  • Claude Code still delivers the most refined user experience overall, offering structured reasoning traces, step-by-step visibility, and helpful inline feedback during development.

  • GPT Codex CLI (v0.59) has levelled up significantly, now supporting streamed reasoning, stable session recovery, clearer accounting of cached tokens, and automatic context compaction, making long-running agent loops far more reliable.

  • Gemini 3 Pro (via the Node SDK) delivered the fastest completions and lowest cost per task, though the model tends to reveal less of its internal reasoning compared to GPT-5.1 or Claude.

Test 1: Statistical Anomaly Detection

The challenge: Build a system that learns baseline error rates, uses z-scores and moving averages, catches rate-of-change spikes, and handles 100k+ logs/minute with under 10ms latency.

Image 3

Opus 4.5 Attempt

Time: 12m 11s | Cost: $1.28| Diff: +2,981 lines across 9 files

https://github.com/VarshithKrishna14/Kompi/commit/aefae4e2a2e6e5befd8365d3f1c730ec564603be

Claude delivered a huge implementation: a full statistical anomaly detector with rolling snapshots, Welford-based state tracking, spike detection, serialisation logic, configuration structures, and extensive inline comments. On first read, it felt like production-ready engineering.

Then it hit the real system.

The first run immediately exposed a critical failure path. When the historical average approached zero, calculateSpikeRatio() it produced astronomical values (e.g., 1e12 or worse), which were sent directly into .toFixed(2) without passing through the existing sanitisation layer. The result: hard runtime crashes from perfectly valid data.

State restoration made things even worse. The deserialize() function reloaded snapshots but didn’t recompute the means, variance terms, or sample counts. After a restart, the spike detector and z-score logic were working off a statistically incompatible internal state. No crash, no error, just silent corruption.

But the core design choice, using Welford for a rolling window, undermined the entire system. The output looked sophisticated, but the logic couldn’t behave correctly under real production workload patterns.

GPT-5.1 Analysis

Time: 6m | Cost: ~$0.24 | Diff: ~+577 lines for across 3 files modified

https://github.com/VarshithKrishna14/Kompi/commit/2afad7e6e7940b6204a0ff0d1c51ea1b912a362e

GPT-5.1 implemented a streaming statistical anomaly detector optimised for extremely high-throughput logging workloads (100k+ events/sec). Instead of heavy time-bucket structures or map-based rolling memory, this design uses a single-pass O(1) update loop with:

  • EWMA for online mean

  • EWMA of squared error-rate for stable variance

  • Rolling time window for short-term spikes

  • Hard defences against NaN, Infinity, invalid counts, and timestamp issues

Everything executes synchronously with constant-time updates, making it viable for a log ingestion pipeline running on a hot data path.

Gemini 3 Pro Attempt

Time: ~5m 44s | Estimated Cost: ~$0.14 | Diff: +366 lines across 4 files

https://github.com/VarshithKrishna14/Kompi/commit/1d711ecdec0786bc0265afe3b73e36bab0b6f553

Gemini 3 Pro tackled the anomaly-detection problem with a stream-optimized, low-latency architecture. The implementation uses a stateless EWMA model (Exponentially Weighted Moving Average) rather than sliding windows, ensuring O(1) memory usage regardless of throughput. High-volume logs are aggregated in-memory and flushed to the detector at fixed intervals (e.g., 1s), ensuring the system easily handles 100,000+ logs/minute with negligible overhead.

Edge cases are rigorously handled:

  • Zero-Variance: Explicitly guarded to prevent infinite Z-scores during flatline periods.

  • Math Safety: Division operations use epsilon guarding (0.000001) to prevent zero-division errors.

  • Invalid Inputs: NaN and Infinity are gracefully filtered out without crashing the pipeline.

The test suite is comprehensive and deterministic:

  • Verifies regime adaptation (baseline shifting) using fast-adaptation alpha values.

  • Tests rate-of-change spikes (e.g., 6x jumps) independently of Z-scores.

  • Validates initialization convergence and mock database integration points.

Round 1 – Quick comparison

Category

Opus 4.5

GPT-5.1 Codex

Gemini 3 Pro

Integrated into the pipeline?

Yes, but incomplete wiring paths

Yes

Yes

State handling

Welford accumulator → never forgets

EWMA + bounded buffers

Pure EWMA constant state

Restart behavior

Hydrates but recomputes nothing → silent corruption

Hydrates & recomputes cleanly

Hydrates & recomputes stdDev correctly

Edge-case safety

Crashes on big spike ratios

Fully guarded

Fully guarded

Bad input tolerance

Mixed

Safe

Safe

Tests

Nondeterministic, randomness

Deterministic

Deterministic

Memory growth

Unbounded / drift

Stable

Fairly constant

Would it survive production?

No

Yes

Depends

Tool Router Integration

Instead of preloading giant toolkits into MCP and bloating every session with dozens of unused capabilities, the Tool Router acts as an on-demand integration layer. This also helps us dogfood our own product.

If you’re curious about the underlying system, the full technical breakdown is documented here:https://composio.dev/blog/introducing-tool-router-(beta)

Before running our further tests and alerting workflows, I integrated through Tool Router. One OAuth handshake per user gives the MCP client access to Slack, Jira, PagerDuty, or any other connected tool, no manual wiring required. (For context, we first tried this setup in the Gemini version.)

The benefits we see in practice:

  • Seamless per-user integration: a single router manages many apps, and each session only exposes the tools the user actually connected.

  • Instant, code-free updates: newly connected services show up automatically, so agents can start using them immediately without redeployments or extra glue code.

  • Automated workflow routing: alerts and tasks from our anomaly detection pipeline can be sent through Slack, Jira, PagerDuty, or any connected tool effortlessly.

In our processor pipeline, we integrated Tool Router to handle alerting dynamically across Slack, Jira, and PagerDuty. Instead of manually wiring each service, we use a unified ComposioClient approach that initialises the MCP client with a single configuration:

const composioClient = new ComposioClient({
  apiKey: process.env.COMPOSIO_API_KEY!,
  userId: 'tracer-system',
  toolkits: ['slack', 'jira', 'pagerduty'],
});

const mcpClient = await composioClient.createMCPClient();

Once initialised, any alerting workflow can call agents directly. For example, our anomaly detection triggers the log-anomaly-alert-agent, which automatically decide which tools to notify.

Test 2: Distributed Alert Deduplication

The challenge: Implement distributed alert deduplication so multiple processors don’t fire duplicate alerts within a 5-second window, tolerating up to 3s clock skew and processor crash

Opus 4.5 Take

Time: ~7 m 1 s (estimate) | Cost: ~$0.48 | +715 lines across 4 files

https://github.com/VarshithKrishna14/Kompi/commit/0bb9ad721afe00f74f01f67409551a9ff0b256a0

Opus 4.5 uses a three-layer deduplication architecture:

  • L1 cache for fast in-memory rejection of recent alerts

  • L2 advisory-locks + explicit DB query to coordinate across processors

  • L3 unique constraints on the alert table to enforce deduplication at the database level

Clock skew is addressed by relying on the database’s NOW() Timestamp rather than local processor clocks. PostgreSQL advisory locks ensure that if a processor crashes while holding a lock, the lock is released automatically when the connection closes.

The test suite is well-sized (~493 lines) and covers cache hit/miss behaviour, concurrent lock acquisition, clock skew edge cases, and simulated processor crashes.

Where it falls short:

  • The L1 cache uses Math.abs(ageMs) to evaluate recency, but this fails to account for processor clock skew. While the L2 lock layer catches most cases, this L1 failure makes the fast path unreliable under skewed clocks.

  • The advisory lock key is derived only from service:alertType (no timestamp or recent history), which can cause excessive serialisation (different distinct alerts colliding unnecessarily).

  • The unique constraint (L3) blocks all duplicate active alerts rather than only duplicates within the 5-second window; this may suppress legitimate, slightly delayed alerts beyond that timeframe.

Overall: **A sophisticated architecture with solid multi-tier deduplication strategy, ** but still a prototype rather than a fully hardened production module.

Image 4

GPT-5.1’s Take

Time: ~4m | Cost: ~$0.27 | +188 net lines across 2 files

https://github.com/VarshithKrishna14/Kompi/commit/406d262d9795d0a5e74582bb28a9215d609e4a4f

GPT-5.1 implemented distributed alert deduplication with a clean, production-oriented architecture built around a simple rule:

Only one processor should emit an alert for the same anomaly within a 5-second window, even with multiple nodes, clock skew, or crashes.

Architecture

GPT-5.1 introduced a new alert-dedup.ts Module containing:

  • AlertDeduplicator interface: shouldEmit(fingerprint, dedupWindowMs) → returns whether this processor is allowed to emit the alert.

  • DedupKeyValueStore abstraction: Defines a shared atomic operationsetIfNotExistsWithTTL(key, ttlMs) The backend (Redis, DynamoDB, etc.) ensures only one processor can create a key during the TTL window.

  • KeyValueStoreAlertDeduplicator: Production implementation that performs distributed coordination using the above atomic write.

  • InMemoryDedupKeyValueStore: Lightweight mock version used for local runs and unit tests.

Gemini 3 Pro’s Take

Time: 4m 02s | Cost: ~$0.11 | +103 lines across 2 files

https://github.com/VarshithKrishna14/Kompi/commit/4090823d34006500fed0e052c99af530076abef6

Gemini 3 Pro implemented distributed alert deduplication directly within the main processing path, making it the only version so far to be fully wired into the LogProcessor without additional scaffolding.

Architecture

Gemini introduced a centralised deduplication strategy based on PostgreSQL:

  • AlertDeduplicator interface (deduplicator.ts)

  • PostgresDeduplicator

  • InMemoryDeduplicator

Round 2 Quick Compare

Metric

Opus 4.5

GPT-5.1

Gemini 3 Pro

Integrated?

No (standalone subsystem, not plugged into pipeline)

Yes

Yes

Approach

3-layer dedup (L1 cache + advisory locks + DB constraint)

KV-store atomic TTL (“reserve if free”)

PostgreSQL INSERT … ON CONFLICT atomic update

Critical bugs?

L1 cache not clock-skew safe; unnecessary lock serialization

None found

None, but tied strictly to Postgres

Cost

~$0.48

~$0.27

~$0.11

Time

~7m

~4m

~4m
The Cost

The Cost

Total spend across both tests:

  • Opus 4.5: $1.76

  • GPT-5.1 Codex: $0.51 (~71% cheaper than Opus)

  • Gemini 3 Pro: $0.25 (~86% cheaper than Opus)

Opus was consistently the most expensive. The reason is straightforward: it generated far more code, ran longer chains of reasoning, and produced large comment blocks and support scaffolding that never landed in production. GPT-5.1 was far leaner, it solved the same problem in less than half the tokens and with far less backtracking. Gemini came in even cheaper, largely because it produced compact implementations with fewer files changed.

Image 5

What I Actually Learned

Across both challenges, three very different personalities showed up:

Opus 4.5

Opus generated the most ambitious, elaborate engineering every time. Rolling statistics, advisory lock stacks, structured configuration, test coverage, comments, serialisation logic, you could mistake the output for a whitepaper implementation.

But once plugged into a running system, hidden failure modes surfaced immediately:

  • Stateful calculations that didn’t restore correctly

  • Edge paths that threw runtime errors

  • Logic that was architecturally elegant but not operationally safe

It thinks at design-document scale, but it requires another engineering pass to harden for production.

GPT-5.1 Codex

GPT-5.1 was the opposite: more minor changes, far more focused, and always wired directly into the running codebase.

This model:

  • Solved the problem in the fewest lines of code

  • Handled error cases proactively

  • Took real deployments into account (crashes, skew, dirty input)

  • Integrated immediately into the live processor code

Not the most beautiful architecture, but consistently the most deployable and far cheaper to run.

Gemini 3 Pro

Gemini landed in the middle, creative, compact, and technically solid. It implemented solutions that were:

  • Fast

  • Easy to reason about

  • Minimal in moving parts

  • Cheap in computing

It didn't produce architecture essays, but its solutions were easier to drop straight into a project, especially the dedup logic, which required almost no scaffolding. The downside is that some of its deeper edge cases had to be checked manually; its tests weren’t as exhaustive as the other two.

Why GPT-5.1 Stands Out

For production engineering, GPT-5.1 hit the sweet spot:

  • Minimal rewrite required

  • Zero critical bugs in either test

  • Fast iteration

  • One-pass integration

It wasn’t the cheapest (Gemini was), but GPT-5.1 produced the most performable-ready code per dollar.

If you want a model that:

  • Writes code that compiles,

  • Runs on the first try,

  • Handles operational edge cases,

  • And fits into an existing codebase…

GPT-5.1 is the practical winner.

When to Use Opus 4.5

Opus is the model to call when you need deep architectural reasoning:

  • System design reviews

  • Technical write-ups

  • Planning modules or frameworks

  • Long-term maintainability discussions

It produces more infrastructure than is needed, but that’s because it’s thinking like a platform architect rather than a service engineer. If you have the time to refine and integrate, Opus provides the broadest view of the problem space.

Just expect to:

  • Wire things together manually

  • Fix runtime behaviour

  • Trim unnecessary complexity

When to Use Gemini 3 Pro

Gemini is the fastest and cheapest path to working code. It’s ideal when:

  • You want to move quickly

  • You’re okay with filling in the deep testing yourself

  • Simplicity matters more than formal architecture

Its solutions tend to be:

  • Straightforward

  • Operationally efficient

  • Very easy to deploy

Just be ready to audit some boundary conditions manually; it doesn’t always anticipate the same number of failure modes as GPT-5.1.

Bottom Line

Across these practical engineering scenarios, GPT-5.1 Codex stood out by producing solutions that were closest to “ready to deploy” with the least amount of intervention. Claude 4.5 consistently demonstrated the strongest architectural reasoning and long-horizon thinking, but its outputs usually required additional effort to integrate and stabilise. Gemini 3 Pro delivered fast, lightweight, and inexpensive solutions that worked well early but benefited from hardening when pushed into more demanding or distributed environments.

In other words:

  • Codex most often produced code that could be dropped straight into the system,

  • Claude provided the deepest engineering thinking, and

  • Gemini offered the quickest path to functional scaffolding at low cost.

Complete code is available on GitHub if you want to examine the implementations. Fair warning: it's an evaluation harness I built for this test, not production code.

These results reflect only what I observed in these particular test cases, but they highlight the practical trade-offs engineers may see when using the models in day-to-day development.

Three flagship AI coding models launched within weeks of each other. Claude Opus 4.5 on November 24. Gemini 3.0 Pro on November 18. GPT 5.1 Codex-Max on November 19. All three claim to be the best model for complex coding tasks and agentic workflows.

The benchmarks show they're neck-and-neck. I wanted to see what that means for actual development work. So I gave all three the same prompts for two complex problems in my observability platform: statistical anomaly detection and distributed alert deduplication: same codebase, exact requirements, same IDE setup.

Here. I compared all these models on some projects I was working on in my spare time. I've used the Tool router, which is beta, in the first test, which also helps in dogfood the product. Do check out if you're someone who wants to use tools with your agents but doesn't want to be bothered with context pollution. Read more on the tool router here.

The Official Benchmarks

Image 2

Pricing Comparison (Per 1M Tokens)

Model

Input Cost

Output Cost

Claude Opus 4.5

$5.00

$25.00

GPT-5.1 Codex

$1.25

$10.00

Gemini 3 Pro

$2.00 (<200K context) / $4.00 (≥200K)

$12.00 (<200K) / $18.00 (≥200K)

Key Benchmark Scores:

  • SWE-bench Verified: Opus 4.5 leads at 80.9%, followed by GPT 5.1 Codex-Max at 77.9% and Gemini 3 Pro at 76.2%

  • Terminal-Bench 2.0: Gemini 3 Pro tops at 54.2%, demonstrating exceptional tool use capabilities

  • MMMU-Pro (Visual Reasoning): Gemini 3 Pro leads with superior multimodal understanding

  • WebDev Arena: Gemini 3 Pro reaches 1487 Elo score for "vibe coding" capabilities

TL;DR:

  • Opus 4.5(Claude): Outstanding at strategy and design, but its solutions tend to be elaborate, slower to integrate, and prone to practical hiccups once they hit the metal.

  • GPT-5.1 Codex: The most dependable for real-world development, integrates cleanly, handles edge cases, and produces code that holds up under load.

  • Gemini 3 Pro: Lean, fast, and cost-efficient, strong for prototyping and greenfield builds, though its outputs need some hardening for production-grade resilience.

How did I test this?

To really see what these models are made of, I threw all three at the exact same prompts and asked them to solve two problems that have bitten our production pipeline more than once: building a solid anomaly detection path, and hardening alert deduplication across multiple processors. These aren’t academic exercises; they’re the kind of tasks where clock drift, concurrency quirks, and partial crashes turn into 3 a.m. pages.

I ran everything in the same Cursor environment, so no model got special treatment. From there, I wasn’t just watching token counts tick upward; I paid attention to how well each model understood the shape of the system, whether its code actually plugged into the real project, and ultimately, whether the output felt like something I could trust in production and ultimately, one key question:

Is this something I would confidently deploy in a real system?

Tooling notes:

  • Claude Code still delivers the most refined user experience overall, offering structured reasoning traces, step-by-step visibility, and helpful inline feedback during development.

  • GPT Codex CLI (v0.59) has levelled up significantly, now supporting streamed reasoning, stable session recovery, clearer accounting of cached tokens, and automatic context compaction, making long-running agent loops far more reliable.

  • Gemini 3 Pro (via the Node SDK) delivered the fastest completions and lowest cost per task, though the model tends to reveal less of its internal reasoning compared to GPT-5.1 or Claude.

Test 1: Statistical Anomaly Detection

The challenge: Build a system that learns baseline error rates, uses z-scores and moving averages, catches rate-of-change spikes, and handles 100k+ logs/minute with under 10ms latency.

Image 3

Opus 4.5 Attempt

Time: 12m 11s | Cost: $1.28| Diff: +2,981 lines across 9 files

https://github.com/VarshithKrishna14/Kompi/commit/aefae4e2a2e6e5befd8365d3f1c730ec564603be

Claude delivered a huge implementation: a full statistical anomaly detector with rolling snapshots, Welford-based state tracking, spike detection, serialisation logic, configuration structures, and extensive inline comments. On first read, it felt like production-ready engineering.

Then it hit the real system.

The first run immediately exposed a critical failure path. When the historical average approached zero, calculateSpikeRatio() it produced astronomical values (e.g., 1e12 or worse), which were sent directly into .toFixed(2) without passing through the existing sanitisation layer. The result: hard runtime crashes from perfectly valid data.

State restoration made things even worse. The deserialize() function reloaded snapshots but didn’t recompute the means, variance terms, or sample counts. After a restart, the spike detector and z-score logic were working off a statistically incompatible internal state. No crash, no error, just silent corruption.

But the core design choice, using Welford for a rolling window, undermined the entire system. The output looked sophisticated, but the logic couldn’t behave correctly under real production workload patterns.

GPT-5.1 Analysis

Time: 6m | Cost: ~$0.24 | Diff: ~+577 lines for across 3 files modified

https://github.com/VarshithKrishna14/Kompi/commit/2afad7e6e7940b6204a0ff0d1c51ea1b912a362e

GPT-5.1 implemented a streaming statistical anomaly detector optimised for extremely high-throughput logging workloads (100k+ events/sec). Instead of heavy time-bucket structures or map-based rolling memory, this design uses a single-pass O(1) update loop with:

  • EWMA for online mean

  • EWMA of squared error-rate for stable variance

  • Rolling time window for short-term spikes

  • Hard defences against NaN, Infinity, invalid counts, and timestamp issues

Everything executes synchronously with constant-time updates, making it viable for a log ingestion pipeline running on a hot data path.

Gemini 3 Pro Attempt

Time: ~5m 44s | Estimated Cost: ~$0.14 | Diff: +366 lines across 4 files

https://github.com/VarshithKrishna14/Kompi/commit/1d711ecdec0786bc0265afe3b73e36bab0b6f553

Gemini 3 Pro tackled the anomaly-detection problem with a stream-optimized, low-latency architecture. The implementation uses a stateless EWMA model (Exponentially Weighted Moving Average) rather than sliding windows, ensuring O(1) memory usage regardless of throughput. High-volume logs are aggregated in-memory and flushed to the detector at fixed intervals (e.g., 1s), ensuring the system easily handles 100,000+ logs/minute with negligible overhead.

Edge cases are rigorously handled:

  • Zero-Variance: Explicitly guarded to prevent infinite Z-scores during flatline periods.

  • Math Safety: Division operations use epsilon guarding (0.000001) to prevent zero-division errors.

  • Invalid Inputs: NaN and Infinity are gracefully filtered out without crashing the pipeline.

The test suite is comprehensive and deterministic:

  • Verifies regime adaptation (baseline shifting) using fast-adaptation alpha values.

  • Tests rate-of-change spikes (e.g., 6x jumps) independently of Z-scores.

  • Validates initialization convergence and mock database integration points.

Round 1 – Quick comparison

Category

Opus 4.5

GPT-5.1 Codex

Gemini 3 Pro

Integrated into the pipeline?

Yes, but incomplete wiring paths

Yes

Yes

State handling

Welford accumulator → never forgets

EWMA + bounded buffers

Pure EWMA constant state

Restart behavior

Hydrates but recomputes nothing → silent corruption

Hydrates & recomputes cleanly

Hydrates & recomputes stdDev correctly

Edge-case safety

Crashes on big spike ratios

Fully guarded

Fully guarded

Bad input tolerance

Mixed

Safe

Safe

Tests

Nondeterministic, randomness

Deterministic

Deterministic

Memory growth

Unbounded / drift

Stable

Fairly constant

Would it survive production?

No

Yes

Depends

Tool Router Integration

Instead of preloading giant toolkits into MCP and bloating every session with dozens of unused capabilities, the Tool Router acts as an on-demand integration layer. This also helps us dogfood our own product.

If you’re curious about the underlying system, the full technical breakdown is documented here:https://composio.dev/blog/introducing-tool-router-(beta)

Before running our further tests and alerting workflows, I integrated through Tool Router. One OAuth handshake per user gives the MCP client access to Slack, Jira, PagerDuty, or any other connected tool, no manual wiring required. (For context, we first tried this setup in the Gemini version.)

The benefits we see in practice:

  • Seamless per-user integration: a single router manages many apps, and each session only exposes the tools the user actually connected.

  • Instant, code-free updates: newly connected services show up automatically, so agents can start using them immediately without redeployments or extra glue code.

  • Automated workflow routing: alerts and tasks from our anomaly detection pipeline can be sent through Slack, Jira, PagerDuty, or any connected tool effortlessly.

In our processor pipeline, we integrated Tool Router to handle alerting dynamically across Slack, Jira, and PagerDuty. Instead of manually wiring each service, we use a unified ComposioClient approach that initialises the MCP client with a single configuration:

const composioClient = new ComposioClient({
  apiKey: process.env.COMPOSIO_API_KEY!,
  userId: 'tracer-system',
  toolkits: ['slack', 'jira', 'pagerduty'],
});

const mcpClient = await composioClient.createMCPClient();

Once initialised, any alerting workflow can call agents directly. For example, our anomaly detection triggers the log-anomaly-alert-agent, which automatically decide which tools to notify.

Test 2: Distributed Alert Deduplication

The challenge: Implement distributed alert deduplication so multiple processors don’t fire duplicate alerts within a 5-second window, tolerating up to 3s clock skew and processor crash

Opus 4.5 Take

Time: ~7 m 1 s (estimate) | Cost: ~$0.48 | +715 lines across 4 files

https://github.com/VarshithKrishna14/Kompi/commit/0bb9ad721afe00f74f01f67409551a9ff0b256a0

Opus 4.5 uses a three-layer deduplication architecture:

  • L1 cache for fast in-memory rejection of recent alerts

  • L2 advisory-locks + explicit DB query to coordinate across processors

  • L3 unique constraints on the alert table to enforce deduplication at the database level

Clock skew is addressed by relying on the database’s NOW() Timestamp rather than local processor clocks. PostgreSQL advisory locks ensure that if a processor crashes while holding a lock, the lock is released automatically when the connection closes.

The test suite is well-sized (~493 lines) and covers cache hit/miss behaviour, concurrent lock acquisition, clock skew edge cases, and simulated processor crashes.

Where it falls short:

  • The L1 cache uses Math.abs(ageMs) to evaluate recency, but this fails to account for processor clock skew. While the L2 lock layer catches most cases, this L1 failure makes the fast path unreliable under skewed clocks.

  • The advisory lock key is derived only from service:alertType (no timestamp or recent history), which can cause excessive serialisation (different distinct alerts colliding unnecessarily).

  • The unique constraint (L3) blocks all duplicate active alerts rather than only duplicates within the 5-second window; this may suppress legitimate, slightly delayed alerts beyond that timeframe.

Overall: **A sophisticated architecture with solid multi-tier deduplication strategy, ** but still a prototype rather than a fully hardened production module.

Image 4

GPT-5.1’s Take

Time: ~4m | Cost: ~$0.27 | +188 net lines across 2 files

https://github.com/VarshithKrishna14/Kompi/commit/406d262d9795d0a5e74582bb28a9215d609e4a4f

GPT-5.1 implemented distributed alert deduplication with a clean, production-oriented architecture built around a simple rule:

Only one processor should emit an alert for the same anomaly within a 5-second window, even with multiple nodes, clock skew, or crashes.

Architecture

GPT-5.1 introduced a new alert-dedup.ts Module containing:

  • AlertDeduplicator interface: shouldEmit(fingerprint, dedupWindowMs) → returns whether this processor is allowed to emit the alert.

  • DedupKeyValueStore abstraction: Defines a shared atomic operationsetIfNotExistsWithTTL(key, ttlMs) The backend (Redis, DynamoDB, etc.) ensures only one processor can create a key during the TTL window.

  • KeyValueStoreAlertDeduplicator: Production implementation that performs distributed coordination using the above atomic write.

  • InMemoryDedupKeyValueStore: Lightweight mock version used for local runs and unit tests.

Gemini 3 Pro’s Take

Time: 4m 02s | Cost: ~$0.11 | +103 lines across 2 files

https://github.com/VarshithKrishna14/Kompi/commit/4090823d34006500fed0e052c99af530076abef6

Gemini 3 Pro implemented distributed alert deduplication directly within the main processing path, making it the only version so far to be fully wired into the LogProcessor without additional scaffolding.

Architecture

Gemini introduced a centralised deduplication strategy based on PostgreSQL:

  • AlertDeduplicator interface (deduplicator.ts)

  • PostgresDeduplicator

  • InMemoryDeduplicator

Round 2 Quick Compare

Metric

Opus 4.5

GPT-5.1

Gemini 3 Pro

Integrated?

No (standalone subsystem, not plugged into pipeline)

Yes

Yes

Approach

3-layer dedup (L1 cache + advisory locks + DB constraint)

KV-store atomic TTL (“reserve if free”)

PostgreSQL INSERT … ON CONFLICT atomic update

Critical bugs?

L1 cache not clock-skew safe; unnecessary lock serialization

None found

None, but tied strictly to Postgres

Cost

~$0.48

~$0.27

~$0.11

Time

~7m

~4m

~4m
The Cost

The Cost

Total spend across both tests:

  • Opus 4.5: $1.76

  • GPT-5.1 Codex: $0.51 (~71% cheaper than Opus)

  • Gemini 3 Pro: $0.25 (~86% cheaper than Opus)

Opus was consistently the most expensive. The reason is straightforward: it generated far more code, ran longer chains of reasoning, and produced large comment blocks and support scaffolding that never landed in production. GPT-5.1 was far leaner, it solved the same problem in less than half the tokens and with far less backtracking. Gemini came in even cheaper, largely because it produced compact implementations with fewer files changed.

Image 5

What I Actually Learned

Across both challenges, three very different personalities showed up:

Opus 4.5

Opus generated the most ambitious, elaborate engineering every time. Rolling statistics, advisory lock stacks, structured configuration, test coverage, comments, serialisation logic, you could mistake the output for a whitepaper implementation.

But once plugged into a running system, hidden failure modes surfaced immediately:

  • Stateful calculations that didn’t restore correctly

  • Edge paths that threw runtime errors

  • Logic that was architecturally elegant but not operationally safe

It thinks at design-document scale, but it requires another engineering pass to harden for production.

GPT-5.1 Codex

GPT-5.1 was the opposite: more minor changes, far more focused, and always wired directly into the running codebase.

This model:

  • Solved the problem in the fewest lines of code

  • Handled error cases proactively

  • Took real deployments into account (crashes, skew, dirty input)

  • Integrated immediately into the live processor code

Not the most beautiful architecture, but consistently the most deployable and far cheaper to run.

Gemini 3 Pro

Gemini landed in the middle, creative, compact, and technically solid. It implemented solutions that were:

  • Fast

  • Easy to reason about

  • Minimal in moving parts

  • Cheap in computing

It didn't produce architecture essays, but its solutions were easier to drop straight into a project, especially the dedup logic, which required almost no scaffolding. The downside is that some of its deeper edge cases had to be checked manually; its tests weren’t as exhaustive as the other two.

Why GPT-5.1 Stands Out

For production engineering, GPT-5.1 hit the sweet spot:

  • Minimal rewrite required

  • Zero critical bugs in either test

  • Fast iteration

  • One-pass integration

It wasn’t the cheapest (Gemini was), but GPT-5.1 produced the most performable-ready code per dollar.

If you want a model that:

  • Writes code that compiles,

  • Runs on the first try,

  • Handles operational edge cases,

  • And fits into an existing codebase…

GPT-5.1 is the practical winner.

When to Use Opus 4.5

Opus is the model to call when you need deep architectural reasoning:

  • System design reviews

  • Technical write-ups

  • Planning modules or frameworks

  • Long-term maintainability discussions

It produces more infrastructure than is needed, but that’s because it’s thinking like a platform architect rather than a service engineer. If you have the time to refine and integrate, Opus provides the broadest view of the problem space.

Just expect to:

  • Wire things together manually

  • Fix runtime behaviour

  • Trim unnecessary complexity

When to Use Gemini 3 Pro

Gemini is the fastest and cheapest path to working code. It’s ideal when:

  • You want to move quickly

  • You’re okay with filling in the deep testing yourself

  • Simplicity matters more than formal architecture

Its solutions tend to be:

  • Straightforward

  • Operationally efficient

  • Very easy to deploy

Just be ready to audit some boundary conditions manually; it doesn’t always anticipate the same number of failure modes as GPT-5.1.

Bottom Line

Across these practical engineering scenarios, GPT-5.1 Codex stood out by producing solutions that were closest to “ready to deploy” with the least amount of intervention. Claude 4.5 consistently demonstrated the strongest architectural reasoning and long-horizon thinking, but its outputs usually required additional effort to integrate and stabilise. Gemini 3 Pro delivered fast, lightweight, and inexpensive solutions that worked well early but benefited from hardening when pushed into more demanding or distributed environments.

In other words:

  • Codex most often produced code that could be dropped straight into the system,

  • Claude provided the deepest engineering thinking, and

  • Gemini offered the quickest path to functional scaffolding at low cost.

Complete code is available on GitHub if you want to examine the implementations. Fair warning: it's an evaluation harness I built for this test, not production code.

These results reflect only what I observed in these particular test cases, but they highlight the practical trade-offs engineers may see when using the models in day-to-day development.