Claude Sonnet 4.6 Hits 79.6% on SWE-bench, Within 1.2% of Opus 4.6

Sonnet Reaches Flagship Territory

Claude Sonnet 4.6's 79.6% score on SWE-bench Verified puts it within striking distance of Opus 4.6's 80.8%—a gap of just 1.2 percentage points.

Historical Context

The rapid improvement in Sonnet-class models:

Model

SWE-bench Verified

Date

Sonnet 3.5

49.0%

Jun 2024

Sonnet 4

72.7%

Mar 2025

Sonnet 4.5

77.2%

Sep 2025

Sonnet 4.6

79.6%

Feb 2026

In 20 months, Sonnet's SWE-bench performance has increased 30+ percentage points.

Benchmark Details

SWE-bench Verified tests AI models on real GitHub issues:

500 curated problems from Python repositories

Must generate correct patches that pass tests

No training on test data

Sonnet 4.6 breakdown:

79.6% standard pass rate

Higher with extended thinking / Adaptive Thinking (high effort)

Competitive Landscape

Model

SWE-bench Verified

Price (Input/Output)

Opus 4.6

80.8%

$15/$75

Sonnet 4.6

79.6%

$3/$15

GPT-5.2

~76%

$1.75/$14

Codex 5.3

56.8%*

$10/$30

*Codex uses different benchmark variant (SWE-Bench Pro)

What the Gap Means

For most development tasks, 79.6% vs 80.8% is statistically insignificant:

Both solve ~4 of 5 real-world bugs correctly

Variance in individual runs exceeds the gap

Cost difference (5x) far exceeds capability difference (1.2%)

Developer Perspectives

"I've been A/B testing Sonnet vs Opus for a week. Can't tell the difference on my codebase. But I sure can tell the difference in my bill." — Senior Engineer, YC startup

"For 99% of tickets, Sonnet 4.6 is Opus. That last 1% is when I escalate." — Tech Lead, Series B company

When Opus 4.6 Still Wins

Despite near-parity, Opus 4.6 pulls ahead on:

Novel algorithm design

Multi-step refactoring with many dependencies

PhD-level scientific code

Maximum accuracy requirements (regulatory, financial)

The Value Proposition

At current pricing:

100 SWE-bench problems cost ~$7 with Sonnet 4.6

Same problems cost ~$35 with Opus 4.6

5x cost for 1.5% improvement

Conclusion

Sonnet 4.6 has effectively commoditized flagship-level coding performance. For most teams, the rational choice is Sonnet by default, Opus by exception.