Users Prefer Claude Sonnet 4.6 Over Opus 4.5 in Head-to-Head Tests

Mid-Tier Model Outperforms Previous Flagship

In what Anthropic calls a "generational leap," user testing shows Claude Sonnet 4.6 defeating the previous flagship Opus 4.5 in preference tests.

Testing Results

Sonnet 4.6 vs Sonnet 4.5: 70% preferred Sonnet 4.6 Sonnet 4.6 vs Opus 4.5: 59% preferred Sonnet 4.6

Why Users Prefer Sonnet 4.6

Qualitative feedback highlighted three factors:

1. Better Instruction Following

"Sonnet 4.6 actually does what I ask. Opus would often 'improve' my request in ways I didn't want."

2. Fewer Hallucinations

"Less confident in wrong answers. When Sonnet 4.6 doesn't know something, it says so rather than making things up."

3. Reduced Over-Engineering

"Asked for a simple function, got a simple function. Not a framework with dependency injection and abstract interfaces."

Benchmark Context

This preference data aligns with benchmarks:

Metric

Sonnet 4.6

Opus 4.5

SWE-bench

79.6%

77.2%

OSWorld

72.5%

61.4%

GDPval-AA

1633 Elo

~1550

Pricing Implications

The preference data makes Sonnet 4.6 even more compelling:

Opus 4.5: $15/$75 per million tokens

Sonnet 4.6: $3/$15 per million tokens

Users get better perceived quality at 20% of the cost.

Enterprise Reaction

"We were planning an Opus 4.5 deployment for Q2. These results have us reconsidering. Why pay 5x for something users like less?" — CTO, Enterprise SaaS company

Opus 4.6 Still Has a Place

Anthropic notes Opus 4.6 (the new flagship) still excels for:

PhD-level scientific reasoning (91.3% GPQA vs 74.1%)

Multi-agent coordination

Extreme long-context retrieval (76% vs 18% on MRCR)

But for most applications, Sonnet 4.6 appears to be the optimal choice.

What This Means

The AI industry is seeing compression: mid-tier models reaching flagship performance while maintaining cost efficiency. Anthropic's strategy of rapid iteration appears to be paying off.