ResearchJune 10, 2026

Claude Fable 5 Benchmarks: 80.3% on SWE-Bench Pro, 11 Points Ahead of the Field

Claude Fable 5 posts 80.3% on SWE-Bench Pro versus 69.2% for Opus 4.8, 58.6% for GPT-5.5, and 54.2% for Gemini 3.1 Pro, plus the top score on Cognition FrontierCode.

Claude Fable 5 did not just edge out the competition - it lapped it. Anthropic's new model scored 80.3% on SWE-Bench Pro, roughly 11 points ahead of the next-best frontier model, in benchmark results published alongside the June 9 launch.

The Numbers

The SWE-Bench Pro comparison, which measures real-world software engineering capability, breaks down as follows:

ModelSWE-Bench Pro
Claude Fable 580.3%
Claude Opus 4.869.2%
GPT-5.558.6%
Gemini 3.1 Pro54.2%

The gap between Fable 5 and Anthropic's own Opus 4.8 - more than 11 points - is larger than the gap between Opus 4.8 and Google's Gemini 3.1 Pro. Fable 5 also posted the highest score among frontier models on Cognition's FrontierCode eval, a separate independent measure of frontier coding ability.

State of the Art Nearly Everywhere

Anthropic says Fable 5 is state-of-the-art on nearly all tested capability benchmarks, with the largest advantages on long and complex tasks. Andrej Karpathy, reviewing the results, called the release "a major-version-bump-deserving step change forward" and described the benchmarks as "SOTA on everything by a margin," noting the model is especially strong for "long problem-solving sessions on very difficult problems."

Beyond coding, the evaluation results extend to other modalities:

  • Vision: state-of-the-art at extracting numbers from scientific figures and rebuilding web apps from screenshots. The model completed Pokémon FireRed using only vision.
  • Long context: support for millions of tokens. Using file-based memory, Fable 5 played Slay the Spire 3x better than Opus 4.8.

Why the Long-Task Lead Matters

Benchmark deltas of a point or two are common between frontier releases; an 11-point jump on SWE-Bench Pro is not. The pattern in the results - biggest gains on the longest, hardest tasks - suggests the improvement is concentrated precisely where agentic workloads live. Cursor CEO Michael Truell said as much: "Claude Fable 5 is the state of the art model on CursorBench. It's opened up a class of long-horizon problems that were out of reach."

Early real-world data points back this up. Stripe reported that a 50-million-line Ruby codebase migration, estimated at over two months for a team, was completed in one day.

Sources

Ready to Experience Claude 5?

Try Now