Claude Fable 5 Benchmarks: 80.3% SWE-Bench Pro, 11-Point Lead

Claude Fable 5 did not just edge out the competition - it lapped it. Anthropic's new model scored 80.3% on SWE-Bench Pro, roughly 11 points ahead of the next-best frontier model, in benchmark results published alongside the June 9 launch.

The Numbers

The SWE-Bench Pro comparison, which measures real-world software engineering capability, breaks down as follows:

Model

SWE-Bench Pro

Claude Fable 5

80.3%

Claude Opus 4.8

69.2%

GPT-5.5

58.6%

Gemini 3.1 Pro

54.2%

The gap between Fable 5 and Anthropic's own Opus 4.8 - more than 11 points - is larger than the gap between Opus 4.8 and Google's Gemini 3.1 Pro. Fable 5 also posted the highest score among frontier models on Cognition's FrontierCode eval, a separate independent measure of frontier coding ability.

State of the Art Nearly Everywhere

Anthropic says Fable 5 is state-of-the-art on nearly all tested capability benchmarks, with the largest advantages on long and complex tasks. Andrej Karpathy, reviewing the results, called the release "a major-version-bump-deserving step change forward" and described the benchmarks as "SOTA on everything by a margin," noting the model is especially strong for "long problem-solving sessions on very difficult problems."

Beyond coding, the evaluation results extend to other modalities:

Vision: state-of-the-art at extracting numbers from scientific figures and rebuilding web apps from screenshots. The model completed Pokémon FireRed using only vision.

Long context: support for millions of tokens. Using file-based memory, Fable 5 played Slay the Spire 3x better than Opus 4.8.

Why the Long-Task Lead Matters

Benchmark deltas of a point or two are common between frontier releases; an 11-point jump on SWE-Bench Pro is not. The pattern in the results - biggest gains on the longest, hardest tasks - suggests the improvement is concentrated precisely where agentic workloads live. Cursor CEO Michael Truell said as much: "Claude Fable 5 is the state of the art model on CursorBench. It's opened up a class of long-horizon problems that were out of reach."

Early real-world data points back this up. Stripe reported that a 50-million-line Ruby codebase migration, estimated at over two months for a team, was completed in one day.

Sources

Anthropic: Claude Fable 5 and Claude Mythos 5

Interconnects: Claude Fable 5 and new AI safety

VentureBeat analysis

Claude Fable 5 Benchmarks: 80.3% on SWE-Bench Pro, 11 Points Ahead of the Field

The Numbers

State of the Art Nearly Everywhere

Why the Long-Task Lead Matters

Sources

Ready to Experience Claude 5?