SWE-bench Explained
The AI Coding Benchmark That Actually Matters (2025 Guide)
What is SWE-bench?
SWE-bench (Software Engineering Benchmark) is a test platform that evaluates AI models' ability to fix real bugs from GitHub repositories like Django and scikit-learn. Models must analyze codebases and generate fixes passing the project's test suite—the closest measure of whether AI can actually code.
How It Works
Real GitHub Issues
Researchers compiled 2,294 genuine bugs from 12 popular Python projects—actual production issues fixed by human developers.
AI Context
Models receive the issue description, repository codebase snapshot, and instructions to generate a patch.
Auto-Grading
AI fixes are tested against the project's test suite. Passing tests equals solved problems; no partial credit.
Two Versions
SWE-bench Full
2,294 problems from 12 Python repositories. May contain ambiguous requirements and flaky tests.
SWE-bench Verified
500 hand-picked, clearly defined problems. The gold standard for current scoring.
Current Leaderboard (November 2025)
| Rank | Model | Score | Released |
|---|---|---|---|
| 🥇 | Claude Sonnet 4.5 | 77.2% | Sep 2025 |
| 🥈 | GPT-5.1 | 76.3% | Nov 2025 |
| 🥉 | Claude Sonnet 3.5 | 49.0% | Jun 2024 |
| 4 | GPT-4 Turbo | 38.0% | Apr 2024 |
Claude 4.5's 77.2% represents a 28.2-point improvement over Claude 3.5.
Why It Matters
Real Skills
Tests complex repository navigation and targeted fixes, not toy problems
Objective
Tests either pass or fail—no subjective evaluation
Predictive
High scores correlate with developer satisfaction using AI coding assistants
Limitations
- 1.Python-Only: Other languages (JavaScript, Rust, Go) may show different performance
- 2.Scope: Focuses on bug fixes, not greenfield development or architecture design
- 3.Edge Cases: Test suites may not catch all edge case bugs
The Bottom Line
SWE-bench Verified is the strongest public metric for evaluating AI coding capabilities, though it should be one factor among many when selecting an AI coding assistant.
View All Benchmark Results