SWE-bench Explained

The AI Coding Benchmark That Actually Matters (2025 Guide)

What is SWE-bench?

SWE-bench (Software Engineering Benchmark) is a test platform that evaluates AI models' ability to fix real bugs from GitHub repositories like Django and scikit-learn. Models must analyze codebases and generate fixes passing the project's test suite—the closest measure of whether AI can actually code.

How It Works

Real GitHub Issues

Researchers compiled 2,294 genuine bugs from 12 popular Python projects—actual production issues fixed by human developers.

AI Context

Models receive the issue description, repository codebase snapshot, and instructions to generate a patch.

Auto-Grading

AI fixes are tested against the project's test suite. Passing tests equals solved problems; no partial credit.

Two Versions

SWE-bench Full

2,294 problems from 12 Python repositories. May contain ambiguous requirements and flaky tests.

Original Version

SWE-bench Verified

500 hand-picked, clearly defined problems. The gold standard for current scoring.

Industry Standard

Current Leaderboard (November 2025)

Rank	Model	Score	Released
🥇	Claude Sonnet 4.5	77.2%	Sep 2025
🥈	GPT-5.1	76.3%	Nov 2025
🥉	Claude Sonnet 3.5	49.0%	Jun 2024
4	GPT-4 Turbo	38.0%	Apr 2024

Claude 4.5's 77.2% represents a 28.2-point improvement over Claude 3.5.

Why It Matters

🔧

Real Skills

Tests complex repository navigation and targeted fixes, not toy problems

✅

Objective

Tests either pass or fail—no subjective evaluation

📊

Predictive

High scores correlate with developer satisfaction using AI coding assistants

Limitations

1.
Python-Only: Other languages (JavaScript, Rust, Go) may show different performance
2.
Scope: Focuses on bug fixes, not greenfield development or architecture design
3.
Edge Cases: Test suites may not catch all edge case bugs

The Bottom Line

SWE-bench Verified is the strongest public metric for evaluating AI coding capabilities, though it should be one factor among many when selecting an AI coding assistant.

View All Benchmark Results