SWE-bench Explained

The AI Coding Benchmark That Actually Matters (2025 Guide)

What is SWE-bench?

SWE-bench (Software Engineering Benchmark) is a test platform that evaluates AI models' ability to fix real bugs from GitHub repositories like Django and scikit-learn. Models must analyze codebases and generate fixes passing the project's test suite—the closest measure of whether AI can actually code.

How It Works

1

Real GitHub Issues

Researchers compiled 2,294 genuine bugs from 12 popular Python projects—actual production issues fixed by human developers.

2

AI Context

Models receive the issue description, repository codebase snapshot, and instructions to generate a patch.

3

Auto-Grading

AI fixes are tested against the project's test suite. Passing tests equals solved problems; no partial credit.

Two Versions

SWE-bench Full

2,294 problems from 12 Python repositories. May contain ambiguous requirements and flaky tests.

Original Version

SWE-bench Verified

500 hand-picked, clearly defined problems. The gold standard for current scoring.

Industry Standard

Current Leaderboard (November 2025)

RankModelScoreReleased
🥇Claude Sonnet 4.577.2%Sep 2025
🥈GPT-5.176.3%Nov 2025
🥉Claude Sonnet 3.549.0%Jun 2024
4GPT-4 Turbo38.0%Apr 2024

Claude 4.5's 77.2% represents a 28.2-point improvement over Claude 3.5.

Why It Matters

🔧

Real Skills

Tests complex repository navigation and targeted fixes, not toy problems

Objective

Tests either pass or fail—no subjective evaluation

📊

Predictive

High scores correlate with developer satisfaction using AI coding assistants

Limitations

  • 1.
    Python-Only: Other languages (JavaScript, Rust, Go) may show different performance
  • 2.
    Scope: Focuses on bug fixes, not greenfield development or architecture design
  • 3.
    Edge Cases: Test suites may not catch all edge case bugs

The Bottom Line

SWE-bench Verified is the strongest public metric for evaluating AI coding capabilities, though it should be one factor among many when selecting an AI coding assistant.

View All Benchmark Results