AI Model Benchmarks

Comprehensive performance comparison of Claude, GPT, and Gemini across all major AI benchmarks. Updated February 2026.

80.8%
Claude Opus 4.6
SWE-bench Leader
98.1%
Codex 5.3
HumanEval Leader
87.3%
Claude Opus 4.5
GPQA Leader
1M
Gemini 3 Pro
Context Tokens

Coding Performance

SWE-bench Verified

Real-world GitHub issues resolution

🥇Claude Opus 4.5
80.9%
🥈Claude Opus 4.6
80.8%
🥉Codex 5.3 Ultra
78.4%
GPT-5.1
74.2%
Claude Sonnet 4.5
73.5%
Gemini 3 Pro
71.8%

HumanEval

Code generation accuracy

🥇Codex 5.3 Ultra
98.1%
🥈Gemini 3 Pro
97.8%
🥉Claude Opus 4.5
97.3%
Claude Sonnet 4.5
95.8%
GPT-5.1
94.2%

MBPP

Python programming tasks

🥇Claude Opus 4.5
96.1%
🥈Codex 5.3
95.7%
🥉Gemini 3 Pro
94.2%
GPT-5.1
93.4%

Reasoning & Knowledge

GPQA Diamond

Graduate-level science questions

🥇Claude Opus 4.5
87.3%
🥈GPT-5.1
81.9%
🥉Gemini 3 Pro
79.4%
Claude Sonnet 4.5
76.2%

MMLU

Massive multitask language understanding

🥇GPT-5.1
92.4%
🥈Gemini 3 Pro
91.8%
🥉Claude Opus 4.5
90.7%
Claude Sonnet 4.5
89.2%

ARC-AGI-2

Abstract reasoning challenges

🥇Claude Opus 4.6
68.8%
🥈GPT-5.1
62.3%
🥉Gemini 3 Pro
59.7%

Context Handling

Context Window Size

Maximum token capacity

🥇Gemini 3 Pro
1000000tokens
🥈GPT-5.1
256000tokens
🥉Claude Opus 4.5
200000tokens

Context Recall

Accuracy at maximum context

🥇Claude Opus 4.5
98%
🥈GPT-5.1
96%
🥉Gemini 3 Pro
91%

Speed & Latency

Time to First Token

Response latency

🥇Claude Opus 4.5
3.2seconds
🥈Gemini 3 Pro
2.4seconds
🥉GPT-5.1
1.8seconds

Understanding the Benchmarks

SWE-bench Verified

Tests AI models on real-world GitHub issues from popular open-source projects. Models must understand code, identify bugs, and generate correct fixes. The "Verified" subset contains 500 carefully validated issues.

HumanEval

Hand-written programming problems that evaluate code generation from natural language descriptions. Each problem includes test cases to verify functional correctness.

GPQA Diamond

Graduate-level questions in physics, biology, and chemistry. Designed to be difficult enough that even PhD holders need to research before answering.

MMLU

57 subjects covering elementary mathematics to advanced law. Tests world knowledge and problem-solving ability across diverse domains.

Compare Models in Detail

Explore comprehensive comparisons between Claude and competing AI models.