AI Model Benchmarks

Comprehensive performance comparison of Claude, GPT, and Gemini across all major AI benchmarks. Updated February 2026.

80.8%

Claude Opus 4.6
SWE-bench Leader

98.1%

Codex 5.3
HumanEval Leader

87.3%

Claude Opus 4.5
GPQA Leader

Gemini 3 Pro
Context Tokens

Coding Performance

SWE-bench Verified

Real-world GitHub issues resolution

🥇Claude Opus 4.5

80.9%

🥈Claude Opus 4.6

80.8%

🥉Codex 5.3 Ultra

78.4%

GPT-5.1

74.2%

Claude Sonnet 4.5

73.5%

Gemini 3 Pro

71.8%

HumanEval

Code generation accuracy

🥇Codex 5.3 Ultra

98.1%

🥈Gemini 3 Pro

97.8%

🥉Claude Opus 4.5

97.3%

Claude Sonnet 4.5

95.8%

GPT-5.1

94.2%

MBPP

Python programming tasks

🥇Claude Opus 4.5

96.1%

🥈Codex 5.3

95.7%

🥉Gemini 3 Pro

94.2%

GPT-5.1

93.4%

Reasoning & Knowledge

GPQA Diamond

Graduate-level science questions

🥇Claude Opus 4.5

87.3%

🥈GPT-5.1

81.9%

🥉Gemini 3 Pro

79.4%

Claude Sonnet 4.5

76.2%

MMLU

Massive multitask language understanding

🥇GPT-5.1

92.4%

🥈Gemini 3 Pro

91.8%

🥉Claude Opus 4.5

90.7%

Claude Sonnet 4.5

89.2%

ARC-AGI-2

Abstract reasoning challenges

🥇Claude Opus 4.6

68.8%

🥈GPT-5.1

62.3%

🥉Gemini 3 Pro

59.7%

Context Handling

Context Window Size

Maximum token capacity

🥇Gemini 3 Pro

1000000tokens

🥈GPT-5.1

256000tokens

🥉Claude Opus 4.5

200000tokens

Context Recall

Accuracy at maximum context

🥇Claude Opus 4.5

98%

🥈GPT-5.1

96%

🥉Gemini 3 Pro

91%

Speed & Latency

Time to First Token

Response latency

🥇Claude Opus 4.5

3.2seconds

🥈Gemini 3 Pro

2.4seconds

🥉GPT-5.1

1.8seconds

Understanding the Benchmarks

SWE-bench Verified

Tests AI models on real-world GitHub issues from popular open-source projects. Models must understand code, identify bugs, and generate correct fixes. The "Verified" subset contains 500 carefully validated issues.

HumanEval

Hand-written programming problems that evaluate code generation from natural language descriptions. Each problem includes test cases to verify functional correctness.

GPQA Diamond

Graduate-level questions in physics, biology, and chemistry. Designed to be difficult enough that even PhD holders need to research before answering.

MMLU

57 subjects covering elementary mathematics to advanced law. Tests world knowledge and problem-solving ability across diverse domains.

Compare Models in Detail

Explore comprehensive comparisons between Claude and competing AI models.

Claude vs ChatGPT Claude vs Gemini