o4-mini
radar chart — all benchmarks
family — openai-o
benchmark scores Scores from Mar 2026
| Categories | ||
|---|---|---|
| MATH / AIME | 97.5% | reasoning |
| HumanEval / MBPP | 97.3% | coding |
| MMLU | 90.0% | general |
| LiveCodeBench | 80.2% | coding |
| GPQA (Diamond) | 77.6% | reasoning |
| Chatbot Arena (LMSYS) | 1391 | general |
| SWE-bench | 68.1% | agenticcoding |
| SimpleQA | 20.2% | general |
| TAU-bench | — | agenticmultiagent |
| GAIA | — | agenticmultiagent |
| WebArena | — | agentic |
| MT-Bench | — | general |
| AgentBench | — | multiagent |
| IFEval | — | generalagentic |
pricing — per 1M tokens via openrouter
Data unavailable
latency percentiles — time to first token (ms)
Data unavailable
model specifications
Data unavailable