o3

OpenAI Best for: reasoning
Compare this model →
Categories
MATH / AIME 96.7%
reasoning
SimpleQA 96.7%
general
MMLU 92.9%
general
GPQA (Diamond) 87.7%
reasoning
HumanEval / MBPP 87.4%
coding
LiveCodeBench 82.7%
coding
SWE-bench 71.7%
agenticcoding
Chatbot Arena (LMSYS) 1424
general
TAU-bench
agenticmultiagent
GAIA
agenticmultiagent
WebArena
agentic
MT-Bench
general
AgentBench
multiagent
IFEval
generalagentic

Data unavailable

Data unavailable

Data unavailable