The model receives a real GitHub repo with a reported bug and must write the patch that fixes it — no hints about where the problem is.
Measures: ability to navigate complex codebases, understand broad context, plan multi-file edits and execute sequential actions autonomously.
Isolated programming problems: given a description, generate the correct function. HumanEval uses Python; MBPP includes beginner-to-intermediate problems.
Measures: code synthesis in a single function. Does not test agentic reasoning or real file editing. Good for "can the model code at all?"
57 knowledge domains: medicine, law, history, physics, math, ethics... The model answers multiple-choice questions at university/grad level.
Measures: breadth of factual knowledge. Great indicator of whether the model is a well-rounded generalist. Does not test deep reasoning.
Questions created by PhDs in physics, biology, and chemistry — so hard that non-domain experts get 30% wrong even with internet access. Diamond = hardest subset.
Measures: deep and rigorous scientific reasoning. Signals advanced technical research capability. Does not directly test code or agentic behavior.
MATH: 12,500 competition problems (algebra, calculus, probability). AIME: US math olympiad — 15 problems per competition, no multiple-choice.
Measures: step-by-step mathematical reasoning, logical rigor, and multi-step problem solving. Direct benchmark for reasoning models (o1, R1, etc).
TAU-bench Agentic Multi-Agent
Simulates customer support agents with real tools (databases, APIs). The agent must solve multi-step tasks by conversing with a simulated user.
Measures: tool use, decision-making in reactive environments, policy following, and error resilience. Benchmark for production agents.
Real-world tasks humans could solve in minutes but that require the agent to web search, read PDFs, run code, and combine multiple information sources.
Measures: orchestration of heterogeneous tools, long-term planning, and real-world grounding. Widely used to evaluate agentic assistants like Deep Research.
The agent controls a real browser and performs tasks on simulated sites (Reddit, GitLab, e-commerce, maps). VisualWebArena adds visual perception (screenshots).
Measures: autonomous web navigation, real UI interaction, sequential action planning. Reference for computer-use agents and automation agents.
Chatbot Arena (LMSYS) General Assistant
Real humans compare two anonymous models in open-ended conversations and vote for the best. The ELO ranking is computed from thousands of real preferences.
Measures: real human preference in everyday use — writing, conversation, instruction following, tone. Best proxy for "which model do people actually prefer".
MT-Bench General Assistant
80 multi-turn conversations across 8 categories (writing, roleplay, extraction, reasoning, math, code, etc.). GPT-4 acts as a judge evaluating responses.
Measures: instruction following in real chat context and multi-turn coherence. Closer to real chatbot use than multiple-choice benchmarks.
Competitive programming problems collected after model cutoffs (LeetCode, Codeforces, AtCoder) — avoids training data contamination.
Measures: ability to solve novel algorithm and data structure problems. More reliable than HumanEval because it is the live test of the model.
Suite of 8 environments: OS, database, web, games, shopping, and more. The model acts as an autonomous agent in each interactive environment.
Measures: generalization of agentic behavior across diverse environments — crucial for evaluating frameworks like AutoGPT, CrewAI and similar.
IFEval General Assistant Agentic
Instructions with explicitly verifiable constraints: "respond in under 100 words", "use exactly 3 sections with headings", "do not use the word X".
Measures: precise instruction and constraint following — critical skill for both assistants and agents receiving detailed system prompts.
SimpleQA General Assistant
Short, verifiable factual questions about the real world. Designed to measure calibration: the model should not confidently hallucinate when it does not know.
Measures: factual honesty and confidence calibration. A model with a high hallucination rate will score poorly here even with a high MMLU.