Maicon Matsubara

Especialista em IA & Agentes Inteligentes

// technical reference

LLM <span class="accent-word">Benchmarks</span> Guide

What each benchmark truly measures — and which one to use depending on your use case.

14 benchmarks

5 categorias

2 idiomas

// benchmarks — click filters to highlight

SWE-bench

Agentic Coding

The model receives a real GitHub repo with a reported bug and must write the patch that fixes it — no hints about where the problem is.

Measures: ability to navigate complex codebases, understand broad context, plan multi-file edits and execute sequential actions autonomously.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

HumanEval / MBPP

Coding

Isolated programming problems: given a description, generate the correct function. HumanEval uses Python; MBPP includes beginner-to-intermediate problems.

Measures: code synthesis in a single function. Does not test agentic reasoning or real file editing. Good for "can the model code at all?"

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

MMLU

General Assistant

57 knowledge domains: medicine, law, history, physics, math, ethics... The model answers multiple-choice questions at university/grad level.

Measures: breadth of factual knowledge. Great indicator of whether the model is a well-rounded generalist. Does not test deep reasoning.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

GPQA (Diamond)

Reasoning

Questions created by PhDs in physics, biology, and chemistry — so hard that non-domain experts get 30% wrong even with internet access. Diamond = hardest subset.

Measures: deep and rigorous scientific reasoning. Signals advanced technical research capability. Does not directly test code or agentic behavior.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

MATH / AIME

Reasoning

MATH: 12,500 competition problems (algebra, calculus, probability). AIME: US math olympiad — 15 problems per competition, no multiple-choice.

Measures: step-by-step mathematical reasoning, logical rigor, and multi-step problem solving. Direct benchmark for reasoning models (o1, R1, etc).

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

TAU-bench

Agentic Multi-Agent

Simulates customer support agents with real tools (databases, APIs). The agent must solve multi-step tasks by conversing with a simulated user.

Measures: tool use, decision-making in reactive environments, policy following, and error resilience. Benchmark for production agents.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

GAIA

Agentic Multi-Agent

Real-world tasks humans could solve in minutes but that require the agent to web search, read PDFs, run code, and combine multiple information sources.

Measures: orchestration of heterogeneous tools, long-term planning, and real-world grounding. Widely used to evaluate agentic assistants like Deep Research.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

WebArena

Agentic

The agent controls a real browser and performs tasks on simulated sites (Reddit, GitLab, e-commerce, maps). VisualWebArena adds visual perception (screenshots).

Measures: autonomous web navigation, real UI interaction, sequential action planning. Reference for computer-use agents and automation agents.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

Chatbot Arena (LMSYS)

General Assistant

Real humans compare two anonymous models in open-ended conversations and vote for the best. The ELO ranking is computed from thousands of real preferences.

Measures: real human preference in everyday use — writing, conversation, instruction following, tone. Best proxy for "which model do people actually prefer".

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

MT-Bench

General Assistant

80 multi-turn conversations across 8 categories (writing, roleplay, extraction, reasoning, math, code, etc.). GPT-4 acts as a judge evaluating responses.

Measures: instruction following in real chat context and multi-turn coherence. Closer to real chatbot use than multiple-choice benchmarks.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

LiveCodeBench

Coding

Competitive programming problems collected after model cutoffs (LeetCode, Codeforces, AtCoder) — avoids training data contamination.

Measures: ability to solve novel algorithm and data structure problems. More reliable than HumanEval because it is the live test of the model.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

AgentBench

Multi-Agent

Suite of 8 environments: OS, database, web, games, shopping, and more. The model acts as an autonomous agent in each interactive environment.

Measures: generalization of agentic behavior across diverse environments — crucial for evaluating frameworks like AutoGPT, CrewAI and similar.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

IFEval

General Assistant Agentic

Instructions with explicitly verifiable constraints: "respond in under 100 words", "use exactly 3 sections with headings", "do not use the word X".

Measures: precise instruction and constraint following — critical skill for both assistants and agents receiving detailed system prompts.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

SimpleQA

General Assistant

Short, verifiable factual questions about the real world. Designed to measure calibration: the model should not confidently hallucinate when it does not know.

Measures: factual honesty and confidence calibration. A model with a high hallucination rate will score poorly here even with a high MMLU.

Relevance

Agentic

Multi-Agent

Vibe Coding

General Assistant

Reasoning

// resumo por caso de uso — benchmarks prioritários

Agentic

SWE-bench Verified
TAU-bench
WebArena / VisualWebArena
GAIA
IFEval

Multi-Agent

GAIA
AgentBench
TAU-bench
SWE-bench

Vibe Coding

SWE-bench
LiveCodeBench
HumanEval / MBPP
BigCodeBench

General Assistant

Chatbot Arena
MT-Bench
MMLU
SimpleQA
IFEval

Reasoning

GPQA Diamond
AIME 2024/2025
MATH-500
ARC-Challenge

Beware of marketing: companies choose which benchmarks to report. A model can have a high SWE-bench but a mediocre Chatbot Arena score — pick the benchmark aligned with YOUR use case.

Data contamination is a real issue: models may have been trained on data that includes the benchmarks. LiveCodeBench and GPQA Diamond are more resistant due to being newer or expert-curated.