2026-04-09 catalog agents benchmarks evaluation methodology tools

Agent Benchmarks

Benchmarks answer the question: “Does my agent actually work, or does it just look impressive in demos?” This page catalogs the benchmarks that matter for coding agents and AI toolkits.

Coding Agent Benchmarks

Benchmark	What it tests	Tasks	Key metric	Link
SWE-bench	Fix real GitHub issues	2,294 (full) / 300 (Verified)	% resolved	swebench.com
SWE-bench ES3	Extended real-world SWE tasks	Expanded set	% resolved	swe-bench.github.io
Terminal-Bench 2.0	Terminal agent tasks (kernel build, git servers, async debug, CTF)	100 validated	Pass rate (k=5 trials)	tbench.ai
Aider Polyglot	Code generation across 6 languages incl. Rust	225 (Exercism)	% passing tests	github.com/Aider-AI
PAC-1	Personal & trustworthy agents	43 tasks	Score / 43	bitgn.com/l/pac1-dev
PinchBench	LLM as OpenClaw coding agent	23 tasks, 8 categories	Averaged score	pinchbench.com

SWE-bench

The industry standard. Real GitHub issues from popular Python repos → agent must understand the issue, find relevant code, write a fix, pass tests. SWE-bench Verified (300 human-validated tasks) is the trusted subset.

Why it matters: closest to real engineering work. Not toy tasks — real bugs in Django, Flask, scikit-learn.

Current top agents: consistently improving — from 4% (Jan 2024) to 70%+ (mid 2026).

SWE-bench ES3

Extended version — broader range of tasks, more languages, harder edge cases. The next generation of SWE-bench.

PAC-1 (Personal & Trustworthy Agents)

BitGN benchmark for personal AI agents. 43 tasks, max score 43.0. DEV leaderboard is active — top agents score perfect 43/43.

We participate here. Focus: personal agent reliability, trustworthiness, real-world task completion.

PinchBench

Evaluates LLM models as OpenClaw coding agents across 8 categories:

Productivity — calendar, scheduling, time parsing
Research — stock prices, conferences, data extraction
Writing — blog posts, emails, content
Coding — scripts, file operations
Analysis — spreadsheets, PDFs, document processing
Email — inbox management, triage
Memory — long-term recall, context retrieval
Skills — ecosystem integration, skill discovery

Scoring: automated grading + LLM judge (nuanced quality). 23 real-world tasks. Results on pinchbench.com.

Key insight: tests practical agent capabilities (tool usage, multi-step reasoning, ambiguous instructions), not isolated LLM knowledge.

Terminal-Bench 2.0

The de-facto standard for terminal agent evaluation. 100 tasks validated by humans and LMs for solvability and realism. Tasks: kernel builds, git server setup, async debugging, COBOL modernization, security CTFs.

Harness: Harbor framework — rewritten from scratch for reliability and parallelism. Built-in agents: Claude Code, Codex, Terminus-2, Aider, Oracle (ground truth).

Running your own agent: BaseInstalledAgent (if CLI) or BaseAgent (full control via perform_task()). Custom agent wrapper is ~40 lines.

Practical notes:

Apple Silicon: x86 container conflicts — use Daytona/Modal instead of local Docker
Some tasks run 2+ hours (COBOL modernization)
Leaderboard requires k=5 trials minimum (high variance: DeepAgents showed 40.4% vs 44.9% between runs)
~$10-20 via Daytona for a full parallel run (32 trials/hour)

Aider Polyglot

225 Exercism tasks across 6 languages (Python, Rust, Go, JS, Java, C#). Simpler than TB2 — cargo test / pytest validation, no containers needed for Mac.

Format: 2 attempts per task. First solo, second with test feedback. Close to real agentic loop.

Why useful for agent comparison: headless mode for all major agents — rust-code -p, codex exec --full-auto, claude -p. Same dataset = comparable scores. Can filter --languages rust for Rust-only subset (~50 tasks).

Key metrics: pass rate attempt 1 / attempt 2, avg loop steps, token usage per task, edit accuracy (valid patch on first try).

Memory Benchmarks

Benchmark	What it tests	Memories	Key metric	Link
LongMemEval	Long-term memory in agents	22K+	Retrieval accuracy	arxiv

LongMemEval

Tests: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention (knowing when you don’t know).

Notable result: MemPalace scored 96.6%. Structure (spatial hierarchy) beat raw embeddings. Verbatim storage beat compression.

Skill Benchmarks

Benchmark	What it tests	Focus
Superpowers TDD enforcement	Does agent actually write tests first?	Process compliance
solo-factory /retro	Pipeline quality after full run	End-to-end methodology

These aren’t public leaderboards — they’re built-in evaluation in agent toolkits. solo-factory’s /retro skill evaluates pipeline effectiveness after a run. Superpowers enforces TDD and deletes code written before tests.

Our Participation

PAC-1 — active participation on DEV leaderboard
PinchBench — relevant for OpenClaw/Moltis ecosystem evaluation
SWE-bench — reference benchmark for coding agents
LongMemEval — reference for memory systems

Continuous Eval: The Missing Layer

Public benchmarks run once. Real systems degrade daily. The Zero Harness eval engine (NeuraLab) proposes eval as a permanent loop, not a one-time check:

Routing Eval

Daily control cases for agent routing. Not abstract benchmarks — real questions matching system roles (memory lookup, code implementation, research, browser-required flows). Catches architectural drift fast: e.g., browser-required cases routing to skill runner instead of browser agent.

Skills Eval

Formal gate before a skill is “ready”:

Has trigger + when_to_use / when_not_to_use
Has required_tools + fallback strategy
Has eval cases
Passes threshold (e.g., 0.8)
Doesn’t break routing around itself

A skill isn’t done because “it worked once.” It must pass formal checks. This maps directly to skill marketplace quality: skills.sh audits, solo-factory /skill-audit.

Browser Eval

14-day comparison of browser candidates across task types. Metrics: success rate, latency, determinism, token overhead, retry rate, composite score. Key finding: lightweight-first + controlled heavy fallback beats browser-first (which adds unnecessary complexity).

Why Continuous Eval Matters

Without permanent eval, degradation is invisible:

Routing drifts
New skill breaks old path
Browser candidate becomes unstable
Latency grows unnoticed

The agent continues to look confident externally. Eval is insurance against self-confident agents.

Next frontier: completion control — agent can’t self-declare “done” until facts, artifacts, and smoke-checks confirm it. Less spectacular than “autonomous agent that does everything.” Closer to production than an investor demo.

What Benchmarks Miss

Public benchmarks test isolated capabilities. They don’t test:

Pipeline quality — can the agent ship a complete product? (research → code → deploy → launch)
Harness effectiveness — does the harness prevent real mistakes?
Architectural drift — does routing degrade over days/weeks? (Zero Harness eval engine)
Compound learning — does the agent get better over time? (decision-traces-compound)
Cross-session memory — does context survive between runs? (agent-memory-architecture)
Completion honesty — does the agent actually finish, or just claim it did?
Real user value — did anyone actually use the thing it built?

The best benchmark is shipping real products. Everything else is a proxy.

Sources

NeuraLab / Zero Harness — practical eval engine for agent routing, skills, browser; continuous eval loop