← Home

Harness Engineering — Development in the Age of Agents

Synthesis of three key sources: OpenAI experiment (Ryan Lopopolo), Mitchell Hashimoto’s adoption journey, and Birgitta Böckeler’s analysis (Thoughtworks/Martin Fowler).


Definition

Harness Engineering — the discipline of designing environments, tools, and feedback loops so AI agents do reliable work. Humans steer, agents execute.

“When the agent struggles, we treat it as a signal: identify what is missing — tools, guardrails, documentation — and feed it back into the repository.” — OpenAI

“Harness engineering: anytime an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.” — Mitchell Hashimoto


Three Components of Harness (Böckeler/Fowler)

1. Context Engineering — context as code

Continuously improved knowledge base in the repository:

AGENTS.md           <- table of contents (~100 lines)
ARCHITECTURE.md     <- domain and layer map
docs/
├── design-docs/    <- design decisions (with verification status)
├── exec-plans/     <- active and completed plans
│   ├── active/
│   └── completed/
├── product-specs/  <- product specifications
├── references/     <- llms.txt for dependencies
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

Key OpenAI insight: One big AGENTS.md is an anti-pattern. Pollutes context, rots instantly, impossible to validate mechanically.

2. Architectural Constraints — boundaries + freedom

Hard boundaries + freedom inside:

“Enforce boundaries centrally, allow autonomy locally.” — OpenAI

3. Garbage Collection — fighting entropy

Agents replicate existing patterns, including bad ones. Without GC, code degrades.

OpenAI: previously 20% of time (Fridays) went to manual cleanup of “AI slop.” Doesn’t scale.


6 Steps of Adoption (Mitchell Hashimoto)

Step 1: Drop the chatbot

Chat interface (ChatGPT, Gemini web) is a dead end for serious development. Use an agent: LLM that reads files, runs programs, makes HTTP requests.

Step 2: Reproduce your own work

Do a task manually, then make the agent do the same with the same quality. Painful, but builds expertise:

Negative space value: understanding when not to use the agent saves the most time.

Step 3: End-of-day agents

Block 30 minutes at end of day for agent runs. Don’t try to do more during work hours. Do more in off hours.

What works:

Step 4: Outsource slam dunks

Tasks where agent almost certainly succeeds: let it run in background. Turn off desktop notifications. Human decides when to context-switch.

“Turn off agent desktop notifications. Context switching is expensive.”

Step 5: Engineer the Harness

Every agent mistake -> engineering solution so it never happens again. Two mechanisms:

  1. AGENTS.md / CLAUDE.md — for simple problems (wrong commands, wrong APIs)
  2. Programmatic tools — scripts, screenshots, filtered tests

“Each line in that file is based on a bad agent behavior, and it almost completely resolved them all.”

Step 6: Always have an agent running

Goal: agent always running. If not, ask: “what could the agent be doing for me?”

Preference: slow, thoughtful models (Amp deep mode / GPT-5.2-Codex). 30+ min per task, but high quality. One agent, not parallel.


OpenAI Experiment: Numbers

Metric Value
Engineers 3 -> 7
Duration 5 months
Code ~1M lines
Pull requests ~1,500
PR/engineer/day 3.5
Human code 0 lines
Estimated speedup ~10x
Max single Codex run 6+ hours

Autonomy Levels (achieved)

One prompt -> agent can:

  1. Validate codebase state
  2. Reproduce bug
  3. Record demo video
  4. Implement fix
  5. Validate fix via UI
  6. Record second video
  7. Open PR
  8. Respond to feedback (agent and human)
  9. Detect and fix build failures
  10. Escalate to human only when judgment needed
  11. Merge

Tools for Legibility


Practical Recommendations

  1. CLAUDE.md as table of contents — keep ~100 lines with links deeper
  2. docs/ as system of record — design docs, execution plans, quality scores
  3. Custom linters with agent-friendly messages — remediation instructions right in the error
  4. Structural tests — check dependency direction, file sizes
  5. Doc-gardening — periodic agent for cleaning stale docs
  6. End-of-day agents — issue triage, research, background tasks
  7. Harness per project — each project = its own CLAUDE.md + docs/ + linters
  8. “Boring” tech — prefer stable, composable technologies

Harness Health Checklist

Anti-Patterns


Predictions (Böckeler/Fowler)

  1. Harness as new service templates — organizations will create harness templates for main stacks
  2. Tech stack convergence — AI pushes toward fewer stacks, “AI-friendliness” as selection criterion
  3. Topology convergence — project structures will become more standard (stable data shapes, modular boundaries)
  4. Two worlds — greenfield with harness vs. retrofit on legacy (different approaches)

Further Reading

Curated from awesome-harness-engineering. Grouped by actionability for solo-factory workflow.

Context & Memory (improve pipeline context efficiency)

Long-Running Agents (improve /build, /pipeline)

Agent Design Patterns (audit skills against)

Evals & Quality (improve /retro, /skill-audit, /review)

Safe Autonomy (improve sandboxing, guardrails)

Multi-Agent (improve agent-teams, /swarm)


Sources:

Sources

Related