Claude Code anatomy — 98.4% infrastructure, 1.6% AI decisions
Key Takeaways
Liu et al. (VILA-Lab, April 2026) reverse-engineered Claude Code v2.1.88 and counted the bytes. Only 1.6% of the TypeScript codebase is LLM decision logic. The other 98.4% is operational infrastructure — context compaction, tool routing, permissions, session persistence, error recovery. This is the first empirical measurement of the harness engineering thesis: as base model capabilities converge, the moat moves from prompting tricks to deterministic scaffolding around the model.
The system funnels every interaction — slash commands, hooks, MCP calls, subagent spawns — into a single unified agentic loop formalized as S_{t+1} = compact(S_t ∪ {A_t, O_t}), where A_t = f_LLM(S_t). Seven components separate reasoning from execution. The LM is trusted to make local decisions; the harness enforces global safety, memory, and recovery.
Architecture map
flowchart TB
User([User input]) --> Loop{Unified
agentic loop
S_t+1 = compact S_t ∪ A_t O_t}
subgraph Reasoning["1.6% — LLM decisions"]
LLM["f_LLM(S_t) → A_t"]
end
subgraph Harness["98.4% — operational infrastructure"]
Ctx[Context compaction
5-layer pipeline]
Perm[Deny-first
permission gate
G A_t S_t = 1]
Tools[Tool router
hooks • skills •
plugins • MCP]
Mem[Session memory
append-only JSONL
+ worktree isolation]
Rec[Error recovery
+ subagent dispatch]
end
Loop --> LLM
LLM --> Perm
Perm -->|approved| Tools
Perm -->|denied| User
Tools --> Ctx
Ctx --> Mem
Mem --> Loop
Rec -.->|fallback| Loop
classDef tiny fill:#fef3c7,stroke:#d97706,stroke-width:2px
classDef huge fill:#dbeafe,stroke:#2563eb,stroke-width:2px
class LLM tiny
class Ctx,Perm,Tools,Mem,Rec huge
The loop itself is trivial; the gravity is in the five boxes around it.
The five-layer context compaction pipeline
Bounded token windows are managed by a lazy-degradation ladder — cheapest operation first:
- Budget reduction — swap oversized raw outputs for reference pointers
- Snip — trim older, less relevant history
- Microcompact — fine-grained, cache-aware compression to preserve prompt caching economics
- Context collapse — read-time projection that merges messages visually without deleting
- Auto-compact — invoke the model itself to summarize state
This sequential graduation maximizes working memory while minimizing cache invalidation. It is context engineering operationalized — not a single CLAUDE.md file, but a whole compression economy.
Deny-first permissions and the 93% problem
Authorization runs on deny-first logic: explicit declarative rules → trust mode (strict planning → auto-execution) → optional ML classifier for intent safety. Action executes only if G(A_t, S_t) = 1. Permissions are deliberately not serialized across session resumption — safety-critical state is ephemeral by design.
The paper’s uncomfortable finding: ~93% of permission prompts are approved by users. The gate exists but approval fatigue compromises it. Deterministic guardrails solve the machine-side problem but create a human discipline problem. Any serious harness design has to account for this — more prompts ≠ more safety once fatigue saturates.
Extension mechanisms ranked by token cost
The paper maps four extension tiers by their default context footprint — matching our own lazy-load vs eager-load observation:
- Hooks — zero default context (fire on event, bring nothing into the prompt)
- Skills — minimal instruction sets (lazy-load on trigger)
- Plugins — medium-cost packaging
- MCP servers — high token budget due to complex schemas (eager-load on connect)
Implication: 150 skills is fine. 10 MCPs and the context is already half-full before the agent does any work.
Architectural contrast
| System | Safety approach | Execution model |
|---|---|---|
| Claude Code | Deny-first permissions + worktree isolation | Single Unix-like agentic loop, minimal invasive |
| SWE-Agent / OpenHands | Heavy Docker containerization | Isolated execution environment |
| LangGraph | Typed state graph with nodes and edges | Constrain cognitive paths |
| Aider | Git as safety net (rollback) | VCS-centric |
Claude Code’s bet: trust the model’s local reasoning, constrain only execution. This is the opposite of LangGraph-style cognitive scaffolding. The richer the model, the more the answer moves from “shape the thoughts” to “shape the environment.”
Validated on our own stack
We hit the same 98.4/1.6 ratio independently building Agent-Bit for PAC1. Our Rust sgr-agent framework scored 93% on BitGN with Nemotron 120B (free) — beating GPT-5.4 ($54/day, 77%) — because the architecture carries the weight, not the model:
- Pipeline state machine before the LLM (classify → scan inbox → security check → ready) blocks 100% of obvious threats with zero tokens. Mirrors Claude Code’s deny-first gate, just earlier in the loop
- Trust metadata on
read()([path | trusted/untrusted]) is the micro-equivalent of the deny-first permission model — annotation gives the LLM a safety hint without blocking reasoning - 10 active + 5 deferred tools matches the paper’s hook/skill/plugin/MCP tiering: heavy schemas stay out of context until invoked
- FileBackend trait = Unix-like abstraction: same agent code runs against RPC (PAC1), local FS, or mock. Same “constrain execution, trust reasoning” bet Claude Code makes
Empirical parallel: paper says 98.4% infra / 1.6% AI logic. Our experience says architecture ≈ 80% of outcome, model ≈ 20%. Same direction, same moat.
Structural trade-offs
The paper doesn’t only celebrate. It names the costs:
- +40.7% code complexity in projects heavily assisted by agent tooling (external studies cited) — agents optimize for task completion, not global coherence
- Subagent isolation means optimal local decisions without full global codebase awareness. Lossy compression accelerates this
- Human mastery atrophy — “developer’s neural connectivity and codebase comprehension demonstrably atrophy” with sustained delegation
- Approval fatigue (see above) — automated safety gates erode human oversight over time
The architectural tension: immediate capability amplification vs long-term preservation of codebase coherence and human mastery. Harness design has to make a deliberate bet on this axis, not just maximize task throughput.
Connections
- harness-engineering-summary — the thesis this paper validates empirically. Now we have a number: 98.4%
- context-engineering — the five-layer compaction is a concrete implementation of “context as code, not chat history”
- writing-claude-md — CLAUDE.md is the root node of the seven-component architecture; this paper shows what sits behind it
- agent-toolkit-landscape — paper explicitly frames Claude Code against LangGraph, SWE-Agent, OpenHands, Aider. Different safety bets
- agent-patterns-stream2 — the hook/skill/plugin/MCP tiering by token cost matches the lazy-load/eager-load observation
- agent-self-discipline — 93% approval = the human drift problem; deterministic gates do not solve discipline
- agent-sandboxing — worktree isolation + deny-first is the execution-level equivalent of VM sandboxing
- agent-mistake-fix-harness — fixing the harness when an agent fails assumes the harness is 98.4% of the system. Now measured
- decision-traces-compound — append-only JSONL transcripts are the memory substrate for trace-based learning
- agent-bit-pac1 — our independent validation: 93% on PAC1 with Nemotron 120B (free) because the Rust harness does 80% of the work. Same deny-first + tool tiering + Unix-like backend abstraction
- pac1-competition-retrospective — competition retrospective: architecture beats model — same lesson the Claude Code paper measures
- fff-agent-file-search — Kovalenko’s retrieval-layer thesis: compaction wipes the agent’s working set, so the tool (not the model) has to carry frecency/combo/git context
- conductor-parallel-agents — if the model is 1.6% of the system, multiplying model instances costs almost nothing; Conductor productizes that insight as a Mac GUI
References
- Liu, Zhao, Shang, Shen — Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems (arXiv:2604.14228)
- GitHub:
VILA-Lab/Dive-into-Claude-Code - Review: arxiviq.substack.com/p/dive-into-claude-code-the-design
- Target: Claude Code v2.1.88 (TypeScript source)