← Wiki

Claude Code anatomy — 98.4% infrastructure, 1.6% AI decisions

Key Takeaways

Liu et al. (VILA-Lab, April 2026) reverse-engineered Claude Code v2.1.88 and counted the bytes. Only 1.6% of the TypeScript codebase is LLM decision logic. The other 98.4% is operational infrastructure — context compaction, tool routing, permissions, session persistence, error recovery. This is the first empirical measurement of the harness engineering thesis: as base model capabilities converge, the moat moves from prompting tricks to deterministic scaffolding around the model.

The system funnels every interaction — slash commands, hooks, MCP calls, subagent spawns — into a single unified agentic loop formalized as S_{t+1} = compact(S_t ∪ {A_t, O_t}), where A_t = f_LLM(S_t). Seven components separate reasoning from execution. The LM is trusted to make local decisions; the harness enforces global safety, memory, and recovery.

Architecture map

flowchart TB
    User([User input]) --> Loop{Unified
agentic loop
S_t+1 = compact S_t ∪ A_t O_t} subgraph Reasoning["1.6% — LLM decisions"] LLM["f_LLM(S_t) → A_t"] end subgraph Harness["98.4% — operational infrastructure"] Ctx[Context compaction
5-layer pipeline] Perm[Deny-first
permission gate
G A_t S_t = 1] Tools[Tool router
hooks • skills •
plugins • MCP] Mem[Session memory
append-only JSONL
+ worktree isolation] Rec[Error recovery
+ subagent dispatch] end Loop --> LLM LLM --> Perm Perm -->|approved| Tools Perm -->|denied| User Tools --> Ctx Ctx --> Mem Mem --> Loop Rec -.->|fallback| Loop classDef tiny fill:#fef3c7,stroke:#d97706,stroke-width:2px classDef huge fill:#dbeafe,stroke:#2563eb,stroke-width:2px class LLM tiny class Ctx,Perm,Tools,Mem,Rec huge

The loop itself is trivial; the gravity is in the five boxes around it.

The five-layer context compaction pipeline

Bounded token windows are managed by a lazy-degradation ladder — cheapest operation first:

  1. Budget reduction — swap oversized raw outputs for reference pointers
  2. Snip — trim older, less relevant history
  3. Microcompact — fine-grained, cache-aware compression to preserve prompt caching economics
  4. Context collapse — read-time projection that merges messages visually without deleting
  5. Auto-compact — invoke the model itself to summarize state

This sequential graduation maximizes working memory while minimizing cache invalidation. It is context engineering operationalized — not a single CLAUDE.md file, but a whole compression economy.

Deny-first permissions and the 93% problem

Authorization runs on deny-first logic: explicit declarative rules → trust mode (strict planning → auto-execution) → optional ML classifier for intent safety. Action executes only if G(A_t, S_t) = 1. Permissions are deliberately not serialized across session resumption — safety-critical state is ephemeral by design.

The paper’s uncomfortable finding: ~93% of permission prompts are approved by users. The gate exists but approval fatigue compromises it. Deterministic guardrails solve the machine-side problem but create a human discipline problem. Any serious harness design has to account for this — more prompts ≠ more safety once fatigue saturates.

Extension mechanisms ranked by token cost

The paper maps four extension tiers by their default context footprint — matching our own lazy-load vs eager-load observation:

Implication: 150 skills is fine. 10 MCPs and the context is already half-full before the agent does any work.

Architectural contrast

System Safety approach Execution model
Claude Code Deny-first permissions + worktree isolation Single Unix-like agentic loop, minimal invasive
SWE-Agent / OpenHands Heavy Docker containerization Isolated execution environment
LangGraph Typed state graph with nodes and edges Constrain cognitive paths
Aider Git as safety net (rollback) VCS-centric

Claude Code’s bet: trust the model’s local reasoning, constrain only execution. This is the opposite of LangGraph-style cognitive scaffolding. The richer the model, the more the answer moves from “shape the thoughts” to “shape the environment.”

Validated on our own stack

We hit the same 98.4/1.6 ratio independently building Agent-Bit for PAC1. Our Rust sgr-agent framework scored 93% on BitGN with Nemotron 120B (free) — beating GPT-5.4 ($54/day, 77%) — because the architecture carries the weight, not the model:

Empirical parallel: paper says 98.4% infra / 1.6% AI logic. Our experience says architecture ≈ 80% of outcome, model ≈ 20%. Same direction, same moat.

Structural trade-offs

The paper doesn’t only celebrate. It names the costs:

The architectural tension: immediate capability amplification vs long-term preservation of codebase coherence and human mastery. Harness design has to make a deliberate bet on this axis, not just maximize task throughput.

Connections

References

Sources

Related