← Skills

/retro

After pipeline completes and needs quality assessment

install: npx skills add fortunto2/solo-factory/retro
phase: utility

This skill is self-contained — follow the phases below instead of delegating to other skills (/review, /audit, /build) or spawning Task subagents. Run all analysis directly.

# /retro

This skill is self-contained — follow the phases below instead of delegating to other skills (/review, /audit, /build) or spawning Task subagents. Run all analysis directly.

Post-pipeline retrospective. Parses pipeline logs, counts productive vs wasted iterations, identifies recurring failure patterns, scores the pipeline run, and suggests concrete patches to skills/scripts to prevent the same failures next time.

## Live Context
- Branch: !`git branch --show-current 2>/dev/null`
- Recent commits: !`git log --oneline -10 2>/dev/null`
- Modified files: !`git diff --name-only HEAD~5..HEAD 2>/dev/null | head -20`

## When to use

After a pipeline completes (or gets cancelled). This is the process quality check — `/review` checks **code quality**, `/retro` checks **pipeline process quality**.

Can also be used standalone on any project — with or without pipeline logs.

## MCP Tools (use if available)

- `session_search(query)` — find past pipeline runs and known issues
- `codegraph_explain(project)` — understand project architecture context
- `codegraph_query(query)` — query code graph for project metadata

If MCP tools are not available, fall back to Glob + Grep + Read.

## Phase 1: Locate Artifacts

1. **Detect project** from `$ARGUMENTS` or CWD:
   - If argument provided: use it as project name
   - Otherwise: extract from CWD basename (e.g., `~/projects/my-app` -> `my-app`)

2. **Find pipeline state file:** `.solo/pipelines/solo-pipeline-{project}.local.md` (project-local) or `~/.solo/pipelines/solo-pipeline-{project}.local.md` (global fallback)
   - If it exists: pipeline is still running or wasn't cleaned up — read YAML frontmatter for `project_root:`
   - If not: pipeline completed — use CWD as project root

3. **Verify artifacts exist (parallel reads):**
   - Pipeline log: `{project_root}/.solo/pipelines/pipeline.log`
   - Iter logs: `{project_root}/.solo/pipelines/iter-*.log`
   - Progress file: `{project_root}/.solo/pipelines/progress.md`
   - Plan-done directory: `{project_root}/docs/plan-done/`
   - Active plan: `{project_root}/docs/plan/`

4. **Determine analysis mode:**
   - If pipeline log exists: proceed with full log-based analysis (Phases 2-4)
   - If NO pipeline log: switch to **fallback mode** (see Fallback Analysis below)

5. **Count iter logs** (if they exist): `ls {project_root}/.solo/pipelines/iter-*.log | wc -l`
   - Report: "Found {N} iteration logs"

## Fallback Analysis (No Pipeline Logs)

If no pipeline logs exist, the retro can still provide value by analyzing:

1. **Git history:** `git log --oneline --since="1 week ago"` — commit frequency, patterns, conventional format
2. **Test results:** run test suite if configured in CLAUDE.md or package.json
3. **Build status:** run build if configured
4. **CLAUDE.md changes:** `git log --oneline -- CLAUDE.md` — how docs evolved
5. **Code quality metrics:** file counts, TODO/FIXME density, dead code indicators
6. **Project structure:** completeness of docs/, tests/, CI config

Skip Phases 2-4 and proceed directly to Phase 5 (Plan Fidelity) and Phase 6 (Git & Code Quality). Adjust Phase 7 scoring to weight available data more heavily.

## Phase 2: Parse Pipeline Log (quantitative)

Read `pipeline.log` in full. Parse line-by-line, extracting structured data from log tags:

**Log format:** `[HH:MM:SS] TAG | message`

**Extract by tag:**

| Tag | What to extract |
|-----|----------------|
| `START` | Pipeline run boundary — count restarts (multiple START lines = restarts) |
| `STAGE` | `iter N/M \| stage S/T: {stage_id}` — iteration count per stage |
| `SIGNAL` | `<solo:done/>` or `<solo:redo/>` — which stages got completion signals |
| `INVOKE` | Skill invoked — extract skill name, check for wrong names |
| `ITER` | `commit: {sha} \| result: {stage complete\|continuing}` — per-iteration outcome |
| `CHECK` | `{stage} \| {path} -> FOUND\|NOT FOUND` — marker file checks |
| `FINISH` | `Duration: {N}m` — total duration per run |
| `MAXITER` | `Reached max iterations ({N})` — hit iteration ceiling |
| `QUEUE` | Plan cycling events (activating, archiving) |
| `CIRCUIT` | Circuit breaker triggered (if present) |
| `CWD` | Working directory changes |
| `CTRL` | Control signals (pause/stop/skip) |

**Compute metrics:**

```
total_runs = count of START lines
total_iterations = count of ITER lines
productive_iters = count of ITER lines with "stage complete"
wasted_iters = total_iterations - productive_iters
waste_pct = wasted_iters / total_iterations * 100
maxiter_hits = count of MAXITER lines
plan_cycles = count of QUEUE lines with "Cycling"

per_stage = {
  stage_id: {
    attempts: count of STAGE lines for this stage,
    successes: count of ITER lines with "stage complete" for this stage,
    waste_ratio: (attempts - successes) / attempts * 100,
  }
}
```

## Phase 3: Parse Progress.md (qualitative)

Read `progress.md` and scan for error patterns:

1. **Unknown skill errors:** grep for `Unknown skill:` — extract which skill name was wrong
2. **Empty iterations:** iterations where "Last 5 lines" show only errors or session header (no actual work done)
3. **Repeated errors:** same error appearing in consecutive iterations -> spin-loop indicator
4. **Doubled signals:** `<solo:done/><solo:done/>` in same iteration -> minor noise (note but don't penalize)
5. **Redo loops:** count how many times build->review->redo->build cycles occurred

For each error pattern found, record:
- Pattern name
- First occurrence (iteration number)
- Total occurrences
- Consecutive streak (max)

## Phase 4: Analyze Iter Logs (sample-based)

Do NOT read all iter logs — could be 60+. Use smart sampling:

1. **First failed iter per pattern:** For each failure pattern found in Phase 3, read the first iter log that shows it
   - Strip ANSI codes when reading: `sed 's/\x1b\[[0-9;]*m//g' < iter-NNN-stage.log | head -100`

2. **First successful iter per stage:** For each stage that eventually succeeded, read the first successful iter log
   - Look for `<solo:done/>` in the output

3. **Final review iter:** Read the last `iter-*-review.log` (the verdict)

4. **Extract from each sampled log:**
   - Tools called (count of tool_use blocks)
   - Errors encountered (grep for `Error`, `error`, `Unknown`, `failed`)
   - Signal output (`<solo:done/>` or `<solo:redo/>` present?)
   - First 5 and last 10 meaningful lines (skip blank lines)

## Phase 5: Plan Fidelity Check

For each track directory in `docs/plan-done/` and `docs/plan/`:

1. **Read spec.md** (if exists):
   - Count acceptance criteria: total `- [ ]` and `- [x]` checkboxes
   - Calculate: `criteria_met = checked / total * 100`

2. **Read plan.md** (if exists):
   - Count tasks: total `- [ ]` and `- [x]` checkboxes
   - Count phases (## headers)
   - Check for SHA annotations (`<!-- sha:... -->`)
   - Calculate: `tasks_done = checked / total * 100`

3. **Compile per-track summary:**
   - Track ID, criteria met %, tasks done %, has SHAs

## Phase 6: Git & Code Quality (lightweight)

Quick checks only — NOT a full /review:

1. **Commit count and format:**
   ```bash
   git -C {project_root} log --oneline | wc -l
   git -C {project_root} log --oneline | head -30
   ```
   - Count commits with conventional format (`feat:`, `fix:`, `chore:`, `test:`, `docs:`, `refactor:`, `build:`, `ci:`, `perf:`)
   - Calculate: `conventional_pct = conventional / total * 100`

2. **Committer breakdown:**
   ```bash
   git -C {project_root} shortlog -sn --no-merges | head -10
   ```

3. **Test status** (if test command exists in CLAUDE.md or package.json):
   - Run test suite, capture pass/fail count
   - If no test command found, skip and note "no tests configured"

4. **Build status** (if build command exists):
   - Run build, capture success/fail
   - If no build command found, skip and note "no build configured"

## Phase 6.5: Context Degradation Analysis

Check for signs of context window problems during the pipeline run:

1. **Iteration quality curve:** Compare early iterations vs late iterations.
   - Did error rates increase over time? (sign of context degradation)
   - Did the agent start repeating itself or losing track of the plan?

2. **Observation masking usage:** Check if `scratch/` directory exists in project root.
   - If yes: good — agent was offloading large outputs
   - If no but iter logs show >100-line tool outputs: flag as waste source

3. **Plan recitation evidence:** In sampled iter logs, check if the agent re-read plan.md at task boundaries.
   - Absent recitation + task drift = context engineering gap

4. **CLAUDE.md bloat:** `wc -c {project_root}/CLAUDE.md`
   - >40,000 chars: WARN — attention dilution likely
   - >60,000 chars: RED — severe context budget pressure

Add findings to the report under `## Context Health`:
```markdown
## Context Health
- Iteration quality trend: {STABLE / DEGRADING / N/A}
- Observation masking: {USED / NOT USED / N/A}
- Plan recitation: {OBSERVED / ABSENT / N/A}
- CLAUDE.md size: {N} chars — {OK / WARN / BLOATED}
```

## Phase 7: Score & Report

Load scoring rubric from `${CLAUDE_PLUGIN_ROOT}/skills/retro/references/eval-dimensions.md`.
If plugin root not available, use the embedded weights:

**Scoring weights:**
- Efficiency (waste %): 25%
- Stability (restarts): 20%
- Fidelity (criteria met): 20%
- Quality (test pass rate): 15%
- Commits (conventional %): 5%
- Docs (plan staleness): 5%
- Signals (clean signals): 5%
- Speed (duration per track, normalized): 5%

**Note:** In fallback mode (no pipeline logs), redistribute Efficiency and Stability weights to Fidelity, Quality, and Commits.

**Generate report** at `{project_root}/docs/retro/{date}-retro.md`:

```markdown
# Pipeline Retro: {project} ({date})

## Overall Score: {N}/10

## Pipeline Efficiency

| Metric | Value | Rating |
|--------|-------|--------|
| Total iterations | {N} | |
| Productive iterations | {N} ({pct}%) | {emoji} |
| Wasted iterations | {N} ({pct}%) | {emoji} |
| Pipeline restarts | {N} | {emoji} |
| Max-iter hits | {N} | {emoji} |
| Total duration | {time} | {emoji} |
| Tracks completed | {N} | |
| Duration per track | {time/tracks} | {emoji} |

## Per-Stage Breakdown

| Stage | Attempts | Successes | Waste % | Notes |
|-------|----------|-----------|---------|-------|
| scaffold | | | | |
| setup | | | | |
| plan | | | | |
| build | | | | |
| deploy | | | | |
| review | | | | |

## Failure Patterns

### Pattern 1: {name}
- **Occurrences:** {N} iterations
- **Root cause:** {analysis}
- **Wasted:** {N} iterations
- **Fix:** {concrete suggestion with file reference}

### Pattern 2: ...

## Plan Fidelity

| Track | Criteria Met | Tasks Done | SHAs | Rating |
|-------|-------------|------------|------|--------|
| {track-id} | {N}% | {N}% | {yes/no} | {emoji} |

## Code Quality (Quick)

- **Tests:** {N} pass, {N} fail (or "not configured")
- **Build:** PASS / FAIL (or "not configured")
- **Commits:** {N} total, {pct}% conventional format

## Three-Axis Growth

| Axis | Score | Evidence |
|------|-------|----------|
| **Technical** (code, tools, architecture) | {0-10} | {what changed} |
| **Cognitive** (understanding, strategy, decisions) | {0-10} | {what improved} |
| **Process** (harness, skills, pipeline, docs) | {0-10} | {what evolved} |

If only one axis is served — note what's missing.

## Recommendations

1. **[CRITICAL]** {patch suggestion with file:line reference}
2. **[HIGH]** {improvement}
3. **[MEDIUM]** {optimization}
4. **[LOW]** {nice-to-have}

## Suggested Patches

### Patch 1: {file} — {description}

**What:** {one-line description}
**Why:** {root cause reference from Failure Patterns}

\```diff
- old line
+ new line
\```
```

**Rating guide (use these emojis):**
- GREEN = excellent
- YELLOW = acceptable
- RED = needs attention

## Phase 8: Interactive Patching

After generating the report:

1. **Show summary** to user: overall score, top 3 failure patterns, top 3 recommendations

2. **For each suggested patch** (if any), use `AskUserQuestion`:
   - Question: "Apply patch to {file}? {one-line description}"
   - Options: "Apply" / "Skip" / "Show diff first"

3. **If "Show diff first":** display the full diff, then ask again (Apply / Skip)

4. **If "Apply":** use Edit tool to apply the change directly

5. **After all patches processed:**
   - If any patches were applied: suggest committing with `fix(retro): {description}`
   - Do NOT auto-commit — just suggest the command

## Phase 9: CLAUDE.md Revision

After patching, revise the project's CLAUDE.md to keep it lean and useful for future agents.

### Steps:

1. **Read CLAUDE.md** and check size: `wc -c CLAUDE.md`
2. **Add learnings from this retro:**
   - Pipeline failure patterns worth remembering (avoid next time)
   - New workflow rules or process improvements
   - Updated commands or tooling changes
   - Architecture decisions that emerged during the pipeline run
3. **If over 40,000 characters — trim ruthlessly:**
   - Collapse completed phase/milestone histories into one line each
   - Remove verbose explanations — keep terse, actionable notes
   - Remove duplicate info (same thing explained in multiple sections)
   - Remove historical migration notes, old debugging context
   - Remove examples that are obvious from code or covered by skill/doc files
   - Remove outdated troubleshooting for resolved issues
4. **Verify result <= 40,000 characters** — if still over, cut least actionable content
5. **Write updated CLAUDE.md**, update "Last updated" date

### Priority (keep -> cut):
1. **ALWAYS KEEP:** Tech stack, directory structure, Do/Don't rules, common commands, architecture decisions
2. **KEEP:** Workflow instructions, troubleshooting for active issues, key file references
3. **CONDENSE:** Phase histories (one line each), detailed examples, tool/MCP listings
4. **CUT FIRST:** Historical notes, verbose explanations, duplicated content, resolved issues

### Rules:
- Never remove Do/Don't sections — critical guardrails
- Preserve overall section structure and ordering
- Every line must earn its place: "would a future agent need this to do their job?"
- Commit the update: `git add CLAUDE.md && git commit -m "docs: revise CLAUDE.md (post-retro)"`

## Phase 10: Factory Critic (optional)

**Run this phase only if `${CLAUDE_PLUGIN_ROOT}` is available** (i.e., solo-factory is installed). Skip if running as a standalone skill without the factory context.

After evaluating the project pipeline, step back and evaluate **the factory itself** — the skills, scripts, and pipeline logic that produced this result. Be a harsh critic.

### What to evaluate:

1. **Read the skills that were invoked** in this pipeline run (from INVOKE lines in pipeline.log):
   - For each skill: `${CLAUDE_PLUGIN_ROOT}/skills/{stage}/SKILL.md`
   - Did the skill have the right instructions for this project's needs?
   - Did it miss context it should have had?

2. **Read pipeline script** signal handling and stage logic:
   - `${CLAUDE_PLUGIN_ROOT}/scripts/solo-dev.sh`
   - Were there structural issues (wrong stage order, missing re-exec, broken redo)?

3. **Cross-reference with failure patterns** from Phase 3:
   - For each failure: was the root cause in the skill, the script, or the project?
   - Skills that caused waste = factory defects

### Score the factory (not the project):

```
Factory Score: {N}/10

Skill quality:
- {skill}: {score}/10 — {why}
- {skill}: {score}/10 — {why}

Pipeline reliability: {N}/10 — {why}

Missing capabilities:
- {what the factory couldn't do that it should have}

Top factory defects:
1. {defect} → {which file to fix} → {concrete fix}
2. {defect} → {which file to fix} → {concrete fix}
```

### Harness Evolution — think about the bigger picture

After scoring the factory, step back further and think about the **harness** — the entire system that guides agents (CLAUDE.md, docs/, linters, skills, templates). Ask:

1. **Context engineering:** Did the agent have everything it needed in-repo? Or did it struggle because knowledge was missing / scattered / stale?
   - Missing docs -> add to `docs/` or CLAUDE.md
   - Stale docs -> flag for doc-gardening
   - Knowledge only in your head -> encode it

2. **Architectural constraints:** Did the agent break module boundaries, produce inconsistent patterns, or ignore conventions?
   - Repeated boundary violations -> need a linter or structural test
   - Inconsistent patterns -> need golden principle in CLAUDE.md
   - Data shape errors -> need parse-at-boundary enforcement

3. **Decision traces:** What worked well that future agents should reuse? What failed that they should avoid?
   - Good patterns -> capture as precedent in docs or CLAUDE.md
   - Bad patterns -> encode as anti-pattern or lint rule
   - Think: "if another agent hits this same problem tomorrow, what should it find?"

4. **Skill gaps:** Which skills need better instructions? Which new skills should exist?
   - Skill that caused waste -> concrete SKILL.md patch
   - Missing capability -> new skill idea for evolution log

### Write to evolution log:

Append findings to `{project_root}/docs/evolution.md` (create if not exists). If `~/.solo/evolution.md` exists, append there as well for cross-project tracking.

```markdown
## {YYYY-MM-DD} | {project} | Factory Score: {N}/10

Pipeline: {stages run} | Iters: {total} | Waste: {pct}%

### Defects
- **{severity}** | {skill/script}: {description}
  - Fix: {concrete file:change}

### Harness Gaps
- **Context:** {what knowledge was missing or stale for the agent}
- **Constraints:** {what boundary violations or inconsistencies occurred}
- **Precedents:** {patterns worth capturing for future agents — good or bad}

### Missing
- {capability the factory lacked}

### What worked well
- {skill/pattern that performed efficiently}
```

**Rules:**
- Be brutally honest — if a skill is broken, say so
- Every defect must have a concrete fix (file + what to change)
- Track what works well too — don't regress good patterns
- Keep entries compact — this file accumulates over time

## Signal Output

**Output signal:** `<solo:done/>`

**Important:** `/retro` always outputs `<solo:done/>` — it never needs redo. Even if pipeline was terrible, the retro itself always completes.

## Edge Cases

- **No pipeline.log and no git history:** abort with clear message — "No pipeline log or git history found. Nothing to analyze."
- **No pipeline.log but git history exists:** switch to fallback mode (see Fallback Analysis)
- **Empty pipeline.log:** report "Pipeline log is empty — was the pipeline cancelled before any iteration?"
- **No iter logs:** skip Phase 4 sampling, note in report
- **No plan-done:** skip Phase 5, note "No completed plans found"
- **No test/build commands:** skip those checks in Phase 6, note in report
- **Pipeline still running:** warn user — "State file exists, pipeline may still be running. Retro on partial data."

## Reference Files

- `${CLAUDE_PLUGIN_ROOT}/skills/retro/references/eval-dimensions.md` — scoring rubric (8 axes, weights)
- `${CLAUDE_PLUGIN_ROOT}/skills/retro/references/failure-catalog.md` — known failure patterns and fixes
Sources