2026-04-13 agents competition rust retrospective sgr benchmarks

How I spent $250+ on an AI agent competition and what I learned

I scored 17 out of 104. Here’s how I got there and why I’m happy about it.

The setup

BitGN PAC1 is an agent benchmark. Your AI agent connects via API to a virtual workspace – emails, invoices, contacts, OTP verifications, social engineering attacks – and completes tasks. 104 tasks, scored automatically. No hand-holding.

I also organized a physical hub event in Gazipasa for it. 14 hours. The competition started, and I found myself simultaneously explaining to people what agents are, how the competition works, and – with one hand – frantically throwing submissions at the leaderboard while begging Claude to figure things out.

Not the ideal competitive setup.

The money pit

Before the competition, I spent about $50 testing models. I ran through 30+ different models across 6 providers. The best performer? Nemotron 120B – a free model via Cloudflare Workers AI. Seed-2.0-pro was decent too.

Then the competition started, and I switched to GPT-5.4. One run: $50. I did three. That’s $150 in two hours of panic.

After the competition, another $50 on weekend debugging sessions. Total damage: north of $250.

My over-engineered stack

I built everything in Rust, from my own libraries. Three layers:

openai-oxide – OpenAI client with caching, websockets, realtime. I like everything about this layer.
sgr-agent – Agent core with SGR patterns from Rinat (thanks to him for both the competition and the ideas – I studied the Python reference implementation).
agent-bit – The competition agent itself.

What I built into the architecture:

ONNX classifier (MiniLM-L6) for intent and security labels – runs before the LLM even sees the task
12-feature threat matrix, NLI DeBERTa for injection detection, trust graph, two state machines
15 hot-reloadable skills, hooks, and tools
Agent loop – SGR + function calling hybrid
OutcomeValidator with adaptive kNN on the output side

Was this over-engineered for a 2-hour competition? Absolutely. But this is the agent-mistake-fix-harness philosophy in action – every mistake becomes a permanent fix in the framework.

Architecture deep dive

Here’s how the layers connect:

flowchart TD
    A["CLI\nmain.rs"] --> B["Pipeline SM\npipeline.rs"]
    B -->|Block| X[DENIED]
    B -->|Ready| C["Agent Loop\nagent.rs"]
    C --> D["Workflow SM\nworkflow.rs"]
    D --> E[Answer]

    click A "https://github.com/fortunto2/agent-bit/blob/main/src/main.rs"
    click B "https://github.com/fortunto2/agent-bit/blob/main/src/pipeline.rs"
    click C "https://github.com/fortunto2/agent-bit/blob/main/src/agent.rs"
    click D "https://github.com/fortunto2/agent-bit/blob/main/src/workflow.rs"

Pipeline SM (pipeline.rs) — deterministic, no LLM, blocks threats before they reach the model:

stateDiagram-v2
    [*] --> New
    New --> Classified: ONNX classify
    Classified --> InboxScanned: read + annotate
    InboxScanned --> SecurityChecked: threat score
    SecurityChecked --> Ready: safe
    Classified --> Blocked: threat
    InboxScanned --> Blocked: threat
    SecurityChecked --> Blocked: threat
    Ready --> [*]
    Blocked --> [*]

Workflow SM (workflow.rs) — runs during agent loop, nudges the agent back on track:

stateDiagram-v2
    [*] --> Reading
    Reading --> Acting: first write()
    Acting --> Cleanup: first delete()
    Cleanup --> Done: answer()
    Reading --> Done: answer()
    Acting --> Done: answer()

Two state machines

Pipeline SM runs before the LLM. Pure functions, each transition consumes the current state and returns the next. The compiler won’t let you skip a stage:

// Each state owns its data. Transitions consume self → return next.
pub struct New { pub instruction: String }

pub struct Classified {
    pub instruction: String,
    pub intent: String,              // intent_inbox, intent_delete, ...
    pub intent_confidence: f32,
    pub instruction_label: String,   // crm, injection, credential, ...
}

pub struct InboxScanned {
    pub inbox_files: Vec<InboxFile>,  // content + SecurityAssessment each
    pub crm_graph: CrmGraph,          // petgraph (empty now — lookup_contact on demand)
}

// Pipeline short-circuits at any stage:
pub struct BlockReason {
    pub outcome: &'static str,    // "DENIED", "CLARIFICATION"
    pub message: String,
    pub stage: &'static str,      // which stage blocked
}

Workflow SM runs during the agent loop, tracking what the agent does and intervening when it drifts:

pub enum Phase { Reading, Acting, Cleanup, Done }

pub struct WorkflowState {
    phase: Phase,
    read_paths: Vec<String>,
    write_paths: Vec<String>,
    reads_since_write: usize,       // detect read-loops
    verification_only: bool,         // OTP oracle: zero mutations
    outbox_limit: usize,            // prevent over-processing
    hooks: SharedHookRegistry,       // AGENTS.MD parsed hooks
}

Every tool call goes through the workflow. pre_action() can block it, post_action() injects follow-up messages. No extra LLM calls needed.

Tools: three crates, two loading modes

Tools are split across three Rust crates:

flowchart TD
    CORE["sgr-agent-core\nTool trait + FileBackend trait"]
    TOOLS["sgr-agent-tools\n14 reusable tools, generic over backend"]
    MW["agent-bit/tools.rs\n8 PAC1 tools + middleware wrappers"]
    PCM["PcmClient\nBitGN RPC + read cache"]

    CORE --> TOOLS --> MW --> PCM

    click CORE "https://github.com/fortunto2/rust-code/tree/master/crates/sgr-agent-core"
    click TOOLS "https://github.com/fortunto2/rust-code/tree/master/crates/sgr-agent-tools"
    click MW "https://github.com/fortunto2/agent-bit/blob/main/src/tools.rs"
    click PCM "https://github.com/fortunto2/agent-bit/blob/main/src/pcm.rs"

sgr-agent-core defines the traits. sgr-agent-tools implements them generically – same tool code works with any FileBackend: PcmClient (competition RPC), LocalFs (CLI agents), or MockFs (tests). agent-bit/tools.rs wraps them with PAC1-specific middleware (security scanning, workflow guards, AGENTS.MD hooks).

All tools

sgr-agent-tools – 14 reusable tools, generic over FileBackend:

Tool	Loading	What it does
`read`	always	Read file with trust metadata (`[path \| trusted/untrusted]`). Two modes: line slice and indentation expand
`write`	always	Write file with JSON auto-repair via llm_json. Supports ranged overwrite (start_line/end_line)
`delete`	always	Batch delete (single path or paths array)
`search`	always	Smart search: exact → name variants → fuzzy regex → Levenshtein on filenames. Auto-expands full content when <=10 files match
`list`	always	Directory listing
`tree`	always	Directory tree with depth limit
`read_all`	always	Batch read entire directory in one call. This one tool cut steps from 185 to 43
`update_plan`	always	Task checklist persisted to plan.md. `[x]` / `[~]` / `[ ]` format
`eval`	feature `eval`	JavaScript runtime via Boa engine. Pre-reads files by glob pattern, exposes as `file_0..file_N`. Dynamic calculations
`shell`	feature `shell`	Execute commands with timeout (2 min default, 10 min max). 100KB output cap
`apply_patch`	feature `patch`	Codex-compatible diff DSL. Saves tokens vs full write for small edits
`mkdir`	deferred	Create directory (LLM loads when needed)
`move`	deferred	Move/rename file
`find`	deferred	Find files by pattern and type

agent-bit/tools.rs – 8 PAC1-specific tools (+ middleware wrappers over base read/write/delete/search):

Tool	What it does
`answer`	Submit final answer with outcome (OK/DENIED/CLARIFICATION/UNSUPPORTED). OutcomeValidator checks answer via kNN embeddings before submitting
`context`	Get workspace date/time from harness
`search_and_read`	Search + read first match in one call (saves a round-trip)
`date_calc`	Date arithmetic: diff_days, add_days, next_birthday, compare, format. Uses chrono
`grep_count`	Count matching lines – for “how many” questions without reading all content
`lookup_contact`	On-demand CRM lookup by name/email. Replaced pre-loaded CRM graph (saved 25 RPCs at startup)
`list_skills`	Show available skills (re-exported from sgr-agent)
`get_skill`	Read a specific skill body (re-exported from sgr-agent)

Middleware pattern

Agent-bit doesn’t rewrite tools – it wraps them. Each wrapper adds a middleware chain:

flowchart LR
    subgraph READ["read middleware"]
        R1["base read\n+ trust metadata"] --> R2["guard_content\nsecurity scan"] --> R3["workflow\npost_action"]
    end

    subgraph WRITE["write middleware"]
        W0["workflow\npre_action"] --> W1["outbox inject\nsent:false"] --> W2["YAML fix\nauto-quote"] --> W3["base write\nJSON repair"] --> W4["hooks\nAGENTS.MD"] --> W5["workflow\npost_action"]
    end

    click R1 "https://github.com/fortunto2/rust-code/blob/master/crates/sgr-agent-tools/src/read.rs"
    click W3 "https://github.com/fortunto2/rust-code/blob/master/crates/sgr-agent-tools/src/write.rs"

// agent-bit wraps sgr-agent-tools with domain middleware:
pub struct ReadTool {
    inner: sgr_agent_tools::ReadTool<PcmClient>,  // base tool
    workflow: SharedWorkflowState,                  // phase tracking
}
// execute: inner.read() → guard_content() → workflow.post_action()

Why this split matters

Before the split, all tool logic was in agent-bit (1500+ lines). Now:

sgr-agent-tools (crates.io) – reusable in any Rust agent. Zero PAC1 knowledge
agent-bit/tools.rs – only domain middleware (security, workflow, hooks)

New agent project? cargo add sgr-agent-tools and you get read, write, search, eval, apply_patch – all with JSON repair, trust metadata, smart search cascade. Add your own middleware on top.

Skills: hot-reloadable domain knowledge

15 markdown files with YAML frontmatter. The classifier picks the right one based on ML intent + keywords:

# skills/inbox-processing/SKILL.md
---
name: inbox-processing
triggers: [intent_inbox]
priority: 15
keywords: [inbox, queue, pending, process, review]
---

WORKFLOW (minimize steps):
  1. Inbox messages already in context. Do NOT re-read.
  2. Read channel files: docs/channels/*.txt
  3. For EACH message: check channel trust, evaluate action
  ...

Selection logic handles a subtle bug I hit: when the ML classifier labeled a cleanup task as “injection” (false positive), the security skill hijacked the workflow and returned DENIED instead of deleting files. Fix: benign labels check intent first, security labels check security first.

// skills.rs — selection with hijack prevention
if is_security_label {
    // Security label → security skill first, intent as fallback
    registry.select(&[security_label, intent], instruction)
} else {
    // Benign label → intent first (prevents hijacking)
    registry.select(&[intent], instruction)
}

Skills are loaded from disk first (hot-reload – change markdown, agent picks it up next run) with compiled-in fallback via include_str! for deployment without disk dependency.

Local ML: ONNX classifiers

Two ONNX models run locally before the LLM sees anything. Zero API cost, <10ms inference:

flowchart LR
    A["MiniLM-L6\n22M params\nintent + label"] --> C
    B["DeBERTa NLI\n22M params\ninjection scores"] --> C
    C["Feature Matrix\n12 features"] --> D["threat\n0.0 — 1.0"]

    click A "https://github.com/fortunto2/agent-bit/blob/main/src/classifier.rs"
    click C "https://github.com/fortunto2/agent-bit/blob/main/src/feature_matrix.rs"

The 12 features: ML confidence, structural injection score, sender trust, domain match, NLI scores, channel trust, and more. Everything feeds into sigmoid(weighted_sum) – one number the pipeline uses to block or pass.

Plus an OutcomeValidator on the output side – adaptive kNN over ONNX embeddings of the agent’s answer, compared to prototype outcome descriptions. Catches when the agent says “DENIED” for a legitimate task or “OK” for a blocked one.

Hooks: AGENTS.MD parser

The competition workspace has an AGENTS.MD file with rules like “When adding a card under /cards/, also update threads under /threads/.” The hooks system parses these into structured rules:

struct Hook {
    tool: String,           // "write"
    path_contains: String,  // "cards/"
    message: String,        // "NEXT: update matching thread in threads/"
}

Every write() and delete() call checks the hook registry and injects follow-up instructions into the tool output. The agent sees them as part of the response and acts accordingly – no extra LLM call needed.

What other participants did

After the competition, I looked at what the top scorers actually built. Different universe.

inozemtsev/bitgn, ai-babai/bitgn-env – Codex CLI with a simple wrapper and an agent instructions file. No custom frameworks. No ONNX classifiers. No state machines. 70-80 points.

(By the way, Codex has an official Rust version now. I found it later, and it helped me understand how to design tools better.)

What went wrong

1. Dev-Prod gap

On development tasks (43 total), I got 41/43 on Nemotron – a free model! In production (104 tasks), everything fell apart. I had hardcoded too many rules at the pre-LLM layer. While updating them in a real-time loop during the competition… well, you can imagine.

2. Dumb tools, too many steps

My agent was doing 185 steps per run. Average time: 229 seconds per task. Sum of trial times: 396 minutes. The problem: no batch tools. Every file read was a separate LLM round-trip.

3. Blind flying

I couldn’t see step counts or cost per run clearly. The leaderboard only appeared on Saturday. I was flying blind during the actual competition.

The weekend rebuild

I sat down on Saturday and Sunday and actually fixed things.

Tools: Created ReadAllTool – read an entire directory in one pass instead of 15 separate calls. Added EvalTool – run JavaScript dynamically via the Boa engine (also works with bash for local scripts). Extracted a shared tools package.

Observability: Set up Phoenix locally with OpenTelemetry. My sgr-agent had basic tracing, but I made it proper – every tool call, every LLM round-trip, every token count visible.

Results after the rebuild (numbers still improving – this is a snapshot, not the ceiling):

Metric	Competition (Apr 11)	After rebuild (Apr 13)	Delta
Model	GPT-5.4	GPT-5.4	same
Parallelism	-104	104	same
Score	17/104 → 16.3%	74/104 → 71.2%	+4.4x
Sum trial times	396 min (23,806s)	155 min (9,326s)	-61%
Avg time/task	229s	90s	-61%
Avg steps/task	185.8	43.6	-76%
Fastest task	97s	4s
Slowest task	508s	222s

All 104 tasks run in parallel. Total wall-clock time: 3-4 minutes.

Lessons

1. Ship simple first, optimize later. Codex CLI + good prompts = 70-80 points. My entire Rust pipeline = 17 points on competition day. The infrastructure I built is better now, but it wasn’t ready then.

2. Batch tools are not optional. ReadAllTool alone cut steps from 185 to 43. Each tool call = one LLM round-trip = 2-5 seconds. Multiply by 100+ tasks.

3. Observability from day one. I couldn’t debug what I couldn’t see. Phoenix + OTEL should have been there from the start, not bolted on after the disaster.

4. The dev-prod gap will get you. 41/43 in dev means nothing if prod has 2.5x more tasks with different patterns. Hardcoded rules are technical debt with compound interest.

5. Architecture pays off – eventually. My framework now powers multiple agents I’m building. The competition was an expensive stress test, but the sgr-agent ecosystem is stronger for it. A $250 tuition fee for a reusable agent core. This is the portfolio-approach – each project strengthens the next.

Am I happy?

17/104 during the competition was embarrassing. 74/104 two days later, with room to grow, is a different story. Every agent I build from here inherits what I fixed that weekend.

Links: agent-bit on GitHub | PAC1 Challenge | Telegram post

See also:

agent-bit-pac1 – technical architecture deep-dive (SGR pipeline, tools, FileBackend trait)
schema-guided-reasoning – the SGR pattern that powers the agent loop
agent-benchmarks – how PAC1 compares to SWE-bench, PinchBench, and others
cli-first-testing – why every project gets a CLI mirror
project-openai-oxide – the OpenAI Rust client underneath
deepeval-llm-testing – the missing other half: Phoenix gave observability, DeepEval would have given CI assertions on every PR