ERC3: AI Agents in Action
Enterprise RAG Challenge 3 shifted from document QA (ERC2) to agent-based API interaction. 103 tasks simulating a company employee: checking permissions, managing salaries, archiving projects, finding info in company wiki. Created by Rinat Abdullin.
Results
| Place | Participant | Score | Cost | Model |
|---|---|---|---|---|
| 1st | Aleksey (maddness) | 100% | $600 | Claude Opus |
| 2nd overall / 1st local | Ilya Rice | 62% | $100 | GPT o3s-120B (Cerebras) |
| 3rd local | Valera | 46% | $2 | Qwen (8 GPUs local) |
Key finding: simple ReAct with excellent prompts + composite tools beat multi-agent orchestration.
1st Place Architecture (100%)
Plain ReAct agent. No subagents, no orchestrator. Anthropic SDK direct.
Prompt structure (4 sections):
base_prompt(~1,800 tokens) — behavioral algorithm: collect context → act → respondrules(34 rules) — edge-case fixes from iterative testingtool_patches— supplementary tool descriptions beyond docstringsexamples— few-shot step-by-step for specific task types
Evolution system (3-agent loop):
- Executor runs tasks. On failure, logs trace
- Analyzer reviews trace, generates hypotheses
- Improver proposes prompt/rule/tool patches
Went from 68% → 90% in ~1.5 hours (78 generations). Plateau at ~90% — last 10% required manual analysis.
Composite tools were key: instead of making agent paginate manually (25+ API calls), wrap in code. find_employees_by_skill, find_projects_by_employee, calculate_workloads — agent calls one tool, gets all results. 20 base + 11 composite = 31 tools total.
Final stats: 103/103, 6.6 min wall-clock (5 workers), 5.8 tool calls/task avg.
2nd Place Architecture (62%, local models)
Plan ReAct agent. Single agent, all 20 tools as structured output enum.
Structured output over tool calling — Cerebras supports strict mode for structured output but NOT for function calling. Schema:
current_state → remaining_work[1..5] → next_action → function(tool enum)
Creates “double reasoning” — model reasons internally, then fills the step-by-step schema. Same SGR NextStep pattern.
Pre-execution step validator: separate LLM call before every tool call. Checks: right tool? Correct params? Following rules? On rejection, agent retries. Validator history is ephemeral — never stored in main conversation.
Context pre-loading (in code, not by agent):
whoami+ full profile + related entities fetched automatically- Context builder (lightweight LLM call) selects relevant blocks for this task
- Agent starts with rich context from step 1
Wiki rule extraction (one-time pipeline):
- Downloads all wiki pages → LLM extracts two rule sets (authenticated vs public)
- Response formatting rules loaded via special tool ONLY when agent is about to respond
- Cached by wiki hash per benchmark variant
Conversation compression: validator rejections are ephemeral, only latest plan kept, older plans dropped.
Key Patterns
Composite tools > raw API
Remove pagination from agent’s view entirely. Wrap multi-step API sequences in code. All top solutions converged on this independently.
Separate formatting rules from working rules
Load response formatting instructions only when agent is about to respond. Keeps context clean during exploration.
Pre-fetch everything you can in code
Don’t waste agent steps on whoami or profile lookups. Dynamic context selection (lightweight LLM call) picks relevant blocks.
Evolution has a plateau
3-agent evolution loop (executor → analyzer → improver) is effective up to ~90%. After that, manual analysis required. The “one change per cycle” rule from agent-patterns-stream2 applies here too.
Structured output vs function calling
- Pre-2025 models (Qwen 2.5, older Llamas): structured output better
- Post-2025 models: native function calling generally better
- Cerebras specifically: strict mode works for SO but NOT for FC
Simple > complex
Claude Code itself uses single-loop architecture. Ilya explicitly confirmed: for business tasks, workflow > agents. Use agents only where workflow can’t work.
Connections
- enterprise-rag-challenge — ERC2 results (RAG-focused, different from ERC3)
- schema-guided-reasoning — NextStep pattern used by 2nd place
- agent-patterns-stream2 — evolution fixed/mutable partition maps to evolution system
- agent-benchmarks — ERC3 as agent benchmark
- agent-toolkit-landscape — simple ReAct vs multi-agent frameworks
- harness-engineering-summary — composite tools = harness for agents
- context-engineering — pre-loading, rule extraction, context builder
Source: Community stream, Apr 2026. Platform: erc.timetoact-group.at