← Wiki

ERC3: AI Agents in Action

Enterprise RAG Challenge 3 shifted from document QA (ERC2) to agent-based API interaction. 103 tasks simulating a company employee: checking permissions, managing salaries, archiving projects, finding info in company wiki. Created by Rinat Abdullin.

Results

Place Participant Score Cost Model
1st Aleksey (maddness) 100% $600 Claude Opus
2nd overall / 1st local Ilya Rice 62% $100 GPT o3s-120B (Cerebras)
3rd local Valera 46% $2 Qwen (8 GPUs local)

Key finding: simple ReAct with excellent prompts + composite tools beat multi-agent orchestration.

1st Place Architecture (100%)

Plain ReAct agent. No subagents, no orchestrator. Anthropic SDK direct.

Prompt structure (4 sections):

Evolution system (3-agent loop):

  1. Executor runs tasks. On failure, logs trace
  2. Analyzer reviews trace, generates hypotheses
  3. Improver proposes prompt/rule/tool patches

Went from 68% → 90% in ~1.5 hours (78 generations). Plateau at ~90% — last 10% required manual analysis.

Composite tools were key: instead of making agent paginate manually (25+ API calls), wrap in code. find_employees_by_skill, find_projects_by_employee, calculate_workloads — agent calls one tool, gets all results. 20 base + 11 composite = 31 tools total.

Final stats: 103/103, 6.6 min wall-clock (5 workers), 5.8 tool calls/task avg.

2nd Place Architecture (62%, local models)

Plan ReAct agent. Single agent, all 20 tools as structured output enum.

Structured output over tool calling — Cerebras supports strict mode for structured output but NOT for function calling. Schema:

current_state → remaining_work[1..5] → next_action → function(tool enum)

Creates “double reasoning” — model reasons internally, then fills the step-by-step schema. Same SGR NextStep pattern.

Pre-execution step validator: separate LLM call before every tool call. Checks: right tool? Correct params? Following rules? On rejection, agent retries. Validator history is ephemeral — never stored in main conversation.

Context pre-loading (in code, not by agent):

Wiki rule extraction (one-time pipeline):

Conversation compression: validator rejections are ephemeral, only latest plan kept, older plans dropped.

Key Patterns

Composite tools > raw API

Remove pagination from agent’s view entirely. Wrap multi-step API sequences in code. All top solutions converged on this independently.

Separate formatting rules from working rules

Load response formatting instructions only when agent is about to respond. Keeps context clean during exploration.

Pre-fetch everything you can in code

Don’t waste agent steps on whoami or profile lookups. Dynamic context selection (lightweight LLM call) picks relevant blocks.

Evolution has a plateau

3-agent evolution loop (executor → analyzer → improver) is effective up to ~90%. After that, manual analysis required. The “one change per cycle” rule from agent-patterns-stream2 applies here too.

Structured output vs function calling

Simple > complex

Claude Code itself uses single-loop architecture. Ilya explicitly confirmed: for business tasks, workflow > agents. Use agents only where workflow can’t work.

Connections


Source: Community stream, Apr 2026. Platform: erc.timetoact-group.at

Sources

Related