Experiment Results

We ran three comparative experiments to benchmark different approaches to agentic development. Each experiment tested multiple strategies on controlled tasks.

Experiment 1: Prompting Approaches

Task: Create a TypeScript utility function that validates and parses ISO 8601 date strings with timezone support.

Approaches Tested

Approach	Description
Minimal	”Create a TypeScript function to validate and parse ISO 8601 dates”
Context-Rich	Detailed requirements, edge cases, return type, pattern references
Spec-Driven + TDD	Write tests first, confirm failure, implement to pass, refactor

Results

Metric	Minimal	Context-Rich	Spec+TDD
Completeness	4/10	7/10	9/10
Edge Case Handling	2/10	6/10	9/10
Test Coverage	1/10	5/10	9/10
Code Quality	5/10	7/10	8/10
Est. Token Cost	~2,000	~5,000	~12,000

Key Findings

Minimal prompts produce superficially correct code that misses edge cases, has no tests, and assumes simple inputs.
Context-rich prompts significantly improve completeness and quality but still leave gaps in edge cases and testing unless specifically requested.
Spec-driven + TDD produces the most robust output. The test-first approach forces comprehensive coverage, and the spec provides unambiguous requirements. Token cost is higher, but the output is production-ready.

Experiment 2: Context Management Strategies

Simulated scenario: Working in a 500+ file codebase across a multi-step implementation.

Strategies Tested

Strategy	Description
Monolithic	Everything in one 300+ line agent configuration file
Hierarchical	Root (~50 lines) + domain configuration files + skills
Progressive + FIC	Minimal root (~30 lines) + phases with compaction + sub-agents

Results

Metric	Monolithic	Hierarchical	Progressive+FIC
Instruction Adherence	71%	89%	94%
Context Efficiency	Low (70-90% fill)	Medium (50-70%)	High (35-55%)
Error Rate	High	Low	Lowest
Setup Complexity	Low	Medium	High
Maintenance	High (one big file)	Medium	Low (modular)

Key Findings

Monolithic degrades rapidly beyond 200 lines. Important rules get “lost” in the noise. Only viable for small projects.
Hierarchical is the best balance for most teams. Auto-loading domain-specific configuration files provides relevant context without bloating every session.
Progressive + FIC achieves the highest quality but requires discipline. The three-phase workflow with compaction between phases keeps context consistently clean.

Recommendation: Start with Hierarchical. Adopt FIC practices (sub-agents for research, compaction between phases) as your team matures. For tool-specific setup of hierarchical context structures, see the Tool Configuration Reference.

Experiment 3: Multi-Agent Orchestration

Task: Implement a payment processing feature requiring research, planning, implementation, and review.

Patterns Tested

Pattern	Description
Single Agent	One agent handles everything
Hierarchical	Lead agent + research/implementation sub-agents
Pipeline	Sequential specialized agents with file-based handoff

Results

Metric	Single	Hierarchical	Pipeline
Token Efficiency	1x (baseline)	0.7x (-30%)	0.5x (-50%)
Output Quality	Degrades	High, consistent	Highest
Context Purity	Low	High	Maximum
Wall-Clock Time	Baseline	~0.7x (faster)	~1.2x (slower)
Coordination Overhead	None	Low	Medium
Information Preservation	Full	Good (some summary loss)	Moderate (lossy)

Key Findings

Single Agent works well for simple tasks but degrades on complex features. Context pollution from research and failed approaches reduces implementation quality.
Hierarchical is the best default pattern. The 30% token savings and consistent quality justify the minimal coordination overhead. Parallel research sub-agents significantly reduce wall-clock time.
Pipeline achieves the highest quality through maximum context purity but is slower due to sequential execution. Best for quality-critical implementations where correctness matters more than speed.

Architecture Decision Matrix

Project Size	Task Complexity	Recommended Pattern
Small (under 100 files)	Simple	Single Agent
Small	Complex	Hierarchical
Medium (100-500 files)	Any	Hierarchical
Large (500+ files)	Simple	Hierarchical
Large	Complex	Pipeline or Hybrid
Any	Quality-critical	Pipeline

Methodology Notes

Experiments were conducted using AI coding agents
Each approach was tested with the same base task and evaluated on consistent criteria
Scores are relative comparisons, not absolute quality measurements
Token costs are estimates based on typical patterns, not exact measurements
Results should be validated against your specific project context and tooling