Skip to content

Experiment Results

We ran three comparative experiments to benchmark different approaches to agentic development. Each experiment tested multiple strategies on controlled tasks.

Task: Create a TypeScript utility function that validates and parses ISO 8601 date strings with timezone support.

ApproachDescription
Minimal”Create a TypeScript function to validate and parse ISO 8601 dates”
Context-RichDetailed requirements, edge cases, return type, pattern references
Spec-Driven + TDDWrite tests first, confirm failure, implement to pass, refactor
MetricMinimalContext-RichSpec+TDD
Completeness4/107/109/10
Edge Case Handling2/106/109/10
Test Coverage1/105/109/10
Code Quality5/107/108/10
Est. Token Cost~2,000~5,000~12,000
  1. Minimal prompts produce superficially correct code that misses edge cases, has no tests, and assumes simple inputs.

  2. Context-rich prompts significantly improve completeness and quality but still leave gaps in edge cases and testing unless specifically requested.

  3. Spec-driven + TDD produces the most robust output. The test-first approach forces comprehensive coverage, and the spec provides unambiguous requirements. Token cost is higher, but the output is production-ready.

Experiment 2: Context Management Strategies

Section titled “Experiment 2: Context Management Strategies”

Simulated scenario: Working in a 500+ file codebase across a multi-step implementation.

StrategyDescription
MonolithicEverything in one 300+ line agent configuration file
HierarchicalRoot (~50 lines) + domain configuration files + skills
Progressive + FICMinimal root (~30 lines) + phases with compaction + sub-agents
MetricMonolithicHierarchicalProgressive+FIC
Instruction Adherence71%89%94%
Context EfficiencyLow (70-90% fill)Medium (50-70%)High (35-55%)
Error RateHighLowLowest
Setup ComplexityLowMediumHigh
MaintenanceHigh (one big file)MediumLow (modular)
  1. Monolithic degrades rapidly beyond 200 lines. Important rules get “lost” in the noise. Only viable for small projects.

  2. Hierarchical is the best balance for most teams. Auto-loading domain-specific configuration files provides relevant context without bloating every session.

  3. Progressive + FIC achieves the highest quality but requires discipline. The three-phase workflow with compaction between phases keeps context consistently clean.

Recommendation: Start with Hierarchical. Adopt FIC practices (sub-agents for research, compaction between phases) as your team matures. For tool-specific setup of hierarchical context structures, see the Tool Configuration Reference.

Task: Implement a payment processing feature requiring research, planning, implementation, and review.

PatternDescription
Single AgentOne agent handles everything
HierarchicalLead agent + research/implementation sub-agents
PipelineSequential specialized agents with file-based handoff
MetricSingleHierarchicalPipeline
Token Efficiency1x (baseline)0.7x (-30%)0.5x (-50%)
Output QualityDegradesHigh, consistentHighest
Context PurityLowHighMaximum
Wall-Clock TimeBaseline~0.7x (faster)~1.2x (slower)
Coordination OverheadNoneLowMedium
Information PreservationFullGood (some summary loss)Moderate (lossy)
  1. Single Agent works well for simple tasks but degrades on complex features. Context pollution from research and failed approaches reduces implementation quality.

  2. Hierarchical is the best default pattern. The 30% token savings and consistent quality justify the minimal coordination overhead. Parallel research sub-agents significantly reduce wall-clock time.

  3. Pipeline achieves the highest quality through maximum context purity but is slower due to sequential execution. Best for quality-critical implementations where correctness matters more than speed.

Project SizeTask ComplexityRecommended Pattern
Small (under 100 files)SimpleSingle Agent
SmallComplexHierarchical
Medium (100-500 files)AnyHierarchical
Large (500+ files)SimpleHierarchical
LargeComplexPipeline or Hybrid
AnyQuality-criticalPipeline
  • Experiments were conducted using AI coding agents
  • Each approach was tested with the same base task and evaluated on consistent criteria
  • Scores are relative comparisons, not absolute quality measurements
  • Token costs are estimates based on typical patterns, not exact measurements
  • Results should be validated against your specific project context and tooling