agent-ready-eval
Agent-Ready Evaluation
Evaluate how well a codebase supports autonomous agent execution based on the "How to Get Out of Your Agent's Way" principles.
Core Philosophy
Autonomous agents fail for predictable reasons—most are system design failures, not model failures. This evaluation checks whether infrastructure enables true autonomy: agents that run unattended, isolated, reproducible, and bounded by system constraints rather than human intervention.
Evaluation Process
1. Gather Evidence
Explore the codebase for indicators across all 12 principles. Key files to examine:
Environment & Isolation:
Dockerfile,docker-compose.yml,.devcontainer/Makefile,setup.sh,bootstrap.sh- CI configs (
.github/workflows/,.gitlab-ci.yml,Jenkinsfile) - Nix files,
devbox.json,flake.nix
Dependencies & State:
- Lockfiles (
package-lock.json,yarn.lock,Pipfile.lock,Cargo.lock,go.sum) - Database configs, migration files, seed scripts
.env.example, config templates
Execution & Interfaces:
- CLI entry points,
bin/scripts - API definitions, OpenAPI specs
- Background job configs (Sidekiq, Celery, Bull)
- Timeout/limit configurations
Quality & Monitoring:
- Test suites, benchmark files
- Logging configuration
- Cost tracking, rate limiting setup
2. Score Each Principle
Read evaluation-criteria.md for detailed scoring rubric.
Score each of the 12 principles 0-3:
- 3: Fully implemented with clear evidence
- 2: Partially implemented, room for improvement
- 1: Minimal awareness, significant gaps
- 0: No evidence
3. Generate Report
Output format:
# Agent-Ready Evaluation Report
**Overall Score: X/36** (Y%)
**Rating: [Excellent|Good|Needs Work|Not Agent-Ready]**
## Summary
[2-3 sentence assessment of overall agent-readiness]
## Principle Scores
| Principle | Score | Evidence |
|-----------|-------|----------|
| 1. Sandbox Everything | X/3 | [brief evidence] |
| 2. No External DB Dependencies | X/3 | [brief evidence] |
| 3. Clean Environment | X/3 | [brief evidence] |
| 4. Session-Independent Execution | X/3 | [brief evidence] |
| 5. Outcome-Based Instructions | X/3 | [brief evidence] |
| 6. Direct Low-Level Interfaces | X/3 | [brief evidence] |
| 7. Minimal Framework Overhead | X/3 | [brief evidence] |
| 8. Explicit State Persistence | X/3 | [brief evidence] |
| 9. Early Benchmarks | X/3 | [brief evidence] |
| 10. Cost Planning | X/3 | [brief evidence] |
| 11. Verifiable Output | X/3 | [brief evidence] |
| 12. Infrastructure-Bounded Permissions | X/3 | [brief evidence] |
## Top 3 Improvements
1. **[Highest impact improvement]**
- Current state: ...
- Recommendation: ...
- Impact: ...
2. **[Second improvement]**
...
3. **[Third improvement]**
...
## Strengths
- [What the codebase does well for agents]
## Detailed Findings
[Optional: deeper analysis of specific areas]
Rating Scale
- 30-36 (83-100%): Excellent - Ready for autonomous agent execution
- 24-29 (67-82%): Good - Minor improvements needed
- 18-23 (50-66%): Needs Work - Significant gaps to address
- 0-17 (<50%): Not Agent-Ready - Major architectural changes needed
Quick Checks
If time is limited, prioritize these high-signal indicators:
- Dockerfile exists? → Sandboxing potential
- Lockfiles present? → Reproducibility
- No external DB in default config? → Isolation
- CLI scripts in bin/ or Makefile? → Direct interfaces
- Tests with assertions? → Verifiable output