Evaluations
Real-world CLI tools evaluated against the CLI Agent Spec — scored across critical failure modes and proactive readiness dimensions.
Available Evaluations
| CLI | Version | Score | Readiness | Scope |
|---|---|---|---|---|
| docuseal | 1.0.3 | 0.72/3 | 7/15 [C] | All (71 of 71) |
| gh | 2.88.1 | 1.8/3 | 7/15 [D] | Critical (13 of 71) |
| gws | 0.22.5 | 1.23/3 | 7/15 [C] | Critical (22 of 71) |
| dokploy | 0.3.0 | 1.1/3 | 7/15 [C] | Critical (22 of 71) |
| langfuse | 0.0.10 | 1.4/3 | 9/15 [C] | Critical (22 of 71) |
| firecrawl | 1.18.1 | 0.48/3 | 7/15 [C] | Critical (22 of 71) |
| omd | 0.1.1 | 1.5/3 | 12/15 [B] | Critical (22 of 71) |
| shopify | @shopify/cli 4.1.0 | 0.6/3 | 6/15 [D] | Critical (22 of 71) |
| hevn | hevn-cli 0.1.0 | 0.9/3 | 7/15 [C] | Critical (22 of 71) |
How Evaluations Work
Each evaluation runs a CLI tool through the spec's failure mode checks and produces:
- Scorecard — per-failure-mode scores (0–3) with evidence
- Readiness score — proactive agent readiness across 5 dimensions
- Perspective reports — tailored output for runtime agents, agent developers, CLI authors, and issue trackers
Evaluations are generated by the /cli-agent-audit skill.