Evaluations

Real-world CLI tools evaluated against the CLI Agent Spec — scored across critical failure modes and proactive readiness dimensions.

Available Evaluations

CLI	Version	Score	Readiness	Scope
docuseal	1.0.3	0.72/3	7/15 [C]	All (71 of 71)
gh	2.88.1	1.8/3	7/15 [D]	Critical (13 of 71)
gws	0.22.5	1.23/3	7/15 [C]	Critical (22 of 71)
dokploy	0.3.0	1.1/3	7/15 [C]	Critical (22 of 71)
langfuse	0.0.10	1.4/3	9/15 [C]	Critical (22 of 71)
firecrawl	1.18.1	0.48/3	7/15 [C]	Critical (22 of 71)
omd	0.1.1	1.5/3	12/15 [B]	Critical (22 of 71)
shopify	@shopify/cli 4.1.0	0.6/3	6/15 [D]	Critical (22 of 71)
hevn	hevn-cli 0.1.0	0.9/3	7/15 [C]	Critical (22 of 71)

Each evaluation runs a CLI tool through the spec's failure mode checks and produces:

Scorecard — per-failure-mode scores (0–3) with evidence
Readiness score — proactive agent readiness across 5 dimensions
Perspective reports — tailored output for runtime agents, agent developers, CLI authors, and issue trackers

Evaluations are generated by the /cli-agent-audit skill.