Skip to content

Evaluations

Real-world CLI tools evaluated against the CLI Agent Spec — scored across critical failure modes and proactive readiness dimensions.

Available Evaluations

CLI Version Score Readiness Scope
docuseal 1.0.3 0.72/3 7/15 [C] All (71 of 71)
gh 2.88.1 1.8/3 7/15 [D] Critical (13 of 71)
gws 0.22.5 1.23/3 7/15 [C] Critical (22 of 71)
dokploy 0.3.0 1.1/3 7/15 [C] Critical (22 of 71)
langfuse 0.0.10 1.4/3 9/15 [C] Critical (22 of 71)
firecrawl 1.18.1 0.48/3 7/15 [C] Critical (22 of 71)
omd 0.1.1 1.5/3 12/15 [B] Critical (22 of 71)
shopify @shopify/cli 4.1.0 0.6/3 6/15 [D] Critical (22 of 71)
hevn hevn-cli 0.1.0 0.9/3 7/15 [C] Critical (22 of 71)

How Evaluations Work

Each evaluation runs a CLI tool through the spec's failure mode checks and produces:

  • Scorecard — per-failure-mode scores (0–3) with evidence
  • Readiness score — proactive agent readiness across 5 dimensions
  • Perspective reports — tailored output for runtime agents, agent developers, CLI authors, and issue trackers

Evaluations are generated by the /cli-agent-audit skill.