gh — CLI Agent Evaluation

Evaluated against the CLI Agent Spec — a specification defining 71 failure modes for CLI tools used under AI agent orchestration.

CLI version: 2.88.1
Evaluated: 2026-05-07
Scope: Critical failure modes (13 of 71)

Scores

Metric	Result
Failure mode score	1.8/3 — 4 passing · 7 partial · 2 failing
Readiness score	7/15 [D]
Observed bugs	3 confirmed during live evaluation
Worst gaps	§1 Exit Codes (0/3), §53 Credential Expiry (0/3)

gh exits 0 on every HTTP error — 401s, 404s, GraphQL failures all look identical to success; agents cannot detect failure without parsing stderr prose
When a token expires mid-session, gh prints a human-readable message and exits 0; the suggested recovery (gh auth login) requires a browser the agent cannot open
gh issue create with all flags provided creates a real resource immediately with no dry-run and no idempotency key — every retry on a failed call produces a duplicate

File	What it is
report-index.md	Full scorecard — all failure modes, readiness breakdown, links to all reports
report-issues.md	Concrete bugs and gaps agents will hit when using this CLI as-is
report-runtime.md	Compact operational brief — what to set, what to avoid, what to watch for
report-agent-dev.md	Integration guide — invocation invariants and per-gap workarounds for agent developers
report-dev.md	Fix list for CLI authors — what to implement, mapped to spec requirements
findings.md	Raw scorecard — one row per evaluated failure mode
issues.md	Observed bugs recorded during live evaluation
trace.md	Audit trail — exact check commands, exit codes, stdout/stderr per §N
environment.md	CLI environment profile — binary path, version, flags, timeout method
readiness.md	Proactive readiness scores across 5 dimensions

Generated by cli-agent-audit · CLI Agent Spec