Skip to content

gh — CLI Agent Evaluation

Evaluated against the CLI Agent Spec — a specification defining 71 failure modes for CLI tools used under AI agent orchestration.

CLI version: 2.88.1
Evaluated: 2026-05-07
Scope: Critical failure modes (13 of 71)

Scores

Metric Result
Failure mode score 1.8/3 — 4 passing · 7 partial · 2 failing
Readiness score 7/15 [D]
Observed bugs 3 confirmed during live evaluation
Worst gaps §1 Exit Codes (0/3), §53 Credential Expiry (0/3)

Key Findings

  • gh exits 0 on every HTTP error — 401s, 404s, GraphQL failures all look identical to success; agents cannot detect failure without parsing stderr prose
  • When a token expires mid-session, gh prints a human-readable message and exits 0; the suggested recovery (gh auth login) requires a browser the agent cannot open
  • gh issue create with all flags provided creates a real resource immediately with no dry-run and no idempotency key — every retry on a failed call produces a duplicate

Files

File What it is
report-index.md Full scorecard — all failure modes, readiness breakdown, links to all reports
report-issues.md Concrete bugs and gaps agents will hit when using this CLI as-is
report-runtime.md Compact operational brief — what to set, what to avoid, what to watch for
report-agent-dev.md Integration guide — invocation invariants and per-gap workarounds for agent developers
report-dev.md Fix list for CLI authors — what to implement, mapped to spec requirements
findings.md Raw scorecard — one row per evaluated failure mode
issues.md Observed bugs recorded during live evaluation
trace.md Audit trail — exact check commands, exit codes, stdout/stderr per §N
environment.md CLI environment profile — binary path, version, flags, timeout method
readiness.md Proactive readiness scores across 5 dimensions

Generated by cli-agent-audit · CLI Agent Spec