Skip to content

shopify — CLI Agent Evaluation

Evaluated against the CLI Agent Spec, a specification defining 71 failure modes for CLI tools used under AI agent orchestration.

CLI version: @shopify/cli/4.1.0 darwin-arm64 node-v25.9.0 Evaluated: 2026-05-28 Scope: Critical (22 of 71 failure modes)

Scores

Metric Result
Failure mode score 0.6/3 — 1 passing · 9 partial · 11 failing · 1 indeterminate
Readiness score 6/15 [D]
Observed bugs 5 confirmed during live evaluation
Worst gaps §10 Interactivity, §37 REPL triggering, §43 output size, §45 headless auth, §74 credential scopes

Key Findings

  • Headless auth is unsafe for agents: shopify auth login printed a device-code URL and kept running until terminated.
  • Interactive command paths are not consistently guarded: shopify theme console did not exit before a 3s alarm killed it.
  • Output is not parse-stable: release notes, preference errors, and prose boxes can appear before command-specific output.
  • Mutating and destructive commands lack a universal dry-run, idempotency key, and effect field.
  • The CLI has useful Oclif JSON command metadata, but no full schema with exit codes, credential scopes, interactivity, or safe-default declarations.

Files

File What it is
report-index.md Full scorecard, readiness breakdown, links to all reports
report-issues.md Concrete bugs and gaps agents will hit when using this CLI as-is
report-runtime.md Compact operational brief — what to set, what to avoid, what to watch for
report-agent-dev.md Integration guide — invocation invariants and per-gap workarounds for agent developers
report-dev.md Fix list for CLI authors — what to implement
findings.md Raw scorecard — one row per evaluated failure mode
issues.md Observed bugs recorded during live evaluation
trace.md Audit trail — exact check commands, exit codes, stdout/stderr per §N
environment.md CLI environment profile — binary path, version, flags, timeout method
readiness.md Proactive readiness scores across 5 dimensions

Generated by cli-agent-audit · CLI Agent Spec