jpoehnelt / agent-dx-cli-scale
A scoring rubric and design pattern for evaluating CLI agent-readiness GitHub: https://github.com/jpoehnelt/skills/blob/main/agent-dx-cli-scale/SKILL.md Type: Design framework / evaluation rubric (not a code library) Author: Justin Poehnelt (Google Developer Relations)
Overview
agent-dx-cli-scale is not a CLI framework — it is an evaluation rubric and design philosophy for measuring how well an existing CLI is designed for AI agent consumption. It defines a 7-axis scoring system (0–3 per axis, 0–21 total) that grades CLIs on agent-readiness, plus a "Multi-Surface Readiness" bonus checklist.
The rubric is published as a Claude Code skill (SKILL.md with YAML frontmatter), meaning it is designed to be loaded into an AI agent's context at conversation start so the agent can apply it to evaluate any CLI the user mentions.
The companion blog post (referenced in the SKILL.md) is "Rewrite Your CLI for AI Agents" and articulates the underlying philosophy: "Human DX optimizes for discoverability and forgiveness. Agent DX optimizes for predictability and defense-in-depth."
Architecture & Design
Format: SKILL.md — YAML frontmatter + Markdown scoring table Deployment: Loaded into Claude Code (or similar agent) as a context skill Scope: Evaluation tool, not implementation framework
The 7 Scoring Axes:
Axis 1: Machine-Readable Output (0–3)
- 0: Human-only output (tables, color codes, prose)
- 1:
--output jsonexists but incomplete/inconsistent - 2: Consistent JSON output + errors also return structured JSON
- 3: NDJSON streaming for pagination; structured output default in non-TTY
Axis 2: Raw Payload Input (0–3)
- 0: Only bespoke flags
- 1:
--jsonor stdin JSON for some commands - 2: All mutating commands accept raw JSON payload mapping to API schema
- 3: Raw payload is first-class; agent uses API schema as documentation with zero translation loss
Axis 3: Schema Introspection (0–3)
- 0: Only
--helptext - 1:
--help --jsonordescribefor some commands - 2: Full schema introspection for all commands as JSON
- 3: Live runtime-resolved schemas from discovery document; includes scopes, enums, nested types
Axis 4: Context Window Discipline (0–3)
- 0: Full API responses, no field limiting
- 1:
--fieldson some commands - 2: Field masks on all read commands; pagination with
--page-all - 3: Streaming pagination (NDJSON per page); explicit skill guidance on field mask usage
Axis 5: Input Hardening (0–3)
- 0: No validation beyond basic type checks
- 1: Some validation; does not cover agent hallucination patterns
- 2: Rejects path traversals (
../), percent-encoded segments, embedded query params - 3: All of above + output path sandboxing, HTTP-layer encoding, explicit security posture ("agent is not trusted operator")
Axis 6: Safety Rails (0–3)
- 0: No dry-run, no response sanitization
- 1:
--dry-runfor some mutating commands - 2:
--dry-runfor all mutating commands - 3: Dry-run + response sanitization (e.g., Model Armor) against prompt injection in API data
Axis 7: Agent Knowledge Packaging (0–3)
- 0: Only
--helpand docs site - 1:
CONTEXT.mdorAGENTS.mdwith basic guidance - 2: Structured skill files (YAML + Markdown) per command/surface
- 3: Comprehensive skill library with agent-specific guardrails; skills versioned, discoverable, follow a standard like OpenClaw
Rating bands: - 0–5: Human-only - 6–10: Agent-tolerant - 11–15: Agent-ready - 16–21: Agent-first
Bonus: Multi-Surface Readiness - MCP (stdio JSON-RPC) — typed tool invocation, no shell escaping - Extension/plugin install — agent treats CLI as native capability - Headless auth — env vars for tokens, no browser redirect
Agent Compatibility Assessment
What it handles natively
As an evaluation tool rather than a framework, it "handles" challenges by defining what good looks like:
- Output Format & Parseability — Axis 1 directly addresses this; defines levels from human-only to NDJSON streaming
- Schema & Help Discoverability — Axis 3 defines the spectrum from
--helptext to live runtime schema - Verbosity & Token Cost — Axis 4 "Context Window Discipline" is explicitly token-aware
- Argument Validation / Input Hardening — Axis 5 goes beyond standard validation to agent-specific hallucination patterns (path traversal, percent-encoding, embedded query params)
- Side Effects & Destructive Operations — Axis 6 Safety Rails directly addresses dry-run
- Prompt Injection via Output — Axis 6 level 3 explicitly mentions response sanitization (Model Armor)
- Command Composition / Raw Payload — Axis 2 addresses agents passing structured JSON directly
What it handles partially
- Interactivity — not explicitly scored, though Axis 2 (raw payload input) partially addresses it
- Error Message Quality — implied by Axis 1 (errors return structured JSON at level 2+) but not a dedicated axis
- Authentication — mentioned in bonus checklist (headless auth) but not scored
What it does not handle
As an evaluation rubric it does not implement anything. For implementation gaps it does not measure: - Exit codes (no axis for exit code taxonomy) - Timeouts and hanging processes - Idempotency and safe retries - Partial failure and atomicity - Signal handling - Child process leakage - Encoding safety - Observability / audit trail - Config shadowing - Race conditions - Network proxy awareness - Self-update behavior
Challenge Coverage Table
| # | Challenge | Rating | Reason |
|---|---|---|---|
| 1 | Exit Codes & Status Signaling | ✗ | No scoring axis for exit codes |
| 2 | Output Format & Parseability | ✓ | Axis 1 directly covers this with 4 levels |
| 3 | Stderr vs Stdout Discipline | ✗ | Not measured |
| 4 | Verbosity & Token Cost | ✓ | Axis 4 "Context Window Discipline" explicitly |
| 5 | Pagination & Large Output | ✓ | Axis 4 level 3 covers NDJSON streaming pagination |
| 6 | Command Composition & Piping | ~ | Axis 2 (raw payload) partially covers composition |
| 7 | Output Non-Determinism | ✗ | Not measured |
| 8 | ANSI & Color Code Leakage | ~ | Implied by Axis 1 level 0 ("color codes" = bad) |
| 9 | Binary & Encoding Safety | ✗ | Not measured |
| 10 | Interactivity & TTY Requirements | ✗ | Not a scored axis |
| 11 | Timeouts & Hanging Processes | ✗ | Not measured |
| 12 | Idempotency & Safe Retries | ✗ | Not measured |
| 13 | Partial Failure & Atomicity | ✗ | Not measured |
| 14 | Argument Validation Before Side Effects | ~ | Axis 5 covers validation but not phase ordering |
| 15 | Race Conditions & Concurrency | ✗ | Not measured |
| 16 | Signal Handling & Graceful Cancellation | ✗ | Not measured |
| 17 | Child Process Leakage | ✗ | Not measured |
| 18 | Error Message Quality | ~ | Axis 1 level 2 requires structured JSON errors |
| 19 | Retry Hints in Error Responses | ✗ | Not measured |
| 20 | Environment & Dependency Discovery | ✗ | Not measured |
| 21 | Schema & Help Discoverability | ✓ | Axis 3 directly covers this with 4 levels |
| 22 | Schema Versioning & Output Stability | ~ | Axis 3 level 3 mentions "current API version" |
| 23 | Side Effects & Destructive Operations | ✓ | Axis 6 Safety Rails directly covers dry-run |
| 24 | Authentication & Secret Handling | ~ | Bonus checklist mentions headless auth |
| 25 | Prompt Injection via Output | ✓ | Axis 6 level 3 explicitly covers response sanitization |
| 26 | Stateful Commands & Session Management | ✗ | Not measured |
| 27 | Platform & Shell Portability | ✗ | Not measured |
| 28 | Config File Shadowing & Precedence | ✗ | Not measured |
| 29 | Working Directory Sensitivity | ✗ | Not measured |
| 30 | Undeclared Filesystem Side Effects | ✗ | Not measured |
| 31 | Network Proxy Unawareness | ✗ | Not measured |
| 32 | Self-Update & Auto-Upgrade Behavior | ✗ | Not measured |
| 33 | Observability & Audit Trail | ✗ | Not measured |
Summary: ✓ 6 / ~ 6 / ✗ 21
Unique Contributions Not in Other Frameworks
1. Agent-specific input hardening (Axis 5)
The rubric explicitly names agent hallucination patterns as distinct from human typos:
- Path traversals (../)
- Percent-encoded segments (%2e)
- Embedded query params (?, # in resource IDs)
- Security posture: "The agent is not a trusted operator"
This is a unique and important insight absent from all other frameworks reviewed.
2. Knowledge packaging as a scored axis (Axis 7) Explicitly scores whether a CLI ships agent-consumable skill files, CONTEXT.md, or a structured skill library. No other framework treats this as a first-class concern.
3. Multi-surface readiness Frames MCP, plugin install, and headless auth as complementary surfaces for the same CLI — not alternatives. A CLI should support all three simultaneously.
4. The "translation loss" framing (Axis 2) Level 3 of raw payload input frames the goal as "zero translation loss" between the API schema and what the agent passes — the agent should be able to use the API schema as documentation directly. This is a precise and actionable design target.
Strengths for Agent Use
- Conceptual clarity — the clearest articulation of agent-vs-human DX tradeoffs of any resource reviewed
- Input hardening taxonomy — unique focus on agent-specific attack/failure vectors (hallucinations, not typos)
- Knowledge packaging — the only framework that treats shipping agent skill files as a first-class design requirement
- Practical scoring — immediately applicable to audit any existing CLI for agent readiness
- Prompt injection addressed — one of the few references to call out response sanitization explicitly
Weaknesses for Agent Use
- Not a code framework — scores CLIs but provides no implementation
- 7 axes miss 26 of 33 challenges — exit codes, timeouts, signals, idempotency, observability, and more are unscored
- No acceptance criteria — scoring bands are qualitative; two evaluators may score the same CLI differently
- Static skill file — the rubric itself is not versioned or discoverable by agents at runtime
Verdict
agent-dx-cli-scale is the most conceptually sophisticated artifact in this review — not as a framework but as a design philosophy. Its framing of "agent is not a trusted operator," its explicit treatment of hallucination-specific input hardening, and its unique Axis 7 (knowledge packaging) contribute ideas that no other framework has articulated. As a scoring rubric it covers 12 of 33 challenges at least partially. Its primary limitation is that it evaluates rather than implements — it tells you what score your CLI has, not how to fix it. Used together with agentyper or as the design spec for a new framework, it provides invaluable conceptual grounding.