Alternative Solutions Landscape
A comprehensive comparison of existing approaches to the agent-CLI integration problem, evaluated against the 73 failure modes in this specification.
Researched March 2026.
Overview
No single existing solution addresses the full scope of the agent-CLI integration problem. The landscape fragments into six distinct layers:
| Layer | What it addresses | Representative solutions |
|---|---|---|
| Protocol | How agents and tool servers communicate | MCP, HTTP function calling |
| Framework | How CLI argument parsing and output are structured | Click, Cobra, Clap, Typer |
| Wrapper | How existing CLI tools are made machine-readable post-hoc | jc, jq, Nushell, PowerShell |
| Convention | Informal checklists for CLI authors | better-cli, DEV Community guides |
| Audit | Post-hoc reliability classification of production agent traces | EvidenceRun |
| Optimization | How agents adaptively learn to use existing CLIs | SkillOpt |
CLI Agent Spec occupies a seventh layer — behavioral contract specification — that none of these approaches formally addresses.
1. Model Context Protocol (MCP)
What it is
MCP is a JSON-RPC 2.0 protocol (originally by Anthropic, donated to the Linux Foundation's Agentic AI Foundation in December 2025, adopted by OpenAI in March 2025) for connecting AI agents to external tools and data. An MCP server exposes tools (executable functions), resources (data), and prompts (templates) over STDIO or HTTP+SSE. Agents discover tools via tools/list and invoke them via tools/call. Every response is a typed JSON object.
Coverage
57.7% across 65 failure modes (25 native ✓, 25 partial ~, 15 missing ✗) — highest score of any evaluated solution.
Challenges MCP resolves natively:
| Challenge | How MCP addresses it |
|---|---|
| §2 Output format | Every tool response is a typed JSON-RPC object — no text to parse |
| §8 ANSI/color leakage | Structurally impossible — responses are JSON, not terminal output |
| §9 Binary encoding | Binary blobs are base64 in typed content objects |
| §21 Schema discoverability | tools/list returns full JSON Schema for every tool |
| §26 Session management | Explicit session lifecycle defined in the protocol |
| §24 Authentication | Isolated to the transport layer, separate from tool logic |
| §37 REPL triggering | Impossible by protocol design |
| §57 Locale-dependent errors | JSON-RPC error objects are structured, not locale-formatted strings |
Challenges MCP misses entirely:
| Challenge | Why MCP cannot address it |
|---|---|
| §1 Exit code taxonomy | MCP replaces exit codes with isError: true — the 14-code taxonomy with retryable and retry_after_ms has no equivalent |
| §11 Timeout enforcement | Spec recommends timeouts; enforcement is left to client implementations |
| §12 Idempotency | idempotentHint is advisory only — not enforced or machine-checkable |
| §13 Partial failure / step manifests | No standard for multi-step operations, rollback, or completed/failed/skipped reporting |
| §19 Retry hints | No first-class retryable/retry_after_ms fields |
| §22 Schema versioning per response | Protocol versioning covers the whole protocol, not individual tool schema versions |
| §47 MCP wrapper schema staleness | When a wrapped CLI evolves, the hand-written MCP wrapper silently falls out of sync — no mechanism exists for this in any solution |
Token cost
A typical CLI interaction costs ~200 tokens. A popular GitHub MCP server with 93 tools consumes ~55,000 tokens before a single call — a 275× overhead. Well-designed hierarchical MCP servers that expose a short index at init and return full schemas on demand close this gap significantly. Benchmarks from early 2026 show 33% worse task completion rates for naive MCP vs direct CLI approaches in inner-loop agent tasks; this reflects poor server design more than inherent protocol limitations.
Beyond token overhead, every MCP wrapper introduces an abstraction tax — a structural fidelity loss from the layer between agent and underlying tool. Constrained tool definitions sacrifice expressiveness; full-surface definitions consume prohibitive context. See the Poehnelt "MCP Abstraction Tax" analysis in §6 for the fidelity spectrum and its implications.
What it requires of tool authors
Authors must implement a full JSON-RPC server: define JSON Schema for every tool, handle the MCP lifecycle (initialize, tools/list, tools/call, shutdown), and ship either a STDIO binary or HTTP service. SDKs exist for Python, TypeScript, Go, Java, and Kotlin. For an existing CLI tool, this means building and maintaining a separate server layer — the CLI itself does not become MCP-native without a wrapper.
Relationship to this spec
Complementary — different integration layers. MCP defines the agent↔server protocol; this spec defines the subprocess behavioral contract. A CLI built to this spec is trivially wrappable in MCP (the manifest provides the JSON Schema, the response envelope maps directly to tool results, the exit code taxonomy maps to isError). A raw CLI requires bespoke wrapper code for each of the 67 failure modes. The two approaches are not in competition — they address sequential layers of the same stack.
2. OpenAPI for CLIs
What it is
OpenAPI is a specification for HTTP APIs. Its application to CLIs takes two forms:
- CLI → OpenAPI: tools like the AWS CLI and Azure CLI expose
--output json/-o jsonflags and generate OpenAPI-style schema documentation from their command trees - OpenAPI → CLI: tools like
openapi-generatorproduce CLI clients from an OpenAPI spec
Coverage
41.5% across 65 failure modes (16 native ✓, 22 partial ~, 27 missing ✗) — tied with Cobra.
Documented limitations of real implementations
| Tool | Limitation |
|---|---|
| Azure CLI | Some subcommands (e.g. az aks command) do not honour --output json |
| Azure CLI | az --version cannot produce JSON output (open issue) |
| AWS CLI | JSON skeleton format is not stable between CLI versions |
| AWS CLI | aws s3 ls returns text regardless of --output setting |
| Both | Exit code 0 returned for many error conditions even in JSON mode |
Gaps
OpenAPI defines HTTP status codes (200, 400, 404, 429), which overlap partially with the exit code taxonomy but are separate — CLI exit codes have no standard OpenAPI representation. OpenAPI says nothing about interactive prompts, child process management, unbounded output, or any of the process-level behavioral contracts. Schema versioning covers the whole API, not individual response schemas per invocation.
Relationship to this spec
Complementary for HTTP-API-backed CLIs; limited for native subprocess CLIs. OpenAPI is the right tool for CLIs that are generated from or backed by HTTP APIs. For CLIs that are native subprocesses, OpenAPI does not address the behavioral layer this spec targets.
3. CLI Frameworks
Click (Python) — 23.8%
Click provides TTY detection (click.isatty()), color stripping, and confirm prompts. It does not natively enforce structured output, exit code taxonomy, retry hints, idempotency, pagination, or tool manifests.
Key agent hazard: click.echo() does not distinguish data from diagnostics — both go to stdout by default. JSON output mixed with progress messages is a common agent parsing failure.
Typer (Python) — 19.2%
Built on Click; inherits its limitations and ranks below Click. typer.prompt() blocks indefinitely on non-TTY stdin — exactly the scenario agents operate in.
Agentyper (0.1.4, alpha): wraps Typer with --yes/--answers flags, isatty() detection, and structured output. Scores 29.2% — 10 points higher than Typer, demonstrating that the agent-friendly layer is implementable but requires deliberate work.
Cobra (Go) — 41.5%
Used by Kubernetes, Docker, gh, Hugo. Go's type system provides UTF-8 safety and buffer/pipe deadlock immunity. context.WithTimeout integration is native. However, Cobra provides no JSON output primitive — every --output json flag in every Cobra-based tool is individually authored by the tool's maintainers. No framework-level primitives for exit code taxonomy, retry hints, idempotency, pagination, or tool manifests.
Notable example: The GitHub CLI (gh) built JSON output on top of Cobra with field selection (gh pr list --json number,title,state). This is strong practice, but it is the GitHub team's design — Cobra enforces nothing.
Clap (Rust) — 43.1%
Highest score among parser frameworks. Rust's type system provides structural solutions for encoding safety (UTF-8 invariant), buffer deadlocks (async I/O safety), and locale issues (no locale-dependent formatting). The Rust CLI book explicitly recommends line-delimited JSON and IsTerminal detection for machine communication. OutputFormat enums with JSON/YAML/TOML variants compose naturally with serde_json.
Gaps: Same as Cobra — no framework-level primitives for exit code taxonomy, retry hints, idempotency, pagination, or tool manifests.
Summary
No major CLI framework has adopted structured JSON output, a defined exit code taxonomy, or agent-specific behavioral contracts as framework-level primitives. All require the application author to implement these manually per command. CLI Agent Spec specifies what that manual implementation must produce.
4. Function Calling Standards
OpenAI function calling / Anthropic tool use / Google Vertex AI
All three converge on the same pattern: the model receives JSON Schema definitions for available tools, outputs a structured call request, and the host executes it and returns the result. The standards define:
- Input: JSON Schema for parameters (name, type, description, required)
- Output: Structured JSON returned to the model
- Error: A boolean flag or error object alongside the result
None of these standards define how the underlying tool should behave. They define the interface between the model and the host application. How the host calls a CLI subprocess, handles exit codes, parses output, or manages timeouts is entirely outside their scope. A CLI wrapped as a function call inherits all 67 failure modes — the wrapper code must handle them individually, which is what this spec eliminates.
MCP tool annotations (2025-11-25 spec)
The 2025-11-25 MCP spec added readOnlyHint, destructiveHint, idempotentHint, openWorldHint — the only function-calling-adjacent standard addressing behavioral contracts. They are advisory only: the protocol does not enforce them, and they cover neither retry semantics, timeouts, partial failure, nor the full exit code taxonomy.
Relationship to this spec
Parallel — different interface boundaries. Function calling standards address the model↔host boundary. This spec addresses the host↔subprocess boundary. Both are necessary; neither substitutes for the other.
5. Shell and Terminal Integration
jc (JSON Convert)
Wraps ~100 standard Unix tools (ls, ps, df, ifconfig, etc.) with hardcoded text-to-JSON parsers. Practical workaround for specific tools; breaks when tools change their output format. Does not address exit codes, interactive prompts, retryability, or any behavioral contracts.
jq
A JSON stream processor. Useful for consuming structured output from CLIs that already emit JSON; a workaround for CLIs that embed structured data in human-formatted text. Does nothing about exit codes, prompts, or unbounded output.
Nushell
A shell that treats all data as structured objects rather than text streams (analogous to PowerShell's object pipeline). Commands pass typed tables and records through pipes. Nushell 0.108.0 (October 2025) added an optional MCP server.
Key limitation: Nushell's structure exists within the shell's pipeline — the CLI tools themselves do not change. When an agent calls ls through Nushell, Nushell parses ls's text output into structured data using built-in parsers. The underlying CLI still has ambiguous exit codes, can prompt interactively, and can emit unbounded output. Agents that operate outside a Nushell environment (which is most agents) receive no benefit.
PowerShell
Mature object pipeline with typed .NET objects. Excellent for PowerShell-native automation. Most CLI tools are not PowerShell cmdlets; most CI/CD and server environments run Linux; most agents trained on Unix idioms generate Bash patterns that fail in PowerShell. Does not address the broader ecosystem of Python/Go/Rust/Node subprocess CLIs.
6. Competing Specifications and Proposals
"Rewrite Your CLI for AI Agents" — Justin Poehnelt
Source: justin.poehnelt.com/posts/rewrite-your-cli-for-ai-agents — the primary source for the jpoehnelt-scale rubric in the comparison matrix. Reference implementation: Google Workspace CLI (gws).
The post frames the design tension as: "Human DX optimizes for discoverability and forgiveness. Agent DX optimizes for predictability and defense-in-depth." It proposes seven principles, each mapping directly to challenges in this spec:
| Principle | Spec challenge(s) | What the post adds |
|---|---|---|
Raw JSON payloads (--json passthrough) |
§46 API Schema Translation Loss | Eliminates flag translation entirely for API-backed CLIs — accept full API payloads directly |
Runtime schema introspection (gws schema <method>) |
§21 Schema Discoverability, §52 Recursive Discovery Cost | Machine-readable dump of method signatures, parameters, response types, and OAuth scopes |
| Context window discipline (field masks, NDJSON) | §4 Verbosity & Token Cost, §5 Pagination, §43 Unbounded Output | Mandatory field masks for API responses; "ALWAYS use field masks to avoid overwhelming your context window" |
| Input hardening against hallucinations | §34 Shell Injection, §35 Agent Hallucination Input Patterns, §59 Token Poisoning | Rejects path traversal (../), control characters below 0x20, embedded ?/# in resource IDs, percent-encoding — framed as: "The agent is not a trusted operator" |
| Shipping agent skills (YAML frontmatter + Markdown) | §44 Agent Knowledge Packaging Absence | Documents invariants agents cannot infer from help text: "Always use --dry-run for mutating operations" |
| Multi-surface architecture from a single source | §47 MCP Wrapper Schema Staleness | Both CLI and MCP server derived from the same Discovery Document — the only known concrete solution to §47 |
Safety rails (--dry-run, response sanitization) |
§23 Destructive Operations, §25 Prompt Injection | Pipes API responses through Google Cloud Model Armor to strip embedded prompt injection before returning to the agent |
Scope: The post is scoped to API-backed CLIs (Google Workspace APIs). It does not address exit code taxonomy, retry hints, timeouts, signal handling, partial failure, or any of §38–68 ecosystem/runtime challenges. The security framing ("agent is not a trusted operator") and the multi-surface / single-source-of-truth architecture are the two ideas with the widest applicability beyond API-backed tools.
On §47 specifically: The comparison matrix marks §47 (MCP Wrapper Schema Staleness) as universally unsolved. Poehnelt's approach — generating both the CLI command tree and the MCP tool definitions from a single upstream API Discovery Document — is the only known architectural pattern that eliminates drift by construction. This pattern is applicable wherever a CLI wraps a structured API with a machine-readable schema.
"The MCP Abstraction Tax" — Justin Poehnelt
Source: justin.poehnelt.com/posts/mcp-abstraction-tax — a direct follow-up to the "Rewrite Your CLI" post above, examining what MCP wrapping costs even when done correctly.
Core thesis: Every protocol layer between an agent and an API loses fidelity — an "abstraction tax." For MCP servers wrapping complex enterprise APIs, the costs compound: "the REST API itself is already an imperfect projection of the underlying data model," and MCP adds another abstraction layer on top.
The two-path problem
Developers wrapping an enterprise API (e.g. a CRM) in MCP face a structural dilemma:
| Path | Approach | Cost |
|---|---|---|
| Constrained tools | Expose create_account, update_opportunity |
Lossy — cannot express complex operations like bulk updates with custom field recalculation |
| Full surface | Expose every API method with complete schemas | Theoretically complete, but "would consume a meaningful fraction of an agent's reasoning capacity" through token overhead |
Neither path escapes the abstraction tax. Constrained tools sacrifice fidelity; full-surface tools sacrifice context budget.
CLI + Skills as a middle path
The post positions CLI + Skills (on-demand discovery) as a third option: the agent pays "token cost only when relevant" rather than loading all tool schemas upfront. This maps directly to the spec's tool manifest design — the manifest is the structured form of what Poehnelt calls "incremental context cost" vs "upfront fidelity loss."
Fidelity spectrum
| Approach | Accessibility | Fidelity | Context cost |
|---|---|---|---|
| MCP (constrained) | High | Lower | Low upfront |
| MCP (full surface) | High | High | Prohibitive |
| CLI + Skills | Moderate | High | On-demand |
| Raw API + client libraries | Low | Maximum | Minimal |
These represent different optimization priorities, not competing solutions.
Spec challenge mappings
| Post concept | Spec challenge |
|---|---|
| Upfront tool schema token overhead | §4 Context window exhaustion |
| Constrained MCP tools losing expressiveness | §47 MCP wrapper schema staleness |
On-demand schema / --help discovery |
§52 Recursive discovery cost, §21 Schema discoverability |
| API opaque identifiers, polymorphic fields | §35 Agent hallucination input patterns |
| MCP iterates faster than the underlying API | §47 (schema drift as a symptom of the abstraction tax) |
Relationship to the "Rewrite Your CLI" post: The first post advocates for Discovery Documents to minimize §47 drift. This post acknowledges that even a perfectly synced wrapper carries a structural fidelity cost. The two posts form a coherent view: Discovery Documents minimize drift but do not eliminate the abstraction tax — the fidelity cost is inherent to the wrapping layer, not to tooling quality.
Relationship to this spec: The spec's tool manifest command (returning the full command tree as machine-readable JSON on demand) is the architectural answer to both concerns: it provides complete fidelity (no constrained-tool expressiveness loss), zero upfront context cost (manifest is loaded only when the agent needs to construct a call), and no wrapper layer (the CLI itself is the tool).
Other community convergence (2025–2026)
Several independent sources converged on a ~10-rule checklist during 2025–2026:
| Source | Rules covered |
|---|---|
| "Keep the Terminal Relevant" (InfoQ, 2026) | --json flag, stdout/stderr separation, exit codes, idempotency, --yes/--force, TTY detection, schema introspection, NDJSON pagination, plus semantic versioning for output contracts and --syntax-check for early validation |
| better-cli / SKILL.md (GitHub: yogin16/better-cli) | 17 rules as an agent-installable skill targeting 40+ agent platforms |
These represent informal community knowledge, not normative specifications. No acceptance criteria, no machine-readable schemas, no tiered contracts, no enforcement mechanism.
"CLI is the new MCP" narrative (early 2026)
A cluster of blog posts argued that direct CLI invocation is superior to MCP for inner-loop agent tasks: - 35× better token efficiency in some benchmarks - 33% better task completion rates in controlled comparisons - Leverages existing maintained tool investment - Unix composability preserved
Lambda AI's 450M-token tool-calling distillation dataset (May 2026) provides independent scale evidence: training samples average 20 turns per conversation with 10–15 tool calls per turn, collected via the Hermes Agent harness across 184 H100 GPUs over several days. This volume of agent–tool interaction confirms that tool-calling is not an edge case — it is the dominant inner-loop operation. It also demonstrates that model capability alone does not solve the behavioral contract problem: even a perfectly fine-tuned tool-calling model fails when the CLI it invokes has ambiguous exit codes, interactive prompts, or unstructured output.
The counterpoint (also well-represented): MCP is better for stateful, authenticated, multi-system coordination and cloud-hosted agent deployments. This debate does not produce a competing specification — it produces advocacy for fixing existing CLIs rather than wrapping them in MCP servers.
AGENTS.md convention
A Markdown file placed in a repository that tells coding agents how to work with that codebase (build steps, test commands, conventions). Used by 60,000+ open-source projects; supported by Codex, Cursor, Gemini CLI, Copilot, and others. Addresses per-project instructions, not CLI behavioral contracts. Does not address exit codes, structured output, prompts, or any process-level guarantees.
AI Manifest (ai-manifest.org)
A community standard for publishing AI service metadata at /.well-known/ai.json, combining OpenAPI schema discovery with JWKS-based cryptographic verification. Addresses service discovery — how agents find what tools exist — not the behavioral contracts of those tools after discovery. Complementary.
AWS CLI Agent Orchestrator (awslabs/cli-agent-orchestrator)
An open-source multi-agent orchestration framework from AWS Labs that wraps Amazon Q CLI and Claude Code as worker agents in a supervisor/worker hierarchy. Orchestrates calls to existing CLIs rather than specifying how CLIs should behave. Does not define exit code standards, structured output envelopes, or tool manifests.
EgisAI
Source: egisai.co / EgisLabs/egisai-sdk — runtime governance SDK for Python AI applications.
EgisAI intercepts LLM provider calls and tool invocations to enforce PII masking, policy rules, model allowlists, and audit logging before and after each call. One-line integration (egisai.init()) with 15+ AI frameworks. Targets engineering and security teams shipping production AI features, not CLI authors.
The SDK operates at the agent → LLM/tool boundary, not the host → subprocess boundary. It addresses adjacent challenges from the agent side:
| EgisAI concern | Spec challenge | Difference |
|---|---|---|
| PII leakage to third-party APIs | §25 Prompt injection, §59 Token poisoning | EgisAI sanitizes agent-side output before dispatch; spec requires CLIs to sanitize their own responses before returning them to the agent |
| Unauthorized tool access | §34 Shell injection | EgisAI restricts which tools agents may call; spec requires tools to reject dangerous inputs at registration |
| Audit trail | §33 Observability & audit trail | EgisAI audits at the agent layer; spec requires request_id and JSONL audit logs at the CLI layer |
Relationship to this spec: Complementary — agent-side governance vs. CLI-side behavioral contracts. EgisAI governs what the agent is permitted to invoke; this spec governs how CLIs must behave when invoked. Both are necessary in a production agent deployment; neither substitutes for the other.
EvidenceRun — Agent Reliability Audit Taxonomy
Source: evidencerun.com / Substack: "12 Ways AI Agents Fail in Production" (May 2026) — a structured failure taxonomy for production agent behavior, used as the basis for a commercial reliability audit service targeting enterprise-facing agent startups.
What it is
A 12-mode taxonomy derived from production incidents, red-team write-ups, and instrumented agent traces across refund bots, coding agents, research agents, sales-prospecting agents, and support copilots. The taxonomy is organized into three severity tiers and delivered as a buyer-ready "Agent Reliability Report" founders can use in investor, security, and enterprise sales conversations.
| Tier | Mode | Description |
|---|---|---|
| Critical | #3 PII exposure | Customer data, secrets, or prompt content leaks to a third-party tool, log, or downstream model |
| Critical | #5 Missing approval | High-impact actions execute without the human-in-the-loop check the workflow promises |
| Critical | #11 Unverifiable decisions | A decision was made; nobody can reconstruct what the agent saw, asked, or weighed |
| Critical | #12 No replay trail | Inputs, prompts, model versions, and tool outputs are not stored long enough for an after-the-fact audit |
| Operational | #1 Tool misuse | Agent calls a tool with the wrong args, wrong scope, or no need to call it at all |
| Operational | #2 Hidden retries | Silent retry loops on non-idempotent calls cause duplicate side effects nobody can see in the trace |
| Operational | #6 Runaway cost | Recursive calls, retry storms, or context bloat send a single run past the alarm threshold |
| Operational | #8 Silent failure | Tool returned an error; the agent returned success — the user is told something happened that didn't |
| Subtle | #4 Prompt injection | Untrusted content in inputs, attachments, or web pages overrides system instructions |
| Subtle | #7 Stale context | Agent acts on cached customer state, expired session data, or out-of-date documents |
| Subtle | #9 Wrong system access | Agent inherits service-account permissions far beyond what the workflow requires |
| Subtle | #10 Output drift | Customer-facing wording, format, or recommendations drift across runs in ways nobody noticed |
Mapping to this spec
EvidenceRun's taxonomy operates at the agent behavior layer; this spec operates at the CLI tool design layer. Most of EvidenceRun's 12 modes have root causes that live in CLI behavioral contracts:
| EvidenceRun mode | Spec challenges | Relationship |
|---|---|---|
| #1 Tool misuse | §2 Output format, §21 Schema discoverability, §35 Hallucination input patterns | Exit code taxonomy and arg validation requirements make misuse structurally detectable |
| #2 Hidden retries | §12 Idempotency / safe retries, §19 Retry hints | Near-perfect match: the spec's retryable/side_effects invariant is the only known design-time mechanism that addresses this mode per exit code |
| #3 PII exposure | §34 Shell injection, §59 Token poisoning | Spec controls structured output boundaries; EgisAI is the closer agent-side enforcement |
| #4 Prompt injection | §25 Prompt injection, §59 Token poisoning | Spec addresses the CLI trust boundary; CLI-side response sanitization (per Poehnelt's Model Armor approach) is the structural fix |
| #5 Missing approval | §23 Destructive operations | side_effects field enables agent-side approval gating; spec cannot enforce gating by design — enforcement is the agent's responsibility |
| #6 Runaway cost | §11 Timeout enforcement, §43 Unbounded output | Timeout signals and pagination requirements reduce runaway cost surface |
| #7 Stale context | None | Agent architecture concern; no CLI design addresses read/write time-delta in agent state |
| #8 Silent failure | §1 Exit code taxonomy, §6 Errors (entire part) | Core value of this spec — exit codes, structured error envelopes, and non-zero exits on failure are the direct structural fix |
| #9 Wrong system access | §30 Undeclared filesystem side effects, §53 Credential expiry | Manifest declarations enable least-privilege reasoning; IAM enforcement is external |
| #10 Output drift | §22 Schema versioning per response | Versioned response-envelope schema and regression scaffolding directly address drift |
| #11 Unverifiable decisions | §33 Observability & audit trail | request_id, JSONL audit logs, and structured response envelopes are the spec's contributions |
| #12 No replay trail | §33 Observability & audit trail, §22 Schema versioning | Structured, versioned, deterministic output makes replay possible; the spec provides the format |
10 of EvidenceRun's 12 modes trace directly to CLI behavioral gaps this spec addresses. The two without a CLI fix — #7 (Stale context) and partly #9 (Wrong system access / IAM enforcement) — are genuinely agent-architecture concerns.
Compounding and the spec's contribution
EvidenceRun emphasizes that failure modes compound in production: prompt injection (#4) + wrong system access (#9) = an attacker with write access to billing; hidden retries (#2) + silent failure (#8) = three invoices sent, system reports one. The spec's side_effects declarations and structured exit codes allow agents to implement defense-in-depth before a call lands — earlier in the stack than any audit can intervene.
Relationship to this spec
EvidenceRun is post-hoc audit; this spec is design-time prevention. EvidenceRun instruments one workflow, maps failures to the 12-mode taxonomy, and packages the evidence for enterprise buyers. This spec provides the behavioral contracts that make most of those failures structurally impossible or detectable at call time. A CLI built to this spec would score better on modes #1, #2, #8, #10, #11, and #12 by construction — not because the audit rubric changed, but because the failure mode was eliminated at the source.
The spec as the substrate that makes EvidenceRun findings fixable: EvidenceRun's 12 modes are high-level; the spec's 73 failure modes are granular sub-cases. "#1 Tool misuse" in EvidenceRun decomposes into at least 8 §N failure modes in the spec. An EvidenceRun audit tells a team which high-level mode fired. The spec tells CLI authors which specific behavioral contract to implement to prevent it from firing again.
7. SkillOpt — Text-Space Skill Optimization
Source: arXiv:2605.23904 — "SkillOpt: Executive Strategy for Self-Evolving Agent Skills" (Yang et al., Microsoft Research + Shanghai Jiao Tong / Tongji / Fudan universities, May 2026)
What it is
SkillOpt is the first systematic text-space optimizer for agent skills. Rather than modifying model weights or wrapping CLIs in new protocols, it treats a compact natural-language skill document as the trainable state of a frozen agent. A separate optimizer model converts scored execution trajectories into bounded add/delete/replace edits on the skill document; a held-out validation gate accepts an edit only if it strictly improves performance on a disjoint selection split. The deployed output is a single best_skill.md file — 300–2,000 tokens, assembled from a median of 2.5 accepted edits across the entire optimization run.
The training loop deliberately mirrors deep-learning optimizer design:
| DL concept | SkillOpt equivalent |
|---|---|
| Mini-batch | Rollout batch of scored trajectories |
| Learning rate | Edit budget Lₜ — maximum edits applied per step |
| Validation / early stopping | Held-out selection gate (strict improvement required; ties rejected) |
| Momentum | Epoch-wise slow update in a protected <!-- SLOW_UPDATE --> markup region |
| Negative replay buffer | Rejected-edit buffer — failed edits recycled as negative feedback for later steps |
Each component is ablated. Removing bounded edits costs up to 22.5 points on SpreadsheetBench; removing the slow/meta update costs the same. The gains are robust to rollout batch size, reflection minibatch size, and learning-rate schedule — but sensitive to the presence of bounded text-space learning and validation gating.
What it requires of CLIs
SkillOpt's training loop requires three prerequisites from any CLI it optimizes against:
1. Automatic verifiers. The held-out validation gate requires a reliable scalar success signal. Exit codes and output must be machine-interpretable without external oracles. A CLI where "success" requires reading human-formatted text and making a judgment call cannot support automatic validation gating — the optimizer learns from ambiguous signal and produces unstable skills. This maps directly to §1 (Exit Code Taxonomy) and §2 (Output Format).
2. Deterministic, bounded output. Non-deterministic output (§7) produces contradictory evidence batches — the optimizer may learn opposite rules from identical underlying behavior. Unbounded output (§43) inflates training cost: SearchQA already costs 213M tokens at 37.9M tokens per test-point gain.
3. Harness-agnostic behavior. The cross-harness transfer results show that skills trained in Codex can outperform skills trained natively in Claude Code. A SpreadsheetBench skill trained entirely inside the Codex execution harness transferred to Claude Code with a +59.7 point gain — slightly exceeding the in-domain Claude Code SkillOpt score of 80.4. The transferred rules are workbook-level procedures ("inspect structure before writing", "write evaluated static values, not formula references") — not harness-specific command sequences. CLIs whose correct usage depends on execution-environment conventions (env vars the harness happens to set, TTY state, workspace layout) produce non-transferable knowledge.
What it can and cannot address
SkillOpt ameliorates behavioral failure modes: procedures, output format expectations, error interpretation, search discipline. It cannot address structural failure modes, which abort rollouts before any trajectory is logged:
| SkillOpt can learn around | SkillOpt cannot fix |
|---|---|
| §2 Unstructured output — learns format rules from failures | §10 TTY deadlock — rollout never completes |
| §4 Verbose output — optimizer selects compact procedures | §11 Hanging process — rollout exceeds training budget |
| §18 Poor error quality — learns to interpret error patterns | §25 Prompt injection — CLI-side output trust boundary |
§21 Schema discoverability — can learn to probe --help |
§34 Shell injection — CLI input validation issue |
| §44 Missing knowledge packaging — the optimized skill IS the knowledge | §45 Headless auth deadlock — structural TTY block |
This partition maps to a useful design principle: structural failure modes are the prerequisite layer that spec compliance must eliminate before skill optimization becomes effective.
Key empirical results
Across 6 benchmarks, 7 target models, and 3 execution harnesses (direct chat, Codex CLI, Claude Code), SkillOpt is best or tied-best on all 52 evaluated (model × benchmark × harness) cells. The gains are largest on procedural benchmarks where reusable format and tool-use rules matter most:
| Benchmark | GPT-5.5 no skill | GPT-5.5 SkillOpt | Gain | Accepted edits |
|---|---|---|---|---|
| SpreadsheetBench | 41.8 | 80.7 | +38.9 | 4 |
| OfficeQA | 33.1 | 72.1 | +39.0 | 1 |
| LiveMathBench | 37.6 | 66.9 | +29.3 | 1 |
| DocVQA | 78.8 | 91.2 | +12.4 | 3 |
| SearchQA | 77.7 | 87.3 | +9.6 | 4 |
| ALFWorld | 83.6 | 95.5 | +11.9 | 2 |
The edit economy finding is striking: OfficeQA gains +39.0 points from a single accepted edit. The validation gate rejects the majority of what the optimizer proposes; the deployed skill is the tip of a discarded iceberg.
These results provide the first quantitative measure of the performance gap attributable to missing procedural knowledge (§44): zero-shot frontier models reach 33–42% accuracy on procedural benchmarks; SkillOpt-trained skills reach 67–81%.
Relationship to this spec
Complementary — optimization layer above spec. This spec defines behavioral contracts CLIs must satisfy; SkillOpt is a training method for producing knowledge artifacts that help agents use conformant CLIs correctly. The two address sequential parts of the same problem:
- A CLI that violates the spec (interactive prompts, ambiguous exit codes, unstructured output) blocks SkillOpt's rollout loop — structural failure modes abort training before any trajectory is logged
- A CLI that satisfies the spec provides the stable, verifiable behavior that SkillOpt's validation gate requires
- SkillOpt then discovers the domain-semantic procedures the spec's requirements alone cannot encode: search heuristics, formula-evaluation discipline, answer-format constraints, tool-use sequencing
The spec is the prerequisite; SkillOpt is a consumer of conformant CLIs that produces portable knowledge artifacts from execution evidence. Together they form a complete adaptation stack: spec compliance removes structural blockers; skill optimization discovers reusable procedures; the exported best_skill.md deploys across models and harnesses without further training.
On §44 specifically: SkillOpt produces exactly the kind of artifact §44 identifies as missing. The spec defines the problem (agents cannot infer domain heuristics from --help); SkillOpt provides a training-loop answer. The two are not in competition — the spec defines what the artifact must encode; SkillOpt provides a systematic method for producing it.
8. Universal Gaps
The following 23 challenges have zero native implementations across all 12 evaluated solutions, including MCP. They represent the genuinely novel territory this spec addresses:
| Challenge | Why no solution addresses it |
|---|---|
| §7 Output non-determinism | No framework enforces deterministic field ordering in responses |
| §11 Timeout enforcement | All solutions treat timeouts as advisory; none enforce them at the framework layer |
| §12 Idempotency / safe retries | Advisory hints exist (MCP idempotentHint, HTTP PUT convention) but none are enforceable |
| §13 Partial failure / step manifests | No standard for multi-step operation state reporting, rollback, or completed/failed/skipped breakdown |
| §15 Race conditions / concurrency | No framework-level protection against concurrent invocations of non-reentrant commands |
| §16 Signal handling & graceful cancellation | Click/Typer map SIGINT to exit 1 + "Aborted!" but leave SIGTERM unhandled; no framework auto-installs a SIGTERM handler that emits a partial JSON result and exits 143 |
| §17 Child process leakage | No standard requires commands to clean up child processes on timeout or signal |
| §19 Retry hints in error responses | retryable and retry_after_ms fields are absent from all framework primitives |
| §20 Environment / dependency discovery | No auto-generated doctor command convention exists in any framework |
| §22 Schema versioning per response | All versioning covers the whole API/protocol; no solution injects per-response schema version |
| §29 Working directory sensitivity | No framework flags or documents commands that produce different results based on CWD |
| §30 Undeclared filesystem side effects | MCP's readOnlyHint is advisory only; no framework provides declarative per-command tracking of files read or written |
| §31 Network proxy unawareness | Go's stdlib HTTP client respects proxy env vars by default (partial); Python's requests and Node.js https do not auto-read HTTP_PROXY/HTTPS_PROXY/NO_PROXY |
| §32 Self-update / auto-upgrade behavior | No standard requires commands to suppress self-update prompts or side effects in automation |
| §33 Observability & audit trail | No framework auto-generates a UUID request_id per invocation, injects it into every response, or writes an append-only JSONL audit log |
| §47 MCP wrapper schema staleness | By definition, no solution — including MCP itself — provides a mechanism to detect when a wrapped CLI has evolved away from its wrapper schema |
| §49 Async job / polling protocol absence | No framework provides a standard job_id / status_command / cancel_command contract for long-running operations |
| §53 Credential expiry mid-session | No framework distinguishes "never authenticated" (exit 8), "credentials expired" (exit 10), and "insufficient permissions" (exit 8) with structured expires_at and refresh_command fields |
| §55 Silent data truncation | No framework emits a structured warning when output exceeds a size threshold |
| §58 Multi-agent concurrent invocation conflict | No framework provides per-instance state namespacing or advisory file locking for config writes to allow parallel agent invocations without conflict |
| §59 High-entropy string token poisoning | No framework sanitizes or flags outputs that could corrupt an agent's context (e.g. injected prompt strings) |
| §66 Symlink loop and recursive traversal exhaustion | No framework tracks visited inodes or enforces traversal depth limits automatically; Go's filepath.WalkDir does not follow symlinks (partial) |
| §67 Agent-generated input syntax rejection | No framework accepts JSON5 (trailing commas, comments, unquoted keys) for structured input flags; all require strict JSON that agents frequently violate |
Comparison Summary
| Solution | Challenge coverage | Requires of tool authors | Key gap | vs. this spec |
|---|---|---|---|---|
| MCP | 57.7% | Full JSON-RPC server per tool | Exit code taxonomy, retry hints, step manifests, schema staleness | Complementary — different layer |
| OpenAPI (CLI) | 41.5% | Map every command/flag to schema | Exit codes, prompts, unbounded output | Complementary for HTTP-backed CLIs |
| Clap (Rust) | 43.1% | Author implements all contracts manually | No framework primitives for any agent contract | Complementary — spec defines what to implement |
| Cobra (Go) | 41.5% | Author implements all contracts manually | Same as Clap | Complementary |
| Click (Python) | 23.8% | Author implements all contracts manually | stdout/stderr mixing, no exit code taxonomy | Complementary |
| Typer (Python) | 19.2% | Author implements all contracts manually | prompt() blocks on non-TTY |
Complementary |
| Function calling (OpenAI/Anthropic/Google) | 0% (different layer) | Write JSON Schema wrapper | Entire subprocess behavioral layer | Parallel — different boundary |
| jc / jq | Parsing workaround only | Nothing | All behavioral contracts | Workaround, not specification |
| Nushell / PowerShell | Parsing workaround only | Nothing for external CLIs | All behavioral contracts; environment dependency | Workaround |
| AGENTS.md | Per-repo instructions only | Write a Markdown file | All process-level contracts | Different scope |
| AI Manifest | Discovery only | Host /.well-known/ai.json |
All behavioral contracts after discovery | Complementary |
| better-cli | Informal checklist | Write CLI following rules | No enforcement, no schemas, no tiered contracts | Informal predecessor of same problem space |
| EgisAI | 0% (different layer) | Add egisai.init() to agent code |
CLI behavioral contracts entirely | Complementary — agent-side governance |
| EvidenceRun | N/A (audit taxonomy) | Instrument one agent workflow for tracing | Post-hoc only — no prevention mechanism; no structural enforcement at CLI or protocol layer | Complementary — post-hoc audit taxonomy vs. design-time behavioral contracts |
| SkillOpt | Behavioral failure modes only (structural modes abort rollouts) | CLI with automatic verifiers + deterministic output + harness-agnostic behavior | Cannot fix §10/§11/§25/§34/§45; requires training budget | Complementary — optimization layer above spec |
References
Primary sources
| Source | URL | Relevance |
|---|---|---|
| Justin Poehnelt — "You Need to Rewrite Your CLI for AI Agents" | https://justin.poehnelt.com/posts/rewrite-your-cli-for-ai-agents/ | Origin of the jpoehnelt-scale rubric; 7-principle framework; single-source-of-truth §47 solution |
| Justin Poehnelt — "The MCP Abstraction Tax" | https://justin.poehnelt.com/posts/mcp-abstraction-tax/ | Fidelity spectrum; two-path problem; CLI+Skills as middle path |
Google Workspace CLI (gws) |
https://github.com/googleworkspace/cli | Reference implementation of Poehnelt's principles |
| Google API Discovery Service | https://developers.google.com/discovery/v1/reference | Discovery Document format used as single source for CLI + MCP generation |
| Google Cloud Model Armor | https://cloud.google.com/model-armor | Response sanitization implementation for §25 prompt injection |
| Jeremiah Lowin — FastMCP 3.1 "Code Mode" | https://www.jlowin.dev/blog/fastmcp-3-1-code-mode | MCP server design using Skills-style on-demand discovery |
Specifications and standards
| Source | URL | Relevance |
|---|---|---|
| Model Context Protocol (MCP) specification | https://spec.modelcontextprotocol.io/ | Protocol layer comparison; tool annotations (2025-11-25) |
| MCP GitHub repository (modelcontextprotocol) | https://github.com/modelcontextprotocol | SDK implementations for Python, TypeScript, Go, Java, Kotlin |
| OpenAPI Specification | https://spec.openapis.org/oas/latest.html | HTTP API schema layer; CLI→OpenAPI and OpenAPI→CLI patterns |
| Agent Skills standard (agentskills.io) | https://agentskills.io/ | Cross-agent skill format used by this project's distributable skills |
| AI Manifest standard | https://ai-manifest.org/ | /.well-known/ai.json service discovery |
Frameworks and tools referenced
| Source | URL | Relevance |
|---|---|---|
| Click (Python) | https://click.palletsprojects.com/ | 23.8% coverage; TTY detection, prompt blocking |
| Typer (Python) | https://typer.tiangolo.com/ | 19.2% coverage; Agentyper extension |
| Cobra (Go) | https://cobra.dev/ | 41.5% coverage; used by gh, Kubernetes, Docker |
| Clap (Rust) | https://docs.rs/clap/ | 43.1% coverage; highest among parser frameworks |
| jc (JSON Convert) | https://github.com/kellyjonbrazil/jc | Text-to-JSON wrapper for ~100 Unix tools |
| Nushell | https://www.nushell.sh/ | Structured shell pipeline; 0.108.0 added MCP server |
| better-cli | https://github.com/yogin16/better-cli | 17-rule checklist as agent-installable skill |
| AWS CLI agent orchestrator | https://github.com/awslabs/cli-agent-orchestrator | Multi-agent CLI orchestration framework |
| EgisAI SDK | https://github.com/EgisLabs/egisai-sdk | Runtime governance interceptor; agent-side §25/§34/§33 adjacency |
| EvidenceRun | https://www.evidencerun.com/ | 12-mode agent reliability audit taxonomy; post-hoc diagnosis of production agent failures |
| EvidenceRun — "12 Ways AI Agents Fail in Production" | https://getevidencerun.substack.com/p/12-ways-ai-agents-fail-in-production | Primary taxonomy source; mapping between 12 modes and spec §N failure modes |
Benchmark data
| Source | URL | Relevance |
|---|---|---|
| "CLI is the new MCP" benchmark data (2026) | (multiple blog posts; no single canonical source) | 35× token efficiency, 33% task completion rate comparisons |
| Lambda AI — Tool-Calling Token Distillation (May 2026) | https://lambda.ai/blog/creating-highly-efficient-agents-450m-tool-calling-tokens-distilled-for-post-training-from-top-open-source-models | 450M-token Hermes Agent harness dataset; 20 turns/sample, 10–15 tools/turn — scale evidence for agent-tool interaction volume |
| MCP GitHub server token analysis | (derived from tools/list inspection of github/github-mcp-server) |
93 tools = ~55,000 tokens at init |
Research papers
| Source | URL | Relevance |
|---|---|---|
| SkillOpt: Executive Strategy for Self-Evolving Agent Skills | https://arxiv.org/abs/2605.23904 | First quantitative measurement of performance gap from missing procedural knowledge (§44); empirical evidence for §2/§4/§18 prerequisites; cross-harness transfer results; 52/52 best-or-tied across model × benchmark × harness |