Challenge Sources & Epistemic Status

This file documents where each challenge came from, how confident the source is, and what kind of evidence supports it. Understanding the source matters for prioritization — a challenge derived from first principles is structurally guaranteed to be real; one absorbed from training data is real but anecdotal.

Source Categories

Code	Type	Description	Confidence
FP	First-principles inference	Logically deduced from the agent interaction model — no experience required	High (structural)
TD	Training data pattern	Absorbed from GitHub issues, blog posts, Stack Overflow, CLI library docs, forum threads during training	Medium (anecdotal, unverifiable)
RA	Research artifact	Read from specific real source code, docs, or spec during this project's research phase	High (verifiable)
TD+FP	Both	Attested in training data AND independently derivable from first principles	Very High
EP	Empirical publication	Quantitative evidence from a peer-reviewed or pre-print research paper; claims are reproducible and verifiable by reference	High (verifiable, quantitative)

Challenge Source Map

Part I: Output & Parsing

#	Challenge	Source	Notes
§1	Exit Codes & Status Signaling	TD+FP	Exit code 0/1 overloading is described in dozens of CLI design guides and POSIX docs; structurally guaranteed: agents branch on exit code, so ambiguity causes misrouting
§2	Output Format & Parseability	TD+FP	Ubiquitous in agent/LLM tool-use literature; structurally guaranteed: agent reads stdout as data
§3	Stderr vs Stdout Discipline	TD+FP	Very common complaint in agent tooling discussions; structurally guaranteed: mixing streams breaks downstream parsing
§4	Verbosity & Token Cost	TD+FP	Token cost framing is specific to LLM agent context; verbose output filling context window is structurally predictable
§5	Pagination & Large Output	TD+FP	Pagination as a list API problem is well-documented; context-window overflow framing is agent-specific pattern from training
§6	Command Composition & Piping	TD	Classic Unix piping design discussion; agent-specific ID chaining pattern absorbed from agent tool-use guides
§7	Output Non-Determinism	TD+FP	Non-deterministic output breaking diffing is a known CI/CD problem; agent retry-loop consequence is first-principles inference
§8	ANSI & Color Code Leakage	TD+FP	Extremely common complaint in both CI and agent contexts; structurally guaranteed if agent reads stdout
§9	Binary & Encoding Safety	TD	Binary-in-JSON encoding issues documented in API design guides and CLI output handling discussions
§76	Streaming-Default JSONL Incompatibility	FP	Structurally guaranteed: `json.loads(stdout)` is the dominant agent parsing pattern; JSONL-default without declaration breaks it by construction

Part II: Execution & Reliability

#	Challenge	Source	Notes
§10	Interactivity & TTY Requirements	TD+FP	Most-cited agent CLI problem in training data; structurally guaranteed: non-TTY + prompt = deadlock
§11	Timeouts & Hanging Processes	TD+FP	Hanging processes in automation documented extensively; timeout budget exhaustion is first-principles inference
§12	Idempotency & Safe Retries	TD+FP	Idempotency keys documented in Stripe API design, distributed systems literature; agent retry-safety framing from agent SDK discussions
§13	Partial Failure & Atomicity	TD+FP	Multi-step failure handling is a classic distributed systems problem; agent-specific resume/rollback framing from training
§14	Argument Validation Before Side Effects	TD+FP	Validate-before-execute principle is well-documented; exit-2-guarantees-no-side-effects framing is agent-specific
§15	Race Conditions & Concurrency	TD	Concurrent access to shared CLI state documented in CLI design guides and lock-file discussions
§16	Signal Handling & Graceful Cancellation	TD+FP	SIGTERM handling is documented POSIX behavior; cleanup-on-cancel for agents is first-principles inference
§17	Child Process Leakage	TD	Zombie process and orphaned child documentation from Unix process management literature

Part III: Errors & Discoverability

#	Challenge	Source	Notes
§18	Error Message Quality	TD+FP	Good error message design is well-documented (e.g., Rust compiler errors blog posts); machine-parseable error framing is agent-specific
§19	Retry Hints in Error Responses	TD	Retry-After header pattern from HTTP RFCs; CLI-level retry hints absorbed from API design guides and agent tooling discussions
§20	Environment & Dependency Discovery	TD	Dependency preflight checking documented in CLI tool design; agent-specific "doctor" pattern from Homebrew, Flutter doctor
§21	Schema & Help Discoverability	TD+FP	Help discoverability for agents is documented in MCP and agent SDK guides; structurally guaranteed: agent needs machine-readable schema to construct valid calls
§22	Schema Versioning & Output Stability	TD	API versioning literature; schema-version-in-output pattern from REST API design guides and GraphQL introspection discussions

Part IV: Security

#	Challenge	Source	Notes
§23	Side Effects & Destructive Operations	TD+FP	Dry-run and confirmation patterns are well-documented; agent-specific "no human to catch mistakes" risk framing from agent safety discussions
§24	Authentication & Secret Handling	TD+FP	Secret-in-env-var pattern is documented in 12-factor app, CI/CD guides; agent-specific leakage vectors from agent security training data
§25	Prompt Injection via Output	TD	Prompt injection via tool output is documented in LLM security research (Greshake et al., similar papers absorbed in training)
§74	Credential Scope Declaration Absence	FP	Observed during gh evaluation: personal PAT grants full account access to agent; OAuth scope minimization is standard security practice but absent from CLI design guides
§75	Safe-Default Execution Mode Absent	FP	Observed in trading bot scenario: --dry-run is available (§23) but not the default; agents that omit the flag cause real trades; Terraform plan/apply split is the established model for safe-default execution

Part V: Environment & State

#	Challenge	Source	Notes
§26	Stateful Commands & Session Management	TD	Session state in CLIs documented in tool design guides; agent session isolation framing from agent SDK discussions
§27	Platform & Shell Portability	TD	Cross-platform CLI portability is extensively documented; `#!/usr/bin/env` and POSIX shell compatibility are classic topics
§28	Config File Shadowing & Precedence	TD	Config precedence (env > file > default) is documented in 12-factor app and CLI design guides; agent-specific confusion absorbed from troubleshooting discussions
§29	Working Directory Sensitivity	TD+FP	CWD sensitivity is a well-known scripting hazard; agent-specific absolute-path requirement is first-principles inference
§30	Undeclared Filesystem Side Effects	TD	Side effect declaration is documented in functional programming and CLI design; agent-specific cleanup challenges absorbed from automation tooling discussions
§31	Network Proxy Unawareness	TD	Proxy env var support (`HTTP_PROXY`, `HTTPS_PROXY`) is documented in many HTTP library guides; agent-specific inference from enterprise environment discussions
§32	Self-Update & Auto-Upgrade Behavior	TD	Auto-update in non-interactive mode is a known CI problem; agent-specific output pollution framing absorbed from automation discussions

Part VI: Observability

#	Challenge	Source	Notes
§33	Observability & Audit Trail	TD	Structured logging, request IDs, and audit trails are documented in production engineering guides; agent-specific trace propagation from OpenTelemetry and agent SDK discussions

Part VII: Ecosystem, Runtime & Agent-Specific (§34–47, §49–70)

Discovered by reading specific real artifacts during the research phase of this project.

§34–47: Research phase (jpoehnelt SKILL.md, agentyper, Commander.js, MCP spec)

#	Challenge	Source	Primary Artifact
§34	Shell Injection via Agent-Constructed Commands	RA+FP	jpoehnelt SKILL.md — Input Hardening axis; structurally guaranteed when agents construct shell strings
§35	Agent Hallucination Input Patterns	RA+FP	jpoehnelt SKILL.md — Input Hardening axis; path traversal and percent-encoding patterns from OWASP
~~§36~~	~~Pager Invocation Blocking Agent Pipelines~~	RA	MERGED into §10 — pager blocking is a specific case of interactivity/TTY deadlock
§37	REPL / Interactive Mode Accidental Triggering	RA+FP	Python argparse subparser fallback behavior; structurally guaranteed in non-TTY context
§38	Runtime Dependency Version Mismatch	RA	Cobra and Clap docs on runtime dependency checking; Node.js engine field pattern
~~§39~~	~~Help Text Routed to Stdout~~	RA	MERGED into §3 — routing help to stdout is a specific case of stderr/stdout stream discipline
§40	`parse()` vs `parseAsync()` Silent Race Condition	RA	Commander.js docs — explicit warning about async/sync mismatch
§41	Update Notifier Side-Channel Output Pollution	RA	`update-notifier` npm package behavior; Python pip version check behavior
§42	Debug / Trace Mode Secret Leakage	RA	Python Fire `--trace` flag — documented in python-fire README and issue tracker
§43	Tool Output Result Size Unboundedness	RA	jpoehnelt SKILL.md — Context Window Discipline axis; MCP spec `maxTokens` parameter
§44	Agent Knowledge Packaging Absence	RA	jpoehnelt SKILL.md — entire premise; agentyper `--schema` flag; OpenClaw SKILL format
§45	Headless Authentication / OAuth Browser Flow Blocking	RA+FP	agentyper docs — headless auth; structurally guaranteed: browser redirect in non-TTY = deadlock
§46	API Schema to CLI Flag Translation Loss	RA	Comparison matrix research — every parser framework loses nested/union types in flag translation
§47	MCP Wrapper Schema Staleness	RA	MCP spec — tools are statically declared; no sync mechanism with CLI source of truth
~~§48~~	~~Structured Output Envelope Absence~~	RA	MERGED into §2 — the envelope spec is the solution to output format & parseability

§49–58: Extended research (CI/CD guides, POSIX docs, agent SDK discussions)

#	Challenge	Source	Notes
§49	Async Job / Polling Protocol Absence	TD+FP	Async job patterns documented in CI/CD and deployment tool design; exit-code contract for "still running" is first-principles inference
§50	Stdin Consumption Deadlock	TD+FP	stdin-as-default-input is a known Unix pattern; non-TTY deadlock is structurally guaranteed
§51	Shell Word Splitting and Glob Expansion	TD+FP	Word splitting and globbing are documented POSIX shell behaviors; agent-constructed string vulnerability is first-principles inference
§52	Recursive Command Tree Discovery Cost	TD+FP	N+1 help calls documented in agent tool-use discussions; context window cost is first-principles inference
§53	Credential Expiry Mid-Session	TD+FP	Token expiry in long-running automation documented in AWS and OAuth guides; expiry-vs-denied ambiguity absorbed from authentication troubleshooting discussions
§54	Conditional / Dependent Argument Requirements	TD+FP	Conditional required args is a known argparse/Click design challenge; one-at-a-time discovery cost is first-principles inference
§55	Silent Data Truncation	TD	API field length limits silently truncating on write documented in database ORM discussions and API client library issue trackers
§56	Exit Code Masking in Shell Pipelines	TD+FP	`pipefail` is documented POSIX/bash behavior; agent consequence is first-principles inference
§57	Locale-Dependent Error Messages	TD	`LC_MESSAGES=C` for English error normalization is documented in server administration guides; agent impact absorbed from internationalization discussions
§58	Multi-Agent Concurrent Invocation Conflict	TD+FP	Multi-agent concurrency (2024–2025); file locking for config writes is a documented Unix pattern; agent-specific framing is first-principles inference

§59–68: Gemini AMI framework & Antigravity-cli manifesto

Discovered by reviewing two external agent-native CLI projects.

#	Challenge	Source	Primary Artifact
§59	High-Entropy String Token Poisoning	RA	Gemini AMI: Output & Context — High-Entropy Masking section
§60	OS Output Buffer Deadlock	RA	Antigravity: I/O & Formatting — Output Buffering; `PYTHONUNBUFFERED` pattern documented explicitly
§61	Bidirectional Pipe Payload Deadlock	RA	Antigravity: I/O & Formatting — Pipe Deadlocks; 64 KB UNIX pipe buffer limit with exact mechanics
§62	$EDITOR and $VISUAL Trap	RA	Gemini AMI: Execution Flow — REPL/Editor Blocks; Antigravity: Interactivity & Prompts — $EDITOR Trap
§63	Terminal Column Width Output Corruption	RA	Antigravity: I/O & Formatting — Terminal Wrapping; `--width=0` solution described explicitly
§64	Headless Display and GUI Launch Blocking	RA	Gemini AMI: System Physics — Headless Display; Antigravity: Environment & Execution — Implicit Browser Fallbacks
§65	Global Configuration State Contamination	RA	Antigravity: State & Concurrency — Global Configuration State Mutation; default-to-local pattern
§66	Symlink Loop and Recursive Traversal Exhaustion	RA	Antigravity: Environment & Execution — Symlink Death Spirals; inode tracking solution
§67	Agent-Generated Input Syntax Rejection	RA	Antigravity: Schema & Discoverability — Input Syntax Rigidity; JSON5 forgiving parser solution; REQ-48
§68	Third-Party Library Stdout Pollution	RA	Gemini AMI: Output & Context; Antigravity: I/O & Formatting — fd-level interception solution
§69	Argument Order Ambiguity	FP	Derived from parser mode differences across argparse/Click/Cobra/Commander.js
§70	Single-Argument Arity Forcing Agent Loop Overhead	FP	Derived from observed agent error: `ws delete` passed multiple paths, argparse rejected extras; UNIX convention (rm/cp/mv accept nargs=+) creates universal agent expectation

Confidence Summary

Confidence	Count	Challenges
Very High (TD+FP or RA+FP)	30	§1–5, §7–8, §10–14, §16, §18, §21, §23–24, §29, §34–35, §37, §45, §49–54, §56, §58
High (RA only)	18	§38, §40–44, §46–47, §59–68
Medium (TD only)	17	§6, §9, §15, §17, §19–20, §22, §25–28, §30–33, §55, §57

Active total: 65 (3 merged: §36→§10, §39→§3, §48→§2). RA+FP challenges (§34–35, §37, §45) counted as Very High.

What This Means for Prioritization

Highest confidence → implement first: - TD+FP challenges are both empirically attested AND structurally necessary. They will definitely occur in any agent using CLI tools.

Research-backed (RA) challenges are specific and concrete: - These were confirmed by reading real code and docs. They're real but may not affect every tool — they depend on specific library choices (Commander.js, Python Fire, etc.).

TD-only challenges need validation: - These are plausible based on patterns seen in training data but should be validated against your actual tooling before investing heavily in framework mitigations.

What Is NOT a Source

Direct runtime experience: these challenges were not discovered by actually running agents against CLI tools. There is no personal debugging history behind them.
User studies or empirical measurement: no user studies, no telemetry, no measured frequency data. The "Very Common / Common / Situational" frequency ratings are estimates based on how often similar problems appear in training data — not measured rates.

Empirical Publication Sources

The following peer-reviewed or pre-print papers provide quantitative evidence for challenges in this spec. EP sources strengthen existing challenge entries without replacing their primary source category — they add measured effect sizes where the spec previously had only structural or anecdotal evidence.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Citation: Yang et al., Microsoft Research + Shanghai Jiao Tong / Tongji / Fudan universities. arXiv:2605.23904, May 2026.

What it measures: Performance gains from optimized skill documents across 6 benchmarks, 7 target models, and 3 execution harnesses (direct chat, Codex CLI, Claude Code). 52/52 best-or-tied evaluated cells.

Challenges it provides EP-level evidence for:

Challenge	What SkillOpt measures	Effect
§44 Agent Knowledge Packaging Absence	Zero-shot frontier models vs. optimized-skill frontier models on procedural benchmarks — the gap is the cost of missing knowledge	GPT-5.5: 33–42% no-skill → 67–81% with optimized skill on SpreadsheetBench, OfficeQA, LiveMathBench
§2 Output Format & Parseability	CLIs with automatic verifiers (parseable exit codes + structured output) support a held-out gate that produces stable gains; CLIs without them cannot train	The validation gate is the single most impactful SkillOpt component; removing it collapses training
§4 Verbosity & Token Cost	Training cost per test-point gain varies by an order of magnitude with trajectory length	SearchQA (longer trajectories): 37.9M tokens/point; SpreadsheetBench (shorter): 0.6M tokens/point
§18 Error Message Quality	Failure minibatches are the primary source of useful edits; the optimizer identifies recurring error patterns and encodes procedures to avoid them	Removing the rejected-edit buffer (which captures recurring failure patterns) costs 4.6 points on SpreadsheetBench

The partition SkillOpt reveals: challenges split into behavioral (can be partially remediated by a well-trained skill) and structural (abort rollouts before any trajectory is logged). §10, §11, §25, §34, §45, §60 are structural — no amount of skill optimization can work around them. This partition is now documented in guides/skill-optimizable-design.md.

Written 2026-03-13. Revised 2026-03-13: §36, §39, §48 marked merged; confidence counts corrected to 30/18/17; personal paths removed; active links added. Revised 2026-03-19: §69 added. Revised 2026-03-26: §70 added. Revised 2026-05-07: §71 (FP), §72 (FP), §73 (FP) added; active total updated to 70. Revised 2026-05-07: §74 (FP) added; active total updated to 71. Revised 2026-05-09: §75 (FP) added; active total updated to 72. Revised 2026-05-26: EP source category added; SkillOpt (arXiv:2605.23904) added as EP source for §44, §2, §4, §18. Covers CLI Agent Spec v1.7 — 72 active challenges (75 original, 3 merged).