Skills — Technical Reference
Ten installable agent skills for evaluating CLIs, implementing the spec, and diagnosing failures at the tool-execution layer. Compatible with any Agent Skills-enabled agent: Claude Code, Cursor, Gemini CLI, Copilot, and others.
Skill Inventory
| Skill | Category | Purpose |
|---|---|---|
cli-agent-onboard |
Evaluation pipeline | Profile a CLI tool — detects runtime, binary, non-interactive flags, timeout method |
cli-agent-evaluate |
Evaluation pipeline | Score a CLI against one failure mode (0–3), return applicable agent workaround |
cli-agent-evaluate-batch |
Evaluation pipeline | Evaluate a CLI across multiple failure modes in one resumable run |
cli-agent-readiness |
Evaluation pipeline | Score proactive agent-readiness across 5 dimensions (0–15) |
cli-agent-report |
Evaluation pipeline | Transform findings into perspective-specific reports for 4 audiences |
cli-agent-audit |
Evaluation pipeline | Autonomous end-to-end pipeline: install → onboard → readiness → evaluate → report |
cli-agent-implement |
Implementation | Guide implementing the spec in a CLI framework, tier by tier |
cli-agent-diagnose |
Diagnosis | Classify a failed agent CLI call against the §N taxonomy; return workaround + memory string |
fixlayer-report |
Diagnosis | Render a self-contained HTML audit report from a tool-execution trace file |
validate-links |
Spec maintenance | Validate cross-links and schema↔requirement symmetry in the spec |
Skill Relations
Call graph
Skills invoke other skills automatically or on demand. The arrows show delegation, not data flow (data flow is covered in Artifact Flow below).
cli-agent-audit
├─→ cli-agent-onboard (Phase 2 — always)
├─→ cli-agent-readiness (Phase 3 — skippable with --skip-readiness)
├─→ cli-agent-evaluate-batch (Phase 4)
└─→ cli-agent-report (Phase 5, mode=all)
cli-agent-evaluate
└─→ cli-agent-onboard (Step 0 — auto-invoked if environment.md missing)
cli-agent-evaluate-batch
└─→ cli-agent-onboard (Step 0 — auto-invoked if environment.md missing)
cli-agent-readiness
└─→ cli-agent-onboard (Step 0 — auto-invoked if environment.md missing)
fixlayer-report
└─→ cli-agent-diagnose (scripts/diagnose.py --explain)
cli-agent-implement, cli-agent-diagnose, and validate-links have no skill-level dependencies.
Artifact flow
All evaluation artifacts live under evaluations/<cli-name>/. Skills read and write specific files; no skill reads another skill's files at runtime (they share the directory, not live state).
cli-agent-onboard
writes → evaluations/<cli>/environment.md
cli-agent-evaluate / cli-agent-evaluate-batch
reads → evaluations/<cli>/environment.md
writes → evaluations/<cli>/findings.md
writes → evaluations/<cli>/issues.md
writes → evaluations/<cli>/trace.md
cli-agent-readiness
reads → evaluations/<cli>/environment.md
writes → evaluations/<cli>/readiness.md
cli-agent-report (mode=all)
reads → evaluations/<cli>/findings.md
reads → evaluations/<cli>/issues.md
reads → evaluations/<cli>/trace.md
reads → evaluations/<cli>/readiness.md (optional)
reads → evaluations/<cli>/environment.md
writes → evaluations/<cli>/report-dev.md
writes → evaluations/<cli>/report-agent-dev.md
writes → evaluations/<cli>/report-runtime.md
writes → evaluations/<cli>/report-issues.md
writes → evaluations/<cli>/report-index.md
writes → evaluations/<cli>/README.md
writes → evaluations/<cli>/linkedin.md (gitignored)
writes → evaluations/<cli>/x.md (gitignored)
writes → docs/evaluations/<cli>/.pages
Single-mode cli-agent-report runs (dev, agent-dev, runtime, issues) emit output to the conversation only — they never write files.
Artifact reference table
| File | Produced by | Consumed by |
|---|---|---|
environment.md |
cli-agent-onboard |
cli-agent-evaluate, cli-agent-evaluate-batch, cli-agent-readiness, cli-agent-report |
findings.md |
cli-agent-evaluate, cli-agent-evaluate-batch |
cli-agent-report |
issues.md |
cli-agent-evaluate, cli-agent-evaluate-batch |
cli-agent-report (modes: issues, all) |
trace.md |
cli-agent-evaluate, cli-agent-evaluate-batch |
cli-agent-report (modes: issues, all) |
readiness.md |
cli-agent-readiness |
cli-agent-report (modes: all) |
report-dev.md |
cli-agent-report |
humans — CLI authors |
report-agent-dev.md |
cli-agent-report |
humans — agent builders |
report-runtime.md |
cli-agent-report |
AI agents at runtime |
report-issues.md |
cli-agent-report |
agent users — concrete bug list |
report-index.md |
cli-agent-report |
humans — navigation |
README.md |
cli-agent-report |
humans — entry point |
linkedin.md |
cli-agent-report |
humans — social post draft |
x.md |
cli-agent-report |
humans — social post draft |
Skill Descriptions
cli-agent-onboard
Role: foundation. Every evaluation skill reads environment.md before running any check. Run once per CLI per machine; re-run with --force to refresh.
Steps: reads agent docs (AGENTS.md, CODING_AGENTS.md, README.md) → detects runtime and toolchain from manifest files → locates and verifies the binary → detects OS timeout method → discovers non-interactive and output-format flags → saves evaluations/<cli>/environment.md.
Invoked automatically by: cli-agent-evaluate, cli-agent-evaluate-batch, cli-agent-readiness (Step 0) and cli-agent-audit (Phase 2).
cli-agent-evaluate
Role: targeted single-failure-mode evaluation. Use when investigating one specific §N, or when building a custom evaluation sequence.
Steps: loads environment profile → locates failure mode file → reads ### Evaluation section → runs **Check:** command → assigns score 0–3 (or ?/3 on timeout) → reads ### Agent Workaround if score < 3 → emits structured result block → saves to findings.md and trace.md.
Score semantics: 3 = fully passing, 0 = failing, ?/3 = check timed out (indeterminate — not a fail).
Resumable: skips §N rows already present in findings.md unless --refresh is passed.
cli-agent-evaluate-batch
Role: efficient multi-failure-mode evaluation in a single run. Accepts a severity filter (critical, high, medium, all), a part number (part 1–part 7), or an explicit §N list.
Difference from cli-agent-evaluate: builds a queue from the failure mode index, processes all entries in order, saves findings after each §N (resumable), emits a final scorecard table sorted by severity then §N.
Resumable: skips §N rows that have a complete row in findings.md AND a complete block in trace.md. Re-evaluate with --refresh.
cli-agent-readiness
Role: positive readiness score — measures what the CLI provides, not just what it avoids breaking. Complements cli-agent-evaluate-batch.
Five dimensions (3 points each, 15 total):
| # | Dimension | What it measures |
|---|---|---|
| 1 | Documentation Quality | Can an agent learn correct invocation from docs alone |
| 2 | Self-Description | Does --schema/manifest return a machine-readable ManifestResponse |
| 3 | Pre-built Integrations | MCP server, OpenAPI spec, Claude skill, workflow recipes — with co-versioning check |
| 4 | Setup Reproducibility | Non-interactive idempotent install; dependency manifest; health-check command |
| 5 | Workflow Coverage | Copy-pasteable examples covering CRUD operations, verified by execution |
Depth: quick evaluates dimensions 1–2 only (score out of 6); full evaluates all five.
Grade scale: A 13–15 · B 10–12 · C 7–9 · D 4–6 · F 0–3.
cli-agent-report
Role: translate raw findings into actionable output for a specific audience. Never re-runs CLI checks — reads only pre-existing artifacts.
Five modes:
| Mode | Audience | Output |
|---|---|---|
dev |
CLI authors | Fix list: failing §N → ### Solutions section from challenge file |
agent-dev |
Agent builders | Integration guide: binary invocation, env vars, per-§N workarounds, invocation invariants |
runtime |
AI agents | Compact brief: always-include flags, never-do actions, output patterns to watch |
issues |
Agent users | Concrete bugs from issues.md + gap list from failing §N |
all |
All above | Runs all four modes, writes 8 files, generates index + README + LinkedIn/X drafts |
all mode validation: runs scripts/validate_report_bundle.py before declaring the bundle complete.
cli-agent-audit
Role: autonomous end-to-end pipeline. Single command, zero human steps in the happy path.
Six phases:
| Phase | Action | Halt on failure |
|---|---|---|
| 0 | Pre-flight — check existing artifacts, determine what to skip | n/a |
| 1 | Install — non-interactive package install, binary verification | yes |
| 2 | Onboard — delegate to cli-agent-onboard |
yes |
| 3 | Readiness — delegate to cli-agent-readiness |
no (warn and continue) |
| 4 | Evaluate — delegate to cli-agent-evaluate-batch |
no (note skipped §N) |
| 5 | Report — delegate to cli-agent-report mode=all |
no (fix reported errors) |
Scope: critical (default), critical+high, or all.
Flags: --skip-install, --skip-readiness, --refresh.
Artifacts produced: 13 files under evaluations/<cli>/ — see the cli-agent-audit SKILL.md for the full table.
cli-agent-implement
Role: guides a CLI framework author through implementing the spec, tier by tier.
Steps: naming audit (corpus alignment, verb gaps, flag gaps) → identify target language and framework → generate language-specific types from JSON schemas → implement REQ-F (Framework-Automatic) → REQ-C (Command Contract) → REQ-O (Opt-In) → verify acceptance criteria.
Key invariant to add post-codegen: retryable: true implies side_effects: "none" in ExitCodeEntry — code generators do not enforce this; add a validation snippet manually.
Standalone: does not read evaluation artifacts or invoke other skills.
cli-agent-diagnose
Role: post-hoc failure classifier. Given a failed CLI call (command, stdout, stderr, exit code), identifies the matching §N failure mode and returns an actionable workaround, a memory string, and a skill patch.
Input forms: failed call visible in the current conversation, a JSON object, or a path to a message history file (OpenAI format, LangSmith, or Langfuse).
Scripts:
- scripts/diagnose.py — classifier; output is always JSON regardless of exit code
- scripts/runner.py — drop-in subprocess.run wrapper that applies §N fixes transparently
- scripts/preflight_hook.py — Claude Code PreToolUse hook that intercepts Bash calls before they run
Exit codes of diagnose.py: 0 match found · 2 trace too sparse · 3 no match.
Scope: tool execution layer only. Reasoning failures, wrong answers, and incorrect domain logic are out of scope — diagnose.py returns no_match for those.
fixlayer-report
Role: visual counterpart to cli-agent-diagnose. Renders a self-contained HTML audit report in the FixLayer design system from a tool-execution trace file.
Input formats: Claude Code PostToolUse hook log (JSONL), OpenAI message history (JSON array), or single-event JSON.
Script: scripts/generate_report.py <trace-file> — calls diagnose.py --explain, collects session stats (tool counts, durations, retries), renders HTML, prints output path.
Report sections: verdict strip (pass/fail + §N codes) → match cards (evidence, triggering tool calls, workaround, limitation) → memory strings and skill patches → action items.
Depends on: cli-agent-diagnose scripts installed alongside it.
validate-links
Role: spec consistency validator. Detects broken file references and symmetry violations before they accumulate.
Five checks:
| # | Check | What it validates |
|---|---|---|
| 1 | Broken file links | Every relative link in every .md file resolves to an existing file |
| 2 | Schema↔requirement symmetry | Used by in schema .md ↔ ## Schema link in requirement; bidirectional |
| 3 | Index completeness | Every file linked from requirements/index.md and schemas/index.md exists; every file in those directories is listed |
| 4 | Content completeness | Challenge and requirement files have all required sections (informational — not a hard error) |
| 5 | Counter consistency | Numbers declared in README.md, AGENTS.md, challenges/index.md, requirements/index.md match actual file counts on disk |
Standalone: reads spec files only; does not invoke other skills or write any artifacts.
When to run: after adding or editing any failure mode, requirement, schema, or guide file.
Quick-Reference: Which Skill to Use
| Goal | Skill |
|---|---|
| Full autonomous audit of a CLI | cli-agent-audit |
| Evaluate one specific §N | cli-agent-evaluate |
| Evaluate a set of failure modes | cli-agent-evaluate-batch |
| Profile a CLI before evaluation | cli-agent-onboard |
| Score what a CLI provides to agents | cli-agent-readiness |
| Generate a fix list for the CLI author | cli-agent-report mode=dev |
| Generate an integration guide for an agent builder | cli-agent-report mode=agent-dev |
| Generate a runtime brief for an AI agent | cli-agent-report mode=runtime |
| Generate all reports + index + social drafts | cli-agent-report mode=all |
| Understand why a CLI call failed | cli-agent-diagnose |
| Visual HTML report from a trace file | fixlayer-report |
| Implement the spec in a CLI framework | cli-agent-implement |
| Check spec cross-links after editing | validate-links |