01 critical exit codes

Part I: Output & Parsing | Challenge §1

1. Exit Codes & Status Signaling

The Problem

The most fundamental contract between a CLI tool and its caller is the exit code. Agents treat exit code 0 as success and anything else as failure — but many tools break this contract in ways that silently mislead the agent.

Violations agents encounter constantly:

# Tool exits 0 but operation failed
$ my-deploy --env prod
Warning: config file not found, using defaults
Deploying... timeout after 30s
$ echo $?
0   # ← agent thinks this succeeded

# Tool exits non-zero on non-error conditions
$ grep "pattern" file.txt
$ echo $?
1   # grep exits 1 when no match found — not an error, just "not found"
    # agent may treat this as a failure and retry or abort

# Inconsistent exit codes across versions
$ tool --version
v1.x: exit 0 on warning, exit 1 on error
v2.x: exit 2 on warning, exit 1 on error
# agent cannot reliably interpret without knowing version

Multi-step commands that mask failures:

cmd1 && cmd2 && cmd3
# if cmd2 fails, the shell exits with cmd2's code
# but the agent only sees the final code and doesn't know which step failed

Impact

Agent retries a succeeded operation (token waste, possible duplication)
Agent proceeds after a failed operation (data corruption, cascading errors)
Agent enters a retry loop on a "failure" that is actually expected behavior

Solutions

For CLI tool authors:

Exit code conventions to follow:
  0  = success, operation completed as intended
  1  = general error (use sparingly — be specific)
  2  = misuse / bad arguments (before operation starts)
  3  = operation started but failed mid-way
  4  = precondition not met (dependency missing, not initialized)
  5  = not found (the thing you asked about doesn't exist)
  6  = conflict / already exists
  7  = timeout
  8  = permission denied
  9  = rate limited / quota exceeded

Separate "not found" from "error":

# Bad: exits 1 for both "error" and "not found"
tool get-user --id 123
# exit 1

# Good: exits 5 for "not found", 1 for actual errors
tool get-user --id 123
# exit 5  ← agent knows to stop, not retry

For CLI framework design: - Define a standard exit code table in your framework - Provide typed exit code constants (not magic numbers) - Make every command document its possible exit codes in --help - Support --exit-on-warning flag to make strict mode opt-in

Evaluation

Score	Condition
0	Binary `0`/`1` only — all failures exit `1`, no semantic distinction
1	Some semantic codes used but undocumented or inconsistent across commands
2	Named codes covering success / validation / not-found / timeout / permission; documented in `--help`; consistent across all commands
3	Full declared exit code table; exit code embedded in JSON error body (`"exit_code": N`); each command declares its possible codes in the tool manifest

Check: Run the tool with missing required args, a nonexistent resource ID, and a simulated network failure. Confirm exit codes are distinct, documented, and match the JSON error body.

Agent Workaround

When exit codes are not semantic, branch on the JSON envelope instead:

import subprocess, json

result = subprocess.run(cmd, capture_output=True)

# 1. Never assume exit 0 means the operation succeeded
if result.returncode == 0:
    data = json.loads(result.stdout)
    if not data.get("ok"):
        handle_logical_failure(data["error"])  # tool exited 0 but reported failure

# 2. Map known semantic codes when available
elif result.returncode == 2:
    raise ValidationError()       # fix input, do not retry as-is

elif result.returncode == 5:
    raise NotFoundError()         # stop, do not retry

elif result.returncode == 9:
    retry_after = extract_retry_after(result.stdout)
    time.sleep(retry_after or 60)  # rate-limited — back off

# 3. Fallback: parse stdout/stderr for error details
else:
    try:
        err = json.loads(result.stdout or result.stderr)
    except Exception:
        err = {"message": result.stderr.decode(errors="replace")}
    raise NonRetryableError(err)  # unknown code — default to no-retry

Limitation: Without semantic exit codes the agent must parse error text to decide retry safety — unreliable across versions and locales