07 medium output nondeterminism

Part I: Output & Parsing | Challenge §7

7. Output Non-Determinism

The Problem

Agents compare outputs, cache results, detect changes, and build logic on top of command results. If the same command with the same arguments produces different output on successive runs, all of these break silently.

Random map/set ordering:

$ tool list-permissions --role admin
{"permissions": ["write", "read", "delete", "admin"]}

$ tool list-permissions --role admin
{"permissions": ["admin", "delete", "read", "write"]}

# Agent compares: permissions changed? No — just reordered.
# Diff-based change detection: false positive every time

Timestamps embedded in data fields:

$ tool get-status --output json
{"status": "ok", "checked_at": "2024-03-11T14:30:01Z", "uptime": 3600}

$ tool get-status --output json
{"status": "ok", "checked_at": "2024-03-11T14:30:04Z", "uptime": 3603}

# Agent caches result, checks if output changed: always "changed"
# Retry detection: can't tell if operation ran twice or output just differs

Random IDs in dry-run output:

$ tool deploy --dry-run
{"effect": "would_create", "preview_id": "prev-a3f2c1"}

$ tool deploy --dry-run
{"effect": "would_create", "preview_id": "prev-9b4d2e"}

# Agent uses preview_id for follow-up call: ID is already stale

Unordered batch results:

$ tool list-users
{"users": [{"id": 3}, {"id": 1}, {"id": 2}]}  # run 1

$ tool list-users
{"users": [{"id": 1}, {"id": 3}, {"id": 2}]}  # run 2 — different order

Impact

Change detection produces constant false positives
Result caching is impossible
Agent cannot tell "operation ran twice" from "output varies naturally"
Dry-run IDs are unusable for follow-up calls

Solutions

Sort all collections in output:

// Always sort arrays of objects by a stable key
{"users": [{"id": 1}, {"id": 2}, {"id": 3}]}

// Always sort string arrays lexicographically
{"permissions": ["admin", "delete", "read", "write"]}

Separate volatile metadata from stable data:

{
  "ok": true,
  "data": {                          // stable — safe to cache/compare
    "status": "ok",
    "version": "1.2.3"
  },
  "meta": {                          // volatile — do not compare
    "checked_at": "2024-03-11T14:30:01Z",
    "duration_ms": 45,
    "request_id": "req-abc"
  }
}

Deterministic dry-run IDs:

# Dry-run preview ID derived from inputs, not random
preview_id = sha256(command + args + timestamp_truncated_to_minute)
# Same args within the same minute → same preview ID

--stable-output flag:

tool list-users --stable-output
# Sorts all collections, omits volatile fields (timestamps, durations)
# Output is deterministic for identical inputs

For framework design: - All array fields sorted by default in --output json mode - data and meta are top-level siblings; agents compare data only - Dry-run IDs are content-addressed, not random - Document which fields are volatile in the output schema ("volatile": true)

Evaluation

Score	Condition
0	Array ordering varies between identical runs; timestamps embedded in top-level data fields
1	Some collections sorted but not all; timestamps in `data` alongside stable fields
2	All collections sorted stably; timestamps isolated in `meta`; output is reproducible for identical inputs
3	`--stable-output` flag omits all volatile fields; dry-run IDs are content-addressed; volatile fields documented in schema

Check: Run the same read command twice in a row and diff the data fields — any ordering difference or timestamp change in data (not meta) is a failure.

Agent Workaround

Compare only data, never meta; extract specific fields rather than diffing full output:

def get_stable(cmd: list[str]) -> dict:
    result = subprocess.run([*cmd, "--output", "json"], capture_output=True, text=True)
    parsed = json.loads(result.stdout)
    # Only compare data — meta contains timestamps and request IDs
    return parsed.get("data", parsed)

# Detect changes correctly
before = get_stable(["tool", "get-status"])
after  = get_stable(["tool", "get-status"])
changed = before != after  # safe — meta excluded

Sort collections before comparing if the tool doesn't:

import json

def normalize(obj):
    if isinstance(obj, list):
        return sorted([normalize(i) for i in obj], key=lambda x: json.dumps(x, sort_keys=True))
    if isinstance(obj, dict):
        return {k: normalize(v) for k, v in sorted(obj.items())}
    return obj

before_norm = normalize(before)
after_norm  = normalize(after)

Limitation: If the tool embeds random IDs or timestamps directly in data fields (not meta) with no way to suppress them, deterministic comparison is impossible — extract and compare only the specific fields that represent meaningful state