Skip to content

33 medium observability

Part VI: Observability | Challenge §33

33. Observability & Audit Trail

Severity: Medium | Frequency: Very Common | Detectability: Easy | Token Spend: Medium | Time: High | Context: Medium

The Problem

When an agent-driven operation fails or produces unexpected results, there's often no way to trace what commands ran, in what order, with what parameters, and what they returned.

No correlation between tool calls:

# Three tool calls: deploy, verify, rollback
# Each has separate logs with no shared identifier
# Impossible to reconstruct the sequence after the fact

No request ID in output:

$ tool deploy
{"ok": true, "effect": "deployed"}
# No request ID to correlate with server-side logs

No timing information:

# Tool ran for 45 seconds
# Output has no timestamps
# Impossible to identify which step was slow

Impact

  • No audit trail means post-incident reconstruction is impossible
  • Missing request_id prevents correlation between agent-side logs and server-side logs
  • No timing data makes it impossible to identify which step in a sequence was slow
  • Agents that re-run operations cannot tell if the outcome was from the current run or a cached/prior result

Solutions

Request/trace ID in every response:

{
  "ok": true,
  "meta": {
    "request_id": "req-abc123",
    "trace_id": "trace-xyz789",
    "duration_ms": 4521,
    "timestamp": "2024-03-11T14:30:00Z",
    "command": "deploy",
    "version": "1.2.3"
  }
}

Correlation ID propagation:

TOOL_TRACE_ID=agent-session-42-step-7 tool deploy
# All log entries for this call include the trace ID

Structured audit log:

$ tool audit-log --since 1h --output jsonl
{"timestamp": "...", "command": "deploy", "params": {...},
 "exit_code": 0, "duration_ms": 4521, "operator": "agent-session-42"}

For framework design: - Every response includes meta.request_id (server-assigned) and meta.trace_id (caller-supplied) - TOOL_TRACE_ID env var propagated automatically - Framework writes append-only audit log to ~/.local/share/tool/audit.jsonl

Evaluation

Score Condition
0 No request_id, no timing, no audit log; tool calls cannot be correlated after the fact
1 meta.request_id present on some responses; no duration_ms; no audit log
2 meta.request_id and meta.duration_ms on all responses; TOOL_TRACE_ID env var propagated
3 meta.trace_id accepts caller-supplied value; tool audit-log --output jsonl available; append-only audit log written automatically

Check: Supply TOOL_TRACE_ID=test-123 and run any command — verify meta.trace_id == "test-123" in the JSON response.


Agent Workaround

Supply a unique trace ID per agent session and per operation; log request_id from every response:

import subprocess, json, uuid, os, time

# Generate a session-scoped trace ID
SESSION_TRACE_ID = f"agent-session-{uuid.uuid4().hex[:8]}"

def traced_run(cmd: list[str], operation: str) -> dict:
    # Per-operation trace ID for fine-grained correlation
    op_trace_id = f"{SESSION_TRACE_ID}-{operation}-{uuid.uuid4().hex[:4]}"

    env = {**os.environ, "TOOL_TRACE_ID": op_trace_id}
    start = time.monotonic()

    result = subprocess.run(cmd, capture_output=True, text=True, env=env)
    elapsed_ms = int((time.monotonic() - start) * 1000)

    try:
        parsed = json.loads(result.stdout)
    except json.JSONDecodeError:
        raise RuntimeError(f"No JSON from {operation}")

    meta = parsed.get("meta", {})
    request_id = meta.get("request_id", "unknown")
    tool_duration = meta.get("duration_ms", "unknown")

    # Log for post-incident reconstruction
    print(
        f"[TRACE] op={operation} trace={op_trace_id} "
        f"request_id={request_id} "
        f"agent_ms={elapsed_ms} tool_ms={tool_duration}"
    )

    return parsed

result = traced_run(
    ["tool", "deploy", "--env", "staging", "--output", "json"],
    operation="deploy",
)

Query the audit log when reconstructing what happened:

def get_audit_log(tool: str, since: str = "1h") -> list[dict]:
    result = subprocess.run(
        [tool, "audit-log", "--since", since, "--output", "jsonl"],
        capture_output=True, text=True,
    )
    lines = [l for l in result.stdout.splitlines() if l.strip()]
    return [json.loads(l) for l in lines]

Limitation: If the tool provides no request_id and no audit log, the only correlation mechanism is timestamps — log the wall-clock time of every tool call in the agent and compare against server-side logs manually to reconstruct sequences