Skip to content

11 critical timeouts

Part II: Execution & Reliability | Challenge §11

11. Timeouts & Hanging Processes

Severity: Critical | Frequency: Common | Detectability: Hard | Token Spend: High | Time: Critical | Context: Low

The Problem

Agents have finite time budgets per tool call. A command that runs forever (network hang, deadlock, waiting for input) burns the budget and returns nothing.

Sources of indefinite hangs:

$ curl http://unreachable-host/api   # DNS timeout: 30-120s default
$ tool sync --remote                  # waits for remote that never responds
$ flock /var/lock/myapp.lock cmd     # waits if lock is held
$ tool process-queue                  # long-running daemon started as CLI
$ docker pull large-image            # download with no progress/timeout

Partial output before hang:

$ tool import large-file.csv
Importing row 1...
Importing row 2...
Importing row 3...
[hangs at row 4]

Agent sees partial output, doesn't know if it succeeded or hung.

Timeout that produces no output:

$ tool --timeout 30 slow-operation
# exits after 30s with exit code 124 (timeout's convention)
# but produces no output — agent doesn't know what happened

Impact

  • Agent turn fails with timeout, no error information extracted
  • Partial side effects may have occurred (unknown state)
  • Agent may retry, causing duplicate side effects

Solutions

Built-in timeout flags:

tool operation --timeout 30s        # fail after 30 seconds
tool operation --connect-timeout 5s # specifically for connection phase

Progress heartbeats to stderr:

$ tool long-operation --output json
# stderr:
[  2s] Starting...
[  5s] Phase 1/3: downloading (23%)
[ 10s] Phase 1/3: downloading (67%)
[ 15s] Phase 2/3: processing
# stdout (only on completion):
{"ok": true, "data": {...}}

Emit partial results before timeout:

{
  "ok": false,
  "partial": true,
  "data": {"processed": 42, "total": 100},
  "error": {"code": "TIMEOUT", "message": "Operation timed out after 30s"},
  "resume_token": "abc123"   // allows resuming if supported
}

For framework design: - Every command has a default timeout; --timeout 0 means no timeout (must be explicit) - Timeout exits with a specific code (e.g., 7) and always emits JSON error - Provide --heartbeat-interval to control stderr progress frequency - Track and report wall time in every JSON response's meta.duration_ms

Evaluation

Score Condition
0 No timeout mechanism; commands hang indefinitely on network or lock contention
1 --timeout flag exists but emits only text output; exit code is non-specific (e.g., 1)
2 Timeout emits structured JSON error with code: "TIMEOUT"; exit code is defined (exit 10)
3 Timeout emits partial result with completed_steps, resume_token, and meta.duration_ms; --heartbeat-interval available

Check: Run a command pointing at an unreachable host with a 2s timeout — verify it exits within 3s with a JSON error and a defined exit code.


Agent Workaround

Enforce a timeout at the subprocess level and parse whatever partial output exists:

import subprocess, json, sys

try:
    result = subprocess.run(
        cmd,
        capture_output=True,
        timeout=30,          # enforce externally even if --timeout not available
        text=True,
    )
    output = result.stdout
except subprocess.TimeoutExpired as e:
    output = (e.stdout or b"").decode(errors="replace")
    # Try to parse partial JSON if any was flushed before timeout
    try:
        parsed = json.loads(output.strip().split("\n")[-1])
    except Exception:
        parsed = {"ok": False, "error": {"code": "TIMEOUT", "partial_output": output}}

# Check meta.duration_ms if present to detect near-timeout situations

Limitation: If the tool buffers all output and flushes nothing before timeout, the agent receives no partial result — there is no workaround for fully-buffered tools; use a shorter timeout to fail fast and avoid wasting turn budget