15 high race conditions
Part II: Execution & Reliability | Challenge §15
15. Race Conditions & Concurrency
Severity: High | Frequency: Situational | Detectability: Hard | Token Spend: Medium | Time: Medium | Context: Low
Impact
- Parallel agent tool calls silently corrupt shared state — config writes lost, temp files overwritten
- Lock conflicts produce unstructured errors with no retry guidance
- Non-deterministic failures are hard to reproduce and hard to diagnose from logs
The Problem
Agents may invoke multiple tool calls in parallel. CLI tools designed for single-user sequential use can corrupt shared state when called concurrently.
Lock file conflicts:
# Two parallel agent tool calls:
$ tool build # writes to /tmp/tool.lock
$ tool test # also writes to /tmp/tool.lock → conflict
# One gets: Error: lock file exists
# Other: runs fine
# Agent doesn't know which one to retry
Shared temp files:
$ tool process --input data.csv --output /tmp/result.json
# Two parallel calls both write to /tmp/result.json
# One overwrites the other's result silently
Race on config mutation:
# Parallel calls:
$ tool config set key1 val1 # reads config, writes key1
$ tool config set key2 val2 # reads config (before key1 written), writes key2
# Result: key1 is lost
Solutions
Session-isolated temp paths:
tool process --input data.csv --session-id $AGENT_SESSION_ID
# Uses /tmp/tool/$AGENT_SESSION_ID/result.json automatically
Advisory locking with timeout:
$ tool build
Error: {
"code": "LOCK_HELD",
"message": "Another build is running (pid 1234, started 30s ago)",
"suggestion": "Wait for it to complete or use --force-unlock if process is dead",
"retry_after_ms": 5000
}
For framework design:
- All temp files scoped to $TOOL_SESSION_ID or a random run ID
- Lock acquisition has a timeout and emits retry_after_ms
- Config mutations use atomic write (write to temp, rename)
Evaluation
| Score | Condition |
|---|---|
| 0 | Shared temp files and no locking; parallel invocations silently corrupt each other |
| 1 | Some locking exists but lock conflict produces unstructured error with no retry hint |
| 2 | Structured LOCK_HELD error with retry_after_ms; lock always released on exit |
| 3 | --session-id namespaces all temp paths; atomic config writes; lock timeout emits structured error |
Check: Invoke two parallel instances of a mutating command — neither should silently succeed while corrupting the other's output; one must fail with a structured LOCK_HELD error containing retry_after_ms.
Agent Workaround
Serialize parallel calls when a tool does not support concurrent invocation:
import threading, time, json
_tool_lock = threading.Lock() # serialize within the same agent process
def run_serialized(cmd):
with _tool_lock:
return run(cmd)
# If a LOCK_HELD error is returned, back off and retry
def run_with_backoff(cmd, max_retries=3):
for attempt in range(max_retries):
result = run(cmd)
parsed = json.loads(result.stdout) if result.stdout else {}
error_code = parsed.get("error", {}).get("code", "")
if error_code == "LOCK_HELD":
wait_ms = parsed.get("error", {}).get("retry_after_ms", 2000)
time.sleep(wait_ms / 1000)
continue
return result
raise RuntimeError("Lock not released after retries")
Pass a unique session ID per parallel invocation if the flag exists:
tool process --session-id $(uuidgen) --input data.csv
Limitation: If the tool uses global shared state with no locking at all, concurrent invocations will silently corrupt each other with no error — the only safe approach is to enforce sequential execution at the agent level, which eliminates any parallelism benefit