19 high retry hints
Part III: Errors & Discoverability | Challenge §19
19. Retry Hints in Error Responses
Severity: High | Frequency: Very Common | Detectability: Medium | Token Spend: High | Time: High | Context: Medium
The Problem
When a command fails, the agent decides: retry immediately, retry after delay, retry with different args, or give up. Without explicit guidance, agents either retry everything (wasting resources, amplifying rate-limit violations) or give up on recoverable failures.
Agent retrying a non-retryable error:
$ tool create-user --email "not-an-email"
{"ok": false, "error": {"code": "VALIDATION_ERROR", "message": "Invalid email"}}
exit 1
# Agent retries 3 times with identical args
# Each retry fails identically — wasted calls, wasted tokens
Agent giving up on a retryable error:
$ tool call-api
{"ok": false, "error": {"code": "SERVICE_UNAVAILABLE", "message": "Try again later"}}
exit 1
# Agent marks task as failed and escalates to user
# But the service recovered 2 seconds later
Rate limit with no backoff hint:
$ tool sync-data
{"ok": false, "error": {"code": "RATE_LIMITED", "message": "Too many requests"}}
exit 9
# Agent retries immediately → hits rate limit again → retry loop
Impact
- Retry amplifies the original problem (rate limits, load)
- No-retry on recoverable failures wastes the entire task
- Agent cannot distinguish "try again" from "fix your args first"
Solutions
retryable and retry_after_ms in every error:
{
"ok": false,
"error": {
"code": "RATE_LIMITED",
"message": "API rate limit exceeded",
"retryable": true,
"retry_after_ms": 5000,
"retry_strategy": "exponential_backoff",
"max_retries": 3
}
}
{
"ok": false,
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid email address",
"retryable": false,
"fix_required": "Correct the --email argument before retrying"
}
}
Retry classification taxonomy:
retryable: false → VALIDATION_ERROR, NOT_FOUND, PERMISSION_DENIED, CONFLICT
retryable: true → TIMEOUT, SERVICE_UNAVAILABLE, RATE_LIMITED, NETWORK_ERROR
retryable: "maybe" → INTERNAL_ERROR (sometimes transient, sometimes not)
Exit code alignment:
Exit 9 (RATE_LIMITED) → always retryable, check retry_after_ms
Exit 7 (TIMEOUT) → retryable, immediate retry ok
Exit 8 (PERMISSION_DENIED) → never retryable without auth change
Exit 2 (BAD_ARGS) → never retryable without arg change
For framework design:
- Every error class has a default retryable value in the error registry
- retry_after_ms sourced from response header (Retry-After) when available
- Framework-level retry logic: honor retryable and retry_after_ms automatically
- Emit attempt and max_attempts in meta so agents know retry history
Evaluation
| Score | Condition |
|---|---|
| 0 | No retryable field; agent cannot distinguish transient from permanent failures; no delay hint for rate limits |
| 1 | Some errors include retryable; retry_after_ms absent; agent must guess delay |
| 2 | All errors include retryable: true/false; rate-limited responses include retry_after_ms; exit code encodes retryability |
| 3 | retry_strategy field present; max_retries hint provided; meta.attempt and meta.max_attempts track retry history |
Check: Trigger a rate-limit error (or a validation error) and verify the response includes retryable: true (or false) and, for rate limits, a retry_after_ms value.
Agent Workaround
Implement retry logic driven by retryable and retry_after_ms fields:
import subprocess, json, time
def run_with_retry(cmd: list[str], max_attempts: int = 3) -> dict:
for attempt in range(1, max_attempts + 1):
result = subprocess.run(cmd, capture_output=True, text=True)
try:
parsed = json.loads(result.stdout)
except json.JSONDecodeError:
if attempt == max_attempts:
raise
time.sleep(2 ** attempt)
continue
if parsed.get("ok"):
return parsed
error = parsed.get("error", {})
retryable = error.get("retryable")
if retryable is False:
# Permanent failure — do not retry
raise RuntimeError(
f"[{error.get('code')}] {error.get('message')} "
f"(fix: {error.get('fix_required', 'see error')})"
)
if retryable is True and attempt < max_attempts:
delay_ms = error.get("retry_after_ms", 1000 * (2 ** attempt))
time.sleep(delay_ms / 1000)
continue
raise RuntimeError(f"Command failed after {attempt} attempts: {parsed}")
raise RuntimeError("Max attempts reached")
Map exit codes to retry decisions when retryable field is absent:
# Exit codes that are always retryable
RETRYABLE_EXIT_CODES = {7, 9} # TIMEOUT, RATE_LIMITED per spec
# Exit codes that are never retryable
PERMANENT_EXIT_CODES = {2, 3, 4, 8} # BAD_ARGS, USAGE, NOT_FOUND, PERMISSION_DENIED
if result.returncode in RETRYABLE_EXIT_CODES:
time.sleep(5)
# retry
elif result.returncode in PERMANENT_EXIT_CODES:
raise RuntimeError("Permanent failure — do not retry")
Limitation: If the tool provides no retryable field and uses exit code 1 for all failures (both permanent and transient), the agent cannot safely distinguish them — limit retries to a low count (≤2) with exponential backoff and treat unknown errors as non-retryable after the final attempt