Skip to content

19 high retry hints

Part III: Errors & Discoverability | Challenge §19

19. Retry Hints in Error Responses

Severity: High | Frequency: Very Common | Detectability: Medium | Token Spend: High | Time: High | Context: Medium

The Problem

When a command fails, the agent decides: retry immediately, retry after delay, retry with different args, or give up. Without explicit guidance, agents either retry everything (wasting resources, amplifying rate-limit violations) or give up on recoverable failures.

Agent retrying a non-retryable error:

$ tool create-user --email "not-an-email"
{"ok": false, "error": {"code": "VALIDATION_ERROR", "message": "Invalid email"}}
exit 1

# Agent retries 3 times with identical args
# Each retry fails identically — wasted calls, wasted tokens

Agent giving up on a retryable error:

$ tool call-api
{"ok": false, "error": {"code": "SERVICE_UNAVAILABLE", "message": "Try again later"}}
exit 1

# Agent marks task as failed and escalates to user
# But the service recovered 2 seconds later

Rate limit with no backoff hint:

$ tool sync-data
{"ok": false, "error": {"code": "RATE_LIMITED", "message": "Too many requests"}}
exit 9

# Agent retries immediately → hits rate limit again → retry loop

Impact

  • Retry amplifies the original problem (rate limits, load)
  • No-retry on recoverable failures wastes the entire task
  • Agent cannot distinguish "try again" from "fix your args first"

Solutions

retryable and retry_after_ms in every error:

{
  "ok": false,
  "error": {
    "code": "RATE_LIMITED",
    "message": "API rate limit exceeded",
    "retryable": true,
    "retry_after_ms": 5000,
    "retry_strategy": "exponential_backoff",
    "max_retries": 3
  }
}
{
  "ok": false,
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid email address",
    "retryable": false,
    "fix_required": "Correct the --email argument before retrying"
  }
}

Retry classification taxonomy:

retryable: false   → VALIDATION_ERROR, NOT_FOUND, PERMISSION_DENIED, CONFLICT
retryable: true    → TIMEOUT, SERVICE_UNAVAILABLE, RATE_LIMITED, NETWORK_ERROR
retryable: "maybe" → INTERNAL_ERROR (sometimes transient, sometimes not)

Exit code alignment:

Exit 9 (RATE_LIMITED)       → always retryable, check retry_after_ms
Exit 7 (TIMEOUT)            → retryable, immediate retry ok
Exit 8 (PERMISSION_DENIED)  → never retryable without auth change
Exit 2 (BAD_ARGS)           → never retryable without arg change

For framework design: - Every error class has a default retryable value in the error registry - retry_after_ms sourced from response header (Retry-After) when available - Framework-level retry logic: honor retryable and retry_after_ms automatically - Emit attempt and max_attempts in meta so agents know retry history

Evaluation

Score Condition
0 No retryable field; agent cannot distinguish transient from permanent failures; no delay hint for rate limits
1 Some errors include retryable; retry_after_ms absent; agent must guess delay
2 All errors include retryable: true/false; rate-limited responses include retry_after_ms; exit code encodes retryability
3 retry_strategy field present; max_retries hint provided; meta.attempt and meta.max_attempts track retry history

Check: Trigger a rate-limit error (or a validation error) and verify the response includes retryable: true (or false) and, for rate limits, a retry_after_ms value.


Agent Workaround

Implement retry logic driven by retryable and retry_after_ms fields:

import subprocess, json, time

def run_with_retry(cmd: list[str], max_attempts: int = 3) -> dict:
    for attempt in range(1, max_attempts + 1):
        result = subprocess.run(cmd, capture_output=True, text=True)
        try:
            parsed = json.loads(result.stdout)
        except json.JSONDecodeError:
            if attempt == max_attempts:
                raise
            time.sleep(2 ** attempt)
            continue

        if parsed.get("ok"):
            return parsed

        error = parsed.get("error", {})
        retryable = error.get("retryable")

        if retryable is False:
            # Permanent failure — do not retry
            raise RuntimeError(
                f"[{error.get('code')}] {error.get('message')} "
                f"(fix: {error.get('fix_required', 'see error')})"
            )

        if retryable is True and attempt < max_attempts:
            delay_ms = error.get("retry_after_ms", 1000 * (2 ** attempt))
            time.sleep(delay_ms / 1000)
            continue

        raise RuntimeError(f"Command failed after {attempt} attempts: {parsed}")

    raise RuntimeError("Max attempts reached")

Map exit codes to retry decisions when retryable field is absent:

# Exit codes that are always retryable
RETRYABLE_EXIT_CODES = {7, 9}   # TIMEOUT, RATE_LIMITED per spec
# Exit codes that are never retryable
PERMANENT_EXIT_CODES = {2, 3, 4, 8}  # BAD_ARGS, USAGE, NOT_FOUND, PERMISSION_DENIED

if result.returncode in RETRYABLE_EXIT_CODES:
    time.sleep(5)
    # retry
elif result.returncode in PERMANENT_EXIT_CODES:
    raise RuntimeError("Permanent failure — do not retry")

Limitation: If the tool provides no retryable field and uses exit code 1 for all failures (both permanent and transient), the agent cannot safely distinguish them — limit retries to a low count (≤2) with exponential backoff and treat unknown errors as non-retryable after the final attempt