REQ-O-049: LLM Token Budget Flags
Tier: Opt-In | Priority: P2
Source: §4 Verbosity & Token Cost · §43 Tool Output Result Size Unboundedness
Addresses: Severity: Medium–Critical / Token Spend: High / Time: Low / Context: High
Description
The framework MUST provide three flags that let an LLM caller manage how much output lands in its context window:
--token-limit <n>: truncate output to the firstntokens and emitmeta.truncated: truewithmeta.token_limit: nin the response envelope. The truncation MUST be clean (never mid-token or mid-field) and MUST be indicated by a truncation sentinel appended to the data.--token-count: instead of emitting the command output, emit only the token count that the output would consume (meta.token_count: n,data: null). The command MUST execute fully to produce the count, but MUST NOT write the payload to stdout.--token-offset <n>: skip the firstntokens of output before beginning emission. When combined with--token-limit, this enables sliding-window access over output that exceeds the caller's context budget.
Token counting MUST use the same tokenizer as the primary consumer (typically cl100k_base or the framework's declared default). If the tokenizer is configurable, --tokenizer <name> selects it.
These flags are distinct from --fields (which filters by key) and the framework's byte-level hard cap (REQ-F-052). They operate on the token dimension, which is what LLM callers actually spend.
Acceptance Criteria
--token-limit 500produces output that, when tokenized, contains ≤500 tokens;meta.truncated: trueandmeta.token_limit: 500are present in the envelope--token-countreturnsdata: nullandmeta.token_count: Nwithout writing the payload--token-offset 200 --token-limit 200returns the second window of 200 tokens- All three flags are available on every command without per-command implementation
meta.token_countis present in every response when--token-countis passed, regardless of--outputformat
Schema
Type: response-envelope.md
Requirement-specific meta fields:
| Field | Type | When present |
|---|---|---|
meta.truncated |
boolean | When --token-limit truncated the payload |
meta.token_limit |
integer | When --token-limit was passed |
meta.token_count |
integer | When --token-count was passed |
meta.token_offset |
integer | When --token-offset was nonzero |
Wire Format
$ tool list-events --token-limit 500 --output json
{
"ok": true,
"data": ["...first 500 tokens of events...", "[TRUNCATED]"],
"error": null,
"warnings": [],
"meta": {
"truncated": true,
"token_limit": 500,
"token_offset": 0,
"duration_ms": 84
}
}
$ tool list-events --token-count --output json
{
"ok": true,
"data": null,
"error": null,
"warnings": [],
"meta": {
"token_count": 4217,
"duration_ms": 61
}
}
Sliding window over large output:
$ tool list-events --token-offset 500 --token-limit 500 --output json
# Returns tokens 500–999 of the full output
Example
Opt-in at the framework level; token counting uses the declared tokenizer.
app = Framework("tool")
app.enable_token_budget(tokenizer="cl100k_base")
# Agent decides whether to fetch before committing context budget:
count_result = run("tool list-events --token-count --output json")
if count_result["meta"]["token_count"] > context_budget:
# Fetch in windows instead
run(f"tool list-events --token-offset 0 --token-limit {context_budget} --output json")
else:
run("tool list-events --output json")
Related
| Requirement | Tier | Relationship |
|---|---|---|
| REQ-F-052 | F | Provides: byte-level hard cap; token budget flags operate on the token dimension instead |
| REQ-O-002 | O | Composes: --fields reduces the key space before token limiting reduces the token count |
| REQ-O-001 | O | Composes: token budget flags apply after format selection, not before |
| schemas/manifest-response.md | Exposes: output_formats field in CommandEntry declares when LLM-optimized formats are available alongside token budget flags |