REQ-F-016: UTF-8 Sanitization Before Serialization
Tier: Framework-Automatic | Priority: P1
Source: §9 Binary & Encoding Safety
Addresses: Severity: High / Token Spend: Low / Time: Medium / Context: Low
Description
The framework's JSON serializer MUST pass every string value through a UTF-8 sanitizer before serialization. The sanitizer MUST handle: non-UTF-8 byte sequences (replace with U+FFFD), null bytes (replace with U+FFFD), and lone surrogate code points. The sanitizer MUST never raise an exception or crash; it MUST always produce valid UTF-8 output.
Acceptance Criteria
- A command that returns a Latin-1 encoded string does not crash; the output is valid UTF-8 JSON
- A command that returns a string containing
\x00produces valid JSON (no embedded null bytes in output) - The framework never emits a
UnicodeDecodeErroror equivalent to stderr
Schema
Types: response-envelope.md
Behavioral requirement — the sanitizer runs inside the framework's JSON serializer. No new fields are added to the envelope; the guarantee is that all string values in the output are valid UTF-8.
Wire Format
Clean JSON output after sanitization (invalid bytes replaced with U+FFFD \ufffd):
{
"ok": true,
"data": {
"filename": "caf\ufffd.txt",
"content": "line one\ufffdline two"
},
"error": null,
"warnings": [
"1 invalid UTF-8 sequence replaced with U+FFFD in field 'filename'",
"1 null byte replaced with U+FFFD in field 'content'"
],
"meta": {
"request_id": "req_02AB",
"command": "read-file",
"timestamp": "2024-06-01T12:00:00Z"
}
}
Example
Framework-Automatic: no command author action needed. The framework's serializer intercepts every string value before encoding it as JSON, passes it through the sanitizer, and optionally appends a warning entry for each replacement.
# Command returns Latin-1 bytes b"caf\xe9.txt"
→ serializer sanitizes → "caf\ufffd.txt"
→ valid UTF-8 JSON emitted to stdout
# Command returns string with null byte "hello\x00world"
→ serializer sanitizes → "hello\ufffdworld"
→ valid UTF-8 JSON emitted to stdout
# No UnicodeDecodeError on stderr in either case
Related
| Requirement | Tier | Relationship |
|---|---|---|
| REQ-F-017 | F | Composes: binary fields are base64-encoded before sanitization runs |
| REQ-F-004 | F | Provides: response envelope that carries sanitized string values |
| REQ-F-005 | F | Composes: locale-invariant serialization works in tandem with UTF-8 sanitization |