Emerging AI-Native CLI Solutions & Patterns

The landscape of agent-tool interaction is fragmenting into at least three distinct layers: protocols (how agents and tools communicate), frameworks (how tool logic is authored and composed), and conventions (how knowledge about tools is packaged for agents). None of these layers has converged on a single standard. Instead, several competing and complementary patterns have emerged — some originating in LLM tooling research, some from the developer tools community, and some from the structured-data ecosystem predating AI entirely. Understanding each pattern on its own terms, and then comparing them across a common set of challenges, reveals which problems the field has solved, which it is actively working on, and which remain almost entirely unaddressed.

This document covers seven emerging patterns and evaluates each against 33 concrete agent-compatibility challenges.

1. LangChain / LangGraph Tool Pattern

What it is and how it works

LangChain is a Python and JavaScript framework for building LLM-powered applications. It provides a Tool abstraction that wraps any callable — a Python function, an API client, a shell command — with a name, a description, and optionally a JSON Schema for its arguments. LangGraph extends this with a graph-based execution model where nodes are either LLM calls or tool invocations, edges define control flow, and state is passed between nodes as a typed dictionary. Together they form an agent-first programming model: the application is a graph of actions, not a linear script.

In LangChain, tools are defined in one of two ways. The simpler form uses the @tool decorator on a Python function, extracting the schema from type annotations and the docstring as the description. The richer form uses the StructuredTool or BaseTool classes directly, providing explicit args_schema (a Pydantic model) and implementing _run / _arun for synchronous and async execution. Agents built with LangChain receive the list of available tools, their names, descriptions, and JSON schemas, and select among them at each reasoning step.

LangGraph's tool invocation model is more explicit: tool nodes are registered graph edges, tool results are stored in graph state, and the graph decides whether to continue (another LLM call, another tool invocation) or end based on the LLM's output. State is typed using Python's TypedDict or Pydantic BaseModel, making the data flowing through the graph schema-validated at each step.

How tools/commands are defined

from langchain.tools import tool
from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(description="The search query string")
    max_results: int = Field(default=10, description="Maximum number of results to return")

@tool(args_schema=SearchInput)
def web_search(query: str, max_results: int = 10) -> str:
    """Search the web for information. Returns a summary of results."""
    ...

The tool name is derived from the function name (converted to snake_case). The description is the docstring. The schema is derived from the Pydantic model. All three are passed to the LLM at the start of each agent turn as part of the system or tool message. LangChain serializes this to whatever tool-call format the underlying LLM API expects (OpenAI function calling, Anthropic tool use, etc.).

In LangGraph, tools are additionally registered as nodes in the graph:

from langgraph.prebuilt import ToolNode

tools = [web_search, file_read, shell_exec]
tool_node = ToolNode(tools)
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", tool_node)
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", "end": END})
graph.add_edge("tools", "agent")

Key design decisions relevant to agent ergonomics

Pydantic-based schemas: LangChain's use of Pydantic means runtime validation is automatic. If an LLM constructs tool arguments that fail Pydantic validation, the error is caught before the underlying function runs — preventing side effects from malformed inputs. This directly addresses argument validation before side effects.

Docstring as description: The tool description (what the LLM reads to decide whether to use the tool) is the docstring. This conflates developer documentation with agent instructions, which can lead to descriptions optimized for human readers rather than LLM comprehension. The best LangChain tools rewrite their docstrings specifically for agent guidance: explaining when to use the tool, what inputs mean, and what the output format will be.

Async-first design: _arun is the preferred execution path, supporting non-blocking concurrent tool calls. LangGraph supports parallel tool execution natively in its ToolNode when the LLM returns multiple tool calls in a single response.

Tool error handling: LangChain's ToolException and handle_tool_error parameter let you configure whether a tool error is raised (crashing the agent) or returned as a string (allowing the agent to observe and recover from the error). The default behavior changed across versions, making this a common source of confusion.

State machine clarity (LangGraph): By making agent control flow a graph with explicit nodes and edges, LangGraph makes the agent's behavior auditable and testable. You can inspect the state at any node, replay from a checkpoint, and trace exactly which tools were called and in what order.

What it solves vs what it doesn't

Solves: Schema-based input validation, structured tool definitions, async execution, parallel tool calls, typed state propagation between steps, checkpoint/replay for observability.

Does not solve: The tools themselves may still behave as opaque subprocesses. A LangChain tool that shells out to grep has the same ANSI leakage, exit code ambiguity, and timeout problems as raw subprocess calls — LangChain only wraps the call site, it does not transform the underlying tool's behavior. Output format is whatever the Python function returns as a string; there is no output schema enforcement unless the developer adds it manually. Retry hints, idempotency signals, and partial failure semantics are entirely up to the tool author. LangGraph adds structure around the agent loop but cannot fix a tool that hangs indefinitely or writes to unexpected filesystem paths.

2. smolagents (HuggingFace)

What it is and how it works

smolagents (formerly transformers.agents) is HuggingFace's minimalist agent framework, designed around a philosophy of code-first tool use rather than JSON-function-call tool use. The core insight is that generating Python code is a more expressive and more verifiable form of tool invocation than generating JSON argument payloads. Rather than calling {"tool": "file_read", "args": {"path": "/tmp/foo.txt"}}, a smolagents-style agent writes:

content = file_read(path="/tmp/foo.txt")
print(content[:500])

This "code as action" approach allows the agent to express loops, conditionals, variable reuse, and multi-step computations in a single generation step — rather than requiring one LLM turn per tool call.

smolagents ships with two agent types: CodeAgent (generates and executes Python code in a sandboxed interpreter) and ToolCallingAgent (uses conventional JSON function-call tool use, for compatibility with OpenAI-style APIs). Both share the same tool definition interface.

How tools/commands are defined

Tools in smolagents are Python classes that inherit from Tool:

from smolagents import Tool

class WebSearchTool(Tool):
    name = "web_search"
    description = """
    Searches the web for information. Use this when you need current information
    or facts not in your training data. Returns a list of result snippets.
    Input: query (str) - the search query
    Output: str - formatted search results
    """
    inputs = {
        "query": {
            "type": "string",
            "description": "The search query"
        }
    }
    output_type = "string"

    def forward(self, query: str) -> str:
        results = search_api(query)
        return "\n".join(r.snippet for r in results[:5])

The inputs dict serves as the schema (a simplified JSON Schema-like structure, not full JSON Schema). output_type declares the return type. The description is agent-facing documentation. The forward method is the implementation.

The @tool decorator provides a shorter form, extracting the schema from type annotations and the docstring:

from smolagents import tool

@tool
def web_search(query: str) -> str:
    """
    Searches the web for information.
    Args:
        query: The search query string
    """
    ...

Key design decisions relevant to agent ergonomics

Code generation vs JSON generation: The fundamental claim of smolagents is that LLMs are better at generating Python than at generating correct nested JSON argument structures. Code generation allows the agent to express computed arguments (path = base_dir + "/" + filename), loop over results, and compose tools without needing the framework to provide a composition layer. This directly sidesteps the "no composition primitives" problem present in all JSON-call-based approaches.

Sandboxed execution: smolagents' CodeAgent runs generated code in a sandboxed Python environment. The sandbox (using RestrictedPython or an E2B cloud sandbox) limits which modules and builtins are available, preventing arbitrary filesystem writes or network calls that the agent's code might attempt. This is a meaningful security boundary absent from most other frameworks.

Variable state in code: In a code-first agent, the outputs of one tool call are Python variables accessible in subsequent calls within the same code block. This eliminates the need for the LLM to re-specify values it already computed, reducing both token consumption and hallucination risk.

Minimal schema: smolagents deliberately uses a simplified schema format rather than full JSON Schema, reducing the schema text the LLM must reason about. This is a token-cost tradeoff: less validation expressiveness in exchange for a smaller context footprint.

HuggingFace Hub integration: Tools can be published to and loaded from the HuggingFace Hub (load_tool("username/tool-name")), providing a discoverable, versioned catalog. This is the closest thing in the ecosystem to a tool registry with versioning semantics.

What it solves vs what it doesn't

Solves: Tool composition and piping (expressed as Python code), multi-step computation in a single LLM turn, sandboxed execution to limit filesystem side effects, variable state across tool calls, Hub-based versioned tool distribution.

Does not solve: The underlying tools still have all their original problems. A tool that calls a subprocess can still hang, leak child processes, or return ANSI codes — the sandbox limits what the generated code can do, not what each registered Tool.forward() can do. Output format is still a string unless the developer adds structure. Retry hints, idempotency, exit codes, and signal handling are not addressed by the framework. The code sandbox is effective at preventing the agent from writing arbitrary Python, but does not constrain the behavior of tool implementations themselves.

3. OpenAI / Anthropic Tool Use Schema

What it is and how it works

Both OpenAI and Anthropic converged on a similar design for structured tool calling in their APIs. The pattern is: at inference time, the caller supplies a list of tool definitions (name, description, input schema) alongside the conversation. The model may respond with a special message type — a tool call — specifying the tool name and a JSON object of arguments. The caller executes the tool, returns the result as a tool result message, and the model continues the conversation. This is a request/response protocol embedded in the chat message sequence.

The two APIs differ in terminology (OpenAI calls them "functions" or "tools"; Anthropic calls them "tools") and in minor schema details, but are structurally equivalent. Most agent frameworks — LangChain, smolagents' ToolCallingAgent, LlamaIndex, AutoGen, and others — target this common pattern as their lowest-level abstraction.

How tools/commands are defined

OpenAI format:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get the current weather for a location. Returns temperature, conditions, and humidity.",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "City and state, e.g. 'San Francisco, CA'"
        },
        "units": {
          "type": "string",
          "enum": ["celsius", "fahrenheit"],
          "description": "Temperature units"
        }
      },
      "required": ["location"]
    }
  }
}

Anthropic format:

{
  "name": "get_weather",
  "description": "Get the current weather for a location.",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": { "type": "string", "description": "City and state" },
      "units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
    },
    "required": ["location"]
  }
}

Both use JSON Schema for the parameters / input_schema field. Neither defines an output schema at the protocol level; output is always a free-form string or JSON that the model treats as text.

Anthropic also defines extended tool types with fixed schemas for specific interaction categories — computer_use for screen interaction actions (screenshot, click, type, scroll) and text_editor for file operations — demonstrating that some tools are better served by predefined schemas than free-form JSON Schema.

Key design decisions relevant to agent ergonomics

JSON Schema for inputs: Using JSON Schema as the input contract is the most consequential design decision. It means any tool's argument structure is formally describable, validatable, and introspectable without running the tool. The model can generate arguments that are syntactically valid before any execution occurs.

No output schema (originally): Neither API originally defined an output schema. Tool results are opaque strings from the protocol's perspective. The model reads them as text. This means a tool that returns ANSI codes, binary garbage, or a 50,000-line log file has no protocol-level safeguard. OpenAI later added strict: true mode for structured outputs, but this applies to the final model response, not tool results. Anthropic added structured tool outputs in the API but this is not universally used.

parallel_tool_calls: OpenAI's parallel_tool_calls: true (default) and Anthropic's equivalent allow the model to emit multiple tool calls in a single response, enabling concurrent execution. This is a significant throughput improvement for tasks requiring multiple independent lookups.

Tool choice control: Both APIs provide a tool_choice parameter allowing the caller to force a specific tool invocation, prevent any tool use, or require at least one tool call. This gives the orchestrating application explicit control over agent tool use behavior.

No side-effect declarations in the protocol: Neither API provides a mechanism for declaring whether a tool is read-only, destructive, idempotent, or has external side effects. This information lives only in the description text, making it invisible to the framework. MCP's tool annotations (readOnlyHint, destructiveHint, idempotentHint) represent a step forward that the raw API protocols have not yet incorporated.

What it solves vs what it doesn't

Solves: Standardized schema format for tool inputs, model-level understanding of tool argument structure, parallel tool invocation, tool selection control, cross-platform compatibility (any framework can target these APIs).

Does not solve: Output format and parseability (results are opaque text), ANSI and encoding issues (no sanitization at the protocol level), error signaling (results are strings; the model must parse error text), idempotency and safety (no annotations), retry hints, composition, all process-level concerns (timeouts, signals, child processes), prompt injection via tool results, and schema versioning (the tool list is supplied fresh each call with no version negotiation).

4. AGENTS.md / CONTEXT.md Convention

What it is and how it works

AGENTS.md and CONTEXT.md are informal but converging conventions for shipping agent-specific documentation alongside a codebase or CLI tool. The idea is simple: just as a repository's README.md is for human readers and CONTRIBUTING.md is for contributors, AGENTS.md is documentation written specifically for AI agents that will operate in or on this repository.

AGENTS.md originated from the OpenAI Codex and ChatGPT "memory" ecosystem but has been more formally articulated in the context of Claude Code and similar agentic coding tools. Claude Code automatically loads CLAUDE.md (its variant of this convention) files from the repository root and parent directories when starting a session. The jpoehnelt agent-dx-cli-scale rubric explicitly scores this as Axis 7 (Agent Knowledge Packaging), treating it as a scored dimension of agent readiness.

CONTEXT.md is a similar convention, sometimes preferred for general-purpose agent context (project structure, key conventions, environment setup), while AGENTS.md is preferred for agent-specific behavioral instructions (what to do, what not to do, which tools to use for which tasks).

How tools/commands are defined

These files are not formal schema documents — they are markdown prose structured for LLM comprehension. A typical AGENTS.md might contain:

# Agent Instructions

## Commands
- Build: `npm run build`
- Test: `npm test`
- Lint: `npm run lint` (run before committing)

## Constraints
- Never commit directly to main; always create a branch
- Do not modify files in /vendor
- Use `--dry-run` before destructive operations

## Architecture
The project uses a monorepo structure. The CLI entrypoint is src/cli/index.ts.
Authentication tokens are stored in ~/.config/app/credentials.json.

## Known Gotchas
- The test suite requires a running local database (see docker-compose.yml)
- The `deploy` command has real side effects; there is no staging environment

The agent reads this at session start and uses it to calibrate its behavior for the specific repository. There is no programmatic schema — it is natural language, relying on the LLM's reading comprehension and instruction-following capability to extract structured guidance.

Key design decisions relevant to agent ergonomics

Zero infrastructure requirement: AGENTS.md requires nothing beyond a text file. Any repository can adopt it without changing build systems, adding dependencies, or modifying CLI source code. The barrier to entry is the lowest of any pattern in this review.

Agent-specific framing: Unlike README.md, AGENTS.md is explicitly written for a non-human reader. This means authors can use imperatives directly ("always run lint before committing"), assume the reader has no context, and include machine-process-relevant information (env var names, config file paths, exact command syntax) without needing to balance readability for humans.

Hierarchical loading: When AGENTS.md files exist at multiple directory levels (root, subdirectory, nested package), agents that support hierarchical loading (Claude Code's CLAUDE.md, for example) merge them, with more specific files overriding more general ones. This allows per-package customization in monorepos.

No versioning or schema validation: The content is freeform. An agent cannot programmatically validate that an AGENTS.md is well-formed, check for required sections, or detect when it is stale. Its effectiveness degrades as the project evolves and the file is not updated.

Complementary to, not replacing, machine-readable schemas: The most effective pattern combines AGENTS.md (human-readable behavioral guidance for the agent) with machine-readable schemas (JSON Schema, MCP tool definitions) for the actual tool interfaces. AGENTS.md fills the gaps that schemas cannot: explaining which tool to use when, what the project's conventions are, and what pitfalls exist in the specific environment.

What it solves vs what it doesn't

Solves: Project-specific behavioral guidance that cannot be encoded in a schema, environment and dependency documentation, known gotchas and constraints, working directory orientation, config file precedence documentation, security and side-effect warnings in natural language.

Does not solve: Any technical problem at the process level. An AGENTS.md that says "commands may timeout after 60 seconds" does not implement timeout enforcement. An AGENTS.md that says "use --no-color" does not strip ANSI codes from output. It is documentation, not implementation. Its effectiveness is bounded by the LLM's ability to follow instructions and remember them across a long session context.

5. OpenClaw Skill Standard

What it is and how it works

OpenClaw is a nascent standard for packaging agent knowledge about tools, CLIs, and APIs into structured skill files. It is referenced in the jpoehnelt agent-dx-cli-scale rubric as the top tier of Axis 7 (Agent Knowledge Packaging): "Comprehensive skill library with agent-specific guardrails; skills versioned, discoverable, follow a standard like OpenClaw."

The concept builds on the observation that AI agents need more than tool schemas — they need agent-specific behavioral guidance: when to use a tool, common failure modes, guardrails against misuse, and worked examples. JSON Schema tells you what arguments a tool accepts; a skill file tells you how an agent should use the tool safely and effectively.

In practice, the skill file format used by Claude Code (which OpenClaw formalizes or closely resembles) is a Markdown document with YAML frontmatter:

---
name: cli-evaluator
description: Evaluates a CLI for AI agent readiness using the agent-dx-cli-scale rubric
version: "1.0.0"
tools:
  - bash
  - read_file
triggers:
  - "evaluate this CLI"
  - "how agent-ready is"
  - "score this tool"
---

Followed by the skill body: natural language instructions, scoring rubrics, worked examples, and guardrails specific to agent consumption. The YAML frontmatter enables programmatic discovery — a tool registry can index skills by name, version, and trigger phrases. The Markdown body provides the content the agent loads into its context.

How tools/commands are defined

OpenClaw skill files do not define tools directly — they define knowledge about how to use tools. The skill file for a CLI tool might contain:

A description of what the CLI does and when to use it
The specific flags and subcommands relevant to agent use
Common error patterns and how to recover from them
Constraints ("never run db delete without --dry-run first")
Examples of correct and incorrect invocations
Environment requirements (what must be set up before the CLI will work)

The YAML frontmatter's tools field lists which underlying agent tools (bash, file_read, etc.) the skill requires — providing a lightweight dependency declaration.

Key design decisions relevant to agent ergonomics

Versioned, machine-discoverable knowledge: By adding YAML frontmatter to what would otherwise be a plain Markdown file, OpenClaw makes skill files indexable. An agent can ask "what skills do I have available for this task?" and get a structured answer. The version field allows skills to evolve without breaking agents that loaded an older version.

Trigger-based activation: The triggers field (phrases that suggest this skill should be loaded) enables lazy loading — the agent does not need to load all skills at session start, only those relevant to the current task. This is a direct response to the token cost of context loading.

Guardrails as first-class content: Unlike AGENTS.md (which can include guardrails as prose), OpenClaw skills are designed to carry agent-specific safety constraints as primary content. The distinction: AGENTS.md is project context; a skill file is behavioral programming for the agent when using a specific tool.

HuggingFace Hub parallel: The vision for OpenClaw is analogous to HuggingFace Hub for models — a public registry of skills that any agent can discover and load. This would create a community-maintained layer of agent knowledge on top of existing tools, without requiring those tools to be rewritten.

Formal standard vs. informal convention: AGENTS.md is a convention adopted by practice. OpenClaw aspires to be a formal, versioned standard with a published specification. As of early 2026, OpenClaw is more aspiration than implementation — the reference is used in the jpoehnelt rubric as a north star, but a complete published specification is not yet widely available.

What it solves vs what it doesn't

Solves: Versioned, discoverable agent knowledge packaging, trigger-based skill loading to manage token cost, explicit guardrails for agent safety, dependency declarations for skills, community-distributable agent expertise about tools.

Does not solve: Any technical problem in tool execution. Like AGENTS.md, OpenClaw is documentation — it can document solutions to problems (ANSI codes, timeouts, idempotency) but cannot implement them. Its effectiveness depends entirely on the agent following the instructions in the skill file, which is a probabilistic rather than deterministic guarantee.

6. Structured Output CLIs (jq, gron, miller, dasel)

What it is and how it works

Before AI agents became a first-class concern, a set of tools emerged to make the output of Unix commands more programmatically accessible. These tools operate on the output of other commands — transforming, filtering, and reshaping structured data formats (JSON, CSV, YAML, TOML, XML) — and together constitute an informal ecosystem of agent-friendly composition primitives.

jq: A domain-specific language and streaming processor for JSON. Accepts JSON on stdin, applies a filter expression, produces transformed JSON or text on stdout. The de facto standard for CLI JSON processing. jq '.items[] | select(.active) | .name' extracts names of active items from a JSON array.

gron: Transforms JSON into flat, grep-friendly text (json.items[0].name = "Alice") and back. Bridges the gap between line-oriented Unix tools and nested JSON structures. Designed for pipelines where you want to use grep, sed, or awk on JSON.

miller (mlr): A multi-format data processor supporting JSON, CSV, TSV, NDJSON, and others. Where jq is optimized for JSON transformation, miller treats data as records and supports SQL-like operations (filter, sort, join, stats) across all formats. Particularly useful for NDJSON (newline-delimited JSON) streams.

dasel: A single-binary data selector supporting JSON, YAML, TOML, NDJSON, and CSV with a unified selector syntax. Designed for scripting environments where you need to read or modify config files in multiple formats with a consistent interface.

How tools/commands are defined

These are conventional CLI tools with conventional CLI interfaces. They do not define a tool schema or agent interface layer — they are composable Unix programs. Their agent relevance comes from what they produce rather than how they are defined.

The agent ergonomic pattern is: pipe the output of a verbose, complex, or unstable CLI through one of these tools to extract only the signal needed for the next step.

# Instead of: gh pr list (returns a human-readable table)
# Use: gh pr list --json number,title,state | jq '.[] | select(.state=="OPEN") | {number, title}'

This pattern: 1. Requests JSON output from the original CLI (--json flag) 2. Pipes to jq to extract only relevant fields 3. Returns a minimal, structured payload to the agent

The agent sees only {"number": 42, "title": "Fix bug"} rather than a table with borders, truncated strings, and ANSI formatting.

Key design decisions relevant to agent ergonomics

Composability through pipes: These tools are designed for Unix pipe composition. An agent that understands this pattern can chain arbitrary sequences of transformations without needing the original CLI to support every output variation.

Field selection reduces token cost: jq's field selection ({name, status}) eliminates irrelevant fields before the output reaches the agent's context. This is one of the most effective techniques for managing verbosity: transform at the pipe boundary, not in the original tool.

NDJSON is the streaming primitive: miller and dasel's support for NDJSON (one JSON object per line) enables stream processing of large datasets. An agent can process large paginated results through a pipeline without loading everything into context at once if the pipeline is set up to stream.

Stable, predictable interfaces: jq, gron, miller, and dasel have stable, well-documented filter languages. An agent can reliably generate correct jq expressions for known data shapes. This is a form of implicit schema: if you know the input shape and the jq filter, you can predict the output shape exactly.

No execution semantics: These tools have no side effects (they are purely transformative), no authentication requirements, no state, and no interactivity. They are maximally safe for agent use.

What it solves vs what it doesn't

Solves: Output format normalization, field selection to control verbosity, ANSI stripping (jq output is clean JSON), format conversion (YAML to JSON, CSV to NDJSON), command composition through pipes, extraction of stable subsets from non-deterministic output.

Does not solve: Any problem upstream of the pipe. If the source CLI hangs, these tools cannot help. If the source CLI exits with a non-zero code and produces no output, there is nothing to pipe. They cannot add exit code semantics, idempotency, authentication, or schema discoverability to the CLIs they process. They are a downstream remedy, not an upstream fix.

7. MCP-Wrapped CLIs (Pattern of Wrapping Existing CLIs as MCP Servers)

What it is and how it works

As MCP (Model Context Protocol) has become the dominant standard for agent-tool integration, a pragmatic pattern has emerged: rather than rewriting existing CLIs to be agent-native, wrap them in thin MCP servers that translate between the MCP protocol and the underlying CLI invocation. The wrapped CLI gains all of MCP's agent ergonomics (structured JSON responses, schema discoverability, authentication isolation, session management) without any changes to the original tool.

In practice, this means building a small MCP server in Python or TypeScript that: 1. Declares MCP tools corresponding to the CLI's subcommands 2. Implements each tool's handler by invoking the CLI as a subprocess 3. Parses the CLI's output (stdout, stderr, exit code) and translates it to MCP's response format (isError, content array) 4. Provides JSON Schema for the tool's inputs by analyzing the CLI's --help or known argument structure 5. Strips ANSI codes, handles encoding, and manages timeouts in the wrapper layer

Examples of this pattern in production: mcp-server-git (wraps git), mcp-server-filesystem (wraps common file operations), mcp-server-github (wraps the GitHub CLI or API), mcp-server-docker (wraps docker commands). Each converts a domain of CLI commands into a set of typed MCP tools.

How tools/commands are defined

The MCP wrapper author defines tools manually, drawing on their knowledge of the underlying CLI:

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: "git_commit",
      description: "Create a git commit with a message. Stages all tracked modified files.",
      inputSchema: {
        type: "object",
        properties: {
          message: { type: "string", description: "The commit message" },
          allow_empty: { type: "boolean", description: "Allow empty commit", default: false }
        },
        required: ["message"]
      },
      annotations: {
        readOnlyHint: false,
        destructiveHint: false,
        idempotentHint: false
      }
    }
  ]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "git_commit") {
    const { message, allow_empty } = request.params.arguments;
    const args = ["commit", "-m", message];
    if (allow_empty) args.push("--allow-empty");

    const result = await execAsync(`git ${args.map(shellEscape).join(" ")}`);

    return {
      content: [{ type: "text", text: result.stdout }],
      isError: false
    };
  }
});

The wrapper author decides which CLI arguments become tool parameters, which are hardcoded, and how to handle errors. This is both the power and the risk of the pattern: the wrapper author has full control over what the agent sees.

Key design decisions relevant to agent ergonomics

Shell injection prevention: The wrapper layer is the natural place to implement proper argument escaping. Rather than the agent constructing a shell command string (injection risk), the wrapper receives typed arguments from JSON and constructs the command safely, using shell-quoting libraries (shlex in Python, shell-escape in Node).

Exit code to isError translation: The wrapper converts POSIX exit codes (which the MCP protocol does not use) to isError: true with a descriptive message. This translation layer is exactly the impedance matching that agents need — they get MCP's clean error model without requiring the original CLI to change.

ANSI stripping at the boundary: The wrapper can strip ANSI escape sequences from stdout/stderr before including them in the response content. This is the right architectural location for this transformation: the wrapper is the last layer before the agent's context.

Timeout enforcement: The wrapper can implement per-invocation timeouts using Promise.race (TypeScript) or asyncio.wait_for (Python), converting a hanging process into a structured error response with a timeout message.

Schema authoring burden: The primary cost of this pattern is that someone must write and maintain the MCP tool definitions by hand. The CLI's --help output is not machine-parseable to a JSON Schema in general. If the CLI adds new flags, the wrapper must be updated. If the CLI's behavior changes, the wrapper's error handling may need updating. This creates a maintenance burden proportional to the size and velocity of the wrapped CLI.

Coverage decisions: The wrapper author selects which subset of CLI functionality to expose. A git MCP wrapper might expose commit, push, pull, diff, and log but omit rebase, bisect, and reflog. This curatorial decision makes the wrapper simpler but potentially incomplete — the agent cannot access full CLI functionality through the MCP interface.

What it solves vs what it doesn't

Solves: Output format (translated to JSON), ANSI stripping, exit code semantics, schema discoverability, authentication isolation, shell injection prevention, timeout enforcement (if implemented), encoding safety (JSON output), session management (MCP protocol), platform portability (via MCP SDK).

Does not solve: The underlying CLI's behavior is not changed. Process-level concerns (child process leakage, signal handling, working directory sensitivity, config shadowing) remain the wrapper author's responsibility to handle. Schema versioning is still manual — if the wrapper's tool schema changes, clients have no automatic notification. Idempotency and atomicity of the underlying CLI operations are unchanged by wrapping. The wrapper pattern also inherits MCP's gaps: no retry hints, no composition primitives, no output non-determinism declarations.

The critical unsolved problem is completeness: a hand-written wrapper will always be a curated, incomplete view of the underlying CLI. For CLIs with 50+ subcommands and hundreds of flags, comprehensive wrapping is impractical without auto-generation tooling.

Combined Challenge Coverage Table

The table below rates each of the seven patterns across all 33 agent-compatibility challenges.

✓ = The pattern substantially addresses this challenge by design
~ = The pattern partially addresses it, or addresses it only under certain conditions / with extra effort
✗ = The pattern does not address this challenge

#	Challenge	LangChain / LangGraph	smolagents	OpenAI / Anthropic Schema	AGENTS.md / CONTEXT.md	OpenClaw Skills	Structured Output CLIs	MCP-Wrapped CLIs
1	Exit Codes & Status Signaling	~	~	✗	~	~	✗	✓
2	Output Format & Parseability	~	~	~	~	~	✓	✓
3	Stderr vs Stdout Discipline	✗	✗	✗	~	~	✗	✓
4	Verbosity & Token Cost	~	✓	✗	~	✓	✓	~
5	Pagination & Large Output	~	~	✗	~	~	✓	~
6	Command Composition & Piping	~	✓	✗	~	~	✓	✗
7	Output Non-Determinism	✗	✗	✗	~	~	✗	✗
8	ANSI & Color Code Leakage	✗	✗	✗	~	~	✓	✓
9	Binary & Encoding Safety	✗	✗	✗	~	~	~	✓
10	Interactivity & TTY Requirements	✗	✗	✗	~	~	✗	~
11	Timeouts & Hanging Processes	~	~	✗	~	~	✗	~
12	Idempotency & Safe Retries	✗	✗	✗	~	~	✗	~
13	Partial Failure & Atomicity	✗	✗	✗	~	~	✗	✗
14	Argument Validation Before Side Effects	✓	✓	✓	~	~	✗	✓
15	Race Conditions & Concurrency	~	~	~	✗	✗	✗	✗
16	Signal Handling & Graceful Cancellation	✗	✗	✗	~	~	✗	~
17	Child Process Leakage	✗	~	✗	✗	✗	✗	✗
18	Error Message Quality	✓	✓	~	~	~	✗	✓
19	Retry Hints in Error Responses	✗	✗	✗	~	~	✗	✗
20	Environment & Dependency Discovery	✗	✗	✗	✓	✓	✗	✗
21	Schema & Help Discoverability	✓	✓	✓	~	✓	✗	✓
22	Schema Versioning & Output Stability	✗	~	✗	✗	~	~	✗
23	Side Effects & Destructive Operations	~	~	✗	✓	✓	✗	~
24	Authentication & Secret Handling	~	~	~	✓	✓	✗	✓
25	Prompt Injection via Output	✗	~	✗	~	~	✗	~
26	Stateful Commands & Session Management	✓	~	~	~	~	✗	✓
27	Platform & Shell Portability	✓	✓	✓	~	~	~	✓
28	Config File Shadowing & Precedence	✗	✗	✗	✓	✓	✗	✗
29	Working Directory Sensitivity	✗	✗	✗	✓	✓	✗	✗
30	Undeclared Filesystem Side Effects	✗	~	✗	~	~	✗	✗
31	Network Proxy Unawareness	✗	✗	✗	✓	✓	✗	✗
32	Self-Update & Auto-Upgrade Behavior	✗	✗	✗	✓	✓	✗	✗
33	Observability & Audit Trail	~	~	✗	~	~	✗	~
✓ count		9	8	5	10	10	5	12
~ count		12	12	8	18	17	6	11
✗ count		12	13	20	5	6	22	10

Notes on the scores:

MCP-Wrapped CLIs score the most native ✓ (12) because the wrapper layer is the implementation site for most translation problems — ANSI stripping, exit-code conversion, argument validation, and session management can all be handled there.
AGENTS.md / CONTEXT.md and OpenClaw Skills both score 10 ✓ but accumulate the most ~ (18 and 17 respectively) because documentation can describe any problem's solution without implementing it. The scores reflect coverage of the challenge conceptually, not technically.
Structured Output CLIs score the fewest ✗ on the specific output challenges they address (2, 4, 5, 6, 8) but score ✗ on almost everything else — they are narrowly excellent and broadly irrelevant.
OpenAI / Anthropic Schema scores the fewest ✓ (5) because it is a thin protocol layer; all implementation is delegated to tool authors.
LangChain / LangGraph and smolagents occupy a middle tier: strong on schema, validation, and composition, weak on process-level and security concerns.

Key Insights

1. The layer problem is fundamental

No single pattern covers the full stack. Each pattern addresses one layer: - Protocol layer: MCP-Wrapped CLIs, OpenAI/Anthropic Schema - Framework layer: LangChain/LangGraph, smolagents - Documentation layer: AGENTS.md, OpenClaw Skills - Transformation layer: Structured Output CLIs

A production agent-compatible CLI framework must make deliberate decisions at all four layers simultaneously. Most current approaches are excellent at one layer and silent on the others.

2. The wrapper pattern is the pragmatic bridge

MCP-Wrapped CLIs solve the highest number of challenges natively because the wrapper layer is the natural place to implement translation: exit codes become isError, ANSI is stripped, arguments are validated before shell invocation, and timeouts are enforced. The insight for a new framework: build the wrapper layer in as a first-class concern, not an afterthought. If your CLI framework generates a thin MCP server alongside the CLI binary — derived from the same schema definition — almost all output and error handling problems are solvable without changing the underlying CLI logic.

3. Code generation (smolagents) solves composition; JSON schemas do not

The smolagents insight — that code generation is more expressive than JSON argument generation for multi-step tasks — directly addresses the composition problem that plagues all JSON-call-based approaches. A framework that supports both a JSON Schema interface (for simple, discrete tool calls) and a code-generation interface (for multi-step, conditional workflows) would cover both patterns. The key constraint is sandbox safety: code generation requires a sandboxed execution environment to be viable in production.

4. Documentation is necessary but not sufficient

AGENTS.md and OpenClaw Skills score 10 ✓ each — but almost entirely through "~" ratings because documentation can describe any problem's solution without actually implementing it. The right pattern: use AGENTS.md and skill files to document the non-automatable constraints (project conventions, environment requirements, known gotchas), while implementing technical constraints in code (validation, ANSI stripping, timeout enforcement). The failure mode to avoid is documenting a problem in AGENTS.md as a substitute for fixing it technically.

5. Schema-first design enables downstream tooling

The convergence of LangChain, smolagents, MCP, and the OpenAI/Anthropic APIs on JSON Schema as the input contract is not accidental. JSON Schema enables: client-side pre-validation (catching malformed arguments before execution), auto-generated documentation, auto-generated wrapper code, and formal verification. A new framework should treat JSON Schema as the primary artifact from which CLIs, MCP servers, documentation, and AGENTS.md files are all derived — not as an annotation added to an existing CLI.

6. Retry hints and idempotency are the biggest unaddressed gap

Across all seven patterns, retry hints (challenge 19) score ✗ universally, and idempotency (challenge 12) scores ✓ only in MCP via advisory annotations that are explicitly untrusted. This is the single highest-impact gap: when an agent encounters an error, it needs to know whether retrying is safe, what delay to apply, and whether to retry with the same arguments or different ones. None of the current patterns address this at the protocol or framework level. A new framework that adds a structured RetryPolicy to tool definitions — specifying which error conditions are retryable, with what backoff, up to what attempt limit — would fill a gap the entire ecosystem has left open.

7. Prompt injection via tool output is underaddressed and growing in severity

As agents operate in adversarial environments (reading files, fetching URLs, processing user-provided content), the risk that tool output contains embedded instructions designed to hijack agent behavior is increasing. Only the jpoehnelt rubric's Axis 6 level 3 explicitly addresses this (response sanitization), and even there it is advisory. A new framework should treat prompt injection as a first-class threat model: tool output that will be returned to the LLM context should be sanitized, sandboxed, or annotated with a trust level before reaching the model.

8. The 33 challenges cluster into five categories with different solution strategies

Analyzing the coverage table, the 33 challenges naturally cluster:

Protocol/format challenges (1–9): Best addressed by the MCP wrapper pattern and structured output CLIs — technical solutions at the I/O boundary.
Process-level challenges (10–17): Almost entirely unaddressed by all patterns; require OS-level enforcement (namespaces, cgroups, seccomp, process supervision).
Error quality challenges (18–19): Well addressed by LangChain/LangGraph and MCP wrappers; weakest in raw API schemas.
Knowledge/discoverability challenges (20–22, 27): Best addressed by AGENTS.md, OpenClaw, and MCP; poorly addressed by transformation-layer tools.
Safety/security challenges (23–26, 28–32): Addressed by a mix of documentation (AGENTS.md, OpenClaw), framework validation (LangChain's Pydantic), and wrapping (MCP annotations).

A new framework should have a distinct strategy for each cluster rather than a single unified solution that is mediocre across all five.

9. Observability is the missing infrastructure concern

Challenge 33 (Observability & Audit Trail) scores "~" at best across all patterns. For autonomous agents running in production environments, the ability to replay, audit, and debug a sequence of tool invocations is as important as the invocations themselves. LangGraph's checkpoint/replay mechanism is the closest to a real solution, but it is framework-internal and not portable across tool implementations. A new framework should treat structured, durable, replayable audit logs as a first-class output alongside the tool's primary output — each tool invocation emitting a trace event with inputs, outputs, duration, exit status, and a correlation ID.

10. The convergence direction is clear: schema-first, wrapper-mediated, code-native

The emerging best practice synthesizes: - Schema-first: define tools as JSON Schema + annotations from the start (not as an afterthought) - Wrapper-mediated: deploy tools as MCP servers (or equivalent) that translate between the protocol and the implementation, handling ANSI, exit codes, timeouts, and injection at the boundary - Code-native composition: support code-generation-based invocation (smolagents style) as a composition layer above atomic tool calls - Documentation layered on top: AGENTS.md and skill files for the non-automatable knowledge that schemas cannot encode - Transformation as escape valve: jq/miller/dasel in pipelines for output normalization when the source cannot be changed

A new agent-compatible CLI framework that implements all five of these layers — with JSON Schema as the source of truth, MCP as the transport, code-generation as the composition model, skill files for agent guidance, and structured output tooling for legacy interop — would cover approximately 28 of the 33 challenges natively. The remaining 5 (process leakage, race conditions, partial atomicity, output non-determinism declarations, and self-update containment) require OS-level enforcement outside the scope of any pure CLI framework.