AI Engineering

Why Agents Need CLI

A practitioner's take on the tool-interface problem, and why I think the industry picked the wrong default.

Typed tool catalogs over MCP are the default way to give agents capabilities, and I think that default is wrong. Not wrong in the sense of "suboptimal in some cases." Wrong in the sense of structurally expensive, architecturally brittle, and solving a problem the model already knows how to solve itself if you stop getting in its way.

The thesis

The fix is to stop inventing new tool interfaces and hand the model the one it's already seen a billion times in training: Unix-style CLI conventions. A single run(command="...") tool with command dispatch behind it outperforms a catalog of typed function calls on every axis I care about — context cost, composition, discoverability, error recovery, auditability.

I'm not the first person to arrive here, and the list of people arriving independently keeps getting longer. A former Manus backend lead posted a long argument for this in November 2025 and it blew up on r/LocalLLaMA. Anthropic's engineering team published Code Execution with MCP within the same window, reaching the same diagnosis from the research side. Simon Willison wrote he'd already abandoned MCP for coding agents. At Perplexity's Ask 2026 conference in March, co-founder and CTO Denis Yarats announced Perplexity is moving away from MCP internally, citing a 72% context window consumption problem and replacing it with direct REST APIs and CLI. Cloudflare's Code Mode reported a 244x token reduction on the same pattern. Y Combinator's Garry Tan built his agent integration as a CLI rather than MCP for reliability and speed. Google deleted MCP from their Workspace CLI (gws) the day after shipping it, because 200-400 Workspace API tools burned 40-100K tokens of context before the agent read a user message. And in March 2026, HKU's Data Intelligence Lab dropped CLI-Anything, a pipeline that auto-generates agent-navigable CLIs for GUI-only software — GIMP, Blender, Audacity, LibreOffice — with 1,508 passing tests across 11 applications in its first release.

Seven independent actors, same conclusion, reached without coordinating. That's the kind of convergence I pay attention to.

This is my synthesis of what they got right, what they're missing, and what I'd tell anyone building an agent system in 2026.

Why the typed-tool-catalog pattern is broken

Two failure modes, both measurable, both structural.

Context tax before the conversation starts. Every MCP tool definition sits in the system prompt. Anthropic put a number on it: a five-server setup with 58 tools burns roughly 55K tokens before the user types anything (Advanced Tool Use). Perplexity put a bigger number on it: in production, tool schemas and protocol overhead consumed up to 72% of available context before any user intent reached the model. Google hit the wall hardest and loudest: their Workspace CLI (gws) exposed 200-400 Workspace API tools over MCP, costing an estimated 40-100K tokens in definitions alone, and they deleted MCP the day after shipping it. A "compact mode" that collapsed the surface to ~26 tools wasn't enough; MCP came back in v0.9.x as an optional layer only where it earns its cost. Cloudflare put the biggest number on it: exposing 2,500 API endpoints as individual MCP tools takes roughly 244,000 tokens to describe, versus ~1,000 tokens to surface the same functionality via code generation against a pre-authorized client. A 244x reduction.

You paid for that context. You can't use it for anything else.

Round-trip tax on intermediate results. Every tool response passes through the model whether the model needs to reason about it or not. Pull a transcript from Drive, push it to Salesforce — the full transcript traverses the context twice. Workflows have gone from 150K tokens to ~2K by moving to code-execution patterns. A 98% reduction at that scale measures the size of the hole you were pouring money into.

Neither of these is a bug in any particular MCP server. They're properties of the pattern. You get them for free the moment you commit to "upfront tool definitions + model-mediated result passing."

The reflex fix — "just load tools dynamically" — breaks the KV cache prefix. Every time you add or remove a tool mid-conversation, the system prompt region recomputes. A single stable run() tool sidesteps this entirely, which is a subtle cost advantage I don't see people talk about enough.

Yarats' third point is the one that took me longest to appreciate: most MCP features are simply never used in production. Discoverability is a development-time feature, not a runtime requirement. Once you know which tools your production agent needs, the dynamic-enumeration machinery is pure overhead you're paying for something you stopped doing.

Why CLI beats MCP specifically

Three reasons, in order of importance.

1. The model already speaks it. GitHub is saturated with shell — install scripts, CI pipelines, Stack Overflow answers, README snippets, Dockerfiles. No agent framework has to teach a frontier model what grep -c ERROR app.log does, because the model has seen that exact shape thousands of times. Compare to a bespoke JSON schema the model is seeing for the first time at inference. One of these is free, the other isn't.

2. Composition is exponential, not linear. A catalog exposing 15 typed tools gives you 15 actions. The same 15 commands with |, &&, ||, and ; give you a composition space that grows combinatorially. The agent writes curl -sL $URL | jq '.items[]' | head 5 in a single tool call where a typed framework would need three sequential round-trips through the model. Every round-trip you eliminate is a full inference pass you don't pay for.

3. --help is a zero-cost discovery protocol. Tools document themselves. The agent pulls the documentation into context only when it needs it, and only for the command it's actually using. Anthropic reinvented this for MCP via filesystem-based tool modules the agent navigates on demand — which is a fine solution, but it's solving a problem CLI conventions already solved fifty years ago. Simon Willison put it bluntly: "I don't use MCP at all any more when working with coding agents" (simonwillison.net).

Zoom out and the pattern is clear: MCP is paying token and engineering costs to invent discoverability the model already understands. CLI stops paying those costs.

The security reality check

Here's where I want to push back on the dominant framing. The MCP-vs-CLI debate has been positioned in part as a security argument, and I want to shut that framing down before it spreads further.

Switching from MCP to CLI does not reduce your attack surface. It changes the protocol. It does not change the threat model.

Repello's AI research team made this point in their coverage of Perplexity's move and it's the clearest statement of the issue I've seen: "The question is not which protocol the tool call travels over. The question is whether the content that comes back from the tool gets inspected before it reaches the model. That question has the same answer regardless of whether you're running MCP, CLI, or direct REST."

Break this down concretely:

The one genuine security improvement in this space isn't the switch to CLI at all — it's Cloudflare's pre-authorized client pattern in Code Mode. When the model writes code that runs against an already-authenticated client, the model never sees the raw credentials, which means prompt injection into the model's context can't extract them. That's a real reduction in attack surface, and it comes from keeping credentials out of the model's reachable state, not from the choice of transport.

Practical implication: your threat model covers every tool integration path, not the protocol layer. Content inspection happens at the model context boundary, not at the transport boundary. If you're running both MCP and CLI agents, your red team test plan has to cover tool response injection across all of them equally.

Do not let anyone sell you "we migrated from MCP to CLI, so we fixed our injection problem." They didn't. They moved it.

Best practices for building CLI-for-agents

This is where most of the engineering lives. The surface pattern — "expose tools as commands" — is easy. Doing it well is not. Here's what I'd tell someone starting today, ordered roughly by how much pain each one saves you.

1. Separate execution from presentation. Always.

This is the single most important architectural decision and the one people get wrong first. Inside a pipe chain, outputs have to stay raw, lossless, metadata-free. If you truncate cat's output mid-pipeline, grep searches the wrong data. If you inject [exit:0] into stdout, the next stage treats it as a search target. Pipes break, silently, and you spend a day wondering why.

Only after the entire chain completes does the presentation layer wrap the final result for the model: binary guards, truncation-with-overflow, metadata footer, stderr attachment. Two layers. Do not conflate them. This isn't a style preference, it's a correctness requirement.

2. Every error message is a navigation hint. Treat it that way.

Errors are the highest-leverage surface in your whole system. An agent that hits [error] binary file retries with garbage. An agent that hits [error] binary image (182KB). Use: see photo.png corrects in one step. Across a 50-turn conversation the token savings dominate everything else you optimize.

Call this tips thinking and bake it into every command: what went wrong, plus what to do instead. Both halves. Never just the first half.

And while we're here: never drop stderr. The worst bug I've read about in this space is MorroHsu's silent-stderr incident — his code dropped stderr whenever stdout was non-empty, the agent hit pip: command not found, couldn't see why, and ground through ten package-manager variations over fifty seconds of inference before stumbling on uv run. One visible stderr line would have collapsed that to one call. Audit your framework for this pattern before you ship.

3. Pass content via stdin, not through the command string.

The biggest engineering tax of CLI-for-agents is double-escaping. The LLM emits JSON, the JSON field contains a shell command, the shell command contains quoted content, and now you're escaping quotes at two layers. Nightmare.

Fix it at the tool interface. Give run() a separate stdin parameter and document it. Content escapes once (the JSON layer) instead of twice. This one change eliminated ~90% of escaping issues in MorroHsu's production system and I believe that number because it matches what I've seen elsewhere.

4. Dangerous operations self-gate. Don't wrap the agent in an approval layer.

Put the risk gate inside each command, not outside the agent. dns update returns a dry-run preview by default and requires --confirm to actually execute. pay --amount 500 blocks on a push notification to the human's device, with full context in the notification, and returns the outcome as part of the same tool call. transfer --amount 10000 requires an OTP the agent has to ask the human for, and the error message tells the agent why so it can relay the reason.

Per-command risk gates are more fine-grained than any global policy framework, and they compose naturally with the CLI model. The command knows its own risk level. The command enforces it. The agent doesn't need to be trusted — it needs to be bounded.

5. Large outputs go to files, summaries come back.

When a command produces more than ~200 lines or ~50KB, truncate to the first 200 lines, write the full result to /tmp/cmd-output/cmd-N.txt, and return the head plus a pointer. Include suggested follow-up commands (grep, tail). The agent already knows how to navigate a file. You're not building a custom pagination API; you're giving it a map.

This is the CLI-native version of what Anthropic does with code execution: large datasets remain inside the execution environment, the heavy lifting happens in code, and the model only sees the summary (Code Execution with MCP). Same philosophy, different mechanism.

6. Append exit code and duration to every result.

[exit:0 | 12ms] at the end of every tool result. After the agent sees this pattern a few dozen times it internalizes both axes: exit code tells it whether to check for errors, duration tells it what's cheap to call freely versus what to use sparingly. Consistent metadata makes the agent smarter across turns. Inconsistent metadata makes every call feel like the first.

7. Design by desire path, not by spec.

Build a minimal CLI, watch the agent use it, and pave the paths it naturally tries to walk. I stole this framing from landscape architecture and it works better than any amount of upfront design. The agent will try to invoke commands you didn't build; those attempts are free telemetry on what your API surface should look like. Iterate on the real traces, not on your assumptions.

8. Your CLI surface is your agent's capability boundary.

The commands you expose define the action space. What's not a command is not a capability. Combine this with sandbox isolation for anything that touches the OS, allowlists for network egress, per-command self-gating from practice #4, and content inspection at the model context layer, and you get a layered model where the worst case at each layer is something you've explicitly accepted.

You're not hoping the agent won't make mistakes. You're designing a boundary, confirming the worst case inside the boundary is acceptable, and letting the agent act freely within it. Injection defense lives at the content-inspection layer; privilege containment lives at the capability-surface layer. Different problems, different layers, both required.

Where I wouldn't use CLI

I'm not claiming this is universal. Three cases where I'd reach for something else:

Dynamic tool discovery in local environments. This is what MCP was actually designed for and where it's genuinely the right answer: Claude Desktop, Cursor, VS Code integrations, IDE tooling, anywhere the agent legitimately needs to enumerate capabilities at runtime that weren't known at design time. Yarats' own framing makes this explicit — MCP's discoverability is a feature you want at development time and stop needing at production scale. If you're building the development-time use case, MCP is fine.

Data-heavy transformation workflows. This is where Anthropic's Programmatic Tool Calling wins — Claude for Excel reads thousands of spreadsheet rows without drowning the context, because the work happens in a sandboxed TypeScript runtime and only the result comes back (Advanced Tool Use). Cloudflare's Code Mode is the same idea: write code against a pre-authorized client, get 244x token efficiency and credential isolation in one move. CLI pipes can express the same thing but code-execution is more ergonomic when the transformation logic is nontrivial.

Persistent stateful sessions. Browser automation, long-lived REPLs, anything where state has to survive across calls. Text-stream pipes are a bottleneck; a protocol with persistent handles is a better fit.

Read this way, CLI, code-execution, and MCP are the right answers to different questions. Shell for orchestration in production, code for data-heavy transformation, MCP for local dynamic discovery. The enemy isn't any single protocol. The enemy is interfaces the model is seeing for the first time, paying token cost for features you don't use, and security theater that mistakes transport choice for threat-model work.

The infrastructure layer is catching up

One objection I used to hear: "CLI-for-agents is great if your software already has a CLI. What about GIMP, Blender, LibreOffice, every GUI-only tool my team depends on?" That objection is evaporating. HKU's Data Intelligence Lab shipped CLI-Anything in March 2026 — a 7-phase pipeline that reads a codebase, maps functionality, architects a command structure, generates a Click-based CLI with REPL support, writes the tests, and publishes to PATH. First release covers GIMP, Blender, Inkscape, Audacity, LibreOffice, OBS, Kdenlive, Shotcut, Zoom, Draw.io, and AnyGen, with 1,508 passing tests at 100%.

Read this as a signal, not a product endorsement. The gold for anyone building internal tooling is the category of work this unlocks: design software agents couldn't touch, video pipelines that broke every time someone clicked a setting, internal tools where humans were the bottleneck between the agent and the actual work. Point a generator at the repo, get a structured capability surface. If your product ships in 2026 and an agent can't operate it from the command line, you're building for a workflow that's already thinning out. CLI-Anything is the bridge for open-source software; the same pipeline applied to internal codebases is a weekend project for most teams.

What I'm still figuring out

Tool discovery at scale. --help solves "the agent knows a tool exists and wants to learn how to use it." It doesn't solve "the agent needs a capability it doesn't know is installed." Cloudflare's Code Mode and Anthropic's Tool Search Tool both attack this with search-backed discovery. MorroHsu floats a cli-search command as the CLI-native equivalent. I don't think anyone has shipped a fully satisfying answer. This is probably the next real problem.

Multimodal I/O. Text-stream pipes handle text. Images, audio, binary artifacts need out-of-band handling — MorroHsu's see photo.png pattern auto-attaches to vision, which works, but the boundaries between "text command" and "binary payload" are awkward and I haven't seen a clean general solution.

Training on CLI traces. Every production agent session naturally generates (task, command, output) triples. Targeted fine-tuning on this data could compound the CLI advantage for smaller models — behavior improves specifically in the shape of its actual environment. I'm suspicious of claims without numbers, but the setup is obvious enough that someone will publish soon.

The one-line version

Stop inventing new tool interfaces. Hand the model the one it already knows. Secure the content it sees, not the protocol it travels over.

Everything else in this document is implementation detail in service of those three claims.

Sources

If you read this far, you'd probably like the newsletter. One analysis per week. Free.