From agent harness to production: what it actually takes

Agentic CLIs crossed a million weekly active developers in early 2026. Claude Code, Codex, Cursor Agent, OpenClaw — they all nail the same thing: natural-language iteration, tight feedback loops, a terminal that feels alive.

Most of that usage isn't apps. It's automations. Sourcing scripts, lead scoring, research agents, content pipelines. The CLI makes it feel trivial to build something useful in an afternoon.

Then you try to run it in production, and the afternoon becomes a month.

We've been talking to power users building with every major harness, and the story is the same across all of them: the harness is great at dev mode, and everything else you need to ship to prod you have to build yourself. More engineering work. More tokens. More vendor contracts.

This post is about that gap — what it contains, what it costs, and what to do about it.

Dev mode vs. prod mode

Every agentic CLI is optimized for the dev loop: fast iteration, transient state, stdout logs, everything tied to a single developer's environment. That's what makes them feel so good to build with.

Prod mode is the opposite: persistent state, shared access, retries, caching, observability, permissions, scheduling. Everything that made the harness nice to build with is now something you have to work around.

You don't replace the harness. You build a layer around it. The question is whether you build that layer or use one.

The four things you end up building yourself

1. Integration sprawl

A typical sourcing or research automation pulls from LinkedIn, Hunter, Specter, Affinity, Google Search, a web scraper, a CRM, and whatever vertical API the task needs. That's ten signup flows, ten credit cards on file, ten rotation policies, ten different auth patterns your script has to handle.

For a weekend project, fine. For a workflow that runs every 48 hours, forever, across a team — it's the first thing that breaks. A trial expires. A rate limit changes. A vendor shifts from API keys to OAuth. Someone leaves and their personal key goes with them.

What a production layer gives you: one subscription, one credit system, one bill, and 100+ pre-connected integrations. Your CLI calls become tool calls into the layer — no separate auth per service.

2. Scripts that don't compound

CLI sessions are conversational. You build a workflow, ship it, close the terminal. Next week you need almost the same thing — lead enrichment, but for a different segment. You start over. Copy-paste the old prompt, tweak, run.

There's no library of reusable blocks. There's no "the ICP scoring step that worked well in March." There's no way for a teammate to import your scoring logic without also inheriting your API keys, your working directory, your local environment.

You end up with folders of near-duplicate scripts that each solve 90% of the same problem and diverge on the last 10%. Every engineer on the team has their own.

What a production layer gives you: every workflow, step, and agent lives in a workspace. You save it once and call it from anywhere — from another workflow, from a teammate's CLI session, from a scheduled run. Your best scoring logic becomes a named block your teammate imports with one command.

3. Paying for the LLM twice

This one came up repeatedly. Power users already have Claude Pro or Claude Teams. They don't want to switch to a "managed Claude" that bills them again for the tokens they were going to use anyway.

And the token economics in prod are rough: running a research or outbound automation against a harness that chains 8–15 reasoning calls per task adds up fast. Every retry costs more tokens. Every re-run of a cached step costs more tokens.

What a production layer gives you: bring-your-own-Claude. The runtime plugs into your existing Anthropic subscription. Your credits pay for infrastructure — integrations, storage, scheduling, memory — not for the LLM you already pay Anthropic for.

4. The terminal as audit trail

This is the big one. The specific gaps:

No cache policy. Every run re-enriches every contact, re-fetches every page, re-scores every lead. Expensive steps run every time because there's no TTL layer between the agent and the APIs. That's tokens and API credits you're burning every single run.
No retry semantics. A 429 on Hunter means the whole pipeline falls over. You rerun manually. You hope the other side doesn't rate-limit you again.
No permission model. The script has full access to whatever account ran it. If the agent has access to your Gmail, it has all access to your Gmail, for every workflow, forever. There's no scope.
No explainable trace. When the agent ships a bad email, you scroll back through the terminal trying to figure out why. If you closed the terminal yesterday, the trace is gone.

What a production layer gives you: cache policies per step. Automatic retry with backoff, policy-driven. Per-integration scoped permissions — your agent can read from Gmail for this workflow, write to HubSpot for that one, and nothing else. Full structured audit trail: every step, every input, every output, every decision, stored and searchable.

What it costs to build this yourself

Put a number on it. For a team of five engineers shipping two or three production automations:

Integrations: 2–4 weeks of engineering to wire up the first handful of vendors, then ongoing maintenance as auth flows shift.
Caching + retry: 1–2 weeks for a per-step policy layer that actually handles backoff and respects rate limits.
Permissions + audit: 2–3 weeks for scoped OAuth flows, permission UI, and a searchable event store.
Scheduling + team sharing: 1–2 weeks for a scheduler that runs when laptops are closed and a workspace model that lets teammates share without sharing keys.

Plus the tokens. Every step without a cache is tokens. Every retry without backoff is tokens. Every run that re-does expensive reasoning because state wasn't persisted is tokens.

Most teams eat a month of engineering and a 2–3× token bill before they get to something that's actually shippable. Some give up and run the automation manually.

What the architecture looks like

You don't leave your CLI. That's the point — nobody wants to stop using a tool they love. You install the AgentLed MCP server:

claude mcp add agentled -e AGENTLED_API_KEY=wsk_... -- npx -y @agentled/mcp-server

Now your CLI has tool access to your AgentLed workspace: the 100+ integrations, the Knowledge Graph, the saved workflow blocks, the scheduler. You keep building the same way you were — natural language, iterative, conversational. But when you ship, the workflow lives in the workspace. It has a schedule. It caches. It retries. It audits. Your teammate can run it without touching your laptop.

And when you want to run something outside a chat session, you use the AgentLed CLI:

agentled run outbound-eu --heartbeat 48h

The CLI uses your Anthropic subscription, talks to the same workspace, writes to the same audit trail. Your dev loop and your prod loop converge.

When to stay on the CLI alone

Not every script needs a production layer. If you're writing a one-off — scrape this page, summarize these docs, draft these emails once — your CLI alone is the right tool. No workspace, no scheduling, no integration catalog. Just the harness and a local environment. That's what it's good at.

The switch makes sense when:

You're running the same automation more than once a week.
More than one person on your team needs to run it.
The result matters enough that you care about retries, audit, and cache.
You're spending real money on API subscriptions or tokens just to keep it running.

If two of those are true, you're past the dev-mode threshold. You need a production layer — either one you build or one you adopt.

The design rule

Your agentic CLI is a development environment. AgentLed is a runtime. They're not competing — they're the two halves of a working agentic stack.

Build in your CLI. Ship to AgentLed. Bring your own Claude on both sides.