How Agents Burn Through Runway, and How to Stop Them

TL;DR. A modern agentic loop (planner, three parallel research agents, critic, executor, verifier) consumes 4 to 8 million input tokens and 250 to 500 thousand output tokens per active hour of dev work. At Sonnet 4.5 list rates, that is roughly $15 to $25 per developer per hour, or $120 to $200 per active eight-hour day. Multiply by your team. The bottleneck on agentic AI is not capability. It is the willingness to let the loop retry.

The 3am story

Every founder I have worked with this year has a version of it. The numbers vary, the story does not:

I left the agent running overnight on a refactor. By morning it had retried itself 220 times on the same failing test, accumulating 14M output tokens before its budget cap kicked in. The bill was eight hundred dollars. The bug is still there.

The behavioral failure mode is the same: the agent did not know how to stop, and the cost compounded silently while the human slept. The most-published advice on this is wrong. You cannot solve it with a bigger context window or a smarter model. You solve it with hard caps, prompt caching, fan-out budgets, and (if you are honest about the math) moving a meaningful percentage of your inference to hardware you own.

Current pricing snapshot

May 2026 list rates:

Model	Input $/M	Output $/M
Claude Sonnet 4.5	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
Claude Opus 4.x	$15.00	$75.00
GPT-5	$2.50	$10.00
GPT-5 mini	$0.25	$1.00
Gemini 2.5 Flash	$0.30	$2.50
DeepSeek V4 Flash	$0.14	$0.28
DeepSeek V4 Pro	$1.74	$3.48

DeepSeek V4 Flash at $0.14/$0.28 resets the spreadsheet. Fan-out architectures that were a budget meeting on Sonnet are now the cost of a coffee. Public weights, MIT license, shipped April 24.

The loop math

A research-heavy agentic dev session from last week:

Loop: planner, 3 parallel research agents, critic, executor, verifier
Tokens per loop: 180,000 input, 12,000 output
Loops per hour during active dev: 8 to 12
Sustained session: 4 hours

Total: 6.5M input plus 432K output per session.

Model	Per session
Sonnet 4.5	$26.00
GPT-5	$20.65
Gemini 2.5 Flash	$3.04
DeepSeek V4 Flash	$1.03
Local (~65W × 4hr)	$0.034

At $26 per developer per session you behave differently. You trim retries. You shorten the critic's argument with the executor. You stop running parallel agents because the fan-out fee is real. The cost is the budget meeting you are having with yourself in the moment, and the budget meeting kills the loop's quality.

At three cents you do none of those things. You let the agent retry. You let the critic argue four times. You leave loops running through the night.

Where the money accumulates

Four cost vectors compound, ranked by how often they kill budgets:

1. Prompt re-emission. Every turn re-bills the entire context: system prompt, tool catalog, history, retrieved docs. A 50-tool MCP catalog is roughly 50K tokens of context overhead per turn. At 80 turns, that is 4M tokens of just tool descriptions. Pruning the tool catalog to what the current task needs is the single highest-leverage cut available.

2. Retry storms. A failing assertion sends the loop back to the planner. Each retry burns the full prompt again. Set max_iter to 5. Hard ceiling 10. Past that, escalate to a human.

3. Long-context drift. Production-tuned evals show that even frontier models drift past about 40K tokens (Jeff Huber, Chroma). The bigger context window is not a budget solution. It is a quality liability. Compact aggressively, summarize history, evict.

4. Untyped tool output. A tool returns 50KB of JSON. The model needs three fields. You just paid for 50KB of input on the next turn. Postprocess tool output in deterministic code before it hits the model. This pattern saves more than prompt caching and nobody talks about it.

Controls that work

In rough order of leverage:

Prompt caching. Anthropic's prompt caching (5-min and 1-hour TTLs) gives 90% cost reduction on cached prefixes. Cached input on Sonnet is $0.30/M instead of $3/M. Architecture trick: keep system prompt + tool catalog + static instructions at the front. Vary dynamic content at the end.

Tiered models. Route by complexity. Planner runs on Sonnet 4.5. The three parallel researchers run on Haiku 4.5 or DeepSeek V4 Flash. Save the expensive model for the turn where intelligence actually matters.

Per-loop budget caps. Set max-token-spend per session, not per call. Wrap the API call. Abort when running total crosses the cap. Pair with a Slack alert.

Iteration counts. Default 5. Hard ceiling 10. Hand off to a human and log it.

Tool catalog scoping. GitHub's MCP exposes 80+ tools. Most loops need 5. Use --toolsets (or your client's equivalent). Single change that has saved me the most money in production.

Batch API. Anthropic charges $1.50/$7.50 instead of $3/$15 for non-real-time work, a 50% discount. Most "real time" tasks do not actually need real time. They need eventual consistency. The deadline is usually self-imposed.

The local-first move

Most of the agentic loop does not need a frontier model. It needs a tool-calling model, a router, a critic, a summarizer, a structured-output extractor. All of those fit on a local Qwen 3.6 27B or Gemma 4 26B MoE.

Economics from my own setup:

Hardware: Framework 16, 96GB unified memory, RX 7700S, Strix Point. ~$3,200 once.
Power: ~65W under sustained load. $0.034 per active hour.
Local mix: Qwen 3.6 27B (default agent), Gemma 4 26B MoE (parallel fan-out), Gemma 4 E4B (router/classifier).
API mix: Claude and GPT-5 only, reserved for the ~5% of turns that genuinely need frontier reasoning.

If you spend $200/month on agentic API calls, the laptop pays back in 16 months. At $500/month, 6 months. At $3,000/month, you should have bought one for every developer last quarter.

The interesting part is not cost recovery. It is the regime change in how you design loops when fan-out is free. Five $0.01 cloud agents and one $0.50 monolith are roughly the same. Five local agents and one local monolith are also the same, except the local fan-out finishes in parallel and the cloud one finishes when the rate limiter says so.

The honest tradeoff

Local does not replace frontier. The 5% of turns that need frontier reasoning will keep needing it. The architecture that wins is hybrid: 95% local for routing, classifying, drafting, summarizing, tool-calling, critiquing. 5% frontier API for the genuinely hard turns. A well-designed loop calls the frontier once per session, by name, when the question earns the bill.

The hybrid pattern has a real engineering tax. Two stacks to maintain. A routing decision (usually a small classifier model, also local). Drift to monitor on both sides.

What you get for the tax is the freedom to design loops without flinching. The teams that ship in 2026 are the teams who can run an experiment without scheduling a budget meeting first.

The takeaway

Token budgeting is engineering, not penny-pinching. A well-architected loop fits inside a known per-session budget. A poorly architected loop is a runway-burning bot waiting for the wrong night.

Six moves pay for themselves on the first session that would have spiraled: prompt caching, tiered models, hard iteration caps, scoped tool catalogs, batch API for non-realtime, and moving the routing-and-tool-calling 80% to local hardware. The teams that flinch and the teams that ship differ by exactly that decision.