Does this mean autonomous agents do not work?

No. It means useful autonomy has a loop-depth budget. Long runs work best when they have checkpoints, verification gates, and interruption rules.

What is the practical checkpoint rule?

For normal coding work, stop around 25 tool calls and report receipts. For hard debugging, allow about 50 before forcing diagnosis.

Why not just let the agent run until it finishes?

Because useful autonomy has a loop-depth tail: nonzero runs had p90 25 tools, 30.5% included an explicit tool error before human re-entry, and deep stuck sessions dominate token volume.

Autonomy Has a Half-Life: What 247,592 Tool Calls Say About Claude Code Checkpoints

The fantasy version of agentic coding is simple:

Give the agent a goal. Come back later. The PR is done.

Sometimes that works. But the transcript data shows a more useful operating model: autonomy has a half-life.

In the refreshed Claude Code mart:

4,234 parsed session IDs
44,483 human messages
247,592 tool events
40,546 human re-entry spans
Timestamp range: November 25, 2025 to May 6, 2026

I looked at how many tool calls happen between one human message and the next.

The Finding

Across all human re-entries:

Span set	Spans	Avg tools	Median	p90	p95
All re-entries	40,546	5.27	0	15	25
Nonzero agent runs	20,241	10.56	5	25	39

Two numbers matter:

50.1% of re-entries happened after zero tool calls.
30.5% of nonzero agent runs had an explicit tool error before the human re-entered.

That second number is the corrected explicit-error cut. A broader keyword-surface draft overstated this because keyword matches fire on normal source code, search output, and delegated summaries. The surviving finding is narrower and cleaner: autonomy often ends around an operator checkpoint, and about a third of nonzero spans carry an explicit tool failure before the human re-enters. That is the same failure tail that makes stuckness expensive at the session level.

At the session level, the tail is real:

Metric	Value
Average tools/session	58.48
Median tools/session	17
p90 tools/session	154.4
p99 tools/session	600.35
Median max tools since a human message	11
p90 max tools since a human message	56
p95 max tools since a human message	75
p99 max tools since a human message	116.67

And yes, there are monster runs:

Max tools since human	Sessions
>=50	535
>=100	89
>=250	5
>=500	3

Why This Is Non-Obvious

Most agent discourse frames autonomy as a binary:

supervised
autonomous

That is not how real coding sessions behave.

The better unit is loop depth: how many read/edit/shell/search actions happen before the human should re-enter or the agent should produce receipts.

The practical answer is not "never run unattended." It is also not "let it run forever."

The answer is: give autonomy a budget.

The So What

Claude Code works best when you set checkpoint budgets by tool-call depth:

Work type	Suggested checkpoint
Small edit	5-10 tool calls
Normal feature/debug task	25 tool calls
Hard debugging sprint	50 tool calls
Unattended run	Explicit verification gate before closure

The checkpoint is not bureaucracy. It is a compression point. The agent should report:

what changed
what evidence it collected
what remains uncertain
the next smallest useful action

That summary keeps the human in the control loop without forcing the human to micromanage every command.

The Operator Protocol

Use this prompt for serious Claude Code work:

Run a bounded sprint. Stop after about 25 tool calls, or sooner if the same
failure repeats. Report what changed, what evidence you collected, what remains
uncertain, and the next smallest useful action.

For risky repos:

Before any long autonomous run, confirm the branch is clean and pushed.
Anything untracked should be treated as disposable.

The More Interesting Takeaway

The best operators are not the people who give the longest prompts. They are the people who manage the agent's loop budget.

A 25-tool sprint with receipts beats a 200-tool wandering session that ends with a confident paragraph.

Autonomy is useful. Unbounded autonomy is where cost, drift, and false closure compound.

The public factory for reproducing this analysis on your own logs is here: MPIsaac-Per/claude-code-ops-audit.