MPIsaac Ventures
Back to Blog

Autonomy Has a Half-Life: What 247,592 Tool Calls Say About Claude Code Checkpoints

Michael Isaac
Michael Isaac
Operator. 30 yrs in enterprise AI.3 min read

The fantasy version of agentic coding is simple:

Give the agent a goal. Come back later. The PR is done.

Sometimes that works. But the transcript data shows a more useful operating model: autonomy has a half-life.

In the refreshed Claude Code mart:

  • 4,234 parsed session IDs
  • 44,483 human messages
  • 247,592 tool events
  • 40,546 human re-entry spans
  • Timestamp range: November 25, 2025 to May 6, 2026

I looked at how many tool calls happen between one human message and the next.

The Finding

Across all human re-entries:

Span setSpansAvg toolsMedianp90p95
All re-entries40,5465.2701525
Nonzero agent runs20,24110.5652539

Two numbers matter:

  • 50.1% of re-entries happened after zero tool calls.
  • 30.5% of nonzero agent runs had an explicit tool error before the human re-entered.

That second number is the corrected explicit-error cut. A broader keyword-surface draft overstated this because keyword matches fire on normal source code, search output, and delegated summaries. The surviving finding is narrower and cleaner: autonomy often ends around an operator checkpoint, and about a third of nonzero spans carry an explicit tool failure before the human re-enters. That is the same failure tail that makes stuckness expensive at the session level.

At the session level, the tail is real:

MetricValue
Average tools/session58.48
Median tools/session17
p90 tools/session154.4
p99 tools/session600.35
Median max tools since a human message11
p90 max tools since a human message56
p95 max tools since a human message75
p99 max tools since a human message116.67

And yes, there are monster runs:

Max tools since humanSessions
>=50535
>=10089
>=2505
>=5003

Why This Is Non-Obvious

Most agent discourse frames autonomy as a binary:

  • supervised
  • autonomous

That is not how real coding sessions behave.

The better unit is loop depth: how many read/edit/shell/search actions happen before the human should re-enter or the agent should produce receipts.

The practical answer is not "never run unattended." It is also not "let it run forever."

The answer is: give autonomy a budget.

The So What

Claude Code works best when you set checkpoint budgets by tool-call depth:

Work typeSuggested checkpoint
Small edit5-10 tool calls
Normal feature/debug task25 tool calls
Hard debugging sprint50 tool calls
Unattended runExplicit verification gate before closure

The checkpoint is not bureaucracy. It is a compression point. The agent should report:

  1. what changed
  2. what evidence it collected
  3. what remains uncertain
  4. the next smallest useful action

That summary keeps the human in the control loop without forcing the human to micromanage every command.

The Operator Protocol

Use this prompt for serious Claude Code work:

Run a bounded sprint. Stop after about 25 tool calls, or sooner if the same
failure repeats. Report what changed, what evidence you collected, what remains
uncertain, and the next smallest useful action.

For risky repos:

Before any long autonomous run, confirm the branch is clean and pushed.
Anything untracked should be treated as disposable.

The More Interesting Takeaway

The best operators are not the people who give the longest prompts. They are the people who manage the agent's loop budget.

A 25-tool sprint with receipts beats a 200-tool wandering session that ends with a confident paragraph.

Autonomy is useful. Unbounded autonomy is where cost, drift, and false closure compound.

The public factory for reproducing this analysis on your own logs is here: MPIsaac-Per/claude-code-ops-audit.