Autonomy Has a Half-Life: What 247,592 Tool Calls Say About Claude Code Checkpoints
The fantasy version of agentic coding is simple:
Give the agent a goal. Come back later. The PR is done.
Sometimes that works. But the transcript data shows a more useful operating model: autonomy has a half-life.
In the refreshed Claude Code mart:
- 4,234 parsed session IDs
- 44,483 human messages
- 247,592 tool events
- 40,546 human re-entry spans
- Timestamp range: November 25, 2025 to May 6, 2026
I looked at how many tool calls happen between one human message and the next.
The Finding
Across all human re-entries:
| Span set | Spans | Avg tools | Median | p90 | p95 |
|---|---|---|---|---|---|
| All re-entries | 40,546 | 5.27 | 0 | 15 | 25 |
| Nonzero agent runs | 20,241 | 10.56 | 5 | 25 | 39 |
Two numbers matter:
- 50.1% of re-entries happened after zero tool calls.
- 30.5% of nonzero agent runs had an explicit tool error before the human re-entered.
That second number is the corrected explicit-error cut. A broader keyword-surface draft overstated this because keyword matches fire on normal source code, search output, and delegated summaries. The surviving finding is narrower and cleaner: autonomy often ends around an operator checkpoint, and about a third of nonzero spans carry an explicit tool failure before the human re-enters. That is the same failure tail that makes stuckness expensive at the session level.
At the session level, the tail is real:
| Metric | Value |
|---|---|
| Average tools/session | 58.48 |
| Median tools/session | 17 |
| p90 tools/session | 154.4 |
| p99 tools/session | 600.35 |
| Median max tools since a human message | 11 |
| p90 max tools since a human message | 56 |
| p95 max tools since a human message | 75 |
| p99 max tools since a human message | 116.67 |
And yes, there are monster runs:
| Max tools since human | Sessions |
|---|---|
| >=50 | 535 |
| >=100 | 89 |
| >=250 | 5 |
| >=500 | 3 |
Why This Is Non-Obvious
Most agent discourse frames autonomy as a binary:
- supervised
- autonomous
That is not how real coding sessions behave.
The better unit is loop depth: how many read/edit/shell/search actions happen before the human should re-enter or the agent should produce receipts.
The practical answer is not "never run unattended." It is also not "let it run forever."
The answer is: give autonomy a budget.
The So What
Claude Code works best when you set checkpoint budgets by tool-call depth:
| Work type | Suggested checkpoint |
|---|---|
| Small edit | 5-10 tool calls |
| Normal feature/debug task | 25 tool calls |
| Hard debugging sprint | 50 tool calls |
| Unattended run | Explicit verification gate before closure |
The checkpoint is not bureaucracy. It is a compression point. The agent should report:
- what changed
- what evidence it collected
- what remains uncertain
- the next smallest useful action
That summary keeps the human in the control loop without forcing the human to micromanage every command.
The Operator Protocol
Use this prompt for serious Claude Code work:
Run a bounded sprint. Stop after about 25 tool calls, or sooner if the same
failure repeats. Report what changed, what evidence you collected, what remains
uncertain, and the next smallest useful action.
For risky repos:
Before any long autonomous run, confirm the branch is clean and pushed.
Anything untracked should be treated as disposable.
The More Interesting Takeaway
The best operators are not the people who give the longest prompts. They are the people who manage the agent's loop budget.
A 25-tool sprint with receipts beats a 200-tool wandering session that ends with a confident paragraph.
Autonomy is useful. Unbounded autonomy is where cost, drift, and false closure compound.
The public factory for reproducing this analysis on your own logs is here: MPIsaac-Per/claude-code-ops-audit.