Stuckness Is Where Agentic Coding Gets Expensive
Good coding agents hit errors.
That is not the problem.
The problem is stuckness: repeated action after repeated failure without a clean diagnosis.
In my refreshed Claude Code mart, I bucketed sessions by explicit tool-error count and looked at token volume.
The corpus basis:
- 4,234 parsed session IDs
- 247,592 tool events
- 18,873 explicit tool-reported error events
- 450,878 assistant turns
This is a correction from the broader keyword-surface draft. Keyword flags are useful for finding error-shaped text, but they fire on normal source code, search results, web pages, and delegated task summaries. The stuckness buckets below use explicit result_is_error only.
The Finding
Token-bearing sessions by error bucket:
| Error bucket | Sessions | Avg explicit errors | Avg tools | Avg human msgs | Avg total tokens |
|---|---|---|---|---|---|
| No-error sessions | 1,560 | 0.0 | 12.1 | 3.2 | 1,567,962 |
| 1-2 errors | 774 | 1.4 | 37.3 | 7.0 | 5,874,648 |
| 3-9 errors | 897 | 5.2 | 86.2 | 13.9 | 17,617,903 |
| 10+ errors | 444 | 29.6 | 275.8 | 46.0 | 85,787,221 |
Then the concentration:
| Surface | Share |
|---|---|
| Sessions with 10+ explicit errors | 12.1% |
| Token volume in 10+ explicit-error sessions | 62.6% |
| Sessions with no explicit errors | 42.4% |
| Token volume in no-explicit-error sessions | 4.0% |
This is the agentic coding cost curve.
The cheap sessions are not where the action is. The expensive sessions are where debugging, environment drift, test loops, and file churn concentrate.
Human Rescue Patterns
After a human re-entered following an explicit tool error, the agent most often went back to shell:
| Next tool family | Interventions | Share | Next explicit failure | Next success-or-test signal |
|---|---|---|---|---|
| shell | 2,780 | 60.1% | 15.3% | 30.8% |
| file_edit | 621 | 13.4% | 5.2% | 63.3% |
| file_search | 412 | 8.9% | 3.4% | 31.6% |
| capability | 234 | 5.1% | 2.1% | 0.0% |
| delegation | 148 | 3.2% | 8.1% | 85.8% |
| planning | 132 | 2.9% | 0.0% | 78.8% |
One surprising line: delegation and planning were small shares of rescue actions, but had the strongest next success-or-test signals.
That does not mean "delegate more" or "plan more" globally. It means after repeated failure, diagnosis beats another blind command.
The So What
Do not optimize for zero errors.
Optimize for fast error classification.
After two repeated failures, stop and name the failure class. This is also where the loop-depth checkpoint belongs: the agent should produce a diagnosis, not another command.
The failure classes:
- auth or permission
- missing file or path drift
- command misuse
- dependency or environment
- test failure
- product logic failure
- external service or network
Then run the cheapest disambiguating check.
The Operator Protocol
Paste this into hard Claude Code debugging sessions:
If two consecutive attempts fail, stop execution and classify the failure.
Name the current hypothesis, evidence for it, evidence against it, and the
smallest next check. Do not retry the same action unchanged.
This is not about making the agent cautious. It is about stopping repeated failure from burning context, tokens, and operator attention.
The Non-Intuitive Takeaway
The best agentic coder is not the one who prevents errors.
The best agentic coder notices when the agent has switched from learning to thrashing, and stops it before it accidentally declares done.
That moment is where elite operators intervene.
The public factory for reproducing the stuckness queries is here: MPIsaac-Per/claude-code-ops-audit.