Are errors bad in agentic coding?

Not by themselves. Errors are often the work. The expensive pattern is repeated failure without diagnosis.

What is the stuckness protocol?

After two repeated failures, stop retrying, classify the failure, and run the cheapest disambiguating check.

Why does this matter for cost?

High-error sessions consumed most observed token volume in this corpus, so reducing stuck loops has outsized leverage.

Stuckness Is Where Agentic Coding Gets Expensive

Good coding agents hit errors.

That is not the problem.

The problem is stuckness: repeated action after repeated failure without a clean diagnosis.

In my refreshed Claude Code mart, I bucketed sessions by explicit tool-error count and looked at token volume.

The corpus basis:

4,234 parsed session IDs
247,592 tool events
18,873 explicit tool-reported error events
450,878 assistant turns

This is a correction from the broader keyword-surface draft. Keyword flags are useful for finding error-shaped text, but they fire on normal source code, search results, web pages, and delegated task summaries. The stuckness buckets below use explicit result_is_error only.

The Finding

Token-bearing sessions by error bucket:

Error bucket	Sessions	Avg explicit errors	Avg tools	Avg human msgs	Avg total tokens
No-error sessions	1,560	0.0	12.1	3.2	1,567,962
1-2 errors	774	1.4	37.3	7.0	5,874,648
3-9 errors	897	5.2	86.2	13.9	17,617,903
10+ errors	444	29.6	275.8	46.0	85,787,221

Then the concentration:

Surface	Share
Sessions with 10+ explicit errors	12.1%
Token volume in 10+ explicit-error sessions	62.6%
Sessions with no explicit errors	42.4%
Token volume in no-explicit-error sessions	4.0%

This is the agentic coding cost curve.

The cheap sessions are not where the action is. The expensive sessions are where debugging, environment drift, test loops, and file churn concentrate.

Human Rescue Patterns

After a human re-entered following an explicit tool error, the agent most often went back to shell:

Next tool family	Interventions	Share	Next explicit failure	Next success-or-test signal
shell	2,780	60.1%	15.3%	30.8%
file_edit	621	13.4%	5.2%	63.3%
file_search	412	8.9%	3.4%	31.6%
capability	234	5.1%	2.1%	0.0%
delegation	148	3.2%	8.1%	85.8%
planning	132	2.9%	0.0%	78.8%

One surprising line: delegation and planning were small shares of rescue actions, but had the strongest next success-or-test signals.

That does not mean "delegate more" or "plan more" globally. It means after repeated failure, diagnosis beats another blind command.

The So What

Do not optimize for zero errors.

Optimize for fast error classification.

After two repeated failures, stop and name the failure class. This is also where the loop-depth checkpoint belongs: the agent should produce a diagnosis, not another command.

The failure classes:

auth or permission
missing file or path drift
command misuse
dependency or environment
test failure
product logic failure
external service or network

Then run the cheapest disambiguating check.

The Operator Protocol

Paste this into hard Claude Code debugging sessions:

If two consecutive attempts fail, stop execution and classify the failure.
Name the current hypothesis, evidence for it, evidence against it, and the
smallest next check. Do not retry the same action unchanged.

This is not about making the agent cautious. It is about stopping repeated failure from burning context, tokens, and operator attention.

The Non-Intuitive Takeaway

The best agentic coder is not the one who prevents errors.

The best agentic coder notices when the agent has switched from learning to thrashing, and stops it before it accidentally declares done.

That moment is where elite operators intervene.

The public factory for reproducing the stuckness queries is here: MPIsaac-Per/claude-code-ops-audit.