What I Learned From 245,306 Claude Code Tool Calls

I analyzed the last 113 days of my Claude Code usage.

Not vibes. Not "here is how I feel after using AI coding tools." The actual transcript corpus: every user prompt, assistant message, shell command, file read, edit, browser action, issue-tracker call, web lookup, and tool result.

One disclosure first: this is my own corpus. I work as a near-lone developer across many applications, services, and infrastructure components on a single technology platform, rather than inside a single app and a single repo. The setup leans heavily on SSH, Docker, and remote orchestration because most days the work spans multiple servers and codebases. A typical single-app developer's corpus will skew differently: less shell, less SSH, more local file work, smaller per-session tool counts. Read what follows as observations from one vantage, not a survey of Claude Code users.

The raw blob store held 13,782 transcript blobs and 23.8 GB of JSONL. For analysis, I deduped it to the latest blob per session so growing transcripts were not counted over and over.

That left:

6,120 latest session files
4.9 GB of transcript data
1,481,120 JSONL rows
44,212 human user rows
446,639 assistant rows
245,306 tool-use blocks
244,954 tool-result blocks
113 calendar days, 105 active days

Here is the thing I did not appreciate until the data was sitting in front of me:

Claude Code is not a chat product with tools attached. It is an operating loop.

The chat is the smallest part of the system. The real action is the loop:

Think.
Use a tool.
Read the result.
Change the plan.
Use another tool.
Repeat until the world is different.

That sounds obvious. It is not. Most people still evaluate AI coding tools by reading model answers. That is like judging an engineer by reading their Slack messages and never watching their terminal.

The terminal is the product.

1. The "user" is not always the user

The first weird lesson is structural.

In the corpus, there were 289,080 rows with type = user. Only 44,212 of those were actual human user rows.

Roughly 244,800 of the remainder are tool-result carrier messages: the environment handing stdout, diffs, browser snapshots, errors, API responses, and file contents back into the conversation. A small residual is made up of subagent prompts (each delegated agent gets its instruction as a user row in its own transcript) and injected system reminders, which the API also wraps as user-typed messages.

That matters because once you understand this, you stop thinking of the transcript as a chat log. It is closer to an execution trace.

The human gives direction. The assistant proposes action. The tools report reality. The next assistant turn is shaped by that reality.

That is the whole game.

If you want better outcomes, do not just write better prompts. Improve the quality of the reality the model gets back:

clearer test output
cleaner CLI commands
deterministic scripts
structured tool responses
less noisy logs
smaller, more focused diffs
better error messages

The model is only as good as the loop it is trapped inside.

2. Claude Code is a shell that edits, not an editor that shells out

The most used tool was not Edit. It was not Read. It was not a planner.

It was Bash.

Top tool families:

Tool family	Calls
Shell	115,915
File edit/read/write	74,345
File search	22,705
Planning	13,490
Delegation/subagents	3,480
Linear/issue tracker MCP	3,197
Web search/fetch	2,954
Browser MCP	1,360
Other (Skill, Monitor, smaller MCPs)	7,860

Top individual tools:

Tool	Calls
Bash	115,743
Read	42,309
Edit	25,091
Grep	17,014
Write	6,945
TodoWrite	6,417
Glob	5,687
TaskUpdate	4,667
TaskCreate	2,406
Task / Agent	3,480 combined

Bash alone was 47.2% of all tool calls.

That completely changes how I think about "AI coding."

Claude Code is not primarily a code editor. It is a Unix operator with an LLM loop. It edits files, yes, but the edits are downstream of reading, searching, testing, building, grepping, running CLIs, checking services, and asking the operating system what is true.

The top shell command verbs make that even clearer:

Command verb	Calls
docker	16,958
ssh	10,816
git	10,605
pnpm	7,364
cd	7,222
ls	6,883
grep	5,606
cat	3,655
curl	2,975
find	2,716
gh	1,994

This is not "autocomplete, but better."

This is an agent operating the same surfaces a senior engineer operates: shell, git, package manager, Docker, SSH, HTTP, issue tracker, browser, files.

If your shell environment is messy, your AI coding environment is messy. If your commands are slow, ambiguous, or inconsistent, the model inherits that friction. If your repo has no reliable test command, the model has no reliable way to know whether it is done.

The practical advice is boring and powerful:

Invest in your CLI surface area.

Write scripts. Standardize commands. Make tests cheap to run. Give the agent a clean Makefile, package.json, justfile, or task runner. Make the right thing easy to execute.

3. Most engineering is reading

File tools tell the same story.

Read happened 42,309 times. Edit happened 25,091 times. Write happened 6,945 times.

Read + Grep + Glob together accounted for 65,010 calls before you even count ls, cat, find, and shell-level search.

That ratio feels right. Good engineering is mostly building a local model of the system before touching it.

The most touched file types were exactly what you would expect from real application work:

File type	Shape
TypeScript	dominant
Markdown	heavy
TSX	heavy
Python	substantial
JSON/YAML/Shell	frequent support work

The model that reads more before editing is usually the model you want. "It immediately patched the file" feels fast. "It inspected the call chain, searched the tests, read the config, then patched the file" is usually cheaper.

This is one place where product friction is good. Claude Code's habit of reading before editing is not ceremony. It is the difference between code mutation and code work.

4. The distribution is insane

A "session" here means one session_id, collapsed across blob snapshots so a long-running session counts once rather than once per autosave. On that definition:

The median session had 1 tool call.

The 90th percentile session had 106 tool calls.

The 99th percentile session had 481 tool calls.

The largest session had 2,206 tool calls.

That is the power-law shape of real agentic work.

Most sessions are small. You ask a question, inspect a file, run one command, get an answer. But the value frontier is in the long tail: the sessions where the agent stays with a problem across hundreds or thousands of environmental observations.

Those long sessions do not feel like "chatting with an AI."

They feel like an engineer working:

inspect
hypothesize
patch
test
hit a failure
inspect again
patch again
check logs
chase config
compare branches
post the comment
run the suite
verify the deployment

The UX question is not "how do we make the model answer better?"

It is "how do we keep a long-running loop grounded, auditable, and recoverable?"

That means resumability. Checkpoints. Clear plans. Tool-result compression. Better failure display. Safer permissioning. Good stop conditions. Real verification.

The long tail is where the product becomes an operating system.

5. Errors are not edge cases. They are the work.

There were 244,954 tool-result blocks.

18,778 were explicitly marked is_error = true, about 7.7%. The explicit flag undercounts: a failing test run is just stdout from pnpm test, a 401 is a body field, not an is_error bit. So I also bucketed by keyword presence in tool-result content. Buckets are not exclusive, so one result can match several:

Keyword bucket	Matches	Reads as
error/exception/failed	55,108	mostly real failures, with some "no errors found" false positives
permission/auth/401/403	50,538	mix of auth failures and successful auth handshakes mentioning the words
rate limit/timeout/429	17,322	mostly real throttling and latency events
not found/404/no such file	16,116	mostly real misses
CI/test/build/lint/check	76,185	operational surface, not specifically error; includes successful runs
git/PR/commit/merge	30,711	operational surface, not specifically error; includes successful operations

The first four buckets give a rough lower bound on error-shaped reality. The last two are operational signal: how much of the loop lives in CI, test, build, and version-control surfaces. Both stories matter for the thesis.

This is why demo videos are misleading. The happy path is not the product.

The product is how the system behaves after the first failure.

Does it read the error or paper over it? Does it retry the same command blindly? Does it narrow the scope? Does it inspect the file that actually failed? Does it notice auth problems and stop wasting calls? Does it distinguish test failure from environment failure?

The best Claude Code sessions are not the ones with no errors. They are the ones where errors become high-quality steering signals.

You want tools that return structured failure. You want tests that fail loudly and locally. You want logs that include the file, line, command, and next diagnostic move.

The agent does not need a frictionless world. It needs a legible one.

6. Planning is visible, but action dominates

Planning tools showed up meaningfully:

TodoWrite: 6,417
TaskUpdate: 4,667
TaskCreate: 2,406

That is 13,490 planning-family tool calls.

Useful, but still dwarfed by shell and file operations.

This matches my experience: planning is most valuable when it is tied to execution state. A plan written in prose is easy to ignore. A task list that moves from pending to in-progress to completed is harder to fake.

The lesson is not "plan more."

The lesson is "make plans operational."

Good agent plans should answer:

What is the next command?
What file is being inspected?
What would count as verification?
What is blocked?
What is explicitly out of scope?

Plans that do not constrain tool use are just narrative.

7. The web is not the center. It is a specialist tool.

WebSearch and WebFetch appeared 2,954 times combined. Browser MCP tools appeared 1,360 times. Issue-tracker MCP tools appeared 3,197 times.

That is meaningful, but it is tiny compared with shell and filesystem work.

This surprised me a little. I expected more web.

But the pattern makes sense. For coding work, the highest-value context is usually local:

the repo
the tests
the config
the package manager
the logs
the branch diff
the live service

The web matters when facts are unstable: docs, API versions, current vendor behavior, release notes, pricing, auth flows. It is not a replacement for understanding the system under your hands.

The practical rule I use now:

Search the web for reality that changes outside the repo. Search the repo for reality that already exists inside it.

8. Prompt caching is not an optimization. It is infrastructure.

The token counters are absurd:

Input tokens (fresh per-turn delta): 98.7 million
Output tokens: 123.3 million
Cache creation input tokens: 2.0 billion
Cache read input tokens: 57.8 billion

Output exceeding fresh input is not a paradox. The "input tokens" line is only the non-cached delta the model sees each turn. The real context the model conditions on is dominated by the cache: 57.8B cache reads against 98.7M fresh input is roughly a 585x ratio. Most of what the model is reading per turn is replayed cached state, not new typing.

Do not overread these as billing numbers. They come from transcript usage objects and should be treated as telemetry, not accounting.

But directionally, the point is obvious:

Long-running agentic coding depends on reusing context.

Without caching, the economics and latency of these workflows look very different. With caching, you can keep large system prompts, tool definitions, repo instructions, and accumulated context in play without paying the full cost every turn.

This is one reason "just paste the relevant files into chat" is the wrong mental model. Agentic work is not a one-shot prompt. It is a stateful loop with repeated context reuse.

9. Real agentic work crosses machine boundaries

Transcript rows (every human message, assistant message, and tool result) by host environment:

Environment	Rows
Server	964,422
MacBook/local	382,346
Other	50,609

This is a row count, not a tool-call count, and it reflects my own setup: workers running on remote machines emit a lot of transcript volume per task. Read the table as "agentic work happens off the laptop too," not as "most coding work is on servers."

Again: this is not a local editor story.

The corpus includes work across laptops, remote servers, production-like paths, issue trackers, browser sessions, package managers, SSH, Docker, and GitHub-style workflows.

The future of AI coding is not just "the IDE gets smarter."

The future is that the agent can operate wherever the work actually lives, with permissions, auditability, and enough environmental context to avoid becoming dangerous.

Local-first matters. Remote execution matters. Source-of-truth storage matters. Credentials matter. RBAC matters. Logs matter.

The agent is not replacing the software delivery system. It is becoming a new operator inside it.

10. The biggest risk is plausible completion

After 245,306 tool calls, my strongest opinion is this:

The dangerous failure mode is not that the model cannot do the work.

The dangerous failure mode is that it can do 80% of the work, narrate the remaining 20% convincingly, and stop.

That is why verification has to become a first-class workflow object.

For any non-trivial task, the final answer should be grounded in evidence:

What changed?
What command verified it?
What failed?
What remains untested?
What assumption is still open?

This is where hooks, checklists, CI, and strict completion rules matter. Not because they make the model smarter. Because they make it harder for the loop to end on vibes.

11. The new skill is not prompting. It is loop design.

Prompting still matters.

But after this analysis, I think "prompt engineering" is too small a frame for serious coding agents.

The higher-leverage work is loop design:

What tools exist?
What do they return?
Which tools can run in parallel?
What permissions are required?
What context is loaded by default?
What gets cached?
What gets written to memory?
What must be verified before completion?
What should stop the agent?
What should force a human decision?

The model is one component. The loop is the system.

Most bad agent behavior is not a model problem in isolation. It is a loop problem:

bad tool output
missing tests
unclear permissions
stale docs
hidden state
noisy logs
no stop condition
no verification gate

If you want better AI coding results, improve the loop.

12. The practical operating rules I took away

Here is the short version I would give any team rolling this out seriously:

Treat Claude Code as a shell operator, not just a coding assistant.
Make repo commands boring, fast, and documented.
Prefer structured tool output over prose.
Make the agent read before it edits.
Keep plans tied to executable next steps.
Use subagents for independent fan-out, not vague delegation.
Search the web for unstable external facts; search the repo for local truth.
Make failure outputs legible.
Never let "done" mean "the model says it is done."
Track the loop, not just the prompt.

The punchline

The real shift is not that AI can write code.

The real shift is that AI can operate the development environment.

That is much bigger and much messier.

Writing code is one act inside software work. Operating the loop includes reading, testing, failing, searching, editing, debugging, checking tickets, using the browser, moving across machines, and proving the result.

245,306 tool calls later, I trust the loop more than the answer.

The answer is cheap.

The loop is where the truth shows up.

Methodology note

I analyzed Claude Code JSONL transcripts stored in object storage. The raw blob store contained 13,782 blobs and 23.8 GB. Because many blobs are repeated snapshots of growing session transcripts, the analytic corpus deduped to the latest blob per session ID: 6,120 files and 4.9 GB.

I counted assistant content blocks with type = tool_use as tool calls. I counted type = tool_result blocks separately. Transcript rows with type = user are not all human prompts; most are tool-result carrier messages. That distinction is why I report both user rows and human user rows.

The corpus spans blob dates from 2026-01-12 through 2026-05-04, 113 calendar days inclusive, with 105 active blob days. Event timestamps inside the transcripts range from 2025-11-25 through 2026-05-04 because some session files preserve older conversation history.