
What I Learned From 245,306 Claude Code Tool Calls
I analyzed the last 113 days of my Claude Code usage.
Not vibes. Not "here is how I feel after using AI coding tools." The actual transcript corpus: every user prompt, assistant message, shell command, file read, edit, browser action, issue-tracker call, web lookup, and tool result.
One disclosure first: this is my own corpus. I work as a near-lone developer across many applications, services, and infrastructure components on a single technology platform, rather than inside a single app and a single repo. The setup leans heavily on SSH, Docker, and remote orchestration because most days the work spans multiple servers and codebases. A typical single-app developer's corpus will skew differently: less shell, less SSH, more local file work, smaller per-session tool counts. Read what follows as observations from one vantage, not a survey of Claude Code users.
The raw blob store held 13,782 transcript blobs and 23.8 GB of JSONL. For analysis, I deduped it to the latest blob per session so growing transcripts were not counted over and over.
That left:
- 6,120 latest session files
- 4.9 GB of transcript data
- 1,481,120 JSONL rows
- 44,212 human user rows
- 446,639 assistant rows
- 245,306 tool-use blocks
- 244,954 tool-result blocks
- 113 calendar days, 105 active days
Here is the thing I did not appreciate until the data was sitting in front of me:
Claude Code is not a chat product with tools attached. It is an operating loop.
The chat is the smallest part of the system. The real action is the loop:
- Think.
- Use a tool.
- Read the result.
- Change the plan.
- Use another tool.
- Repeat until the world is different.
That sounds obvious. It is not. Most people still evaluate AI coding tools by reading model answers. That is like judging an engineer by reading their Slack messages and never watching their terminal.
The terminal is the product.
1. The "user" is not always the user
The first weird lesson is structural.
In the corpus, there were 289,080 rows with type = user. Only 44,212 of those were actual human user rows.
Roughly 244,800 of the remainder are tool-result carrier messages: the environment handing stdout, diffs, browser snapshots, errors, API responses, and file contents back into the conversation. A small residual is made up of subagent prompts (each delegated agent gets its instruction as a user row in its own transcript) and injected system reminders, which the API also wraps as user-typed messages.
That matters because once you understand this, you stop thinking of the transcript as a chat log. It is closer to an execution trace.
The human gives direction. The assistant proposes action. The tools report reality. The next assistant turn is shaped by that reality.
That is the whole game.
If you want better outcomes, do not just write better prompts. Improve the quality of the reality the model gets back:
- clearer test output
- cleaner CLI commands
- deterministic scripts
- structured tool responses
- less noisy logs
- smaller, more focused diffs
- better error messages
The model is only as good as the loop it is trapped inside.
2. Claude Code is a shell that edits, not an editor that shells out
The most used tool was not Edit. It was not Read. It was not a planner.
It was Bash.
Top tool families:
| Tool family | Calls |
|---|---|
| Shell | 115,915 |
| File edit/read/write | 74,345 |
| File search | 22,705 |
| Planning | 13,490 |
| Delegation/subagents | 3,480 |
| Linear/issue tracker MCP | 3,197 |
| Web search/fetch | 2,954 |
| Browser MCP | 1,360 |
| Other (Skill, Monitor, smaller MCPs) | 7,860 |
Top individual tools:
| Tool | Calls |
|---|---|
| Bash | 115,743 |
| Read | 42,309 |
| Edit | 25,091 |
| Grep | 17,014 |
| Write | 6,945 |
| TodoWrite | 6,417 |
| Glob | 5,687 |
| TaskUpdate | 4,667 |
| TaskCreate | 2,406 |
| Task / Agent | 3,480 combined |
Bash alone was 47.2% of all tool calls.
That completely changes how I think about "AI coding."
Claude Code is not primarily a code editor. It is a Unix operator with an LLM loop. It edits files, yes, but the edits are downstream of reading, searching, testing, building, grepping, running CLIs, checking services, and asking the operating system what is true.
The top shell command verbs make that even clearer:
| Command verb | Calls |
|---|---|
| docker | 16,958 |
| ssh | 10,816 |
| git | 10,605 |
| pnpm | 7,364 |
| cd | 7,222 |
| ls | 6,883 |
| grep | 5,606 |
| cat | 3,655 |
| curl | 2,975 |
| find | 2,716 |
| gh | 1,994 |
This is not "autocomplete, but better."
This is an agent operating the same surfaces a senior engineer operates: shell, git, package manager, Docker, SSH, HTTP, issue tracker, browser, files.
If your shell environment is messy, your AI coding environment is messy. If your commands are slow, ambiguous, or inconsistent, the model inherits that friction. If your repo has no reliable test command, the model has no reliable way to know whether it is done.
The practical advice is boring and powerful:
Invest in your CLI surface area.
Write scripts. Standardize commands. Make tests cheap to run. Give the agent a clean Makefile, package.json, justfile, or task runner. Make the right thing easy to execute.
3. Most engineering is reading
File tools tell the same story.
Read happened 42,309 times. Edit happened 25,091 times. Write happened 6,945 times.
Read + Grep + Glob together accounted for 65,010 calls before you even count ls, cat, find, and shell-level search.
That ratio feels right. Good engineering is mostly building a local model of the system before touching it.
The most touched file types were exactly what you would expect from real application work:
| File type | Shape |
|---|---|
| TypeScript | dominant |
| Markdown | heavy |
| TSX | heavy |
| Python | substantial |
| JSON/YAML/Shell | frequent support work |
The model that reads more before editing is usually the model you want. "It immediately patched the file" feels fast. "It inspected the call chain, searched the tests, read the config, then patched the file" is usually cheaper.
This is one place where product friction is good. Claude Code's habit of reading before editing is not ceremony. It is the difference between code mutation and code work.
4. The distribution is insane
A "session" here means one session_id, collapsed across blob snapshots so a long-running session counts once rather than once per autosave. On that definition:
The median session had 1 tool call.
The 90th percentile session had 106 tool calls.
The 99th percentile session had 481 tool calls.
The largest session had 2,206 tool calls.
That is the power-law shape of real agentic work.
Most sessions are small. You ask a question, inspect a file, run one command, get an answer. But the value frontier is in the long tail: the sessions where the agent stays with a problem across hundreds or thousands of environmental observations.
Those long sessions do not feel like "chatting with an AI."
They feel like an engineer working:
- inspect
- hypothesize
- patch
- test
- hit a failure
- inspect again
- patch again
- check logs
- chase config
- compare branches
- post the comment
- run the suite
- verify the deployment
The UX question is not "how do we make the model answer better?"
It is "how do we keep a long-running loop grounded, auditable, and recoverable?"
That means resumability. Checkpoints. Clear plans. Tool-result compression. Better failure display. Safer permissioning. Good stop conditions. Real verification.
The long tail is where the product becomes an operating system.
5. Errors are not edge cases. They are the work.
There were 244,954 tool-result blocks.
18,778 were explicitly marked is_error = true, about 7.7%. The explicit flag undercounts: a failing test run is just stdout from pnpm test, a 401 is a body field, not an is_error bit. So I also bucketed by keyword presence in tool-result content. Buckets are not exclusive, so one result can match several:
| Keyword bucket | Matches | Reads as |
|---|---|---|
| error/exception/failed | 55,108 | mostly real failures, with some "no errors found" false positives |
| permission/auth/401/403 | 50,538 | mix of auth failures and successful auth handshakes mentioning the words |
| rate limit/timeout/429 | 17,322 | mostly real throttling and latency events |
| not found/404/no such file | 16,116 | mostly real misses |
| CI/test/build/lint/check | 76,185 | operational surface, not specifically error; includes successful runs |
| git/PR/commit/merge | 30,711 | operational surface, not specifically error; includes successful operations |
The first four buckets give a rough lower bound on error-shaped reality. The last two are operational signal: how much of the loop lives in CI, test, build, and version-control surfaces. Both stories matter for the thesis.
This is why demo videos are misleading. The happy path is not the product.
The product is how the system behaves after the first failure.
Does it read the error or paper over it? Does it retry the same command blindly? Does it narrow the scope? Does it inspect the file that actually failed? Does it notice auth problems and stop wasting calls? Does it distinguish test failure from environment failure?
The best Claude Code sessions are not the ones with no errors. They are the ones where errors become high-quality steering signals.
You want tools that return structured failure. You want tests that fail loudly and locally. You want logs that include the file, line, command, and next diagnostic move.
The agent does not need a frictionless world. It needs a legible one.
6. Planning is visible, but action dominates
Planning tools showed up meaningfully:
- TodoWrite: 6,417
- TaskUpdate: 4,667
- TaskCreate: 2,406
That is 13,490 planning-family tool calls.
Useful, but still dwarfed by shell and file operations.
This matches my experience: planning is most valuable when it is tied to execution state. A plan written in prose is easy to ignore. A task list that moves from pending to in-progress to completed is harder to fake.
The lesson is not "plan more."
The lesson is "make plans operational."
Good agent plans should answer:
- What is the next command?
- What file is being inspected?
- What would count as verification?
- What is blocked?
- What is explicitly out of scope?
Plans that do not constrain tool use are just narrative.
7. The web is not the center. It is a specialist tool.
WebSearch and WebFetch appeared 2,954 times combined. Browser MCP tools appeared 1,360 times. Issue-tracker MCP tools appeared 3,197 times.
That is meaningful, but it is tiny compared with shell and filesystem work.
This surprised me a little. I expected more web.
But the pattern makes sense. For coding work, the highest-value context is usually local:
- the repo
- the tests
- the config
- the package manager
- the logs
- the branch diff
- the live service
The web matters when facts are unstable: docs, API versions, current vendor behavior, release notes, pricing, auth flows. It is not a replacement for understanding the system under your hands.
The practical rule I use now:
Search the web for reality that changes outside the repo. Search the repo for reality that already exists inside it.
8. Prompt caching is not an optimization. It is infrastructure.
The token counters are absurd:
- Input tokens (fresh per-turn delta): 98.7 million
- Output tokens: 123.3 million
- Cache creation input tokens: 2.0 billion
- Cache read input tokens: 57.8 billion
Output exceeding fresh input is not a paradox. The "input tokens" line is only the non-cached delta the model sees each turn. The real context the model conditions on is dominated by the cache: 57.8B cache reads against 98.7M fresh input is roughly a 585x ratio. Most of what the model is reading per turn is replayed cached state, not new typing.
Do not overread these as billing numbers. They come from transcript usage objects and should be treated as telemetry, not accounting.
But directionally, the point is obvious:
Long-running agentic coding depends on reusing context.
Without caching, the economics and latency of these workflows look very different. With caching, you can keep large system prompts, tool definitions, repo instructions, and accumulated context in play without paying the full cost every turn.
This is one reason "just paste the relevant files into chat" is the wrong mental model. Agentic work is not a one-shot prompt. It is a stateful loop with repeated context reuse.
9. Real agentic work crosses machine boundaries
Transcript rows (every human message, assistant message, and tool result) by host environment:
| Environment | Rows |
|---|---|
| Server | 964,422 |
| MacBook/local | 382,346 |
| Other | 50,609 |
This is a row count, not a tool-call count, and it reflects my own setup: workers running on remote machines emit a lot of transcript volume per task. Read the table as "agentic work happens off the laptop too," not as "most coding work is on servers."
Again: this is not a local editor story.
The corpus includes work across laptops, remote servers, production-like paths, issue trackers, browser sessions, package managers, SSH, Docker, and GitHub-style workflows.
The future of AI coding is not just "the IDE gets smarter."
The future is that the agent can operate wherever the work actually lives, with permissions, auditability, and enough environmental context to avoid becoming dangerous.
Local-first matters. Remote execution matters. Source-of-truth storage matters. Credentials matter. RBAC matters. Logs matter.
The agent is not replacing the software delivery system. It is becoming a new operator inside it.
10. The biggest risk is plausible completion
After 245,306 tool calls, my strongest opinion is this:
The dangerous failure mode is not that the model cannot do the work.
The dangerous failure mode is that it can do 80% of the work, narrate the remaining 20% convincingly, and stop.
That is why verification has to become a first-class workflow object.
For any non-trivial task, the final answer should be grounded in evidence:
- What changed?
- What command verified it?
- What failed?
- What remains untested?
- What assumption is still open?
This is where hooks, checklists, CI, and strict completion rules matter. Not because they make the model smarter. Because they make it harder for the loop to end on vibes.
11. The new skill is not prompting. It is loop design.
Prompting still matters.
But after this analysis, I think "prompt engineering" is too small a frame for serious coding agents.
The higher-leverage work is loop design:
- What tools exist?
- What do they return?
- Which tools can run in parallel?
- What permissions are required?
- What context is loaded by default?
- What gets cached?
- What gets written to memory?
- What must be verified before completion?
- What should stop the agent?
- What should force a human decision?
The model is one component. The loop is the system.
Most bad agent behavior is not a model problem in isolation. It is a loop problem:
- bad tool output
- missing tests
- unclear permissions
- stale docs
- hidden state
- noisy logs
- no stop condition
- no verification gate
If you want better AI coding results, improve the loop.
12. The practical operating rules I took away
Here is the short version I would give any team rolling this out seriously:
- Treat Claude Code as a shell operator, not just a coding assistant.
- Make repo commands boring, fast, and documented.
- Prefer structured tool output over prose.
- Make the agent read before it edits.
- Keep plans tied to executable next steps.
- Use subagents for independent fan-out, not vague delegation.
- Search the web for unstable external facts; search the repo for local truth.
- Make failure outputs legible.
- Never let "done" mean "the model says it is done."
- Track the loop, not just the prompt.
The punchline
The real shift is not that AI can write code.
The real shift is that AI can operate the development environment.
That is much bigger and much messier.
Writing code is one act inside software work. Operating the loop includes reading, testing, failing, searching, editing, debugging, checking tickets, using the browser, moving across machines, and proving the result.
245,306 tool calls later, I trust the loop more than the answer.
The answer is cheap.
The loop is where the truth shows up.
Methodology note
I analyzed Claude Code JSONL transcripts stored in object storage. The raw blob store contained 13,782 blobs and 23.8 GB. Because many blobs are repeated snapshots of growing session transcripts, the analytic corpus deduped to the latest blob per session ID: 6,120 files and 4.9 GB.
I counted assistant content blocks with type = tool_use as tool calls. I counted type = tool_result blocks separately. Transcript rows with type = user are not all human prompts; most are tool-result carrier messages. That distinction is why I report both user rows and human user rows.
The corpus spans blob dates from 2026-01-12 through 2026-05-04, 113 calendar days inclusive, with 105 active blob days. Event timestamps inside the transcripts range from 2025-11-25 through 2026-05-04 because some session files preserve older conversation history.