Agentic Search Is Major, Not Half
Entire's May 6, 2026 post, How We Improved Agentic Search, put a useful public number on a thing coding-agent operators feel every day: search is not a side quest.
Their public trace set:
| Surface | Count |
|---|---|
| Checkpoints | 1,983 |
| Tool calls | 202,142 |
| Search-related tool calls | 98,555 |
| Search share | 48.8% |
Their search split:
| Search bucket | Count | Share of search calls |
|---|---|---|
| Read / file retrieval | 48,322 | 49.0% |
| Bash search fallback | 23,180 | 23.5% |
| Grep / content search | 23,136 | 23.5% |
| Other | 3,917 | 4.0% |
I reran the same question against my refreshed Claude Code corpus. The answer is not identical. It is still large enough to matter.
The Replication Cut
My denominator:
- 4,234 parsed session IDs
- 247,592 tool events
- timestamp range: November 25, 2025 to May 6, 2026
The tricky part is Bash. A dedicated Grep call is easy to classify. A shell command can be a real search, a setup step, a test, a deployment, or a pipeline that starts with cd and searches later. So I kept three cuts:
| Local definition | Search events | Share of all tool events |
|---|---|---|
| Strict first-token Bash search | 75,200 | 30.37% |
| Wide first-token Bash / file-discovery search | 87,236 | 35.23% |
| Strict command-text Bash search | 91,501 | 36.96% |
The strict first-token cut counts file retrieval, dedicated grep/content search, and Bash commands whose first command token is a search verb. The wide cut includes more file-discovery verbs. The command-text cut catches shell pipelines where the search verb appears after a setup command.
That gives the cleaner headline:
In this corpus, search is major, not half.
Entire saw 48.8%. My corpus lands between 30.4% and 37.0%. The exact percentage is workload-shaped and taxonomy-shaped, but both datasets reject the same bad mental model: coding agents are not mostly final-answer generators. They spend a large fraction of their life trying to find the right thing to inspect.
Why The Numbers Differ
The corpora are different. Entire's public analysis comes from real-world development checkpoints from the open source entireio/cli repo. Mine is a single-operator Claude Code corpus across many repos, remote machines, issue-tracker work, browser checks, infrastructure commands, and long-running maintenance sessions.
The tool surfaces are also different. Entire's benchmark normalizes search through a search_code abstraction. Claude Code exposes Read, Grep, Glob, Bash, WebFetch, MCP tools, and delegated agents. Classifying "search" in that environment is inherently messier.
That messiness is the point. The agent does not care whether a file was found through Grep, rg, find, ls, cat, or a shell pipeline. It cares whether the returned slice helps it decide what to read next.
What Held Up
Entire's strongest result was not "search is exactly half." It was that faster search alone is not the main bottleneck. In their speed benchmark, faster indexed search cut median search_code latency from 14.7 ms to 1.7 ms, while wall clock moved only modestly because tool execution was a tiny share of total runtime.
My latency cuts point the same way:
| Local surface | p50 |
|---|---|
| Kept tool-result delta | 450 ms |
| User-to-assistant delta | 4,182 ms |
| Ratio | 9.3x |
The Codex telemetry overlay is even cleaner:
| Codex telemetry surface | p50 |
|---|---|
| Tool dispatch | 11.62 ms |
| Model stream request | 3,601.86 ms |
So I would not read the search-share result as "make grep faster and the agent is fixed." The better read is: search is frequent enough that the quality of search output shapes the loop, but raw search latency is usually not the first-order wall-clock bottleneck.
The Better Evaluation Target
Entire's pgr result is the useful direction: ranking and presentation improved first-result relevance more clearly than raw speed improved end-to-end runtime.
That matches the local first-read cut. In 1,831 edit-anchored sessions, early reading correlated with total work, not cleanly with explicit failures:
| Relationship | Correlation |
|---|---|
| Pre-edit reads vs total tools | 0.249 |
| Pre-edit reads vs total explicit errors | 0.021 |
| Pre-edit reads vs post-edit explicit errors | 0.081 |
That does not say "read more and errors disappear." It says hard tasks require more exploration, and the useful question is whether the agent reaches the right inspection path sooner.
The Operator Protocol
If you are evaluating agentic code search, do not stop at search latency.
Track:
- search share of all tool calls
- search calls before first useful file read
- first relevant result rank
- output characters returned per search
- repeated reformulations before inspection
- downstream tool count, cost, and wall clock
The first four are the search-layer metrics. The last two are downstream system metrics. Do not collapse them into one number too early.
The practical protocol:
Optimize search for first useful inspection, not raw scan speed.
Show the agent fewer, better-ranked, better-trimmed candidates.
Then measure whether it reaches the right file sooner.
That is the piece of Entire's finding I would carry into Claude Code, Codex, Cursor, Aider, or any other coding-agent loop.
Search is not the whole system. It is one of the load-bearing surfaces inside the system.
The public query factory for the local replication is here: MPIsaac-Per/claude-code-ops-audit.