MPIsaac Ventures
Back to Blog

Agentic Search Is Major, Not Half

Michael Isaac
Michael Isaac
Operator. 30 yrs in enterprise AI.4 min read

Entire's May 6, 2026 post, How We Improved Agentic Search, put a useful public number on a thing coding-agent operators feel every day: search is not a side quest.

Their public trace set:

SurfaceCount
Checkpoints1,983
Tool calls202,142
Search-related tool calls98,555
Search share48.8%

Their search split:

Search bucketCountShare of search calls
Read / file retrieval48,32249.0%
Bash search fallback23,18023.5%
Grep / content search23,13623.5%
Other3,9174.0%

I reran the same question against my refreshed Claude Code corpus. The answer is not identical. It is still large enough to matter.

The Replication Cut

My denominator:

  • 4,234 parsed session IDs
  • 247,592 tool events
  • timestamp range: November 25, 2025 to May 6, 2026

The tricky part is Bash. A dedicated Grep call is easy to classify. A shell command can be a real search, a setup step, a test, a deployment, or a pipeline that starts with cd and searches later. So I kept three cuts:

Local definitionSearch eventsShare of all tool events
Strict first-token Bash search75,20030.37%
Wide first-token Bash / file-discovery search87,23635.23%
Strict command-text Bash search91,50136.96%

The strict first-token cut counts file retrieval, dedicated grep/content search, and Bash commands whose first command token is a search verb. The wide cut includes more file-discovery verbs. The command-text cut catches shell pipelines where the search verb appears after a setup command.

That gives the cleaner headline:

In this corpus, search is major, not half.

Entire saw 48.8%. My corpus lands between 30.4% and 37.0%. The exact percentage is workload-shaped and taxonomy-shaped, but both datasets reject the same bad mental model: coding agents are not mostly final-answer generators. They spend a large fraction of their life trying to find the right thing to inspect.

Why The Numbers Differ

The corpora are different. Entire's public analysis comes from real-world development checkpoints from the open source entireio/cli repo. Mine is a single-operator Claude Code corpus across many repos, remote machines, issue-tracker work, browser checks, infrastructure commands, and long-running maintenance sessions.

The tool surfaces are also different. Entire's benchmark normalizes search through a search_code abstraction. Claude Code exposes Read, Grep, Glob, Bash, WebFetch, MCP tools, and delegated agents. Classifying "search" in that environment is inherently messier.

That messiness is the point. The agent does not care whether a file was found through Grep, rg, find, ls, cat, or a shell pipeline. It cares whether the returned slice helps it decide what to read next.

What Held Up

Entire's strongest result was not "search is exactly half." It was that faster search alone is not the main bottleneck. In their speed benchmark, faster indexed search cut median search_code latency from 14.7 ms to 1.7 ms, while wall clock moved only modestly because tool execution was a tiny share of total runtime.

My latency cuts point the same way:

Local surfacep50
Kept tool-result delta450 ms
User-to-assistant delta4,182 ms
Ratio9.3x

The Codex telemetry overlay is even cleaner:

Codex telemetry surfacep50
Tool dispatch11.62 ms
Model stream request3,601.86 ms

So I would not read the search-share result as "make grep faster and the agent is fixed." The better read is: search is frequent enough that the quality of search output shapes the loop, but raw search latency is usually not the first-order wall-clock bottleneck.

The Better Evaluation Target

Entire's pgr result is the useful direction: ranking and presentation improved first-result relevance more clearly than raw speed improved end-to-end runtime.

That matches the local first-read cut. In 1,831 edit-anchored sessions, early reading correlated with total work, not cleanly with explicit failures:

RelationshipCorrelation
Pre-edit reads vs total tools0.249
Pre-edit reads vs total explicit errors0.021
Pre-edit reads vs post-edit explicit errors0.081

That does not say "read more and errors disappear." It says hard tasks require more exploration, and the useful question is whether the agent reaches the right inspection path sooner.

The Operator Protocol

If you are evaluating agentic code search, do not stop at search latency.

Track:

  1. search share of all tool calls
  2. search calls before first useful file read
  3. first relevant result rank
  4. output characters returned per search
  5. repeated reformulations before inspection
  6. downstream tool count, cost, and wall clock

The first four are the search-layer metrics. The last two are downstream system metrics. Do not collapse them into one number too early.

The practical protocol:

Optimize search for first useful inspection, not raw scan speed.
Show the agent fewer, better-ranked, better-trimmed candidates.
Then measure whether it reaches the right file sooner.

That is the piece of Entire's finding I would carry into Claude Code, Codex, Cursor, Aider, or any other coding-agent loop.

Search is not the whole system. It is one of the load-bearing surfaces inside the system.

The public query factory for the local replication is here: MPIsaac-Per/claude-code-ops-audit.