MPIsaac Ventures
Back to Blog

fpk: F-Bombs Per Thousand. The Dev-Experience Metric You Didn't Know You Needed

Michael Isaac
Michael Isaac
Operator. 30 yrs in enterprise AI.8 min read

I scanned 5 months of my own Claude Code conversation logs for f-bombs and correlated the rate with model and CLI version. The result was a surprisingly clean DX gradient - and a metric I'm only half-joking about.


The premise

I keep every Claude Code conversation. Five months of JSONL logs sit on disk: 6,120 sessions, 44,212 human-typed prompts, 245,306 tool calls, ~5 GB of raw transcript data.

I look at this corpus often, usually for tool-use patterns or agent-loop archaeology. Last night I went looking for something less serious: how often do I swear at it?

Specifically: do I swear at some Claude models more than others? Do I swear less on newer Claude Code CLI builds than older ones?

The answer turned out to be more interesting than I expected. So I formalized the metric and gave it a name.

fpk: f-bombs per 1,000 prompts.

This post is a writeup of how I measured it, what I found, and why I think the joke might actually be a real metric.


Methodology

The corpus is native Claude Code JSONL transcripts. Each row is a JSON object with a type (user, assistant, system, attachment, etc.), a message, and metadata including version, timestamp, and sessionId.

The first job was deciding what counts as "the human typing." A naive grep over user rows would catch tool results too, because tool outputs come back as user-role messages. The Claude Code parsing convention is:

A row is a human user row when its type=user AND its content is not entirely tool-result blocks.

I applied that filter exactly. Then I stripped out the wrapper tags Claude Code injects into user content: <system-reminder>..., <command-name>..., <user-prompt-submit-hook>.... Those are harness output, not me typing.

Then I ran regex for the f-word vocabulary:

  • \bf+u+c+k+\w*\b for "fuck" and inflections
  • \bmother\s*f+u+c+k+\w*\b for "motherfucker" variants
  • \bf[\*#@!]{1,3}ck\w*\b for censored forms
  • \b(?:fck|fckn|fckin|fcking|fkin|fking|fkn)\w*\b for abbreviations
  • \b(?:wtf|stfu|mfer|mfers|mofo|fubar|gtfo)\b for the obvious abbreviated forms

The first pass turned up 1,565 hits, but inspection showed two big false-positive sources: FK (foreign key, in database schema discussions) and -af (a common CLI flag). I dropped those tokens from the patterns. Final count: 1,308 f-bombs across 44,212 human prompts.

That's an overall rate of 29.58 fpk, one f-bomb every 34 prompts. About once an hour at my prompt cadence.

The complete sample distribution:

  • Total f-bombs: 1,308
  • Sessions with at least one f-bomb: 329 (5.4% of all sessions)
  • Calendar span: Dec 12, 2025 to May 4, 2026
  • Active days: 105

By month:

MonthF-bombs
2025-12111
2026-01495 ← peak
2026-02366
2026-03233
2026-0499
2026-054 (4 days)

January was peak rage. April was the calmest full month. May is on track for an even lower number. Something is improving.


Greatest hits

A sample of f-bomb-bearing prompts, verbatim:

"Fuck screenshots."

"WTF happened to my fucking chrome profile?!?!?"

"DUDE! WE ONLY USE DOCKER. WTF"

"wtf are you doing?!?! I REBUILT THE FUCKING..."

"I need to login but am in some fuckin google loop..."

A pattern emerges. Most of the swearing is not at the model's reasoning. It's at environmental friction: browser profiles, login loops, screenshot scaffolding, infrastructure that should "just work." The model is the one I'm typing to, but it's not always the thing I'm angry at. It's the closest available target.

(This is going to matter when we look at per-model rates.)


fpk by Claude model

For each f-bomb-bearing prompt, I attributed it to the next assistant model in that session (i.e. the orchestrator model that was about to read the message). If no subsequent assistant existed in the session, I fell back to the previous one.

ModelF-bombsPromptsfpk
claude-opus-4-564917,02838.11
claude-opus-4-653015,32434.59
claude-sonnet-4-6834822.99
claude-sonnet-4-5341,60321.21
glm-4.7361,99818.02
claude-opus-4-7433,87211.11
<synthetic>71,4414.86
claude-haiku-4-507010.00

Two things jump out.

First: Opus 4.7 has roughly one-third the fpk of Opus 4.5. A 3.4x reduction in observable rage across one model family, controlling for the same operator and the same workflow.

That's a striking gradient. It does not prove the model "got better" in any specific dimension. It does not even prove the model is the cause; the Opus 4.5 era was also the era when the work I was doing was hardest and the surrounding tooling was earliest. Hard work + earlier model + earlier tooling all compound.

But the gradient is consistent with the experience. Opus 4.7 trips me up less. I yell at it less. When I do, the f-bomb is more likely to be at infrastructure than at the model.

Second: Haiku has an fpk of 0.00 - zero f-bombs across 701 prompts. This is not because Haiku is a calming presence. It is because Haiku is never the orchestrator in my setup. Haiku runs as a background subagent, dispatched by Opus or Sonnet for narrow tasks. It never sits in the seat that reads my next message. By the time something has gone wrong with a Haiku-driven subtask, I am yelling at the Opus orchestrator that dispatched it, not at Haiku.

The seat matters more than the model. The model that catches the wrath is the one reading your message first.

This is a real architectural observation that the metric makes legible. If you redesigned my agent setup to put Haiku in the orchestrator seat for some workloads, Haiku's fpk would not stay at zero.


fpk by Claude Code CLI version

The Claude Code CLI shipped ~100 distinct versions across the corpus. Per-version rates are noisy at small sample sizes, so I bucketed by version era:

EraF-bombsPromptsfpk
2.0.x (legacy)1808,12422.16
2.1.0-29 (early)54015,69234.41
2.1.30-69 (mid)2927,29540.03
2.1.70-99 (late)2297,52030.45
2.1.100+ (current)675,58112.01

Same shape: 3.3x reduction from peak (mid-2.1) to current.

The worst individual versions, restricted to those with ≥150 prompts so the rate is statistically meaningful:

Versionfpk
2.1.42173.79
2.1.59134.50
2.1.27121.77
2.1.83109.59

2.1.42 generated an f-bomb every 6 prompts. If you remember regressions in those releases, the fpk metric is empirical confirmation that you weren't imagining it.

The calmest individual versions, ≥300 prompts:

Versionfpk
2.1.1100.00
2.1.320.00
2.1.1122.49
2.0.702.69
2.1.962.91

Two builds with hundreds of prompts and zero f-bombs. Whatever those builds did, keep doing it.


So... is fpk a real metric?

I started this as a joke. I'm not sure it should stay one.

The mainstream AI dev-tool metrics are:

  • Speed: tokens per second, latency, time to first token
  • Cost: dollars per task, dollars per session
  • Accuracy: benchmark scores, win rates, eval suites
  • Engagement: DAU/WAU, sessions per user, retention
  • Satisfaction: NPS, CSAT, survey scores

None of these capture the lived experience of using an agent loop for hours a day. They are all either too narrow (speed, cost), too synthetic (eval benchmarks), or too high-level (NPS, retention) to detect the moment-to-moment friction that actually drives users away.

fpk is different in two ways:

  1. It is unfiltered. Nobody types "fuck" into their AI assistant for the survey. It is involuntary. The signal is closer to ground truth than any solicited rating.

  2. It is longitudinal and dense. Every prompt is a measurement. You don't need to design a study or wait for a survey window. Five months of logs gives you 44,000 measurements.

The disadvantages are real. fpk only captures visible frustration; silent rage and silent disengagement are invisible. Cultural norms vary, and some operators don't swear, period, so their fpk will always be zero regardless of how broken the system is. And it is N-of-1: my fpk is mine, not yours.

But within an N-of-1 longitudinal frame, fpk is the cleanest friction signal I have ever had access to. It is showing me exactly the thing I want to know: is the loop getting better or worse? Right now, mine is dropping. That is not nothing.


What I'm doing with this

Three things.

1. Tracking it as a personal dashboard metric. Daily fpk, weekly fpk, by-model fpk. If it starts rising again, I want to know early, and I want to know which model or CLI version is responsible.

2. Treating high-fpk sessions as bug reports to myself. The f-bomb is a marker for a moment of friction. The session around it is the post-mortem. Tag, replay, fix.

3. Recommending it to anyone building AI dev tools. If you ship a serious agent loop, your users' conversation logs are the richest dataset you will ever have access to. Index them. Compute fpk. Compare across versions. The signal is real.


Reproducing this

If you have your own Claude Code logs (they live in ~/.claude/projects/ on most setups), the analysis is straightforward:

  1. Walk the JSONL files.
  2. For each row where type=user and the content is not all tool_result blocks, extract the text.
  3. Strip wrapper tags: <system-reminder>, <command-name>, hook output, and so on.
  4. Regex for f-word variations. Drop bare fk and af from your pattern, they false-positive on foreign keys and CLI flags.
  5. For each match, attribute to the next assistant model in the same session. Bucket by model and CLI version.
  6. Normalize per 1,000 prompts.

The two scripts I used are about 200 lines of Python total. The schema, ingest pipeline, and analyses live in the companion repo: github.com/MPIsaac-Per/claude-code-ops-audit.

If you compute yours, please send me the number. Anonymous if you prefer. I want to know what the distribution across operators looks like.

In the meantime: my fpk is 29.58 and falling. Yours?