FIELD NOTES · 02 22 MAY 2026 · 9 MIN READ

rtk works. The pond
it's fishing is just 4%
of your bill.

The proxy is honest about its layer — it shrinks ~53% of the bash output it intercepts. In Claude Code, that bash output is roughly 4% of the input you actually pay for. The other 96% lives in cache_read, which rtk cannot touch. A map, not an accusation.

BYMarcus Sánchez

TOPICLLM tooling · benchmarking

STACKrtk · Claude Code · Langfuse

STATUSPUBLISHED

TL;DR

rtk works exactly as advertised — it shrinks ~53% of the bash output it intercepts. But in Claude Code, that bash output is roughly 4% of the input you actually pay for. The other 96% lives in cache_read — system prompt, CLAUDE.md, skill manifests, prior tool results — which rtk cannot touch. So the proxy is honest about its layer; the README's "session drops 150K → 45K" extrapolation just doesn't hold once you map the bucket sizes.

01.What rtk claims

rtk-ai/rtk is a Rust CLI proxy that compresses tool output before it reaches your LLM context. The repo's headline claims:

60–90% token reduction on common dev commands (git, cargo, pytest, docker, …)
"Typical 30-min Claude Code sessions drop from ~150K to ~45K tokens" (~70% session-level reduction)
Drop-in via a Claude Code PreToolUse hook that rewrites git status → rtk git status

The second claim — the session-level one — is the one most users will read first. The first is mechanically true and easy to verify; the second requires looking at what the LLM actually consumed.

02.Setup

Installed via brew install rtk on 2026-05-13 02:17 COT (confirmed from the Homebrew INSTALL_RECEIPT, brew info, and the binary mtime). Version 0.39.0, single binary, hook wired up (rtk hook claude in ~/.claude/settings.json).

Eight days of usage since install. Telemetry: Claude Code → OTEL → Langfuse Cloud (US). Two 7-day comparison windows:

Window	Dates (UTC)	Observations
PRE	2026-05-06 → 2026-05-13	1,500
POST	2026-05-15 → 2026-05-22	1,500

Sample: 500 claude_code.llm_request generations/day × 3 representative days per window. Install day is excluded.

03.The four buckets of a Claude Code request

Every Claude Code turn sends the model an input made of four buckets. Knowing which bucket a tool operates on tells you the ceiling on how much it can save.

Bucket	~size/req	$/M (Sonnet)	What lives there	rtk?
cache_read	~60,000	$0.30	System prompt, `CLAUDE.md`, `MEMORY.md`, skill manifests, hook outputs, every prior tool result, full conversation history	No
cache_creation	~2,400	$3.75	What's new this turn — including the bash output rtk just compressed	Yes
output	~440	$15.00	What the model generates	No
input (uncached)	~3	$3.00	Truly fresh non-cached bytes	No

rtk operates on cache_creation, and only on the bash-output portion of it. Read, Edit, Write, WebFetch, and every MCP tool flow past rtk untouched. In agent-heavy flows (issue enrichment, code review) those non-bash tools do most of the work.

// the ceiling — before measuring anything, rtk's theoretical maximum reach is ~4% of input (the bash slice of the ~4% cache_creation wedge). A heroic 80% cut on that slice nets you ~1–2% off the whole input — which is exactly the noise floor we measured.

04.What rtk's own counter says

Before challenging the downstream claim, confirm rtk is doing something at all. rtk gain prints a daily summary:

$ rtk gain
RTK Token Savings (Global Scope)
════════════════════════════════════════════════════════════
Total commands:    16,345
Input tokens:      34.0M
Output tokens:     15.8M
Tokens saved:      18.2M (53.5%)
Total exec time:   442m23s (avg 1.6s)

53.5% — below the advertised 60–90% but in the same order of magnitude. The interesting part is the per-command breakdown:

FIG · 01

Tokens saved by command — top 7 (out of 41 distinct entries)

source: rtk gain --json

Aggregate savings are dominated by a handful of heavy commands (vitest, playwright, the stash dump). The high-frequency one — rtk read, 2,770 calls — only trims 10%.

So rtk is doing the work it says it's doing. The question is whether that work translates into fewer tokens charged to my Anthropic account.

05.What Langfuse actually shows

Per-request token averages across both windows, pulled from metadata.attributes on every claude_code.llm_request generation:

FIG · 02

Effective input tokens per LLM request — PRE vs POST

stacked: cache_read · cache_creation · uncached

The uncached bucket — the one rtk shrinks — is a sliver. cache_read dwarfs everything; it grew 3% POST. The meter went up, not down.

Uncached input · Δ

−56%

5.5 → 2.4 tokens

cache_read · Δ

+3%

60,331 → 62,171 tokens

Effective input · Δ

+3%

62,572 → 64,571 tokens

The −56% looks dramatic but it's the smallest bucket by four orders of magnitude — a saving of three tokens per request. Side by side:

// rtk's actual reach

−56%

uncached input · −3 tokens / request

// the bucket above it

+3%

cache_read · +1,840 tokens / request

Both bars are drawn to the same per-token scale. That's the bandwidth difference between "the bucket rtk touches" and "the bucket Anthropic bills you for" — and it's exactly what the four-buckets table at the top of this post predicts.

06.Why we didn't see more improvement

Three things stack up, and all three are features of how Claude Code works rather than failures of rtk.

1. rtk only fishes the cache_creation bucket.

It's the right pond — that is where new bash output enters the model — but it's a ~4% wedge of input. The 96% wedge (cache_read) carries the system prompt, your CLAUDE.md, MEMORY.md, ~80 installed skill manifests, hook outputs, and every prior tool result, all replayed each turn. rtk can't see any of it.

2. Most tool calls aren't bash.

In a typical agent flow, Read, Edit, Write, WebFetch, and MCP servers each contribute new content to cache_creation, none of it through rtk. If a session does 50 tool calls and 40 are non-bash, rtk's slice of cache_creation is dilute even within its own bucket.

3. The compounding effect into cache is real but quiet.

Smaller bash outputs today mean smaller cached entries replayed tomorrow — so rtk does slow the growth of cache_read slightly. But cache_read also grows from every non-bash tool call, every skill that gets loaded, every CLAUDE.md edit. rtk's deceleration is swamped by everything else accreting.

And activity often expands to fill the freed space. Per-request input stays flat across the install boundary; per-day token spend goes up, simply because there are more requests:

FIG · 03

Daily trace volume — 14 days across the install boundary

~70 → ~230 traces/day

Dashed marker = install day. Per-request input is flat across the boundary; per-day token spend is up, simply because there are more requests.

And one shape-of-traffic observation worth naming: the high-frequency intercepts are low-margin, and the high-margin ones are low-frequency. rtk read at 10.6% savings fires 2,770 times; the 90%+ savers (vitest, playwright, stash show) fire <100 times each. That's not a rtk failure — it's what dev traffic looks like.

// none of this is a rtk bug — these are shape-of-the-stack facts. rtk would have to operate inside Claude Code's prompt assembly to reach cache_read, and that's a different tool — probably an impossible one, since cache_read content is mostly user-supplied (your CLAUDE.md, your skills, your hooks).

07.Where each claim lands on the map

// rtk's claims, mapped to the buckets

Shell-layer compression works as advertised
"60–90% per command"

In its layer

Per-command savings, on rtk's own counter
"rtk gain reports 53.5%"

Squarely in spirit

Session drops 150K → 45K (~70%)
per session, charged to your account

Doesn't survive cache_read

The proxy is doing its job. The README's headline extrapolates a per-command saving to session totals, but session totals are dominated by replayed cached context that rtk can't see. That's a map question, not a pass/fail.

08.The lever that would move your bill

If cache_read is 96% of input volume and rtk can't touch it, what can? The answer is unglamorous: shrink what you cram into the cached prelude. Every byte you keep in CLAUDE.md, MEMORY.md, the skill manifest, and hook outputs gets replayed on every single turn for the entire session.

Concrete moves in rough order of impact:

Audit your installed skills. Each skill in the manifest costs tokens every turn whether you use it or not. ~80 skills installed → tens of thousands of cached tokens per turn. Uninstall what you don't reach for.
Diet your CLAUDE.md / MEMORY.md / SOUL.md. These are read every session start and replayed in cache on every turn. Trim ruthlessly.
Shrink hook outputs. A SessionStart hook that prints 5KB of context is 5KB replayed on every subsequent turn until the session ends.
Watch MCP server count. Each connected MCP server publishes tool schemas into the prompt. Disable the ones you're not using this session.
Compact aggressively. Claude Code's auto-compaction summarizes old turns; manually /compact before long sessions when you don't need the full history.

rtk stays installed — it's free, zero-dependency, and the savings it does book are real money on heavy-output commands (vitest, playwright, gh pr diff). But the headline win in a Claude Code workflow comes from treating your cached prelude as a budget, not from compressing bash.

09.Likely pushback

"You didn't reset cache between runs." → Correct, and that's exactly the point. Most users don't either, and the README claims session-level savings without that caveat.
"Your activity tripled, of course tokens went up." → Per-request average wasn't supposed to depend on activity. If rtk shrank per-call inputs, the average would drop regardless of volume. It didn't.
"You're using the wrong hook." → Verifiable: jq '.hooks' ~/.claude/settings.json shows rtk hook claude is wired up, and rtk gain captured 16,345 commands. The hook works — that's not the disconnect.
"You should have tested a controlled per-session run, not before/after windows." → Agreed — and that's the next experiment. Sum tokens per session.id across a controlled task (e.g. "enrich this issue") run before and after rtk, with cache state reset. Isolates session size from activity volume. My windows don't control for what I was doing.

10.Methodology notes

All timestamps America/Bogota (COT). Langfuse stores UTC.
Tokens live in metadata.attributes.{input,output,cache_read,cache_creation}_tokens on Claude Code's OTEL spans. Langfuse's auto-extracted promptTokens/completionTokens fields are zero because the attribute naming doesn't match Langfuse's mapping — I had to read attributes directly.
Filter: type=GENERATION, level=DEFAULT, name=claude_code.llm_request. Excludes 429 rate-limit errors (zero tokens).
Sample is 1,500 observations per window across 3 days each — large enough to wash out single-session anomalies but not a controlled experiment.