The surprising depths of prompt caching

Prompt caching looks like a token discount. Underneath, it is KV tensors, prefix trees, inference economics, and a privacy model hiding in plain sight.

TO

the opub team

/ 15 min read

Tokens sure are expensive, and they probably won't get cheaper any time soon. Especially when you're pulling from a pool of donated tokens, making efficient use of what's available to you is important.

The first real lever many teams discover for significant LLM spend reduction is prompt caching. Effective caching strategies can save your team massive sums of money.

What is it, how it works, best practices - we'll get into all that. We'll also explore radix trees as a caching method, and why frontier model providers probably won't be too upfront about it if they use them. And it seems like they are.

It'll be fun, no math! We promise.

Prompt bit on caching

To start, a clean definition:

Prompt caching skips repeated re-computation of the expensive part of LLM inference — the key-value attention states — for tokens that haven't changed between requests. Instead of re-reading your system prompt from scratch on every call, the server loads the precomputed math and picks up from there. You pay only for what's new.

If your system prompt says: "Please respond exclusively as a pirate on the cusp of scurvy", the attention states, "the path" through the model, is pre-primed by the server.

That means subsequent prompts already "know" the prompt, and will "skip" to what's new: yaaarrr me treasure for a melon, ye' salty dog!

Tokenomics

Prompt caching is a source of significant cost savings. It reduces the number of tokens that need to be crunched.

You'll see caching information referenced by your coding harness and API responses, something like this in the case of Codex:

Token usage: total=647,414 input=615,643 (+ 7,505,280 cached) output=31,771 (reasoning 4,583)

Using OpenAI's current pricing for GPT-5.5, we see the cost benefit of cache:

Bucket Tokens Rate Cost
Input (uncached) 615,643 $5.00/M $3.08
Input (cached) 7,505,280 $0.50/M $3.75
Output (incl. reasoning) 31,771 $30.00/M $0.95
Total $7.78

Not bad! In a vacuum anyways.

What would it cost without caching?

Bucket Tokens Rate Cost
All input at full price 8,120,923 $5.00/M $40.60
Output 31,771 $30.00/M $0.95
Total $41.55

Ooof! 💰 That's some cost savings:

Cost
With cache $7.78
Without cache $41.55
Saved $33.77 (81% off)

If your application, team, or agent sends the same long system prompt, tool schema, coding instructions, policy text, or repository context thousands of times per day, you are probably paying the model provider to re-read the same tokens over and over. You don't have to do that.

For now, there's an aligned incentive between us, the developers, and the frontier providers, to reduce straight inference hits. Model providers are trying to scale to match massive demand, and we're trying to save some major dollars. Efficiency is key to all of this truly working, on both sides.

But the discount is only a part of the story. An interesting question is why that repeated work is necessary in the first place.

Models don't read

A transformer, your frontier model, does not simply read text and remember the words. During the prompt-processing step, text is split into tokens, and each token is turned into vectors. No reading involved.


Meme explaining that a model does not simply read text, it works with vectors

Claude explaining this for the 24th time before (probably) trying to steal my rings


A token starts as a small integer ID, like:

"helpful" -> token id 15345

The model first turns that ID into a dense vector, maybe thousands of numbers wide:

15345 -> [0.0312, -0.8741, 0.2209, ...]

Then every transformer layer computes attention state for that token. For caching, the important outputs are the token's Key and Value vectors at each layer.

So one token does not become one thing. It becomes something like:

token
  layer 1: K vector + V vector
  layer 2: K vector + V vector
  layer 3: K vector + V vector
  ...
  layer N: K vector + V vector

A 2,000-token prompt is not just 2,000 wee little text fragments. It is 2,000 positions worth of mathematical state, repeated across dozens of layers. Each K/V vector can be hundreds or thousands of numbers depending on the architecture.

Multiply that by every token in the prompt, every request, and every concurrent user, and the "same system prompt" becomes a large block of GPU memory and prefill compute. That compute is what gets - and will keep getting - costly.

It's baked right into the expansion:

text tokens -> embeddings -> per-layer K/V tensors

The KV cache is the model server keeping those per-token, per-layer Key/Value tensors around.

So what do we cache?

The phrase "prompt caching" sounds like a text cache. It's something different.

By the time caching matters, the prompt has already expanded into per-token, per-layer KV state. During normal generation, the model server already keeps that state around so new output tokens can attend back to the prompt without recomputing the whole history.

For a single request, that cache is usually ephemeral. The model prefills the prompt, stores the key/value tensors needed for decoding, generates the answer, and then releases the memory. Like a shooting star, it flares out, then it's gone.


Money to burn

Gasoline for your money fire, sir?


Prompt caching changes one thing: for a reusable prompt prefix, the server keeps those key/value tensors and lets a later request resume from them. This leads directly to cost savings and faster inference returns.

That small change only works if the prefix stays stable.

Let's get practical.

How to prompt

Prompt caching turns prompt layout into infrastructure design.

The rule is simple: static content first, dynamic content last.

A readable prompt can still be cache-hostile:

system instructions
tool policy
today's date
retrieved context
user message

That layout makes sense to a human. We say what we want, add policy, note the date, include retrieved context, then ask the actual question.

But the cache only sees token order. If today's date, a session ID, a tool-call timestamp, or retrieved context appears before the reusable instructions are finished, the stable prefix gets shorter. Change token 47, and everything from token 47 onward has to be recomputed.

That is cache breaking. Better to visualize it.

Shared prefix

[SYS]Youareahelpfulassistantwithtools:

Request 1

Cold cache

ready

Token stream

[SYS]Youareahelpfulassistantwithtools: Whatis2+2?

KV rows, layers collapsed

[SYS] K V ...
You K V ...
are K V ...
... 5 shared prefix rows
What K V ...
is K V ...

12

tokens computed

0

tokens cached

Request 2

Cache hit

ready

Token stream

[SYS]Youareahelpfulassistantwithtools: NowexplainKVcache

KV rows, layers collapsed

[SYS] K V ... loaded
You K V ... loaded
are K V ... loaded
... 5 shared prefix rows
Now K V ...
explain K V ...

4

tokens computed

8

tokens cached

Napkin math $5/M full input, $0.50/M cached. Per 1M repeats: $60 cold $24 cached, saving $36.

Send request 1. The first request is cold. Blue means computed from scratch. The model has seen none of this prefix yet, so the whole prompt is full price.

Now send request 2. The shared system prompt turns green because that prefix can be loaded from cache. Only the new user suffix has to be computed.

Then click BREAK PREFIX. A changing token inside the system prompt cuts the reusable prefix short. The later rows have to be recomputed, and naturally, paid for.

The example is contrived, but the dark pattern is real. With large operational documents, tool schemas, or repository context, an innocuous per-call update can undermine the cache strategy. The napkin math at the bottom uses the same GPT-5.5 pricing example from above.

It adds up very, very fast.

The cache-conscious version is deliberate about its roots:

static instructions
static tool definitions
static examples
stable repository or product context
dynamic date
retrieved context
user message

For an individual, this saves money. For a team sharing style guides, tool definitions, codebase context, and policy documents, it compounds.

But the game also depends on your provider, code, and harness.

Comparing provider caching

The major hosted providers all expose some form of prompt caching, but the surface area tells you a lot about how they want developers to think.

Anthropic, Claude Code

Claude logo

Anthropic gives developers explicit control. In the Claude API, you can add cache_control and choose where cache breakpoints belong.

Anthropic's docs now also describe automatic caching, but the core model is still visible: you can inspect cache_creation_input_tokens and cache_read_input_tokens, see whether you paid for a write or a read, and place breakpoints deliberately.

The pricing model is explicit too: 5-minute cache writes cost 1.25x base input price, 1-hour writes cost 2x, and cache reads cost 0.1x base input price.

The default lifetime is 5 minutes, refreshed on use, with a 1-hour option at the higher write price. Their docs are also blunt about the common failure mode: if the breakpoint includes a timestamp or other changing block, you can pay for fresh writes repeatedly without getting useful reads.

Practically, if you set your organization up with consistent, shared context documents, and label them as such when called, you're well on your way.

See Anthropic's prompt caching documentation.

OpenAI, Codex

OpenAI logo

We'll unpack this one more later because it's... intriguing...

OpenAI makes caching automatic. There is no marker to place around a block.

The system looks for matching prompt prefixes once a request reaches the minimum length, reports cached tokens in usage, and applies cached-token pricing.

The current OpenAI docs say caching is enabled for prompts of 1,024 tokens or more, that cache hits depend on exact prefix matches, and that static content should come before variable content.

OpenAI also documents routing by a hash of the initial prefix, usually around the first 256 tokens, and a prompt_cache_key parameter that can improve routing when many requests share long common prefixes.

OpenAI's current docs also go beyond the older "50% discount" framing: they describe input-cost reductions of up to 90% on recent models, in-memory retention, and extended retention for some models.

See OpenAI's prompt caching guide.

Google, Gemini

Google Gemini logo

Google's Gemini API exposes the heaviest abstraction.

You create named cached-content objects, refer to them by ID, and can update TTL or expiry. That makes context caching feel less like a hidden optimization and more like a resource you manage.

It also means the bill can include both cached-token use and storage duration.

It might seem cumbersome, but it can be used to great effect.

And the devil's also in the details...

See Google's Gemini context caching documentation.


Under the hoodlums

In November 2025, Will McGinnis published independent tests of Anthropic and OpenAI caching.

He expected an OpenAI cache miss after appending content to a cached prefix, but observed 1,920 cached tokens anyway. Thus Will concluded that the system reused the longest matching portion rather than behaving like a naive exact-whole-prompt lookup.

Per Will:

"OpenAI's docs claim 'exact prefix match' is required. My tests showed otherwise. I intentionally modified the cached prefix by appending content, expecting a cache miss. Instead: 1,920 tokens cached. The system cached the longest matching substring, not just exact prefixes."

Hmm - why? Nothing better than a good mystery. Except maybe a mystery that helps us get to the depths of how this works and where this might be going.


Homer disappearing into the bushes

Well you see, it works like... [vanishes into bush]


OpenAI says exact prefix matches matter.

Will's result suggests that the lookup mechanism may be more specific than "did this whole prompt match?" If the server can reuse the longest matching beginning of a request, then caching is no longer just a yes/no lookup. It becomes a search for the deepest reusable prefix.

A regular cache answers a binary question:

  • have I seen this exact key before?

A prefix cache needs a better question:

  • how much of this request have I already seen?

That is where radix trees enter the story.

Radical radix trees

A radix tree stores shared prefixes as shared structure. If many requests begin with the same instruction block, the tree can represent that common beginning once, then branch when the prompts diverge:

You are a helpful assistant with tools. [User A]
You are a helpful assistant with tools. [User B]
You are a helpful assistant without tools. [User C]

The shared branch is You are a helpful assistant. Then the tree forks at with versus without. For an inference server, the values attached to those branches are not strings. They are locations for KV tensors. The shared prefix can exist once in GPU memory while many requests walk through it.

Radix tree diagram showing shared prompt prefixes stored once with one reusable KV state

This is the difference between "request two is cheaper than request one" and "a whole population of similar requests can share infrastructure state." It moves prompt caching from a per-session discount into population-scale serving economics.

Building a factory which makes cars, or a factory which makes factories that make cars.

SGLang made this idea concrete with RadixAttention.

In the original LMSYS SGLang announcement, the authors describe retaining KV cache after generation and storing it in a radix tree for efficient prefix search, insertion, and eviction. The frontend still sends full prompts. The runtime finds the longest reusable prefix, reuses its KV cache, and appends the new suffix.

That matters because GPU memory, not just raw FLOPS, is one of the hard constraints in model serving. If hundreds of concurrent users send the same system prompt and tool definitions, a naive implementation can produce hundreds of copies of the same KV state. Remember, that's heavy.

A radix tree implementation can coalesce the shared prefix. One branch. Many requests. One stored representation of the reusable work.

This can lead to tremendous efficiency and performance gains at scale. It is also why this topic stops being just a developer tip and starts getting politically weird.

SGLang is for people serving models. It can be up-front about radix trees because the deployment boundary is usually clear: your hardware, your users, your cache policy.

For public frontier providers, OpenAI, Anthropic, Google, and company, that boundary is the entire product.

Keeping the quiet part quiet

Hosted frontier providers are not upfront about their exact caching infrastructure. That's reasonable and not weird at all. Serving architecture at scale is expensive, proprietary, changes often, and depends on hardware, model family, routing, enterprise tenancy, and retention policy.


Nothing to see here, please disperse

It's also wild and often on fire.


But the lack of specific detail is still informative. Above, we looked at the user-side communication from these companies, which is only a modest proxy for what really happens.

The most efficient known implementation pattern for shared-prefix serving is tree-shaped KV reuse. SGLang documents it and research systems extend it.

KVFlow, for example, studies prefix caching for multi-agent workflows and compares against SGLang's hierarchical radix cache.

In self-hosted inference, this is normal systems engineering. You can say "we keep a radix tree of shared prompt prefixes" because the operator already controls who shares the machine, who shares the cache, and what gets evicted.

In hosted frontier APIs, the same sentence has a different weight. Now you are asking customers to accept that the most efficient serving path may involve many unrelated requests walking through the same cached prefix structure.

That does not mean the provider is leaking prompt text. It does mean the architecture has to be reconciled with the privacy story customers think they bought.

The privacy model most customers carry around is simple:

my request -> isolated inference -> my response

A prefix-sharing serving architecture asks them to accept a different mental model:

many requests -> shared cached prefix -> suffix & output diverges

The outputs can be functionally identical. The provider may enforce organization boundaries. The cached representation may be only KV tensors rather than prompt text. None of that changes the optics: the architecture implies structural coalescence across all requests.

Maybe that's OK, yet there's a feeling that all of our inputs get subsumed into a massive ocean of pre-computed abstractions of incoming and outbound thought. That's a weird feeling for any company with intellectual property.

And then suddenly the issue is no longer philosophical.

Security and cache methods

The 2025 paper Auditing Prompt Caching in Language Model APIs tested production APIs for timing differences caused by caching. The concern is not that a cache hit reveals the prompt text.

It is that latency can reveal whether a prompt, or part of a prompt, was recently cached. If a request comes back faster than it should, you have learned something about the recent shape of traffic.

That is a side channel. No content needs to be returned. The timing is the signal.

The authors report detecting global cache sharing across users in several providers and argue that prompt caching creates privacy leakage through timing.

From the article:

"Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. [...] If the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts."

That is why "do they use radix trees?" is an interesting question. The more sophisticated the shared-prefix machinery is, the more careful the provider has to be about tenancy boundaries, routing, retention, and what the public API reveals through latency.

Is this bad, or evil?


Two robots being confronted, suggesting cache method uncertainty

Good bot? Bad bot? Who can be sure!


No, nothing like that. It sits at the intersection of economics and privacy. And that gets muddy.

OpenAI's current docs address this in their own way. They say prompt caches are not shared between organizations, describe in-memory and extended retention, and note that extended caching may store key/value tensors derived from customer content for a bounded period. That is useful disclosure. It is also not volunteering anything.

Anthropic documents cache breakpoints, TTLs, minimum lengths, invalidation behavior, and token accounting. Google documents explicit cached-content objects and storage-duration billing. Cool. But remember, these are product surfaces, not full infrastructure disclosures, and they can change rapidly.

That silence is not automatically damning. It is also not empty. It is a signal that prompt caching lives in an awkward place: the economics work best when repeated structure is shared, while enterprise trust works best when every request feels cleanly isolated.

What to take with you

For most builders, the practical advice is short and sweet:

Put static content first and dynamic content last. Treat dates, session IDs, retrieved snippets, user messages, and per-request tool output as cache breakers unless they appear after the reusable prefix.

Design for teams, not only sessions. Shared system prompts, tool schemas, style guides, repository maps, and policy documents should stay byte-for-byte stable across users when they can.

Watch the provider surface. OpenAI is automatic, but layout still matters. Anthropic gives explicit breakpoints, but writes cost more than normal input. Gemini makes caches explicit resources, including storage duration.

If you self-host, test against your actual traffic. For multi-turn chats, coding agents, and workflows with long shared prefixes, SGLang's radix-tree approach deserves serious attention. vLLM remains strong and widely used, but cache architecture matters when concurrent traffic shares prompt structure.

Summary: caching in

Prompt caching starts as an innocent billing feature. Reuse the same prompt prefix, pay less. Nice.

Then you look down one layer and realize the cached thing is not text. It is the model's intermediate work: per-token, per-layer attention state sitting in expensive memory.

Look down another layer and the economics change again. The real win is not one user sending request two after request one. It is thousands of apparently separate requests doing the same first thousand steps, then branching.

That is the rabbit hole. Prompt caching is a discount, a data structure, a GPU memory strategy, and a privacy story all at once.

TO

Written by the opub team

Filed under agents, models, tokenomics

Subscribe to the newsletter

Hear from open public.

Rare and concise letters with our latest writing, sponsorships, and updates.

No spam, never sold. Unsubscribe any time.

Next up

Introducing Open Public

May 21, 2026 / 6 min / launch, donors

More from opub

All posts
/ 6 min read

Introducing Open Public

opub is the public ai compute commons for open source. Donors fund donated compute for open source projects, and maintainers spend it on over 30 top coding models through API keys with public token spend.

the opub team