Prompt Caching: Cutting LLM Cost and Latency in Half

If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.

May 24, 20268 min readAIScaling

Most LLM apps resend a large, identical preamble on every call — a system prompt, tool definitions, few-shot examples, retrieved documents. The model re-tokenises and re-processes all of it each time, and you pay full input price for tokens that never changed. Prompt caching lets the provider reuse the computed state for a repeated prefix, cutting both cost and time-to-first-token.

1. Why a prefix can be cached at all

A transformer processes tokens left to right; the internal state for position N depends only on tokens 0..N. So if two requests share the first 2,000 tokens, the work for those tokens is identical and can be reused. The catch follows directly: caching only works on an exact prefix match. One different character near the top busts the cache for everything after it.

2. Order your prompt static-first

This is the whole technique. Put everything stable at the top, everything variable at the bottom.

[ system instructions ]   ← static  ┐
[ tool definitions     ]   ← static  │ cacheable prefix
[ few-shot examples    ]   ← static  ┘
[ retrieved context    ]   ← semi-static
[ user question        ]   ← variable (cache miss starts here)

A common own-goal is injecting a timestamp or a request ID into the system prompt. That single dynamic token at the top means you never cache anything.

3. Explicit vs implicit caching

Implicit (some providers) — automatic for any repeated prefix over a minimum length. Free; you just structure prompts well.
Explicit (cache breakpoints / cache_control) — you mark where the cacheable prefix ends. More control, often a small write cost but large read discount.

Cached input tokens are typically billed at a fraction (often ~10%) of the normal input rate, and cache entries have a short TTL (minutes), so the win is biggest for bursty, repeated traffic.

4. What it does and doesn't speed up

Prompt caching shortens the prefill phase — processing the input — which lowers time-to-first-token and input cost. It does not change generation: output tokens cost the same and stream at the same rate. If your latency is dominated by long outputs, caching helps less than streaming would.

5. Measure cache-hit rate

Most APIs return cached_tokens (or similar) in the usage block. Log it and watch the ratio. A hit rate near zero on a stable-prompt app means something dynamic is leaking into your prefix — hunt it down before optimising anything else.

Rules of thumb

Sort prompts static-first. The cacheable part must be a byte-identical prefix.
Never put timestamps, UUIDs, or per-request data above the stable content.
Caching cuts prefill (input cost, TTFT), not generation. Combine with streaming for the full win.
Log cached_tokens and treat a low hit rate as a bug, not a given.

Prompt Caching: Cutting LLM Cost and Latency in Half

1. Why a prefix can be cached at all

2. Order your prompt static-first

3. Explicit vs implicit caching

4. What it does and doesn't speed up

5. Measure cache-hit rate

Rules of thumb

7 replies// weighed in

More from this topic

LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter

Context Windows: What They Actually Cost You and How to Fit More In

RAG Without the Hallucinations: Retrieval Patterns That Hold Up