Prompt Caching: Cutting LLM Cost and Latency in Half
If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.
Most LLM apps resend a large, identical preamble on every call — a system prompt, tool definitions, few-shot examples, retrieved documents. The model re-tokenises and re-processes all of it each time, and you pay full input price for tokens that never changed. Prompt caching lets the provider reuse the computed state for a repeated prefix, cutting both cost and time-to-first-token.
1. Why a prefix can be cached at all
A transformer processes tokens left to right; the internal state for position N depends only on tokens 0..N. So if two requests share the first 2,000 tokens, the work for those tokens is identical and can be reused. The catch follows directly: caching only works on an exact prefix match. One different character near the top busts the cache for everything after it.
2. Order your prompt static-first
This is the whole technique. Put everything stable at the top, everything variable at the bottom.
[ system instructions ] ← static ┐
[ tool definitions ] ← static │ cacheable prefix
[ few-shot examples ] ← static ┘
[ retrieved context ] ← semi-static
[ user question ] ← variable (cache miss starts here)
A common own-goal is injecting a timestamp or a request ID into the system prompt. That single dynamic token at the top means you never cache anything.
3. Explicit vs implicit caching
- Implicit (some providers) — automatic for any repeated prefix over a minimum length. Free; you just structure prompts well.
- Explicit (cache breakpoints /
cache_control) — you mark where the cacheable prefix ends. More control, often a small write cost but large read discount.
Cached input tokens are typically billed at a fraction (often ~10%) of the normal input rate, and cache entries have a short TTL (minutes), so the win is biggest for bursty, repeated traffic.
4. What it does and doesn't speed up
Prompt caching shortens the prefill phase — processing the input — which lowers time-to-first-token and input cost. It does not change generation: output tokens cost the same and stream at the same rate. If your latency is dominated by long outputs, caching helps less than streaming would.
5. Measure cache-hit rate
Most APIs return cached_tokens (or similar) in the usage block. Log it and watch the ratio. A hit rate near zero on a stable-prompt app means something dynamic is leaking into your prefix — hunt it down before optimising anything else.
Rules of thumb
- Sort prompts static-first. The cacheable part must be a byte-identical prefix.
- Never put timestamps, UUIDs, or per-request data above the stable content.
- Caching cuts prefill (input cost, TTFT), not generation. Combine with streaming for the full win.
- Log
cached_tokensand treat a low hit rate as a bug, not a given.