ML
AI · LLM

Prompt Caching: Cutting LLM Cost and Latency in Half

If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.

May 24, 20268 min readAIScaling

Most LLM apps resend a large, identical preamble on every call — a system prompt, tool definitions, few-shot examples, retrieved documents. The model re-tokenises and re-processes all of it each time, and you pay full input price for tokens that never changed. Prompt caching lets the provider reuse the computed state for a repeated prefix, cutting both cost and time-to-first-token.

1. Why a prefix can be cached at all

A transformer processes tokens left to right; the internal state for position N depends only on tokens 0..N. So if two requests share the first 2,000 tokens, the work for those tokens is identical and can be reused. The catch follows directly: caching only works on an exact prefix match. One different character near the top busts the cache for everything after it.

2. Order your prompt static-first

This is the whole technique. Put everything stable at the top, everything variable at the bottom.

[ system instructions ]   ← static  ┐
[ tool definitions     ]   ← static  │ cacheable prefix
[ few-shot examples    ]   ← static  ┘
[ retrieved context    ]   ← semi-static
[ user question        ]   ← variable (cache miss starts here)

A common own-goal is injecting a timestamp or a request ID into the system prompt. That single dynamic token at the top means you never cache anything.

3. Explicit vs implicit caching

  • Implicit (some providers) — automatic for any repeated prefix over a minimum length. Free; you just structure prompts well.
  • Explicit (cache breakpoints / cache_control) — you mark where the cacheable prefix ends. More control, often a small write cost but large read discount.

Cached input tokens are typically billed at a fraction (often ~10%) of the normal input rate, and cache entries have a short TTL (minutes), so the win is biggest for bursty, repeated traffic.

4. What it does and doesn't speed up

Prompt caching shortens the prefill phase — processing the input — which lowers time-to-first-token and input cost. It does not change generation: output tokens cost the same and stream at the same rate. If your latency is dominated by long outputs, caching helps less than streaming would.

5. Measure cache-hit rate

Most APIs return cached_tokens (or similar) in the usage block. Log it and watch the ratio. A hit rate near zero on a stable-prompt app means something dynamic is leaking into your prefix — hunt it down before optimising anything else.

Rules of thumb

  • Sort prompts static-first. The cacheable part must be a byte-identical prefix.
  • Never put timestamps, UUIDs, or per-request data above the stable content.
  • Caching cuts prefill (input cost, TTFT), not generation. Combine with streaming for the full win.
  • Log cached_tokens and treat a low hit rate as a bug, not a given.
SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Sania Patel· ML EngineerAgrees

    "temperature is a flatness dial, not a creativity dial" — saving this. so many product folks ask me to "make the AI more creative" by cranking temperature and i can finally point them at a paragraph instead of a 20-min explainer.

    May 26, 2026·2 days later
  2. Jorge Ramírez· Senior SWE · AI infraFrom experience

    prompt caching reordering is THE highest-leverage change in any LLM app. moved our system prompt + tool defs to the front, kept dynamic stuff at the end — 73% cost reduction overnight. zero quality change. shipped on a Tuesday.

    May 27, 2026·3 days later
  3. Mai Đỗ🇻🇳 SG· AI EngineerAgrees

    đoạn "context là budget, không phải buffet" hợp với cảm giác mình giải thích cho dev mới hàng tuần. cứ stuff full doc vô context xong than "sao chậm thế." RAG + summary gần như luôn thắng.

    May 28, 2026·4 days later
  4. Tobias Eriksson· Research EngineerPushback

    tiny addition — needle-in-a-haystack benchmarks are increasingly gamed by post-training. real-world long-context perf on multi-fact retrieval is still mediocre. trust evals on YOUR data, not vendor blog posts. (otherwise spot on.)

    May 30, 2026·6 days later
  5. Leila Hamidi· Tech LeadFrom experience

    the "router to a smaller model" pattern paid for our entire LLM bill. classifier costs $0.0001/req, downstream model costs reduced 40%. should be the first optimisation any LLM-heavy team ships.

    May 29, 2026·5 days later
  6. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    May 30, 2026·6 days later
  7. Ahmed Rahman· Full StackKind words

    concise + opinionated = my favourite kind of engineering post. so many blogs hedge every claim into mush. give me the spicy take with the receipts. more please.

    May 25, 2026·1 day later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email