Prompt Caching: Cutting LLM Cost and Latency in Half
If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.
Notes on building with LLMs — context windows and their real costs, retrieval vs long-context trade-offs, prompt caching, tool use, agent loops, and the eval discipline that keeps a feature shippable.
4 articles · updated regularly
If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.
Retrieval-augmented generation is only as good as what you retrieve. The model isn't the hard part — chunking, hybrid search, and reranking are.
1M-token context isn't free, isn't free, and also isn't free. Here's how attention scales, why "lost in the middle" is real, and the four techniques I use to stop fighting the window.
Every LLM API has a dozen tunable knobs. Most engineers only know temperature, and they tune it wrong. Here's what each parameter actually does to the math, and which ones I touch in production.