AI · LLM — Engineering Journal

May 24, 20268 min

Prompt Caching: Cutting LLM Cost and Latency in Half

If you send the same system prompt and context on every request, you're paying to re-process it every time. Prompt caching reuses that work — here's how to structure prompts to hit the cache.

AIScaling

Read

AI · LLM

May 22, 202611 min

RAG Without the Hallucinations: Retrieval Patterns That Hold Up

Retrieval-augmented generation is only as good as what you retrieve. The model isn't the hard part — chunking, hybrid search, and reranking are.

AISystem Design

Read

AI · LLM

Apr 16, 202612 min

Context Windows: What They Actually Cost You and How to Fit More In

1M-token context isn't free, isn't free, and also isn't free. Here's how attention scales, why "lost in the middle" is real, and the four techniques I use to stop fighting the window.

AILLM

Read

AI · LLM

Apr 02, 202611 min

LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter

Every LLM API has a dozen tunable knobs. Most engineers only know temperature, and they tune it wrong. Here's what each parameter actually does to the math, and which ones I touch in production.

AILLM

Read

Prompt Caching: Cutting LLM Cost and Latency in Half

RAG Without the Hallucinations: Retrieval Patterns That Hold Up

Context Windows: What They Actually Cost You and How to Fit More In

LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter

More topics