Context Windows: What They Actually Cost You and How to Fit More In
1M-token context isn't free, isn't free, and also isn't free. Here's how attention scales, why "lost in the middle" is real, and the four techniques I use to stop fighting the window.
Every six months a model ships with a bigger context window and the discourse repeats: RAG is dead, just dump everything in context. Six months later, someone runs the math, the latency, and the eval, and discovers that no, RAG is fine, and yes, the context window has costs nobody put on the marketing slide.
This is what I tell my team about contexts. Mostly so I don't have to repeat it.
1. What a context window actually is
The context window is the maximum number of tokens the model can attend to in a single forward pass — system prompt + chat history + retrieved documents + the question + the model's own response, all of it. It's a hard architectural limit baked into the model's positional encoding.
A token is roughly 3–4 characters of English, less in code, more in non-Latin scripts. "100k token context" is roughly 75,000 words, or ~150 dense pages of text. Or one moderately verbose Postgres dump.
2. The cost is quadratic, not linear
This is the part most people miss. Standard transformer attention is O(N²) in sequence length. Doubling the context doesn't double the compute — it quadruples it. The math:
attention_ops ≈ N · N · d_model
= O(N²) per layer
Modern models use various tricks to soften this — sliding-window attention, flash attention, sparse attention, MoE — but the fundamental ratio doesn't go away. A 1M-token context isn't 10× more expensive than a 100k context. It's closer to 100× if your model uses dense attention.
What this means in practice:
- API cost scales with input tokens — usually linearly per the price sheet, but the underlying compute is super-linear.
- API latency scales worse than linearly. Time-to-first-token at 500k context can be 8–15 seconds before the response starts.
- Beyond some point, providers throttle long-context requests or queue them differently. "Cheap" 1M-token calls take noticeably longer than the API page would suggest.
3. "Lost in the middle" is a real phenomenon
Stanford's 2023 paper showed that LLMs reliably attend to the start and end of a long context but lose track of facts placed in the middle. Subsequent work has confirmed this across model families. The needle-in-a-haystack benchmark some labs publish is misleading because the needle is a single, distinctive sentence — real applications need to retrieve multiple, similarly-formatted facts, where the model's edge degrades faster.
Practical implication: where in the context you put a piece of information matters. Important instructions go at the very start (system prompt) or very end (right before the question). Don't bury the lede 80% of the way through 200k tokens of chat history.
4. Tokens come from places you forget about
The output tokens count too. So does every assistant turn in a chat history. So does the JSON schema for tool calls. So does the system prompt. Real-world breakdown of a typical production prompt:
- System prompt: 800 tokens
- Tool definitions (5 tools): 1,200 tokens
- Chat history (last 10 turns): 4,000 tokens
- Retrieved context (RAG, 5 chunks): 3,500 tokens
- User message: 200 tokens
- Total input: ~9,700 tokens
- Expected output: 600 tokens
That's already 10k tokens before your app does anything interesting. Multiply by your QPS and your bill is real.
5. Four techniques that actually work
5.1 Hierarchical context
Don't pass the full document. Pass a summary, plus the specific chunks relevant to the question. The summary catches "general background" questions; the chunks catch specifics.
Context = 1-paragraph doc summary + top-3 retrieved chunks
Token cost drops by 80%. Quality on most queries: indistinguishable. Quality on "summarise the whole doc" queries: better, because the summary is already there.
5.2 Conversation compression
Don't keep all 50 prior chat turns in context. After turn N, summarise turns 1…N-10 into a single "so far" block, and only keep the last 10 verbatim.
This is a 2-line change and saves 40% of token cost on long conversations. The summary is generated by the model itself and stays useful for ~10 more turns before you compress again.
5.3 Cache-aware structure
Anthropic, OpenAI, and others now expose prompt caching: the API caches the prefix of identical prompts, so you pay (much) less for tokens that don't change between requests.
Structure your prompt so the cacheable part is at the front:
[system prompt] ← stable, cached
[tool definitions] ← stable, cached
[doc context] ← stable per session, cached
[chat history] ← changes each turn
[user message] ← changes each turn
Cost saving: 50–90% on the cacheable portion, depending on the provider. Latency saving: noticeable, especially at high QPS. Just by reordering. This is the single highest-leverage change you can make in a long-context app.
5.4 Don't use the long window for everything
If 90% of your queries can be answered with 4k tokens of context, route those to a smaller, cheaper model. Reserve the 200k-token Claude calls for the 10% that genuinely need them. A simple classifier on the incoming query — "does this need long context?" — pays for itself in a week.
6. When you actually need the long window
- Single-document analysis where the document genuinely doesn't decompose (a contract, a single book chapter).
- Code refactors that need to see the full module to be safe.
- Multi-step reasoning that needs all prior steps in scope.
- One-shot bulk processing where round-trip latency matters more than cost.
For everything else, retrieval still wins. The question is never "long context vs RAG." It's "which mix of long context, RAG, summarisation, and caching minimises my cost-per-correct-answer."
The mental model
Context is a budget, not a buffet. The model has a finite amount of "attention" to spend across whatever you put in. Stuffing more in doesn't always help — it dilutes attention to the bits that mattered. Most production wins come from putting less in the window, putting it in better order, and structuring it so it caches.
The teams I've seen succeed with LLMs treat context like a senior engineer treats memory in a hot path: every byte is a decision, and the cheap-looking ones (verbose system prompts, every prior chat turn, full document dumps) compound into the expensive ones (slow latency, runaway cost, lost-in-the-middle bugs).