LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter
Every LLM API has a dozen tunable knobs. Most engineers only know temperature, and they tune it wrong. Here's what each parameter actually does to the math, and which ones I touch in production.
Open the docs of any modern LLM API and you'll see something like fifteen optional fields under "sampling parameters." Most engineers cargo-cult one or two and never touch the rest. That's fine for prototypes. For production, knowing what each parameter actually does to the probability distribution is the difference between an app that works and an app that drifts.
This is the tour I wish I'd had on day one.
1. The thing under the hood
Every "generate next token" step is the same operation: the model produces a vector of logits — one number per token in the vocabulary, typically 50k–200k entries. Higher logit = the model thinks this token is more likely to come next. To turn logits into probabilities, you apply softmax:
p_i = exp(logit_i / T) / Σ exp(logit_j / T)
That T is temperature. Then the API samples one token from that probability distribution. Every other parameter is a way of shaping that distribution before sampling.
2. Temperature
What it does: divides the logits by T before softmax.
T = 0(or near zero): the highest-logit token always wins. Deterministic. "Greedy decoding."T = 1: the distribution is unchanged from the model's raw output.T = 2: the distribution flattens — unlikely tokens become more likely. More creative, more wild.
What it's NOT: a creativity dial. It's a flatness dial. Setting temperature to 2 doesn't make the model smarter; it makes it sample lower-probability tokens more often, which can be creative or can be nonsense.
In practice: 0.0–0.3 for code, JSON, classification, anything where you want predictable output. 0.6–0.9 for prose, summaries, drafting. I almost never go above 1.0.
3. Top-P (nucleus sampling)
What it does: before sampling, sort tokens by probability, take the smallest set whose cumulative probability ≥ P, and renormalize. Then sample from that set.
Example: top_p=0.9 with a long-tail distribution might keep only 8 tokens out of 50,000. The model can never sample anything outside the top 90% of its mass.
Why it exists: high temperature makes all tokens more likely, including completely incoherent ones. Top-P cuts off the long tail of nonsense while still letting the model pick interestingly within the plausible set.
Rule of thumb: pick a temperature OR a top-P, not both at extreme values. temperature=0.7, top_p=0.9 is a sane default. temperature=2, top_p=0.99 is asking for word salad.
4. Top-K
What it does: same idea as top-P, but "keep the top K tokens" regardless of their cumulative probability.
Less popular than top-P because it's blunt — top-3 in a peaked distribution is fine, but top-3 in a flat distribution might cut off most of the probability mass. Top-P adapts; top-K doesn't.
I leave top-K alone unless I have a specific reason. If you're using both, top-K applies first, then top-P.
5. Max tokens
The hard ceiling on response length. Two things to know:
- It's tokens, not characters. ~4 chars per token in English; ~2 chars per token in code.
- The model doesn't know it has a budget unless you tell it. Setting
max_tokens=200won't make the model wrap up gracefully — it just truncates whenever the count runs out, mid-sentence.
If you want concise output, ask for it in the prompt. max_tokens is a safety belt against runaway costs, not a length control.
6. Frequency penalty & presence penalty
Both subtract from the logits of tokens you've seen before:
- Frequency penalty: the more often a token appeared, the bigger the penalty. Discourages repetition.
- Presence penalty: any token that's appeared once gets the same penalty. Encourages topic diversity.
Useful for long-form generation that drifts into loops. Both default to 0; useful range is roughly 0.1 to 0.6. Above 1.0 you start to see the model avoiding common words like "the" — pathological.
7. Seed
Most modern APIs accept a seed for reproducibility. With temperature=0, the same prompt always returns the same output anyway. With temperature>0, the same seed + same prompt returns the same output most of the time.
I say "most of the time" because the model is run on multiple GPUs in parallel; floating-point non-determinism across hardware can still produce small differences. Treat seed as best-effort, not a guarantee.
8. Stop tokens
List of strings that, if generated, terminate the response. Useful for structured output:
{
"stop": ["", "\n\nUser:"]
}
The model doesn't know about your stop tokens until it generates one. They're a post-hoc cutoff, not a constraint on what the model produces. Useful, not magic.
9. Reasoning / thinking budgets
Newer reasoning models (Claude with extended thinking, OpenAI's o-series, DeepSeek R1) expose a reasoning budget — a separate token cap for internal chain-of-thought before the visible answer. Bigger budgets = more reasoning = better answers on hard problems = more cost.
Rule of thumb: most tasks don't need reasoning. Hard math, multi-step logic, planning, code that has to be correct on first try — those benefit. Customer support chatbots don't. Don't pay for thinking you don't use.
10. The settings I actually use in production
| Use case | temp | top_p | notes |
|---|---|---|---|
| Classification / extraction | 0.0 | 1.0 | Deterministic. Validate output schema. |
| Code generation | 0.2 | 0.95 | Slight randomness avoids stuck-ruts. |
| Tool use / function calling | 0.0–0.2 | 1.0 | You want the same tool args every time. |
| Drafting / editing prose | 0.7 | 0.9 | Standard creative-but-grounded. |
| Brainstorming | 0.9 | 0.95 | Higher temp, accept some chaff. |
The thing nobody tells you
Sampling parameters are rarely the actual problem with your LLM app. The actual problem is almost always the prompt, the context, or the schema. I've watched teams spend a week tuning top_p when their bug was "the system prompt was 800 tokens and contradicted itself in the middle."
Tune temperature first (one knob). Pick a sane top_p (0.9) and leave it. Reach for the rest only when you have a specific symptom they're meant to address. Most production LLM systems run on temperature=0.2 and never touch anything else.