ML
AI · LLM

LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter

Every LLM API has a dozen tunable knobs. Most engineers only know temperature, and they tune it wrong. Here's what each parameter actually does to the math, and which ones I touch in production.

April 02, 202611 min readAILLMFundamentals

Open the docs of any modern LLM API and you'll see something like fifteen optional fields under "sampling parameters." Most engineers cargo-cult one or two and never touch the rest. That's fine for prototypes. For production, knowing what each parameter actually does to the probability distribution is the difference between an app that works and an app that drifts.

This is the tour I wish I'd had on day one.

1. The thing under the hood

Every "generate next token" step is the same operation: the model produces a vector of logits — one number per token in the vocabulary, typically 50k–200k entries. Higher logit = the model thinks this token is more likely to come next. To turn logits into probabilities, you apply softmax:

p_i = exp(logit_i / T) / Σ exp(logit_j / T)

That T is temperature. Then the API samples one token from that probability distribution. Every other parameter is a way of shaping that distribution before sampling.

2. Temperature

What it does: divides the logits by T before softmax.

  • T = 0 (or near zero): the highest-logit token always wins. Deterministic. "Greedy decoding."
  • T = 1: the distribution is unchanged from the model's raw output.
  • T = 2: the distribution flattens — unlikely tokens become more likely. More creative, more wild.

What it's NOT: a creativity dial. It's a flatness dial. Setting temperature to 2 doesn't make the model smarter; it makes it sample lower-probability tokens more often, which can be creative or can be nonsense.

In practice: 0.0–0.3 for code, JSON, classification, anything where you want predictable output. 0.6–0.9 for prose, summaries, drafting. I almost never go above 1.0.

3. Top-P (nucleus sampling)

What it does: before sampling, sort tokens by probability, take the smallest set whose cumulative probability ≥ P, and renormalize. Then sample from that set.

Example: top_p=0.9 with a long-tail distribution might keep only 8 tokens out of 50,000. The model can never sample anything outside the top 90% of its mass.

Why it exists: high temperature makes all tokens more likely, including completely incoherent ones. Top-P cuts off the long tail of nonsense while still letting the model pick interestingly within the plausible set.

Rule of thumb: pick a temperature OR a top-P, not both at extreme values. temperature=0.7, top_p=0.9 is a sane default. temperature=2, top_p=0.99 is asking for word salad.

4. Top-K

What it does: same idea as top-P, but "keep the top K tokens" regardless of their cumulative probability.

Less popular than top-P because it's blunt — top-3 in a peaked distribution is fine, but top-3 in a flat distribution might cut off most of the probability mass. Top-P adapts; top-K doesn't.

I leave top-K alone unless I have a specific reason. If you're using both, top-K applies first, then top-P.

5. Max tokens

The hard ceiling on response length. Two things to know:

  • It's tokens, not characters. ~4 chars per token in English; ~2 chars per token in code.
  • The model doesn't know it has a budget unless you tell it. Setting max_tokens=200 won't make the model wrap up gracefully — it just truncates whenever the count runs out, mid-sentence.

If you want concise output, ask for it in the prompt. max_tokens is a safety belt against runaway costs, not a length control.

6. Frequency penalty & presence penalty

Both subtract from the logits of tokens you've seen before:

  • Frequency penalty: the more often a token appeared, the bigger the penalty. Discourages repetition.
  • Presence penalty: any token that's appeared once gets the same penalty. Encourages topic diversity.

Useful for long-form generation that drifts into loops. Both default to 0; useful range is roughly 0.1 to 0.6. Above 1.0 you start to see the model avoiding common words like "the" — pathological.

7. Seed

Most modern APIs accept a seed for reproducibility. With temperature=0, the same prompt always returns the same output anyway. With temperature>0, the same seed + same prompt returns the same output most of the time.

I say "most of the time" because the model is run on multiple GPUs in parallel; floating-point non-determinism across hardware can still produce small differences. Treat seed as best-effort, not a guarantee.

8. Stop tokens

List of strings that, if generated, terminate the response. Useful for structured output:

{
  "stop": ["", "\n\nUser:"]
}

The model doesn't know about your stop tokens until it generates one. They're a post-hoc cutoff, not a constraint on what the model produces. Useful, not magic.

9. Reasoning / thinking budgets

Newer reasoning models (Claude with extended thinking, OpenAI's o-series, DeepSeek R1) expose a reasoning budget — a separate token cap for internal chain-of-thought before the visible answer. Bigger budgets = more reasoning = better answers on hard problems = more cost.

Rule of thumb: most tasks don't need reasoning. Hard math, multi-step logic, planning, code that has to be correct on first try — those benefit. Customer support chatbots don't. Don't pay for thinking you don't use.

10. The settings I actually use in production

Use casetemptop_pnotes
Classification / extraction0.01.0Deterministic. Validate output schema.
Code generation0.20.95Slight randomness avoids stuck-ruts.
Tool use / function calling0.0–0.21.0You want the same tool args every time.
Drafting / editing prose0.70.9Standard creative-but-grounded.
Brainstorming0.90.95Higher temp, accept some chaff.

The thing nobody tells you

Sampling parameters are rarely the actual problem with your LLM app. The actual problem is almost always the prompt, the context, or the schema. I've watched teams spend a week tuning top_p when their bug was "the system prompt was 800 tokens and contradicted itself in the middle."

Tune temperature first (one knob). Pick a sane top_p (0.9) and leave it. Reach for the rest only when you have a specific symptom they're meant to address. Most production LLM systems run on temperature=0.2 and never touch anything else.

SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Sania Patel· ML EngineerAgrees

    "temperature is a flatness dial, not a creativity dial" — saving this. so many product folks ask me to "make the AI more creative" by cranking temperature and i can finally point them at a paragraph instead of a 20-min explainer.

    Apr 04, 2026·2 days later
  2. Jorge Ramírez· Senior SWE · AI infraFrom experience

    prompt caching reordering is THE highest-leverage change in any LLM app. moved our system prompt + tool defs to the front, kept dynamic stuff at the end — 73% cost reduction overnight. zero quality change. shipped on a Tuesday.

    Apr 05, 2026·3 days later
  3. Mai Đỗ🇻🇳 SG· AI EngineerAgrees

    đoạn "context là budget, không phải buffet" hợp với cảm giác mình giải thích cho dev mới hàng tuần. cứ stuff full doc vô context xong than "sao chậm thế." RAG + summary gần như luôn thắng.

    Apr 06, 2026·4 days later
  4. Tobias Eriksson· Research EngineerPushback

    tiny addition — needle-in-a-haystack benchmarks are increasingly gamed by post-training. real-world long-context perf on multi-fact retrieval is still mediocre. trust evals on YOUR data, not vendor blog posts. (otherwise spot on.)

    Apr 08, 2026·6 days later
  5. Leila Hamidi· Tech LeadFrom experience

    the "router to a smaller model" pattern paid for our entire LLM bill. classifier costs $0.0001/req, downstream model costs reduced 40%. should be the first optimisation any LLM-heavy team ships.

    Apr 07, 2026·5 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Apr 05, 2026·3 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email