Temperature, Top-p & Tokens — Controlling LLM Output

Every LLM API exposes a few knobs. Understanding them is the difference between flaky and reliable AI features.

Temperature — randomness (0 to ~1)

0 — deterministic, picks the most likely token every time. Use for extraction, classification, code, anything needing consistency.
0.7–1.0 — creative, varied. Use for brainstorming, writing, ideation.

temperature=0    → same input, same output. Reliable.
temperature=0.9  → same input, different outputs. Creative.

Top-p (nucleus sampling)

An alternative randomness control: only consider tokens making up the top p probability mass (e.g. 0.9). Usually leave it at default and tune temperature instead — changing both at once is confusing.

Max tokens — the length cap

Limits the response length (and cost). Remember: input + output both count toward the context window and your bill. A long document in the prompt costs tokens too.

System prompt — the model's standing instructions

messages = [
  { "role": "system", "content": "You are a concise tutor. Use simple English." },
  { "role": "user",   "content": "Explain recursion." },
]
# the system prompt shapes ALL responses — set persona, rules, format here.

Defaults that just work: temperature 0 for anything factual/structured, 0.7 for creative. Put rules and persona in the system prompt. Cap max_tokens to control cost.

← Previous

Fine-Tuning vs RAG vs Prompting — Which to Use

AI Agents Explained — LLMs That Take Actions