Every LLM API exposes a few knobs. Understanding them is the difference between flaky and reliable AI features.
Temperature — randomness (0 to ~1)
- 0 — deterministic, picks the most likely token every time. Use for extraction, classification, code, anything needing consistency.
- 0.7–1.0 — creative, varied. Use for brainstorming, writing, ideation.
temperature=0 → same input, same output. Reliable. temperature=0.9 → same input, different outputs. Creative.
Top-p (nucleus sampling)
An alternative randomness control: only consider tokens making up the top p probability mass (e.g. 0.9). Usually leave it at default and tune temperature instead — changing both at once is confusing.
Max tokens — the length cap
Limits the response length (and cost). Remember: input + output both count toward the context window and your bill. A long document in the prompt costs tokens too.
System prompt — the model's standing instructions
messages = [
{ "role": "system", "content": "You are a concise tutor. Use simple English." },
{ "role": "user", "content": "Explain recursion." },
]
# the system prompt shapes ALL responses — set persona, rules, format here.Defaults that just work: temperature 0 for anything factual/structured, 0.7 for creative. Put rules and persona in the system prompt. Cap max_tokens to control cost.