How LLMs Actually Work — Next-Token Prediction Demystified

Strip away the magic: an LLM is a machine that predicts the next token, over and over. Everything it does emerges from that one skill done extraordinarily well.

Step 1: Tokenization

Text is split into tokens (~¾ of a word). "Learning" might be one token; "annauniversity" several. The model only ever sees token IDs (numbers).

"AI is amazing"  →  ["AI", " is", " amaz", "ing"]  →  [4521, 318, 6994, 278]

Step 2: Next-token prediction

Given the tokens so far, the model outputs a probability for every possible next token, then picks one. Append it, feed it all back in, predict again. That loop generates the whole response.

"The capital of France is"  →  Paris (0.94)  London (0.01)  a (0.01) ...
# picks "Paris", appends, continues.

Step 3: The three training stages

Pre-training — predict the next token across a huge chunk of the internet. Learns grammar, facts, reasoning patterns. Produces a raw, knowledgeable but unruly model.
Fine-tuning (instruction) — train on question→answer pairs so it follows instructions.
RLHF — humans rank responses; the model learns to prefer helpful, harmless ones. This is what made ChatGPT feel aligned.

Why it hallucinates

It predicts plausible tokens, not true ones. It has no database it looks facts up in — it generates what "sounds right" from patterns. That is why it can invent citations confidently. The fix for real apps is RAG — giving it real documents to ground on.

Mental model: an LLM is an incredibly well-read autocomplete. Brilliant at language and patterns, not a truth oracle. Design around that and you build reliable products.

← Previous

Training vs Inference — Cost, Hardware & Why It Matters

Prompt Engineering — Get 10x Better Outputs