Two very different phases with very different costs. Confusing them leads to bad architecture and budget decisions.
Training — teaching the model (expensive, one-time-ish)
- Runs the full learn loop over massive data, millions/billions of times.
- Needs many GPUs for days to months. GPT-4-class training costs tens of millions of dollars.
- You (usually) don't do this — you use a pre-trained model. This is why APIs exist.
Inference — using the trained model (cheap-per-call, but adds up)
- One forward pass to get a prediction. Milliseconds for small models, seconds for big LLMs.
- This is what your app does on every request — and what you pay per token for on APIs.
TRAINING: data + answers ==(days on GPUs)==> a trained model INFERENCE: trained model + new input ==(ms)==> a prediction
What this means for you as a builder
- Don't train from scratch. Use pre-trained models via API or Hugging Face. Fine-tune only when you must.
- Inference cost scales with usage. An LLM feature that's cheap in testing can be expensive at 1M users — cache, use smaller models where possible, and count tokens.
- Latency is a product decision. Big models are smarter but slower. Match model size to the task.
Next track: the technology everyone actually wants to learn → LLMs.