Every ML project — a fraud detector or a Kaggle entry — follows the same 7 steps. Learn the skeleton once and every project fits it.
The 7 steps
- Frame the problem — classification or regression? What does "good" mean (accuracy? catching all fraud)?
- Get & clean data — the 60%. Handle missing values, duplicates, wrong types.
- Explore (EDA) — plot distributions, correlations. Understand before modelling.
- Feature engineering — turn raw data into signals (date → day-of-week; text → counts).
- Split the data — train / test, so you can measure on data the model never saw.
- Train & tune — fit a model, adjust its knobs (hyperparameters).
- Evaluate & deploy — measure honestly on the test set, then ship and monitor.
The most important line of code in ML
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train ONLY on X_train. Judge ONLY on X_test.
# If you evaluate on data the model trained on, you are lying to yourself.The golden rule
Never let the model see the test set during training. The test set simulates "the future / real users". A model that scores 99% on training data and 60% on test data has overfit — memorised, not learned.
Next: build your first real model → Your First Model with scikit-learn.