The Machine Learning Workflow

Every ML project — a fraud detector or a Kaggle entry — follows the same 7 steps. Learn the skeleton once and every project fits it.

The 7 steps

Frame the problem — classification or regression? What does "good" mean (accuracy? catching all fraud)?
Get & clean data — the 60%. Handle missing values, duplicates, wrong types.
Explore (EDA) — plot distributions, correlations. Understand before modelling.
Feature engineering — turn raw data into signals (date → day-of-week; text → counts).
Split the data — train / test, so you can measure on data the model never saw.
Train & tune — fit a model, adjust its knobs (hyperparameters).
Evaluate & deploy — measure honestly on the test set, then ship and monitor.

The most important line of code in ML

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train ONLY on X_train. Judge ONLY on X_test.
# If you evaluate on data the model trained on, you are lying to yourself.

The golden rule

Never let the model see the test set during training. The test set simulates "the future / real users". A model that scores 99% on training data and 60% on test data has overfit — memorised, not learned.

Next: build your first real model → Your First Model with scikit-learn.

← Previous

Just-Enough Math for AI (No PhD Required)

Your First ML Model with scikit-learn (Full Example)

The Machine Learning Workflow — End to End

The 7 steps

The most important line of code in ML

The golden rule