Feature Engineering — Where Models Are Really Won

Kaggle wisdom: better features beat better algorithms. A simple model with great features crushes a fancy model with raw data. This is the highest-leverage ML skill.

The essential transforms

1. Encode categories — models need numbers, not text.

import pandas as pd
# one-hot: "CSE"/"IT"/"ECE" -> three 0/1 columns
df = pd.get_dummies(df, columns=["dept"])

2. Scale numeric features — so "salary" (lakhs) doesn't dwarf "age" (tens). Essential for distance/gradient-based models.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit on train
X_test  = scaler.transform(X_test)        # ONLY transform test (no leakage!)

3. Extract from dates & text — raw timestamps and strings are weak; derived signals are strong.

df["day_of_week"] = df["date"].dt.dayofweek     # captures weekly patterns
df["is_weekend"]  = df["day_of_week"] >= 5
df["title_len"]   = df["title"].str.len()       # simple but often predictive

The one rule that prevents disaster

Fit transforms on training data only, then apply to test. Fitting the scaler on all data leaks test information into training ("data leakage") and gives fake-good scores that collapse in production. This is the #1 silent ML bug.

Next track: when features aren't enough and you need neural networks.

← Previous

Overfitting & Underfitting — The Central Problem of ML

Neural Networks From Scratch — The Intuition