Kaggle wisdom: better features beat better algorithms. A simple model with great features crushes a fancy model with raw data. This is the highest-leverage ML skill.
The essential transforms
1. Encode categories — models need numbers, not text.
import pandas as pd # one-hot: "CSE"/"IT"/"ECE" -> three 0/1 columns df = pd.get_dummies(df, columns=["dept"])
2. Scale numeric features — so "salary" (lakhs) doesn't dwarf "age" (tens). Essential for distance/gradient-based models.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) # fit on train X_test = scaler.transform(X_test) # ONLY transform test (no leakage!)
3. Extract from dates & text — raw timestamps and strings are weak; derived signals are strong.
df["day_of_week"] = df["date"].dt.dayofweek # captures weekly patterns df["is_weekend"] = df["day_of_week"] >= 5 df["title_len"] = df["title"].str.len() # simple but often predictive
The one rule that prevents disaster
Fit transforms on training data only, then apply to test. Fitting the scaler on all data leaks test information into training ("data leakage") and gives fake-good scores that collapse in production. This is the #1 silent ML bug.
Next track: when features aren't enough and you need neural networks.