Accuracy Is Not Enough — Precision, Recall & F1

A model that is 99% accurate can be useless. This lesson is the difference between looking competent and being competent in ML interviews.

The trap: imbalanced data

Predict a rare disease present in 1% of people. A model that always says "healthy" is 99% accurate — and catches zero sick patients. Accuracy hid total failure. This is why we need better metrics.

The confusion matrix — the source of truth

                 Predicted +      Predicted -
Actual +      True Positive     False Negative   (missed!)
Actual -      False Positive    True Negative
              (false alarm)

The two that matter

Precision = of everything flagged positive, how much was right? TP / (TP + FP). "When I raise an alarm, am I usually correct?"
Recall = of all real positives, how many did I catch? TP / (TP + FN). "Am I missing real cases?"

The trade-off (this is the interview question)

Disease screening / fraud → maximise recall. Missing a real case is catastrophic; a false alarm just means a follow-up check.
Spam filter / recommending content → favour precision. A false positive (real email → spam) is worse than letting one spam through.
F1 score = harmonic mean of the two — one number when you need balance.

from sklearn.metrics import classification_report
print(classification_report(y_test, preds))   # precision, recall, F1 per class

Say "it depends on the cost of a false negative vs false positive" in an interview and you instantly sound senior.

← Previous

Regression vs Classification — With Real Models

Overfitting & Underfitting — The Central Problem of ML