NumPy & Pandas Essentials for AI

Data in AI lives in two structures: NumPy arrays (raw numbers/tensors) and Pandas DataFrames (labelled tables). Master these and half the job is done.

NumPy — fast math on arrays

import numpy as np

a = np.array([1, 2, 3, 4])
a * 2                 # [2 4 6 8]  — operates on the WHOLE array, no loop
a.mean(), a.std()     # stats built in

# matrices (2D) — how ALL model data is shaped
m = np.array([[1, 2], [3, 4]])
m.shape               # (2, 2)  — always check shapes!

Vectorisation is the key idea: a * 2 runs in optimised C over the whole array — 100x faster than a Python loop. Models do billions of these operations, so never loop when NumPy can vectorise.

Pandas — spreadsheets in code

import pandas as pd

df = pd.read_csv("students.csv")
df.head()                       # first 5 rows
df.info()                       # columns, types, missing values
df["cgpa"].mean()               # column stats
df[df["cgpa"] >= 8]             # filter rows
df["passed"] = df["cgpa"] >= 5  # new column
df.groupby("dept")["cgpa"].mean()   # aggregate

Cleaning — the unglamorous 60% of AI

df.isnull().sum()                    # how many missing per column?
df["age"].fillna(df["age"].median(), inplace=True)   # fill gaps
df.drop_duplicates(inplace=True)     # remove dupes

Real datasets are messy. "Data cleaning" and "feature engineering" are where models are won or lost — more than fancy algorithms.

← Previous

Python for AI — The 20% You Use in Every Notebook

Just-Enough Math for AI (No PhD Required)