Data in AI lives in two structures: NumPy arrays (raw numbers/tensors) and Pandas DataFrames (labelled tables). Master these and half the job is done.
NumPy โ fast math on arrays
import numpy as np a = np.array([1, 2, 3, 4]) a * 2 # [2 4 6 8] โ operates on the WHOLE array, no loop a.mean(), a.std() # stats built in # matrices (2D) โ how ALL model data is shaped m = np.array([[1, 2], [3, 4]]) m.shape # (2, 2) โ always check shapes!
Vectorisation is the key idea: a * 2 runs in optimised C over the whole array โ 100x faster than a Python loop. Models do billions of these operations, so never loop when NumPy can vectorise.
Pandas โ spreadsheets in code
import pandas as pd
df = pd.read_csv("students.csv")
df.head() # first 5 rows
df.info() # columns, types, missing values
df["cgpa"].mean() # column stats
df[df["cgpa"] >= 8] # filter rows
df["passed"] = df["cgpa"] >= 5 # new column
df.groupby("dept")["cgpa"].mean() # aggregateCleaning โ the unglamorous 60% of AI
df.isnull().sum() # how many missing per column? df["age"].fillna(df["age"].median(), inplace=True) # fill gaps df.drop_duplicates(inplace=True) # remove dupes
Real datasets are messy. "Data cleaning" and "feature engineering" are where models are won or lost โ more than fancy algorithms.