How does a computer tell a cat from a dog? Convolutional Neural Networks (CNNs) — the architecture that kicked off the deep learning revolution in 2012.
The problem with plain networks on images
A 200×200 colour image = 120,000 numbers. A fully-connected layer would need billions of weights and ignore that nearby pixels are related. CNNs fix both with one idea: slide small filters across the image.
Convolution — detecting features anywhere
A filter is a tiny grid (say 3×3) that slides over the image looking for a pattern — an edge, a curve. It produces a feature map highlighting where that pattern appears. Because the same filter slides everywhere, a cat detected in the corner is detected in the centre too (translation invariance).
Layer 1 filters learn: edges, colours, gradients Layer 2 combines into: corners, textures Layer 3 combines into: eyes, wheels, fur Final layer: "cat" vs "dog" # The network LEARNS these filters — nobody programs "detect an eye".
Pooling — shrink and focus
Between convolutions, pooling (e.g. max-pooling) downsamples the feature maps — keeping the strongest signals, reducing size and computation, and adding robustness to small shifts.
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(3, 16, 3), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3), nn.ReLU(), nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(32*6*6, 10), # classify
)CNNs power face unlock, medical imaging, self-driving perception and photo search. For sequences (text, time series) we needed something else → next.