CNNs Explained — How Computers See Images

How does a computer tell a cat from a dog? Convolutional Neural Networks (CNNs) — the architecture that kicked off the deep learning revolution in 2012.

The problem with plain networks on images

A 200×200 colour image = 120,000 numbers. A fully-connected layer would need billions of weights and ignore that nearby pixels are related. CNNs fix both with one idea: slide small filters across the image.

Convolution — detecting features anywhere

A filter is a tiny grid (say 3×3) that slides over the image looking for a pattern — an edge, a curve. It produces a feature map highlighting where that pattern appears. Because the same filter slides everywhere, a cat detected in the corner is detected in the centre too (translation invariance).

Layer 1 filters learn:  edges, colours, gradients
Layer 2 combines into:  corners, textures
Layer 3 combines into:  eyes, wheels, fur
Final layer:            "cat" vs "dog"
# The network LEARNS these filters — nobody programs "detect an eye".

Pooling — shrink and focus

Between convolutions, pooling (e.g. max-pooling) downsamples the feature maps — keeping the strongest signals, reducing size and computation, and adding robustness to small shifts.

import torch.nn as nn
cnn = nn.Sequential(
    nn.Conv2d(3, 16, 3),  nn.ReLU(),  nn.MaxPool2d(2),
    nn.Conv2d(16, 32, 3), nn.ReLU(),  nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(32*6*6, 10),           # classify
)

CNNs power face unlock, medical imaging, self-driving perception and photo search. For sequences (text, time series) we needed something else → next.

← Previous

Your First Neural Network in PyTorch

Transformers & Attention — The Architecture Behind ChatGPT