Neural networks are not magic. They are multi-layered compositions of linear algebra and non-linear activation functions, optimised by gradient descent. Once you hold that mental model firmly, the entire field becomes navigable.
The Neuron: Linear + Non-linear
Each neuron computes a weighted sum of its inputs, adds a bias term, then passes the result through a non-linear activation function. Without that non-linearity, stacking layers would be pointless — a deep stack of linear transformations is still just a linear transformation.
import numpy as np
def neuron(x, w, b):
z = np.dot(w, x) + b # linear step: weighted sum
return np.maximum(0, z) # ReLU: max(0, z)
# ReLU (Rectified Linear Unit) is the most common
# activation in modern networks. It is cheap to compute
# and avoids the vanishing gradient problem.
Forward Pass: Data Flows In, Predictions Come Out
During the forward pass, input data propagates layer by layer through the network. Each layer transforms the representation — early layers detect low-level features (edges in images, character n-grams in text), while later layers detect abstract concepts built from those features.
Loss Functions: How Wrong Are We?
The loss function measures the gap between the network's prediction and the ground truth. Common choices include:
- Mean Squared Error (MSE) — regression tasks.
- Cross-Entropy Loss — classification tasks. Penalises confident wrong predictions heavily.
- Huber Loss — regression tasks where outliers should not dominate.
Backpropagation: Computing the Gradient
Backpropagation is simply the chain rule of calculus applied systematically across the network. It computes the partial derivative of the loss with respect to every weight. The result is a gradient — a vector pointing in the direction of steepest ascent in loss space.
The optimiser (typically Adam or SGD with momentum) then takes a small step in the opposite direction, nudging each weight toward a local minimum of the loss.
import torch
import torch.nn as nn
model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 10))
criterion = nn.CrossEntropyLoss()
optimiser = torch.optim.Adam(model.parameters(), lr=1e-3)
# One training step
optimiser.zero_grad() # clear old gradients
loss = criterion(model(x), y) # forward pass + loss
loss.backward() # backprop
optimiser.step() # update weights
Why Depth Matters
Each additional layer allows the network to build more abstract representations. In a convolutional image network: layer 1 detects edges, layer 3 detects textures, layer 7 detects eyes or wheels. This hierarchical composition is the key to deep learning's power — and it cannot be replicated with a single wide layer.
The unreasonable effectiveness of deep learning comes not from any single clever idea, but from composing many simple differentiable operations into a hierarchy that can learn anything.