In the previous lessons you saw that neural networks have weights, and training adjusts those weights. But how does the network know which weights to change, and by how much? The answer is backpropagation - the single most important algorithm in modern deep learning.
Andrej Karpathy calls it "the most important thing to understand about neural networks." Let us see why.
During a forward pass, data flows left to right through the network:
The prediction is then compared to the true answer using a loss function (covered in the next lesson). The loss is a single number that says: "Here is how wrong you are."
Imagine you bake a cake and it tastes awful. You used five ingredients. The question is: which ingredient contributed most to the bad taste, and by how much?
Backpropagation answers exactly this question for neural networks. It assigns blame to every single weight by asking: "If I nudge this weight slightly, how much does the loss change?"
That rate of change is called a gradient, and it comes from calculus - specifically, the derivative.
Geoffrey Hinton, one of the "godfathers of AI," has said that backpropagation is the key idea that made deep learning practical. Without it, training networks with millions of parameters would be computationally impossible.
Neural networks are chains of simple operations composed together. The chain rule from calculus tells us how to differentiate composed functions:
If
y = f(g(x)), thendy/dx = f'(g(x)) ร g'(x).
Sign in to join the discussion
Everyday analogy: You drive to a shop. Your speed depends on how hard you press the accelerator. The accelerator position depends on traffic. To know how traffic affects your speed, you multiply: (speed per accelerator press) ร (accelerator press per traffic condition). That is the chain rule - multiplying local rates of change along a chain.
Backpropagation was popularised in a landmark 1986 paper by Rumelhart, Hinton, and Williams, but the core idea of reverse-mode automatic differentiation dates back to the 1960s.
Modern frameworks like PyTorch build a computation graph during the forward pass. Every operation - add, multiply, ReLU - becomes a node. Backpropagation then walks this graph in reverse, applying the chain rule at each node to compute gradients.
Think of it like a river system. The loss is the ocean at the end. Backprop traces every tributary upstream to find how much each source (weight) contributed to the final flow.
Suppose L = (w ร x - y)ยฒ with w = 2, x = 3, y = 10.
w ร x = 6, then 6 - 10 = -4, then (-4)ยฒ = 16. Loss = 16.dL/d(diff) = 2 ร (-4) = -8, then d(diff)/d(wx) = 1, so dL/d(wx) = -8.d(wx)/dw = x = 3, so dL/dw = -8 ร 3 = -24.The gradient of โ24 tells us: increasing w will decrease the loss rapidly. That is exactly the signal we need to improve.
In the chain rule, what do we do with the local derivatives at each node?
In a deep network, gradients must travel through many layers. Each layer multiplies the gradient by its local derivative. This creates two dangerous failure modes:
If local derivatives are small (e.g., the sigmoid function saturates near 0 or 1), repeated multiplication makes gradients shrink towards zero. Early layers barely learn - they receive almost no signal. This plagued early deep networks.
If local derivatives are large, gradients grow exponentially. Weights receive enormous updates and the network becomes unstable, producing NaN values.
ReLU's derivative is either 0 or 1 - it never shrinks the gradient when active. Why might this simple property have been revolutionary for training deep networks?
Modern solutions include:
Once backprop computes every gradient, the optimiser (next lesson) updates each weight:
w_new = w_old - learning_rate ร gradient
The learning rate controls the step size. Too large and you overshoot; too small and training takes forever. The gradient tells you the direction; the learning rate tells you how far to step.
What causes vanishing gradients in deep networks?
Every time ChatGPT improves its next-word prediction, every time a self-driving car refines its steering, backpropagation is running underneath. It is the algorithm that makes learning from mistakes mathematically precise.
Without backprop, we would have no efficient way to train networks with millions - or billions - of parameters.
What does a gradient tell us about a weight?
Karpathy emphasises that backprop is "just recursive application of the chain rule." If you understand the chain rule and computation graphs, you understand backprop. What other complex systems could be understood by breaking them into simple, composable pieces?