AI & Engineering Academics›🌿 AI Sprouts›Lessons›Loss Functions and Optimisers

📉

AI Sprouts • Intermediate⏱️ 15 min read

Loss Functions and Optimisers

Backpropagation gives us gradients - but gradients of what? Before backprop can run, we need a single number that captures how wrong the model is. That number comes from a loss function. Once we have gradients, an optimiser decides how to update the weights. Together, they form the learning loop.

What Is a Loss Function?

A loss function (also called a cost function) takes the model's prediction and the true answer, and returns a number measuring "wrongness." The goal of training is to minimise this number.

Think of it like a score in golf - lower is better. A loss of 0 means a perfect prediction.

A U-shaped curve showing loss on the y-axis and weight value on the x-axis, with a ball rolling towards the minimum — Training is like rolling a ball downhill on the loss landscape, searching for the lowest point.

Loss Functions for Regression - MSE

When predicting continuous values (house prices, temperature), we use Mean Squared Error (MSE):

MSE = (1/n) × Σ(predicted - actual)²

Squaring does two things: it makes all errors positive, and it punishes large errors disproportionately. Predict a house price off by £100k and the squared error is 100× worse than being off by £10k.

🤯

MSE dates back to Carl Friedrich Gauss in 1795 - over two centuries before neural networks. He used it to track the orbit of the asteroid Ceres.

Loss Functions for Classification - Cross-Entropy

When predicting categories (spam or not, cat vs dog), we use cross-entropy loss. It measures how far the model's predicted probabilities are from the true labels.

If the correct answer is "cat" and the model says 99% cat, the loss is tiny. If it says 10% cat, the loss is enormous. Cross-entropy has a useful property: it becomes infinitely unhappy when the model is confidently wrong, creating a strong gradient to correct the mistake.

Binary cross-entropy is for two-class problems. Categorical cross-entropy handles multiple classes by comparing probability distributions.

🧠Quick Check

Why is MSE a poor choice for classification tasks?

Lesson 7 of 160% complete

←Backpropagation

Discussion

Suggest an edit to this lesson

Gradient Descent - Rolling Downhill

With a loss function defined, we can visualise the loss landscape - a surface where each point represents a set of weights and the height is the loss. Training means finding the lowest valley.

Gradient descent is the algorithm that gets us there:

Compute the gradient (slope) at the current position.
Take a step in the opposite direction (downhill).
Repeat.

The size of each step is controlled by the learning rate - arguably the most important hyperparameter in deep learning.

The Learning Rate Dilemma

Too high: You overshoot the valley, bouncing back and forth or diverging entirely.
Too low: You creep along painfully slowly and may get stuck in a shallow local minimum.
Just right: You converge steadily to a good solution.

🤔

Think about it:

Imagine hiking down a foggy mountain where you can only feel the slope directly under your feet. You step downhill, but you cannot see the whole landscape. How might you end up in a small dip that is not the deepest valley? This is the local minimum problem.

Flavours of Gradient Descent

Batch Gradient Descent

Computes the gradient using the entire dataset before each update. Accurate but painfully slow for large datasets - imagine re-reading every book in a library before correcting a single spelling mistake.

Stochastic Gradient Descent (SGD)

Updates weights after each single example. Fast but noisy - the path zigzags wildly. The noise can actually help escape local minima, which is a surprising benefit.

Mini-Batch Gradient Descent

The practical sweet spot. Computes gradients on a small batch (typically 32–512 examples). Balances speed and stability, and is what virtually all modern training uses.

Plain SGD has limitations. Researchers have developed smarter optimisers that adapt as they go.

Like a heavy ball rolling downhill, momentum accumulates velocity in consistent directions and dampens oscillations. If the gradient keeps pointing the same way, momentum accelerates. If it keeps changing direction, momentum smooths it out.

Adapts the learning rate per parameter. Frequently updated weights get smaller steps; rarely updated weights get larger steps. Great for sparse data (like text), but the learning rate can shrink to zero over time.

Adam (Adaptive Moment Estimation)

Combines momentum and per-parameter adaptive rates. It maintains running averages of both the gradient (first moment) and the squared gradient (second moment). Adam is the default choice for most practitioners today.

🧠Quick Check

What advantage does Adam have over basic SGD?

Learning Rate Schedules

Rather than fixing the learning rate, modern training often schedules it:

Step decay: Halve the rate every N epochs.
Cosine annealing: Smoothly decrease following a cosine curve, sometimes with warm restarts.
Warmup: Start with a tiny rate, gradually increase, then decay. Used in Transformer training.

The intuition: take big steps early to explore broadly, then small steps later to fine-tune.

Gradient Clipping - Safety Rails

Sometimes gradients explode (as we saw in the backpropagation lesson). Gradient clipping caps the gradient magnitude before the update step. If the gradient exceeds a threshold, it is scaled down proportionally. This is standard practice when training RNNs and Transformers.

🧠Quick Check

What does gradient clipping prevent?

🤯

The Adam optimiser paper (Kingma & Ba, 2014) has over 150,000 citations, making it one of the most cited papers in all of computer science.

Loss functions quantify how wrong a model is - MSE for regression, cross-entropy for classification.
Gradient descent minimises the loss by repeatedly stepping opposite to the gradient.
The learning rate controls step size and is critical to get right.
Adam is the go-to optimiser, combining momentum and adaptive rates.
Learning rate schedules and gradient clipping are essential training stabilisers.

🤔

Think about it:

If you were training a model and the loss stopped decreasing after a few epochs, what would you investigate first - the learning rate, the loss function, or the data? Why?

📚 Further Reading

Andrej Karpathy - A Recipe for Training Neural Networks - Practical wisdom on loss debugging and optimiser selection
3Blue1Brown - Gradient Descent - Stunning visual intuition for how gradient descent navigates loss landscapes
An Overview of Gradient Descent Optimisation Algorithms (Ruder, 2016) - Comprehensive comparison of SGD, Adam, and friends

AI Foundations

AI Mastery

Career Ready

Lab

Loss Functions and Optimisers

Loss Functions and Optimisers

What Is a Loss Function?

Loss Functions for Regression - MSE

Loss Functions for Classification - Cross-Entropy

Discussion

Gradient Descent - Rolling Downhill

The Learning Rate Dilemma

Flavours of Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Modern Optimisers

SGD with Momentum

AdaGrad

Adam (Adaptive Moment Estimation)

Learning Rate Schedules

Gradient Clipping - Safety Rails

Key Takeaways

📚 Further Reading