AI EducademyAIEducademy
๐ŸŒณ

AI Foundations

๐ŸŒฑ
AI Seeds

Start from zero

๐ŸŒฟ
AI Sprouts

Build foundations

๐ŸŒณ
AI Branches

Apply in practice

๐Ÿ•๏ธ
AI Canopy

Go deep

๐ŸŒฒ
AI Forest

Master AI

๐Ÿ”จ

AI Mastery

โœ๏ธ
AI Sketch

Start from zero

๐Ÿชจ
AI Chisel

Build foundations

โš’๏ธ
AI Craft

Apply in practice

๐Ÿ’Ž
AI Polish

Go deep

๐Ÿ†
AI Masterpiece

Master AI

๐Ÿš€

Career Ready

๐Ÿš€
Interview Launchpad

Start your journey

๐ŸŒŸ
Behavioral Mastery

Master soft skills

๐Ÿ’ป
Technical Interviews

Ace the coding round

๐Ÿค–
AI & ML Interviews

ML interview mastery

๐Ÿ†
Offer & Beyond

Land the best offer

View All Programsโ†’

Lab

7 experiments loaded
๐Ÿง Neural Network Playground๐Ÿค–AI or Human?๐Ÿ’ฌPrompt Lab๐ŸŽจImage Generator๐Ÿ˜ŠSentiment Analyzer๐Ÿ’กChatbot Builderโš–๏ธEthics Simulator
๐ŸŽฏMock InterviewEnter the Labโ†’
JourneyBlog
๐ŸŽฏ
About

Making AI education accessible to everyone, everywhere

โ“
FAQ

Common questions answered

โœ‰๏ธ
Contact

Get in touch with us

โญ
Open Source

Built in public on GitHub

Get Started
AI EducademyAIEducademy

MIT Licence. Open Source

Learn

  • Academics
  • Lessons
  • Lab

Community

  • GitHub
  • Contribute
  • Code of Conduct
  • About
  • FAQ

Support

  • Buy Me a Coffee โ˜•
  • Terms of Service
  • Privacy Policy
  • Contact
AI & Engineering Academicsโ€บ๐ŸŒฟ AI Sproutsโ€บLessonsโ€บLoss Functions and Optimisers
๐Ÿ“‰
AI Sprouts โ€ข Intermediateโฑ๏ธ 15 min read

Loss Functions and Optimisers

Loss Functions and Optimisers

Backpropagation gives us gradients - but gradients of what? Before backprop can run, we need a single number that captures how wrong the model is. That number comes from a loss function. Once we have gradients, an optimiser decides how to update the weights. Together, they form the learning loop.

What Is a Loss Function?

A loss function (also called a cost function) takes the model's prediction and the true answer, and returns a number measuring "wrongness." The goal of training is to minimise this number.

Think of it like a score in golf - lower is better. A loss of 0 means a perfect prediction.

A U-shaped curve showing loss on the y-axis and weight value on the x-axis, with a ball rolling towards the minimum
Training is like rolling a ball downhill on the loss landscape, searching for the lowest point.

Loss Functions for Regression - MSE

When predicting continuous values (house prices, temperature), we use Mean Squared Error (MSE):

MSE = (1/n) ร— ฮฃ(predicted - actual)ยฒ

Squaring does two things: it makes all errors positive, and it punishes large errors disproportionately. Predict a house price off by ยฃ100k and the squared error is 100ร— worse than being off by ยฃ10k.

๐Ÿคฏ

MSE dates back to Carl Friedrich Gauss in 1795 - over two centuries before neural networks. He used it to track the orbit of the asteroid Ceres.

Loss Functions for Classification - Cross-Entropy

When predicting categories (spam or not, cat vs dog), we use cross-entropy loss. It measures how far the model's predicted probabilities are from the true labels.

If the correct answer is "cat" and the model says 99% cat, the loss is tiny. If it says 10% cat, the loss is enormous. Cross-entropy has a useful property: it becomes infinitely unhappy when the model is confidently wrong, creating a strong gradient to correct the mistake.

Binary cross-entropy is for two-class problems. Categorical cross-entropy handles multiple classes by comparing probability distributions.

๐Ÿง Quick Check

Why is MSE a poor choice for classification tasks?

Lesson 7 of 160% complete
โ†Backpropagation

Discussion

Sign in to join the discussion

Suggest an edit to this lesson

Gradient Descent - Rolling Downhill

With a loss function defined, we can visualise the loss landscape - a surface where each point represents a set of weights and the height is the loss. Training means finding the lowest valley.

Gradient descent is the algorithm that gets us there:

  1. Compute the gradient (slope) at the current position.
  2. Take a step in the opposite direction (downhill).
  3. Repeat.

The size of each step is controlled by the learning rate - arguably the most important hyperparameter in deep learning.

The Learning Rate Dilemma

  • Too high: You overshoot the valley, bouncing back and forth or diverging entirely.
  • Too low: You creep along painfully slowly and may get stuck in a shallow local minimum.
  • Just right: You converge steadily to a good solution.
๐Ÿค”
Think about it:

Imagine hiking down a foggy mountain where you can only feel the slope directly under your feet. You step downhill, but you cannot see the whole landscape. How might you end up in a small dip that is not the deepest valley? This is the local minimum problem.

Flavours of Gradient Descent

Batch Gradient Descent

Computes the gradient using the entire dataset before each update. Accurate but painfully slow for large datasets - imagine re-reading every book in a library before correcting a single spelling mistake.

Stochastic Gradient Descent (SGD)

Updates weights after each single example. Fast but noisy - the path zigzags wildly. The noise can actually help escape local minima, which is a surprising benefit.

Mini-Batch Gradient Descent

The practical sweet spot. Computes gradients on a small batch (typically 32โ€“512 examples). Balances speed and stability, and is what virtually all modern training uses.

Modern Optimisers

Plain SGD has limitations. Researchers have developed smarter optimisers that adapt as they go.

SGD with Momentum

Like a heavy ball rolling downhill, momentum accumulates velocity in consistent directions and dampens oscillations. If the gradient keeps pointing the same way, momentum accelerates. If it keeps changing direction, momentum smooths it out.

AdaGrad

Adapts the learning rate per parameter. Frequently updated weights get smaller steps; rarely updated weights get larger steps. Great for sparse data (like text), but the learning rate can shrink to zero over time.

Adam (Adaptive Moment Estimation)

Combines momentum and per-parameter adaptive rates. It maintains running averages of both the gradient (first moment) and the squared gradient (second moment). Adam is the default choice for most practitioners today.

๐Ÿง Quick Check

What advantage does Adam have over basic SGD?

Learning Rate Schedules

Rather than fixing the learning rate, modern training often schedules it:

  • Step decay: Halve the rate every N epochs.
  • Cosine annealing: Smoothly decrease following a cosine curve, sometimes with warm restarts.
  • Warmup: Start with a tiny rate, gradually increase, then decay. Used in Transformer training.

The intuition: take big steps early to explore broadly, then small steps later to fine-tune.

Gradient Clipping - Safety Rails

Sometimes gradients explode (as we saw in the backpropagation lesson). Gradient clipping caps the gradient magnitude before the update step. If the gradient exceeds a threshold, it is scaled down proportionally. This is standard practice when training RNNs and Transformers.

๐Ÿง Quick Check

What does gradient clipping prevent?

๐Ÿคฏ

The Adam optimiser paper (Kingma & Ba, 2014) has over 150,000 citations, making it one of the most cited papers in all of computer science.

Key Takeaways

  • Loss functions quantify how wrong a model is - MSE for regression, cross-entropy for classification.
  • Gradient descent minimises the loss by repeatedly stepping opposite to the gradient.
  • The learning rate controls step size and is critical to get right.
  • Adam is the go-to optimiser, combining momentum and adaptive rates.
  • Learning rate schedules and gradient clipping are essential training stabilisers.
๐Ÿค”
Think about it:

If you were training a model and the loss stopped decreasing after a few epochs, what would you investigate first - the learning rate, the loss function, or the data? Why?


๐Ÿ“š Further Reading

  • Andrej Karpathy - A Recipe for Training Neural Networks - Practical wisdom on loss debugging and optimiser selection
  • 3Blue1Brown - Gradient Descent - Stunning visual intuition for how gradient descent navigates loss landscapes
  • An Overview of Gradient Descent Optimisation Algorithms (Ruder, 2016) - Comprehensive comparison of SGD, Adam, and friends