Back to all lessons
Neural NetworksIntermediate

⬅️Training & Backpropagation

How a network learns from mistakes

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Backpropagation is the bookkeeping system that makes neural-network training practical. After a prediction is made, we measure the error, calculate how much each weight contributed to that error, and then nudge the weights in the direction that should reduce future mistakes.

How this lesson fits

This module introduces the core architecture behind much of modern AI. Students follow information as it moves through layers, is transformed by weights and activations, and eventually becomes a prediction that can be improved through feedback.

The big question

How do large collections of simple numerical operations combine into a model that can recognize patterns humans struggle to hand-code?

Trace a forward pass through a network and explain what each layer contributesExplain why nonlinear activations and gradients make learning possibleRelate abstract neural-network mechanics to practical perception tasks

Why You Should Care

Students often hear backpropagation as if it were a mysterious black box. This lesson turns it into a comprehensible process: compute loss, trace responsibility backward, and update parameters gradually until performance improves.

Where this is used today

  • Training the neural networks behind modern language, vision, and speech systems
  • Optimization problems where a differentiable model must improve through repeated feedback
  • Scientific and business models that tune parameters by following gradients

Think of it like this

It is like reviewing a team performance after the game. You do not just say 'we lost'; you identify which decisions mattered, how strongly they mattered, and what each player should change next time.

Easy mistake to make

Backpropagation is not magic learning dust. It is a structured accounting method for assigning credit and blame across many connected weights.

By the end, you should be able to say:

  • Explain why gradients indicate how a small parameter change should affect the loss
  • Describe how errors are propagated backward through successive layers using the chain rule
  • Connect learning rate, gradients, and convergence to stable or unstable training behavior

Think about this first

If a model consistently predicts values that are too high, what kind of information would you need in order to decide which weights should decrease and by how much?

Words we will keep using

lossgradientlearning ratechain ruleepoch

Backpropagation: How the Network Learns from Mistakes

Backpropagation sounds intimidating, but it's really just a "blame game." When the network makes a mistake, we trace the error backward through the connections to find out which weights were responsible. Then we nudge them to do better next time.

Lw1=Ly^y^h1h1w1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_1}

The chain rule is just a way of tracing influence. The output depends on the hidden units, the hidden units depend on the weights, so the error can be followed all the way back to each parameter.

The 5 Moves to Watch

Step 1: Forward Pass
Step 2: Compute Loss
Step 3: Output Gradient
Step 4: Backprop to Hidden
Step 5: Weight Update

Live 2-Layer Network — Watch Weights Update

Node colour: green = high, red = low. Blue edges = positive weight, red = negative. Dashed node = true label y.

Loss curve

Click 100 Steps to watch it learn!

Network state (epoch 0)
h₁ = 0.5474 h₂ = 0.6479
ŷ = 0.7363 y = 1.0000
L = 0.034778

Current weight values

w1 (x1→h1)0.500
w2 (x2→h1)-0.200
w3 (x1→h2)0.300
w4 (x2→h2)0.700
b1 (bias h1)0.100
b2 (bias h2)-0.100
wOut1 (h1→ŷ)0.800
wOut2 (h2→ŷ)0.600
bOut0.200

Vanishing & Exploding Gradients

In very deep networks, the gradient can shrink until learning becomes painfully slow, or grow until training becomes unstable. That is why modern architectures use tools like ReLU, normalization, and residual connections to keep learning healthy.