Neural NetworksIntermediate

⬅️Training & Backpropagation

How a network learns from mistakes

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min

Pause and experiment as you go.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Backpropagation is the bookkeeping system that makes neural-network training practical. After a prediction is made, we measure the error, calculate how much each weight contributed to that error, and then nudge the weights in the direction that should reduce future mistakes.

How this lesson fits

This module introduces the core architecture behind much of modern AI. Students follow information as it moves through layers, is transformed by weights and activations, and eventually becomes a prediction that can be improved through feedback.

The big question

How do large collections of simple numerical operations combine into a model that can recognize patterns humans struggle to hand-code?

Trace a forward pass through a network and explain what each layer contributesExplain why nonlinear activations and gradients make learning possibleRelate abstract neural-network mechanics to practical perception tasks

Why You Should Care

Students often hear backpropagation as if it were a mysterious black box. This lesson turns it into a comprehensible process: compute loss, trace responsibility backward, and update parameters gradually until performance improves.

Where this is used today

✓Training the neural networks behind modern language, vision, and speech systems
✓Optimization problems where a differentiable model must improve through repeated feedback
✓Scientific and business models that tune parameters by following gradients

Think of it like this

It is like reviewing a team performance after the game. You do not just say 'we lost'; you identify which decisions mattered, how strongly they mattered, and what each player should change next time.

Easy mistake to make

Backpropagation is not magic learning dust. It is a structured accounting method for assigning credit and blame across many connected weights.

By the end, you should be able to say:

Explain why gradients indicate how a small parameter change should affect the loss
Describe how errors are propagated backward through successive layers using the chain rule
Connect learning rate, gradients, and convergence to stable or unstable training behavior

Think about this first

If a model consistently predicts values that are too high, what kind of information would you need in order to decide which weights should decrease and by how much?

Words we will keep using

lossgradientlearning ratechain ruleepoch

Backpropagation: How the Network Learns from Mistakes

Backpropagation sounds intimidating, but it's really just a "blame game." When the network makes a mistake, we trace the error backward through the connections to find out which weights were responsible. Then we nudge them to do better next time.

\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_1}

The chain rule is just a way of tracing influence. The output depends on the hidden units, the hidden units depend on the weights, so the error can be followed all the way back to each parameter.

The 5 Moves to Watch

Step 1: Forward Pass

Step 2: Compute Loss

Step 3: Output Gradient

Step 4: Backprop to Hidden

Step 5: Weight Update

Live 2-Layer Network — Watch Weights Update

Node colour: green = high, red = low. Blue edges = positive weight, red = negative. Dashed node = true label y.

Loss curve

Click 100 Steps to watch it learn!

x₁: 0.5x₂: 0.8y: 1.0

Learning rate α: 0.50

Network state (epoch 0)

h₁ = 0.5474 h₂ = 0.6479

ŷ = 0.7363 y = 1.0000

L = 0.034778

Current weight values

w1 (x1→h1)0.500

w2 (x2→h1)-0.200

w3 (x1→h2)0.300

w4 (x2→h2)0.700

b1 (bias h1)0.100

b2 (bias h2)-0.100

wOut1 (h1→ŷ)0.800

wOut2 (h2→ŷ)0.600

bOut0.200

Vanishing & Exploding Gradients

In very deep networks, the gradient can shrink until learning becomes painfully slow, or grow until training becomes unstable. That is why modern architectures use tools like ReLU, normalization, and residual connections to keep learning healthy.