Backpropagation — short for "backward propagation of errors" — is the algorithm that made multilayer neural networks practical. Although the mathematical idea of computing gradients by the chain rule was known earlier (Werbos, 1974), it was Rumelhart, Hinton, and Williams (1986) who demonstrated its power for training networks with hidden layers, publishing their landmark paper in Nature. The algorithm computes how much each weight in the network contributed to the overall error, then adjusts every weight simultaneously in the direction that reduces that error.
The Algorithm
Output layer delta: δₖ = (tₖ − oₖ) · f′(netₖ)
Hidden layer delta: δⱼ = f′(netⱼ) · Σₖ wₖⱼ · δₖ
Weight update: Δwⱼᵢ = η · δⱼ · xᵢ
η = learning rate
The computation proceeds in two phases. In the forward pass, input is propagated through the network layer by layer to produce an output. In the backward pass, the error at the output is propagated backward through the network, computing the "delta" (local error signal) for each unit. Each delta is the product of the derivative of the activation function at that unit and the weighted sum of deltas from the layer above. The weight update for any connection is then simply the product of the learning rate, the sending unit's activation, and the receiving unit's delta.
Learning Dynamics and Challenges
Backpropagation performs gradient descent on the error surface — a high-dimensional landscape whose shape depends on the training data, the architecture, and the activation functions. The learning trajectory can be affected by local minima (though in practice these are rarely a serious problem for large networks), saddle points, and flat plateaus where learning stalls. Practical enhancements include momentum (adding a fraction of the previous weight change to the current update), adaptive learning rates, and weight decay (a regularization term that penalizes large weights).
Whether the brain implements anything like backpropagation has been debated since the algorithm was introduced. Critics note that biological neurons do not have access to the symmetric weight matrices required for the backward pass, that learning in the brain appears more local, and that error signals would need to be propagated backward through many synapses. However, recent proposals such as predictive coding, feedback alignment, and equilibrium propagation suggest that biologically plausible mechanisms can approximate the gradient computations of backpropagation, keeping the debate active.
In cognitive modeling, backpropagation's significance extends beyond its role as a training algorithm. The internal representations that emerge in hidden layers after learning have been used to explain phenomena in language acquisition (past-tense learning), reading (mapping orthography to phonology), and semantic cognition (learning the structure of conceptual knowledge). The representations are not hand-coded but emerge from the statistics of the training environment, providing a compelling account of how structured knowledge could arise from experience.