Temporal difference (TD) learning, developed by Richard Sutton in 1988, is a reinforcement learning algorithm that updates predictions about future rewards based on the discrepancy between consecutive predictions. TD learning bridges the Rescorla-Wagner model from animal learning theory and dynamic programming from optimal control theory.
V(sₜ) ← V(sₜ) + α · δₜ
δₜ = TD error (reward prediction error)
γ = discount factor (0 to 1)
α = learning rate
Connection to Dopamine
In a landmark discovery, Schultz, Dayan, and Montague (1997) showed that the firing patterns of midbrain dopamine neurons closely match the TD prediction error signal. Dopamine neurons fire when rewards are unexpected (positive δ), pause when expected rewards are omitted (negative δ), and show no response to fully predicted rewards (δ = 0). This correspondence has become one of the most successful examples of a computational model directly predicting neural activity.
Relationship to Rescorla-Wagner
The Rescorla-Wagner model can be seen as a special case of TD learning where there is only one time step between CS and US. TD learning generalizes this by allowing prediction errors to propagate backwards through multiple time steps, explaining phenomena like second-order conditioning and the timing of conditioned responses that the Rescorla-Wagner model cannot address.