Policy gradient methods, unlike value-based approaches like Q-learning, directly parameterize and optimize the policy — the mapping from states to action probabilities. The REINFORCE algorithm (Williams, 1992) provides a simple, elegant way to estimate the gradient of expected reward with respect to policy parameters using only sampled trajectories.
The Policy Gradient Theorem
Gradient: ∇J(θ) = E_τ[Σₜ ∇log π_θ(aₜ|sₜ) · Rₜ]
Update: θ ← θ + α · ∇log π_θ(a|s) · (R − b)
b = baseline (e.g., average reward, reduces variance)
Actor-Critic Methods
Actor-critic methods combine policy gradient (the actor, which selects actions) with value function estimation (the critic, which evaluates states). The critic provides a learned baseline that reduces the variance of the policy gradient estimate, dramatically improving learning speed. The actor-critic architecture has been proposed as a model of corticostriatal interactions, with the prefrontal cortex implementing the policy (actor) and the basal ganglia computing value estimates (critic) via dopaminergic prediction errors.