Policy Gradient Methods

Policy gradient methods, unlike value-based approaches like Q-learning, directly parameterize and optimize the policy — the mapping from states to action probabilities. The REINFORCE algorithm (Williams, 1992) provides a simple, elegant way to estimate the gradient of expected reward with respect to policy parameters using only sampled trajectories.

The Policy Gradient Theorem

REINFORCE Algorithm Policy: π_θ(a|s) = P(action a in state s; parameters θ)
Gradient: ∇J(θ) = E_τ[Σₜ ∇log π_θ(aₜ|sₜ) · Rₜ]

Update: θ ← θ + α · ∇log π_θ(a|s) · (R − b)
b = baseline (e.g., average reward, reduces variance)

Actor-Critic Methods

Actor-critic methods combine policy gradient (the actor, which selects actions) with value function estimation (the critic, which evaluates states). The critic provides a learned baseline that reduces the variance of the policy gradient estimate, dramatically improving learning speed. The actor-critic architecture has been proposed as a model of corticostriatal interactions, with the prefrontal cortex implementing the policy (actor) and the basal ganglia computing value estimates (critic) via dopaminergic prediction errors.

The Policy Gradient Theorem

Actor-Critic Methods

References

External Links

The Policy Gradient Theorem

Actor-Critic Methods

Related Topics

References

External Links