Skip to content
Reinforcement Learning
Lesson 6 ⏱ 14 min

Actor-critic methods

Video coming soon

Actor-Critic: Two Networks, One Goal

Builds the actor-critic architecture from the advantage function, explains A2C and A3C with diagrams, then shows how PPO's clipped objective prevents destabilizing updates.

⏱ ~10 min

🧮

Quick refresher

Neural networks and gradient descent

Neural networks are differentiable function approximators trained via gradient descent on a loss function. Backpropagation computes gradients of the loss w.r.t. parameters. Multiple networks can be trained simultaneously with shared or independent optimizers.

Example

Training two networks at once — one to predict value, one to select actions — is standard in multi-task learning.

Actor-critic is exactly this.

The Best of Both Worlds

REINFORCE uses Monte Carlo returns: unbiased but high-variance, slow to learn. Q-learning uses TD bootstrapping: low variance but biased (because the Q estimate is imperfect). Actor-critic methods combine both — the actor learns a policy, the critic provides a low-variance value estimate as a baseline.

Actor-critic methods are the dominant algorithm class in modern applied RL. PPO, SAC, and A3C — the algorithms behind robotic control, game-playing agents, and RLHF for language models — are all actor-critic variants. Understanding the architecture explains why they're so much more stable and data-efficient than pure policy gradient methods.

The Two-Network Architecture

An agent maintains:

  • Actor: — chooses actions.
  • Critic: — evaluates how good the current state is.

The actor uses the critic's evaluation to reduce variance in policy gradient updates. The critic trains on TD error targets using the actor's behavior.

Advantage Actor-Critic (A2C)

The A(s,a) = Q(s,a) − V(s) measures how much better action a is than average. The actor update becomes:

δt=rt+γVϕ(st+1)Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)
δt\delta_t
TD error (advantage estimate) at time t
αθ\alpha_\theta
Actor learning rate
αϕ\alpha_\phi
Critic learning rate
Vϕ(st)V_\phi(s_t)
Critic's value estimate
θθ+αθ,δt,θlogπθ(atst)\theta \leftarrow \theta + \alpha_\theta, \delta_t, \nabla_\theta \log \pi_\theta(a_t \mid s_t)
ϕϕαϕ,δt,ϕVϕ(st)\phi \leftarrow \phi - \alpha_\phi, \delta_t, \nabla_\phi V_\phi(s_t)

The critic minimizes the squared TD error (a regression problem). The actor maximizes expected return weighted by the advantage estimate.

Worked Example: CartPole Step

State: pole angle 0.02 rad, velocity 0.1 m/s. Critic outputs V(s) = 0.8. Actor outputs: P(push_left) = 0.4, P(push_right) = 0.6. Agent takes action "push_right." Next state: pole slightly more upright. r = +1. Critic outputs V(s') = 0.85.

δ = 1 + 0.99×0.85 − 0.8 = 1 + 0.8415 − 0.8 = 1.0415.

Since δ > 0, "push_right" was better than average. Actor update: increase log P(push_right | s) by α×1.0415. Critic update: decrease V(s) toward 1.0415 by α.

A3C: Asynchronous Advantage Actor-Critic

The (Mnih et al., 2016) addresses the sample efficiency problem with a simple trick: parallelism.

  • N workers run in parallel, each with their own environment copy.
  • Each worker has a local copy of θ and φ.
  • Workers collect short , compute gradients, and asynchronously push updates to a global parameter server.
  • Workers periodically pull the latest global parameters.

Benefits:

  1. No replay buffer needed — parallelism provides decorrelated experience naturally.
  2. Uses CPU cores efficiently (workers run on separate threads).
  3. More stable training than single-worker due to diverse experience.

A2C (the synchronous version) waits for all workers to finish a rollout before updating — simpler to implement, nearly as effective on most tasks.

Proximal Policy Optimization (PPO)

Plain actor-critic has a problem: a single bad gradient step can move the policy far from where the training data was collected, making importance weights unreliable. (Schulman et al., 2017) fixes this with a clipped surrogate objective.

Define the probability ratio:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
rt(θ)r_t(\theta)
Ratio of new to old policy probability for action a_t in state s_t
πθold\pi_{\theta_{\text{old}}}
Policy at start of update epoch

The clipped objective:

LCLIP(θ)=Et[min(rt(θ)A^t,;clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\bigl[\min\bigl(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\bigr)\bigr]
A^t\hat{A}_t
Advantage estimate at time t
ϵ\epsilon
PPO clip parameter (typical: 0.1–0.2)

When > 0 (good action): we want to increase r_t(θ), but clip at 1+ε — no reward for moving probability ratio above 1+ε. When Â_t < 0 (bad action): we want to decrease r_t(θ), but clip at 1−ε — no reward for moving below 1−ε.

PPO Full Objective

In practice, PPO adds critic loss and an entropy bonus:

L(θ)=LCLIP(θ)c1LVF(θ)+c2S[πθ]\mathcal{L}(\theta) = \mathcal{L}^{\text{CLIP}}(\theta) - c_1 \mathcal{L}^{\text{VF}}(\theta) + c_2 S[\pi_\theta]
c1,c2c_1, c_2
Loss coefficients for critic and entropy terms
S[πθ]S[\pi_\theta]
Policy entropy: encourages exploration

The entropy term prevents premature convergence to a deterministic policy.

Interactive example

Side-by-side visualization of actor policy and critic value surface updating together on a simple 2D navigation task

Coming soon

Summary

  • Actor-critic combines a policy network (actor) and value network (critic): lower variance than REINFORCE, more general than Q-learning.
  • A2C: synchronous parallel workers, TD error as advantage estimate.
  • A3C: asynchronous parallel workers — no replay buffer needed.
  • PPO: clips the probability ratio to prevent destructive policy updates; the dominant practical algorithm.
  • PPO's full objective includes critic loss and entropy bonus for exploration.

Quiz

1 / 3

In the advantage actor-critic (A2C), the TD error δ = r + γV(s') − V(s) is used as an estimate of the advantage A(s,a). Why is this valid?