The Best of Both Worlds
REINFORCE uses Monte Carlo returns: unbiased but high-variance, slow to learn. Q-learning uses TD bootstrapping: low variance but biased (because the Q estimate is imperfect). Actor-critic methods combine both — the actor learns a policy, the critic provides a low-variance value estimate as a baseline.
Actor-critic methods are the dominant algorithm class in modern applied RL. PPO, SAC, and A3C — the algorithms behind robotic control, game-playing agents, and RLHF for language models — are all actor-critic variants. Understanding the architecture explains why they're so much more stable and data-efficient than pure policy gradient methods.
The Two-Network Architecture
An agent maintains:
- Actor: — chooses actions.
- Critic: — evaluates how good the current state is.
The actor uses the critic's evaluation to reduce variance in policy gradient updates. The critic trains on TD error targets using the actor's behavior.
Advantage Actor-Critic (A2C)
The A(s,a) = Q(s,a) − V(s) measures how much better action a is than average. The actor update becomes:
- TD error (advantage estimate) at time t
- Actor learning rate
- Critic learning rate
- Critic's value estimate
The critic minimizes the squared TD error (a regression problem). The actor maximizes expected return weighted by the advantage estimate.
Worked Example: CartPole Step
State: pole angle 0.02 rad, velocity 0.1 m/s. Critic outputs V(s) = 0.8. Actor outputs: P(push_left) = 0.4, P(push_right) = 0.6. Agent takes action "push_right." Next state: pole slightly more upright. r = +1. Critic outputs V(s') = 0.85.
δ = 1 + 0.99×0.85 − 0.8 = 1 + 0.8415 − 0.8 = 1.0415.
Since δ > 0, "push_right" was better than average. Actor update: increase log P(push_right | s) by α×1.0415. Critic update: decrease V(s) toward 1.0415 by α.
A3C: Asynchronous Advantage Actor-Critic
The (Mnih et al., 2016) addresses the sample efficiency problem with a simple trick: parallelism.
- N workers run in parallel, each with their own environment copy.
- Each worker has a local copy of θ and φ.
- Workers collect short , compute gradients, and asynchronously push updates to a global parameter server.
- Workers periodically pull the latest global parameters.
Benefits:
- No replay buffer needed — parallelism provides decorrelated experience naturally.
- Uses CPU cores efficiently (workers run on separate threads).
- More stable training than single-worker due to diverse experience.
A2C (the synchronous version) waits for all workers to finish a rollout before updating — simpler to implement, nearly as effective on most tasks.
Proximal Policy Optimization (PPO)
Plain actor-critic has a problem: a single bad gradient step can move the policy far from where the training data was collected, making importance weights unreliable. (Schulman et al., 2017) fixes this with a clipped surrogate objective.
Define the probability ratio:
- Ratio of new to old policy probability for action a_t in state s_t
- Policy at start of update epoch
The clipped objective:
- Advantage estimate at time t
- PPO clip parameter (typical: 0.1–0.2)
When > 0 (good action): we want to increase r_t(θ), but clip at 1+ε — no reward for moving probability ratio above 1+ε. When Â_t < 0 (bad action): we want to decrease r_t(θ), but clip at 1−ε — no reward for moving below 1−ε.
PPO Full Objective
In practice, PPO adds critic loss and an entropy bonus:
- Loss coefficients for critic and entropy terms
- Policy entropy: encourages exploration
The entropy term prevents premature convergence to a deterministic policy.
Interactive example
Side-by-side visualization of actor policy and critic value surface updating together on a simple 2D navigation task
Coming soon
Summary
- Actor-critic combines a policy network (actor) and value network (critic): lower variance than REINFORCE, more general than Q-learning.
- A2C: synchronous parallel workers, TD error as advantage estimate.
- A3C: asynchronous parallel workers — no replay buffer needed.
- PPO: clips the probability ratio to prevent destructive policy updates; the dominant practical algorithm.
- PPO's full objective includes critic loss and entropy bonus for exploration.