RL【10-1】:Actor - Critic

发布于:2025-09-15 ⋅ 阅读:(19) ⋅ 点赞:(0)

系列文章目录

Fundamental Tools

RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation

Algorithm

RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning
RL【8】:Value Function Approximation
RL【9】:Policy Gradient
RL【10-1】:Actor - Critic
RL【10-2】:Actor - Critic



前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning


Introduction

Actor-critic methods are still policy gradient methods.

  • They emphasize the structure that incorporates the policy gradient and
    value-based methods.

What are “actor” and “critic”?

  • Here, “actor” refers to policy update. It is called actor is because the policies will be applied to take actions.
  • Here, “critic” refers to policy evaluation or value estimation. It is called critic because it criticizes the policy by evaluating it.

ActorCritic 的角色和功能可以这么理解

  1. Actor(行动者)

    • 功能

      • Actor 负责 决策,即根据当前状态 s s s 输出动作 a a a 的概率分布(策略)。
    • 数学形式

      通常表示为一个参数化策略 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(as),参数 θ \theta θ 来自神经网络。

    • 直观类比

      相当于一个“演员”,在舞台上看到当前环境状态后,决定下一步要表演的动作。

    • 输出

      • 离散动作环境 → 各动作的概率分布。
      • 连续动作环境 → 动作的均值与方差。
  2. Critic(评论家)

    • 功能

      Critic 负责 评价 Actor 的动作选择得好不好。它通过估计 价值函数(Value Function)来衡量某状态或状态–动作对的长期回报。

    • 数学形式

      • 状态值函数 V π ( s ) V^\pi(s) Vπ(s)

      • 或动作值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)

        Critic 通过比较 实际回报预测值 来给 Actor 提供梯度信号。

    • 直观类比

      相当于一个“评论家”,不表演,但会点评“刚刚那个动作好/不好”,并给出改进方向。

  3. Actor–Critic 的交互流程

    1. Actor 决策:根据状态 s t s_t st 选取动作 a t a_t at
    2. 环境反馈:环境返回奖励 r t r_t rt 和下一个状态 s t + 1 s_{t+1} st+1
    3. Critic 评价:通过 TD(Temporal Difference) 误差 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)V(st) 来衡量好坏。
    4. Actor 更新:利用 Critic 给的信号( δ t \delta_t δt)来更新策略参数 θ \theta θ

The simplest actor-critic (QAC)

Revisit the idea of policy gradient

  • Revisit the idea of policy gradient

    1. A scalar metric J ( θ ) J(\theta) J(θ), which can be v ˉ π \bar v_\pi vˉπ or r ˉ π \bar r_\pi rˉπ.

    2. The gradient-ascent algorithm maximizing J ( θ ) J(\theta) J(θ) is

      θ t + 1 = θ t + α ∇ θ J ( θ t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) θt+1=θt+αθJ(θt)

      = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] = \theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big] =θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

    3. The stochastic gradient-ascent algorithm is

      θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • We can see “actor” and “critic” from this algorithm:

    • This algorithm corresponds to actor!
    • The algorithm estimating q t ( s , a ) q_t(s,a) qt(s,a) corresponds to critic!

从 Policy Gradient 到 Actor-Critic

Policy Gradient (PG) 方法中,我们的目标是最大化某个指标(如 v ˉ π \bar v_\pi vˉπ r ˉ π \bar r_\pi rˉπ):

θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) = \theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\big[\nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A)\big] θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

  • Actor 部分 ∇ θ ln ⁡ π ( A ∣ S , θ ) \nabla_\theta \ln \pi(A|S,\theta) θlnπ(AS,θ),决定如何更新策略参数 θ \theta θ
  • Critic 部分 q π ( S , A ) q_\pi(S,A) qπ(S,A),决定给 Actor 提供的学习信号(Critic 来评估 当前动作的价值,并把这个评估结果作为学习信号反馈给 Actor)。

但是:

  • q π ( s , a ) q_\pi(s,a) qπ(s,a) 在真实环境中是 未知 的 → 需要估计。
  • 如果用 Monte Carlo 方法估计,就得到 REINFORCE
  • 如果用函数逼近 + 时序差分 (TD) 来估计,就得到 Actor-Critic

The simplest actor-critic algorithm (QAC)

  • Aim: Search for an optimal policy by maximizing J ( θ ) J(\theta) J(θ).
  • At time step t t t in each episode, do:
    • Generate a t a_t at following π ( a ∣ s t , θ t ) \pi(a|s_t,\theta_t) π(ast,θt), observe r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1, and then generate a t + 1 a_{t+1} at+1 following π ( a ∣ s t + 1 , θ t ) \pi(a|s_{t+1}, \theta_t) π(ast+1,θt).

    • Critic (value update):

      w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t) wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)

    • Actor (policy update):

      θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1}) θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt+1)

Actor-Critic 框架 (QAC)

Q Actor-Critic (QAC) 中:

  • Actor(策略更新器)

    • 根据 Critic 提供的 q ( s , a ) q(s,a) q(s,a) 更新策略参数 θ \theta θ

      θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1}) θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt+1)

    → 这一步是 策略梯度更新,提高高价值动作的概率。

  • Critic(价值评估器)

    • 使用 TD 方法来更新 q ( s , a ) q(s,a) q(s,a) 的参数 w w w

      w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1},a_{t+1},w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t) wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)

    → 这一步是 价值函数近似,修正对 q ( s , a ) q(s,a) q(s,a) 的估计。

Remarks:

  • The critic corresponds to “SARSA + value function approximation”.
  • The actor corresponds to the policy update algorithm.
  • The algorithm is on-policy (why is PG on-policy?).
    • Since the policy is stochastic, no need to use techniques like ε \varepsilon ε-greedy.
  • This particular actor-critic algorithm is sometimes referred to as Q Actor-Critic (QAC).
  • Though simple, this algorithm reveals the core idea of actor-critic methods.

Remarks 的几点解释

  1. Actor ↔ Critic 的分工
    • Actor:负责学策略 π ( a ∣ s , θ ) \pi(a|s,\theta) π(as,θ)(对应 Policy Gradient 更新);
    • Critic:负责学价值函数 q ( s , a , w ) q(s,a,w) q(s,a,w)(对应 SARSA + 函数逼近)。
  2. On-policy 特性
    • 采样数据时,必须按照当前策略 π \pi π 来生成。
    • 因为 ∇ θ ln ⁡ π ( a ∣ s , θ ) \nabla_\theta \ln \pi(a|s,\theta) θlnπ(as,θ) 直接依赖于当前策略。
    • 不需要像 Q-learning 一样用 ε \varepsilon ε-greedy 来做探索。
  3. 为什么叫 QAC?
    • 因为 Critic 用的是 Q ( s , a ) Q(s,a) Q(s,a) (动作价值函数),所以叫 Q Actor-Critic
  4. 意义
    • REINFORCE 用 MC → 高方差;
    • Actor-Critic 用 TD → 降低方差,更稳定。
    • QAC 是最简单的 Actor-Critic,但它揭示了 “Actor 调整策略,Critic 提供信号” 的核心思想。

Advantage actor-critic (A2C)

Baseline invariance

Property: the policy gradient is invariant to an additional baseline

∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big] θJ(θ)=ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

= E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) \big(q_\pi(S,A) - b(S)\big) \Big] =ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))]

  • Here, the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S.
  • Next, we answer two questions:
    • Why is it valid?
    • Why is it useful?

Baseline Invariance 的核心思想

  • 在 Policy Gradient 中,更新公式是:

    ∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A) \Big] θJ(θ)=ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

  • 这意味着,策略参数的更新方向由 动作在当前状态下的价值 q π ( S , A ) q_\pi(S,A) qπ(S,A) 决定。

  • 然而我们可以在公式里引入一个 baseline b ( S ) b(S) b(S)

    ∇ θ J ( θ ) = E S , A [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}{S,A} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) \big(q_\pi(S,A) - b(S)\big) \Big] θJ(θ)=ES,A[θlnπ(AS,θt)(qπ(S,A)b(S))]

关键结论:无论选什么 b ( S ) b(S) b(S),这个公式都是 不变的(即 baseline 不会改变期望的梯度方向)。

First, why is it valid?

  • That is because

    E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] = 0 ESη,Aπ[θlnπ(AS,θt)b(S)]=0

  • The details:

E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) b ( S ) ] = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s , θ t ) ∇ θ ln ⁡ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ ∑ a ∈ A π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \nabla_\theta \ln \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta 1 \\ &= 0 \end{aligned} ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)aAθπ(as,θt)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0

为什么这是有效的(Why valid)

  • 我们证明了:

    E S , A [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}{S,A} \Big[\nabla\theta \ln \pi(A|S,\theta_t) b(S)\Big] = 0 ES,A[θlnπ(AS,θt)b(S)]=0

  • 直观理解:

    • baseline 只是一个与动作无关的“参考值”;
    • 在期望下,它会完全抵消掉对梯度的影响;
    • 所以 baseline 不会引入偏差(unbiased)。

Second, why is the baseline useful?

  • The gradient is

    ∇ θ J ( θ ) = E [ X ] \nabla_\theta J(\theta) = \mathbb{E}[X] θJ(θ)=E[X]

    • where

      X ( S , A ) ≐ ∇ θ ln ⁡ π ( A ∣ S , θ t ) [ q π ( S , A ) − b ( S ) ] X(S,A) \doteq \nabla_\theta \ln \pi(A|S, \theta_t) \big[q_\pi(S,A) - b(S)\big] X(S,A)θlnπ(AS,θt)[qπ(S,A)b(S)]

    • We have

      • E [ X ] \mathbb{E}[X] E[X] is invariant to b ( S ) b(S) b(S).
      • v a r ( X ) \mathrm{var}(X) var(X) is NOT invariant to b ( S ) b(S) b(S).
  • Why? Because

    t r [ v a r ( X ) ] = E [ X T X ] − x ˉ T x ˉ \mathrm{tr}[\mathrm{var}(X)] = \mathbb{E}[X^T X] - \bar{x}^T \bar{x} tr[var(X)]=E[XTX]xˉTxˉ

  • and

    E [ X T X ] = E [ ( ∇ θ ln ⁡ π ) T ( ∇ θ ln ⁡ π ) ( q π ( S , A ) − b ( S ) ) 2 ] = E [ ∥ ∇ θ ln ⁡ π ∥ 2 ( q π ( S , A ) − b ( S ) ) 2 ] \mathbb{E}[X^T X] = \mathbb{E}\Big[ (\nabla_\theta \ln \pi)^T (\nabla_\theta \ln \pi)(q_\pi(S,A) - b(S))^2 \Big] = \mathbb{E}\Big[ \|\nabla_\theta \ln \pi\|^2 (q_\pi(S,A) - b(S))^2 \Big] E[XTX]=E[(θlnπ)T(θlnπ)(qπ(S,A)b(S))2]=E[θlnπ2(qπ(S,A)b(S))2]

为什么这是有用的(Why useful)

虽然 baseline 不改变期望梯度,但它会影响 梯度估计的方差

  • 真实更新用的是采样近似:

    ∇ θ J ≈ ∇ θ ln ⁡ π ( a ∣ s , θ ) ( q π ( s , a ) − b ( s ) ) \nabla_\theta J \approx \nabla_\theta \ln \pi(a|s,\theta) (q_\pi(s,a) - b(s)) θJθlnπ(as,θ)(qπ(s,a)b(s))

  • 如果没有 baseline,方差会很大(因为 q π ( s , a ) q_\pi(s,a) qπ(s,a) 波动很大);

  • 引入合适的 baseline,可以显著降低方差,提高稳定性。

这就是 Variance Reduction 的思想。

Our goal

  • Select an optimal baseline b to minimize v a r ( X ) \mathrm{var}(X) var(X).
    • Benefit: when we use a random sample to approximate E [ X ] \mathbb{E}[X] E[X], the estimation variance would also be small.
  • In the algorithms of REINFORCE and QAC,
    • There is no baseline.
    • Or, we can say b = 0 b=0 b=0, which is not guaranteed to be a good baseline.

The optimal baseline

  • The optimal baseline that can minimize v a r ( X ) \mathrm{var}(X) var(X) is, for any s ∈ S s \in \mathcal{S} sS,

    b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln ⁡ π ( A ∣ s , θ t ) ∥ 2 q π ( s , A ) ] E A ∼ π [ ∥ ∇ θ ln ⁡ π ( A ∣ s , θ t ) ∥ 2 ] b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 \big]} b(s)=EAπ[∥∇θlnπ(As,θt)2]EAπ[∥∇θlnπ(As,θt)2qπ(s,A)]

  • Although this baseline is optimal, it is complex.

  • We can remove the weight ∥ ∇ θ ln ⁡ π ( A ∣ s , θ t ) ∥ 2 \|\nabla_\theta \ln \pi(A|s,\theta_t)\|^2 θlnπ(As,θt)2 and select the suboptimal baseline:

    b ( s ) = E A ∼ π [ q π ( s , A ) ] = v π ( s ) b(s) = \mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s) b(s)=EAπ[qπ(s,A)]=vπ(s)

    • which is the state value of s s s.

最优 baseline 与次优 baseline

  • 最优 baseline:

    b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln ⁡ π ( A ∣ s , θ ) ∥ 2 q π ( s , A ) ] E A ∼ π [ ∥ ∇ θ ln ⁡ π ( A ∣ s , θ ) ∥ 2 ] b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 \big]} b(s)=EAπ[∥∇θlnπ(As,θ)2]EAπ[∥∇θlnπ(As,θ)2qπ(s,A)]

    理论上最优,但计算复杂。

  • 次优 baseline:

    b ( s ) = E A ∼ π [ q π ( s , A ) ] = v π ( s ) b(s) = \mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s) b(s)=EAπ[qπ(s,A)]=vπ(s)

    也就是 状态价值函数

    这就是 A2C 的关键:用 A ( s , a ) = q π ( s , a ) − v π ( s ) A(s,a) = q_\pi(s,a) - v_\pi(s) A(s,a)=qπ(s,a)vπ(s) 作为 优势函数 (Advantage)

与 A2C (Advantage Actor-Critic) 的联系

  • Actor 部分:

    使用优势函数 A ( s , a ) = q π ( s , a ) − v π ( s ) A(s,a) = q_\pi(s,a) - v_\pi(s) A(s,a)=qπ(s,a)vπ(s) 更新策略:

    θ ← θ + α ∇ θ ln ⁡ π ( a ∣ s , θ ) A ( s , a ) \theta \leftarrow \theta + \alpha \nabla_\theta \ln \pi(a|s,\theta) A(s,a) θθ+αθlnπ(as,θ)A(s,a)

  • Critic 部分:

    学习价值函数 v π ( s ) v_\pi(s) vπ(s) 来作为 baseline b ( s ) b(s) b(s)

直观解释:

  • Critic 估计 baseline(即状态价值 v π ( s ) v_\pi(s) vπ(s));
  • Actor 使用 q π − v π q_\pi - v_\pi qπvπ 来更新,这样高于期望的动作会被增强,低于期望的动作会被削弱;
  • 好处是降低了梯度的方差,更新更稳定。

The algorithm of advantage actor-critic

When b ( s ) = v π ( s ) b(s) = v_\pi(s) b(s)=vπ(s),

  • the gradient-ascent algorithm is

    θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S ) ] ] \theta_{t+1} = \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S,A) - v_\pi(S)]\Big] θt+1=θt+αE[θlnπ(AS,θt)[qπ(S,A)vπ(S)]]

    ≐ θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) δ π ( S , A ) ] \doteq \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S,A)\Big] θt+αE[θlnπ(AS,θt)δπ(S,A)]

    • where

      δ π ( S , A ) ≐ q π ( S , A ) − v π ( S ) \delta_\pi(S,A) \doteq q_\pi(S,A) - v_\pi(S) δπ(S,A)qπ(S,A)vπ(S)

    • is called the advantage function (why called advantage?).

  • the stochastic version of this algorithm is

    θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) [ q t ( s t , a t ) − v t ( s t ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)[q_t(s_t, a_t) - v_t(s_t)] θt+1=θt+αθlnπ(atst,θt)[qt(st,at)vt(st)]

    = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t) =θt+αθlnπ(atst,θt)δt(st,at)

Moreover, the algorithm can be reexpressed as

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) δ t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)δt(st,at)

= θ t + α ∇ θ π ( a t ∣ s t , θ t ) π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t,a_t) =θt+απ(atst,θt)θπ(atst,θt)δt(st,at)

= θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ∇ θ π ( a t ∣ s t , θ t ) = \theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t, \theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t, \theta_t) =θt+α(π(atst,θt)δt(st,at))θπ(atst,θt)

  • The step size is proportional to the relative value δ t \delta_t δt rather than the absolute value q t q_t qt, which is more reasonable.
  • It can still well balance exploration and exploitation.

Furthermore, the advantage function is approximated by the TD error:

δ t = q t ( s t , a t ) − v t ( s t ) ; ; → ; ; r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\to;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) δt=qt(st,at)vt(st);;;;rt+1+γvt(st+1)vt(st)

  • This approximation is reasonable because

    E [ q π ( S , A ) − v π ( S ) ∣ S = s t , A = a t ] = E [ R + γ v π ( S ’ ) − v π ( S ) ∣ S = s t , A = a t ] \mathbb{E}[q_\pi(S,A) - v_\pi(S)|S=s_t, A=a_t] = \mathbb{E}[R + \gamma v_\pi(S’) - v_\pi(S)|S=s_t, A=a_t] E[qπ(S,A)vπ(S)S=st,A=at]=E[R+γvπ(S)vπ(S)S=st,A=at]

  • Benefit: only need one network to approximate v π ( s ) v_\pi(s) vπ(s) rather than two networks for q π ( s , a ) q_\pi(s,a) qπ(s,a) and v π ( s ) v_\pi(s) vπ(s).

Advantage actor-critic (A2C) or TD actor-critic

  • Aim: Search for an optimal policy by maximizing J ( θ ) J(\theta) J(θ).

  • At time step t t t in each episode, do

    • Generate a t a_t at following π ( a ∣ s t , θ t ) \pi(a|s_t, \theta_t) π(ast,θt) and then observe r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1.

    • TD error (advantage function):

      δ t = r t + 1 + γ v ( s t + 1 , w t ) − v ( s t , w t ) \delta_t = r_{t+1} + \gamma v(s_{t+1}, w_t) - v(s_t, w_t) δt=rt+1+γv(st+1,wt)v(st,wt)

    • Critic (value update):

      w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t) wt+1=wt+αwδtwv(st,wt)

    • Actor (policy update):

      θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t, \theta_t) θt+1=θt+αθδtθlnπ(atst,θt)

  • It is on-policy. Since the policy π ( θ t ) \pi(\theta_t) π(θt) is stochastic, no need to use techniques like ε \varepsilon ε-greedy.

Baseline → Advantage Function → A2C 算法实现

  1. 从 Baseline 到 Advantage Function
    • 在 Policy Gradient 里,我们有基本的更新公式:

      θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q π ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qπ(st,at)

    • 但是直接使用 q π ( s , a ) q_\pi(s,a) qπ(s,a) 容易带来 高方差。于是我们可以引入 baseline b ( s ) b(s) b(s) 来减少方差:

      θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) [ q π ( s t , a t ) − b ( s t ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\big[q_\pi(s_t,a_t) - b(s_t)\big] θt+1=θt+αθlnπ(atst,θt)[qπ(st,at)b(st)]

    • 一个常见选择是 状态价值函数 b ( s ) = v π ( s ) b(s) = v_\pi(s) b(s)=vπ(s)。这样,更新公式变为:

      θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) δ π ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\delta_\pi(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)δπ(st,at)

      • 其中

        δ π ( s , a ) = q π ( s , a ) − v π ( s ) \delta_\pi(s,a) = q_\pi(s,a) - v_\pi(s) δπ(s,a)=qπ(s,a)vπ(s)

      • 被称为 Advantage Function

    • 直观解释

      • q π ( s , a ) q_\pi(s,a) qπ(s,a) 表示在状态 s s s 下执行动作 a a a 的长期价值。
      • v π ( s ) v_\pi(s) vπ(s) 表示在状态 s s s 下的平均价值(对所有动作加权)。
      • 因此 δ π ( s , a ) \delta_\pi(s,a) δπ(s,a) 衡量了 这个动作相对平均水平的好坏
        • δ > 0 \delta > 0 δ>0 → 动作比平均好,应增加概率。
        • δ < 0 \delta < 0 δ<0 → 动作比平均差,应降低概率。
  2. 更新公式的进一步改写
    • 通过概率比形式,可以把更新写为:

      θ t + 1 = θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ∇ θ π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t,\theta_t) θt+1=θt+α(π(atst,θt)δt(st,at))θπ(atst,θt)

    • 这样可以看出,步长(step size)和 advantage 的相对大小直接挂钩,从而更合理地平衡探索(exploration)与利用(exploitation)。

3. Advantage 的近似:TD Error

直接计算 q π ( s , a ) q_\pi(s,a) qπ(s,a) 代价太大,所以引入 时间差分误差 (TD error) 来近似:

δ t = q t ( s t , a t ) − v t ( s t ) ; ; ≈ ; ; r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\approx;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) δt=qt(st,at)vt(st);;;;rt+1+γvt(st+1)vt(st)

  • 好处
    • 只需学习 一个价值函数网络 v π ( s ) v_\pi(s) vπ(s),不需要同时学习 q π ( s , a ) q_\pi(s,a) qπ(s,a) v π ( s ) v_\pi(s) vπ(s),降低计算复杂度。
    • δ t \delta_t δt 既是 TD 误差,也是 Advantage Function 的近似。
  1. Advantage Actor-Critic (A2C) 算法流程

    A2C 结合了 Actor(策略更新)Critic(价值函数更新)

    • Critic(学习 v ( s ) v(s) v(s),提供学习信号)

      w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t) wt+1=wt+αwδtwv(st,wt)

      • 这里 critic 用 TD 误差 δ t \delta_t δt 来更新 v ( s ) v(s) v(s)
    • Actor(更新策略)

      θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t,\theta_t) θt+1=θt+αθδtθlnπ(atst,θt)

      • 这里 actor 根据 critic 提供的 δ t \delta_t δt 信号,调整策略。
    • 直观解释

      • Critic 判断“这个动作到底好不好”,并计算 δ t \delta_t δt
      • Actor 根据 Critic 的反馈,增加好动作的概率,减少坏动作的概率。
  2. 为什么 A2C 有优势?

    1. 降低方差:baseline( v ( s ) v(s) v(s))有效减少了更新的随机性。
    2. 信号更直观:Advantage 告诉我们动作相对平均水平的好坏,而不是绝对值。
    3. 更高效:用 TD 误差近似 q ( s , a ) − v ( s ) q(s,a) - v(s) q(s,a)v(s),只需一个 Critic 网络。
    4. 仍然 On-Policy:采样和更新在当前策略下完成,不需要额外探索机制(例如 ϵ \epsilon ϵ-greedy)。
  • 总结下来,A2C 的关键逻辑是:

    Policy Gradient + Baseline → Advantage Function → 用 TD 误差近似 Advantage → Actor & Critic 协同更新。


总结

Actor-Critic 方法通过 Critic 提供的价值估计来指导 Actor 的策略更新,而在确定性策略梯度(DPG/DDPG)中,Actor 直接输出动作并利用 ∇ θ μ ( s ) ∇ a q ( s , a ) \nabla_\theta \mu(s)\nabla_a q(s,a) θμ(s)aq(s,a) 更新,避免了概率采样,高效适用于连续动作空间,但探索需依赖额外噪声。