本系列为学习赵世钰老师的《强化学习的数学原理》所作的学习笔记.
2.5.1 Sarsa系列
之前介绍的时序差分算法是用来计算给定策略 π \pi π和其状态 s s s的状态值期望 v π ( s ) v_\pi(s) vπ(s), 那么能不能直接估计动作值期望呢? 这也就是Sarsa方法.
2.5.1.1 Sarsa
给定策略 π \pi π, 我们可以用以下算法估计动作值:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] , q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right)\right], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))],
这就是Sarsa算法, 它的名字来源于上式的输入: ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right) (st,at,rt+1,st+1,at+1).
实际上, Sarsa算法是一种随机近似方法, 用于求解使用动作值表达的贝尔曼公式:
q π ( s , a ) = E [ R + γ q π ( S ′ , A ′ ) ∣ s , a ] , for all ( s , a ) . q_\pi(s, a)=\mathbb{E}\left[R+\gamma q_\pi\left(S^{\prime}, A^{\prime}\right) \mid s, a\right], \quad \text { for all }(s, a) . qπ(s,a)=E[R+γqπ(S′,A′)∣s,a], for all (s,a).
2.5.1.2 Expected Sarsa
如果将TD目标改成期望, 就是Expected Sarsa:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ) ] ) ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]\right)\right] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])]
其中TD目标项中的期望定义为:
E [ q t ( s t + 1 , A ) ] = ∑ a π t ( a ∣ s t + 1 ) q t ( s t + 1 , a ) ≐ v t ( s t + 1 ) \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]=\sum_a \pi_t\left(a \mid s_{t+1}\right) q_t\left(s_{t+1}, a\right) \doteq v_t\left(s_{t+1}\right) E[qt(st+1,A)]=a∑πt(a∣st+1)qt(st+1,a)≐vt(st+1)
Sarsa中只计算一个 a t + 1 a_{t+1} at+1带来的动作值 q t ( s t + 1 , a t + 1 ) q_t(s_{t+1}, a_{t+1}) qt(st+1,at+1), Expected Sarsa完整的计算了状态值 v t ( s t + 1 ) v_t\left(s_{t+1}\right) vt(st+1)
同样, Expected Sarsa也是计算一个贝尔曼公式:
q π ( s , a ) = E [ R t + 1 + γ E [ q π ( S t + 1 , A t + 1 ) ∣ S t + 1 ] ∣ S t = s , A t = a ] , for all s , a . q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \mathbb{E}\left[q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right] \mid S_t=s, A_t=a\right], \\ \quad \text { for all } s, a . qπ(s,a)=E[Rt+1+γE[qπ(St+1,At+1)∣St+1]∣St=s,At=a], for all s,a.
其中的期望可以展开成:
E [ q π ( S t + 1 , A t + 1 ) ∣ S t + 1 ] = ∑ A ′ q π ( S t + 1 , A ′ ) π ( A ′ ∣ S t + 1 ) = v π ( S t + 1 ) \mathbb{E}\left[q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right]=\sum_{A^{\prime}} q_\pi\left(S_{t+1}, A^{\prime}\right) \pi\left(A^{\prime} \mid S_{t+1}\right)=v_\pi\left(S_{t+1}\right) E[qπ(St+1,At+1)∣St+1]=A′∑qπ(St+1,A′)π(A′∣St+1)=vπ(St+1)
2.5.1.3 n-step Sarsa
我们回顾一下动作值的定义, 给定状态和动作时, 轨迹 G t G_t Gt期望:
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s, a)=\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right] qπ(s,a)=E[Gt∣St=s,At=a]
其中 G t G_t Gt是指轨迹的discounted return, 定义为:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots Gt=Rt+1+γRt+2+γ2Rt+3+…
如果我们额外定义一个上标 n n n, 第 n n n步时用动作值 q π ( S t + n , A t + n ) q_\pi(S_{t+n}, A_{t+n}) qπ(St+n,At+n)替代表达式:
G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) , G_t^{(n)}=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right), Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),
- n = 1 n=1 n=1时: Sarsa
动作值期望可以写成:
q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] . q_\pi(s, a)=\mathbb{E}\left[G_t^{(1)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right) \mid s, a\right] . qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].
随机近似理论求解上式的更新公式为:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] , q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right)\right], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))],
显然这就是Sarsa算法求解的目标贝尔曼公式
- n = ∞ n=\infty n=∞时: MC learning
动作值期望可以写成:
q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … ∣ s , a ] . q_\pi(s, a)=\mathbb{E}\left[G_t^{(\infty)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid s, a\right] . qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+…∣s,a].
求解上式的更新公式为:
q t + 1 ( s t , a t ) = g t ≐ r t + 1 + γ r t + 2 + γ 2 r t + 3 + … , q_{t+1}\left(s_t, a_t\right)=g_t \doteq r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+\ldots, qt+1(st,at)=gt≐rt+1+γrt+2+γ2rt+3+…,
g t g_t gt是 G t G_t Gt的采样, 这实际就是蒙特卡洛方法
- n = k n=k n=k时: n-step Sarsa
当我们采用一个这种的固定值 k k k时:
q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] . q_\pi(s, a)=\mathbb{E}\left[G_t^{(n)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right) \mid s, a\right] . qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].
它的更新公式为:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ) ] \begin{aligned}q_{t+1}\left(s_t, a_t\right)= & q_t\left(s_t, a_t\right) \\& -\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^n q_t\left(s_{t+n}, a_{t+n}\right)\right)\right]\end{aligned} qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))]
这也就是n-step Sarsa方法.
不过需要说明的是, 不管 n n n定义为多少, 它们的值都是一样的, 只是写法不同:
G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)} Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞)
2.5.2 Q-learning
2.5.2.1 定义
Q-learning的更新公式为:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma \max _{a \in \mathcal{A}\left(s_{t+1}\right)} q_t\left(s_{t+1}, a\right)\right)\right] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γa∈A(st+1)maxqt(st+1,a))]
只有TD目标项有变化, 它不再是Sarsa中的给定动作的动作值: q t ( s t + 1 , a t + 1 ) q_t\left(s_{t+1}, a_{t+1}\right) qt(st+1,at+1), 或者是expected Sarsa中的动作值的期望: E [ q t ( s t + 1 , A ) ] \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right] E[qt(st+1,A)].
而是最优动作的动作值: max a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) \max _{a \in \mathcal{A}\left(s_{t+1}\right)} q_t\left(s_{t+1}, a\right) maxa∈A(st+1)qt(st+1,a), 同样它也是一个随机近似方法, 要求解的方程为:
q ( s , a ) = E [ R t + 1 + γ max a q ( S t + 1 , a ) ∣ S t = s , A t = a ] . q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right] . q(s,a)=E[Rt+1+γamaxq(St+1,a)∣St=s,At=a].
2.5.2.2 On-policy 和 Off-policy
首先我们定义两个策略概念:
- behavior policy: 行为策略, 用来生成经验样本的策略。
- target policy: 目标策略, 不断更新以趋于最优策略的策略。
当行为策略和目标策略相同时, 这种学习过程称为在on-policy学习(Sarsa, MC learning);
当行为策略和目标策略可以不同(也可以相同)时, 这种学习过程称为在off-policy学习(Q-learning);
On-policy版本的Q-learning算法如上图, 它的策略即用来获取经验数据, 也用来更新策略. 因此它是On-policy的.
上图是Off-policy的Q-learning算法, 它的行为策略 π b ( a ∣ s ) \pi_b(a \mid s) πb(a∣s)是一开始给定的, 会基于 π b ( a ∣ s ) \pi_b(a \mid s) πb(a∣s)获取经验数据. 而更新策略 π T , t + 1 ( a ∣ s t ) \pi_{T, t+1}\left(a \mid s_t\right) πT,t+1(a∣st)时, 并不更新行为策略, 因此是Off-policy的.
2.5.2.3 例子
我们给出这样一个例子, 假设我们已经知道了它的最优策略(a)和最优状态值分布(b)
并且我们给出一个均匀的行为策略©和它对应的采样结果(d)
当我们使用Q-learning求解时, 结果非常接近真实的最优解, 过程数据如图.
但是当行为策略并不均匀时, 求解质量会变差:
2.5.3 TD算法总结
所有的TD算法实际上, 都是如下的更新形式:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − q ˉ t ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\bar{q}_t\right] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−qˉt]
区别在于, 它们的更新的TD目标 q ˉ t \bar{q}_t qˉt不一样:
Algorithm Expression of the TD target q ˉ t in ( 7.20 ) Sarsa q ˉ t = r t + 1 + γ q t ( s t + 1 , a t + 1 ) n -step Sarsa q ˉ t = r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) Q-learning q ˉ t = r t + 1 + γ max a q t ( s t + 1 , a ) Monte Carlo q ˉ t = r t + 1 + γ r t + 2 + γ 2 r t + 3 + … \begin{array}{l|l}\hline \text { Algorithm } & \text { Expression of the TD target } \bar{q}_t \text { in }(7.20) \\\hline \text { Sarsa } & \bar{q}_t=r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right) \\\hline n \text {-step Sarsa } & \bar{q}_t=r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^n q_t\left(s_{t+n}, a_{t+n}\right) \\\hline \text { Q-learning } & \bar{q}_t=r_{t+1}+\gamma \max _a q_t\left(s_{t+1}, a\right) \\\hline \text { Monte Carlo } & \bar{q}_t=r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+\ldots \\\hline\end{array} Algorithm Sarsa n-step Sarsa Q-learning Monte Carlo Expression of the TD target qˉt in (7.20)qˉt=rt+1+γqt(st+1,at+1)qˉt=rt+1+γrt+2+⋯+γnqt(st+n,at+n)qˉt=rt+1+γmaxaqt(st+1,a)qˉt=rt+1+γrt+2+γ2rt+3+…
以及它们对应的求解目标公式:
Algorithm Equation to be solved Sarsa BE: q π ( s , a ) = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] n -step Sarsa BE: q π ( s , a ) = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ S t = s , A t = a ] Q-learning BOE: q ( s , a ) = E [ R t + 1 + γ max a q ( S t + 1 , a ) ∣ S t = s , A t = a ] Monte Carlo BE: q π ( s , a ) = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … ∣ S t = s , A t = a ] \begin{array}{l|l}\hline \text { Algorithm } & \text { Equation to be solved } \\\hline \text { Sarsa } & \text { BE: } q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s, A_t=a\right] \\\hline n \text {-step Sarsa } & \text { BE: } q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right) \mid S_t=s, A_t=a\right] \\\hline \text { Q-learning } & \text { BOE: } q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right] \\\hline \text { Monte Carlo } & \text { BE: } q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid S_t=s, A_t=a\right] \\\hline\end{array} Algorithm Sarsa n-step Sarsa Q-learning Monte Carlo Equation to be solved BE: qπ(s,a)=E[Rt+1+γqπ(St+1,At+1)∣St=s,At=a] BE: qπ(s,a)=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣St=s,At=a] BOE: q(s,a)=E[Rt+1+γmaxaq(St+1,a)∣St=s,At=a] BE: qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+…∣St=s,At=a]