强化学习任务:倒立摆
强化学习任务:倒立摆
任务描述
具体要求
倒立摆系数参数如下表:
变量 | 取值 | 单位 | 含义 |
---|---|---|---|
m | 0.055 | kg | 重量 |
g | 9.81 | m/s² | 重力加速度 |
l | 0.042 | m | 重心到转子的距离 |
J | 1.91 × 10⁻⁴ | kg·m² | 转动惯量 |
b | 3 × 10⁻⁶ | Nm·s/rad | 粘滞阻尼 |
K | 0.0536 | Nm/A | 转矩常数 |
R | 9.5 | Ω | 转子电阻 |
采样时间 T s T_s Ts 选取0.005s,离散时间动力学 f f f 可以使用欧拉法获得
{ α k + 1 = α k + T s α ˙ k α ˙ k + 1 = α ˙ k + T s α ¨ ( α k , α ˙ k , a k ) \left\{ \begin{array}{l} \alpha_{k+1} = \alpha_k + T_s \dot{\alpha}_k \\ \dot{\alpha}_{k+1} = \dot{\alpha}_k + T_s \ddot{\alpha} (\alpha_k, \dot{\alpha}_k, a_k) \end{array} \right. {αk+1=αk+Tsα˙kα˙k+1=α˙k+Tsα¨(αk,α˙k,ak)
折扣因子选取 γ = 0.98 \gamma=0.98 γ=0.98。选取较高折扣因子的目的是为了提高目标点(顶点)附近奖励在初始时刻状态价值的重要性,这样最优策略能够以成功将摆杆摆起并稳定作为最终目标。
(Tip:可以将动作空间离散化成 {−3,0,3} 三个动作,以这三个动作作为动作集学习最优策略。)
创建自定义环境
经典控制场景
gymnasium
中实现了经典控制环境倒立摆,可以作为参考。
值得注意的是,这里是使用超过最大时间步数时截断环境来限制环境的步数。(本来想的是直接根据任务要求通过判断当前状态是否接近[0, 0]来返回是否完成任务。)
def step():
...
# truncation=False as the time limit is handled by the `TimeLimit` wrapper added during `make`
return self._get_obs(), -costs, False, False, {}
此外状态重置是通过均匀分布采样,而非每次从初始状态开始,仔细想这样做是增加探索,有利于学习。
self.state = self.np_random.uniform(low=low, high=high)
设计实现
class InvertedPendulumEnv(gym.Env):
metadata = {
"render_modes": ["human", "rgb_array"],
"render_fps": 30,
}
def __init__(self,
max_episode_steps: int = 200,
normalize_state: bool = False,
discrete_action: bool = False,
render_mode: Optional[str] = 'human'):
super(InvertedPendulumEnv, self).__init__()
self.max_episode_steps = max_episode_steps
self.steps = 0
self.n_actions = 3 # 离散动作数量
self.max_voltage = 3.0 # 最大电压
self.l = 0.042 # 摆杆长度 (m)
self.m = 0.055 # 质量 (kg)
self.J = 1.91 * 1e-4 # 转动惯量 (kg⋅m²)
self.g = 9.81 # 重力加速度 (m/s²)
self.b = 3 * 1e-6 # 阻尼系数 (N⋅m⋅s/rad)
self.K = 0.0536 # 转矩常数 (N⋅m/A)
self.R = 9.5 # 电机电阻 (Ω)
self.render_mode = render_mode
self.discrete_action = discrete_action
self.normalize_state = normalize_state
self.last_u = 0
self.screen_dim = 500
self.screen = None
self.clock = None
self.isopen = True
self.state_bounds = {
'alpha': (-np.pi, np.pi),
'alpha_dot': (-15*np.pi, 15*np.pi),
'u': (-self.max_voltage, self.max_voltage)
}
high = np.array([np.pi, 15*np.pi], dtype=np.float32)
# 定义状态空间 [角度α, 角速度α_dot]
if self.normalize_state:
self.observation_space = gym.spaces.Box(
low=-np.ones_like(high), high=np.ones_like(high), dtype=np.float32
)
else:
self.observation_space = gym.spaces.Box(
low=-high, high=high, dtype=np.float32
)
# 定义动作空间 (电压u)
if self.discrete_action:
self.discrete_actions = np.linspace(
-self.max_voltage, self.max_voltage, self.n_actions
)
self.action_space = gym.spaces.Discrete(self.n_actions)
else:
self.action_space = gym.spaces.Box(
low=-self.max_voltage, high=self.max_voltage,
shape=(1,), dtype=np.float32
)
def step(self, action):
self.steps += 1
alpha, alpha_dot = self.state
if self.discrete_action:
u = self.discrete_actions[action]
else:
u = np.clip(action, -3.0, 3.0) # 确保电压在[-3,3]范围内
self.last_u = u # for rendering
# 实现系统动力学方程
# α̈ = (1/J)(mgl*sin(α) - bα̇ - (K²/R)α̇ + (K/R)u)
alpha_ddot = (1/self.J) * (
self.m * self.g * self.l * np.sin(alpha) -
self.b * alpha_dot -
(self.K**2/self.R) * alpha_dot +
(self.K/self.R) * u
)
# 使用欧拉方法进行数值积分
dt = 0.005 # 时间步长
alpha_dot_new = alpha_dot + alpha_ddot * dt
alpha_new = alpha + alpha_dot * dt
# 处理边界角度更新
if alpha_new > np.pi:
alpha_new = alpha_new - 2 * np.pi
elif alpha_new < -np.pi:
alpha_new = alpha_new + 2 * np.pi
# 确保角速度在[-15π,15π]范围内
alpha_dot_new = np.clip(alpha_dot_new, -15*np.pi, 15*np.pi)
# R(s,a) = -s^T diag(5,0.1)s - u² -> R(s, a) = - 5 * alpha^2 - 0.1 * alpha_dot^2 - u^2
# a = normalize(alpha_new, -np.pi, np.pi)
# a_dot = normalize(alpha_dot_new, -15*np.pi, 15*np.pi)
# u = normalize(u, -3, 3)
reward = -(5 * alpha_new**2 + 0.1 * alpha_dot_new**2 + u**2)
self.state = np.array([alpha_new, alpha_dot_new], dtype=np.float32)
terminated = self.steps >= self.max_episode_steps # 任务自然终止
truncated = self.steps >= self.max_episode_steps # 到达最大步数限制
return self._get_obs(), reward, terminated, truncated, {}
def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
super().reset(seed=seed)
self.steps = 0
if options is None:
alpha = self.np_random.uniform(*self.state_bounds['alpha'])
alpha_dot = self.np_random.uniform(*self.state_bounds['alpha_dot'])
self.state = np.array([alpha, alpha_dot], dtype=np.float32)
else:
alpha = options.get("alpha")
alpha_dot = options.get("alpha_dot")
self.state = np.array([alpha, alpha_dot], dtype=np.float32)
if self.render_mode == "human":
self.render()
return self._get_obs(), {}
def _normalize_state(self, state):
alpha, alpha_dot = state
norm_alpha = normalize(alpha, *self.state_bounds['alpha'])
norm_alpha_dot = normalize(alpha_dot, *self.state_bounds['alpha_dot'])
return np.array([norm_alpha, norm_alpha_dot], dtype=np.float32)
def _get_obs(self):
if self.normalize_state:
return self._normalize_state(self.state)
else:
alpha, alpha_dot = self.state
return np.array([alpha, alpha_dot], dtype=np.float32)
# render(), close()函数参考经典环境代码
状态更新的边界
由于角度范围是 [ − π , π ] [-\pi, \pi] [−π,π],角速度范围是 ( − 15 π , 15 π ) (-15\pi, 15\pi) (−15π,15π),使用欧拉法更新,采样时间 T s = 0.005 T_s=0.005 Ts=0.005。
θ t + 1 = θ t + θ ˙ t ⋅ T s \theta_{t+1} = \theta_t + \dot\theta_t \cdot T_s θt+1=θt+θ˙t⋅Ts
考虑倒立摆垂直向下的情况 θ t = π \theta_t=\pi θt=π,最大更新量 Δ = θ ˙ t ⋅ T s \Delta=\dot\theta_t \cdot T_s Δ=θ˙t⋅Ts,此时有 π < θ t + 1 < 2 π \pi<\theta_{t+1}<2\pi π<θt+1<2π,故应取模 θ t + 1 = θ t + 1 − 2 π \theta_{t+1}=\theta_{t+1}-2\pi θt+1=θt+1−2π;
类似的,当 − 2 π < θ t + 1 < − π -2\pi<\theta_{t+1}<-\pi −2π<θt+1<−π,应取模 θ t + 1 = θ t + 1 + 2 π \theta_{t+1}=\theta_{t+1}+2\pi θt+1=θt+1+2π。
wrappers.RecordEpisodeStatistic
此包装器将跟踪累积奖励和剧集时长。
# 示例
info = {
"episode": {
"r": "<cumulative reward>",
"l": "<episode length>",
"t": "<elapsed time since beginning of episode>"
},
}
DQN实现
当前阶段超参数配置
config={
"n_envs": 4,
"total_timesteps": 500000,
"learning_rate": 2e-4,
"buffer_size": 100000,
"batch_size": 256,
"gamma": 0.98,
"tau": 0.01, # the target network update rate
"target_network_frequency": 100, # the timesteps it takes to update the target network
"learning_starts": 5000,
"train_frequency": 10,
"start_epsilon": 1.0,
"end_epsilon": 0.05,
"exploration_fraction": 0.8, # the fraction of `total-timesteps` it takes from start epsilon to go end epsilon
"eval_frequency": 10000,
},
激活函数对本任务的影响
relu替换为leakyrelu
训练效果
episodic_return有上升趋势
loss比较震荡,很难收敛,似乎也正常。
q_values走势几乎都是下降趋势,留下疑问。
测试效果
有点效果,但不多。
wandb使用
wandb.Video
输入的data可以是numpy array,Channels should be (time, channel, height, width) or (batch, time, channel, height width)
,被AI骗了,一直传的(T, H, W, C)
,我说怎么百试不灵……