强化学习任务:倒立摆

发布于:2025-03-17 ⋅ 阅读:(14) ⋅ 点赞:(0)

强化学习任务:倒立摆

项目地址

任务描述

在这里插入图片描述

具体要求

倒立摆系数参数如下表:

变量 取值 单位 含义
m 0.055 kg 重量
g 9.81 m/s² 重力加速度
l 0.042 m 重心到转子的距离
J 1.91 × 10⁻⁴ kg·m² 转动惯量
b 3 × 10⁻⁶ Nm·s/rad 粘滞阻尼
K 0.0536 Nm/A 转矩常数
R 9.5 Ω 转子电阻

采样时间 T s T_s Ts 选取0.005s,离散时间动力学 f f f 可以使用欧拉法获得
{ α k + 1 = α k + T s α ˙ k α ˙ k + 1 = α ˙ k + T s α ¨ ( α k , α ˙ k , a k ) \left\{ \begin{array}{l} \alpha_{k+1} = \alpha_k + T_s \dot{\alpha}_k \\ \dot{\alpha}_{k+1} = \dot{\alpha}_k + T_s \ddot{\alpha} (\alpha_k, \dot{\alpha}_k, a_k) \end{array} \right. {αk+1=αk+Tsα˙kα˙k+1=α˙k+Tsα¨(αk,α˙k,ak)
折扣因子选取 γ = 0.98 \gamma=0.98 γ=0.98。选取较高折扣因子的目的是为了提高目标点(顶点)附近奖励在初始时刻状态价值的重要性,这样最优策略能够以成功将摆杆摆起并稳定作为最终目标。

(Tip:可以将动作空间离散化成 {−3,0,3} 三个动作,以这三个动作作为动作集学习最优策略。)

创建自定义环境

经典控制场景

gymnasium 中实现了经典控制环境倒立摆,可以作为参考

​ 值得注意的是,这里是使用超过最大时间步数时截断环境来限制环境的步数。(本来想的是直接根据任务要求通过判断当前状态是否接近[0, 0]来返回是否完成任务。)

def step():
    ...
    # truncation=False as the time limit is handled by the `TimeLimit` wrapper added during `make`
    return self._get_obs(), -costs, False, False, {}

​ 此外状态重置是通过均匀分布采样,而非每次从初始状态开始,仔细想这样做是增加探索,有利于学习。

self.state = self.np_random.uniform(low=low, high=high)

设计实现

class InvertedPendulumEnv(gym.Env):
    
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    
    def __init__(self, 
                 max_episode_steps: int = 200, 
                 normalize_state: bool = False, 
                 discrete_action: bool = False, 
                 render_mode: Optional[str] = 'human'):
        super(InvertedPendulumEnv, self).__init__()
        
        self.max_episode_steps = max_episode_steps
        self.steps = 0
        self.n_actions = 3     # 离散动作数量
        self.max_voltage = 3.0  # 最大电压
        self.l = 0.042            # 摆杆长度 (m)
        self.m = 0.055            # 质量 (kg)
        self.J = 1.91 * 1e-4          # 转动惯量 (kg⋅m²)
        self.g = 9.81                   # 重力加速度 (m/s²)
        self.b = 3 * 1e-6        # 阻尼系数 (N⋅m⋅s/rad)
        self.K = 0.0536           # 转矩常数 (N⋅m/A)
        self.R = 9.5            # 电机电阻 (Ω)
        
        self.render_mode = render_mode
        self.discrete_action = discrete_action
        self.normalize_state = normalize_state

        self.last_u = 0
        self.screen_dim = 500
        self.screen = None
        self.clock = None
        self.isopen = True
        
        self.state_bounds = {
            'alpha': (-np.pi, np.pi),
            'alpha_dot': (-15*np.pi, 15*np.pi),
            'u': (-self.max_voltage, self.max_voltage)
        }
        
        high = np.array([np.pi, 15*np.pi], dtype=np.float32)
        # 定义状态空间 [角度α, 角速度α_dot]
        if self.normalize_state:
            self.observation_space = gym.spaces.Box(
                low=-np.ones_like(high), high=np.ones_like(high), dtype=np.float32
            )
        else:
            self.observation_space = gym.spaces.Box(
                low=-high, high=high, dtype=np.float32
            )
        
        # 定义动作空间 (电压u)
        if self.discrete_action:
            self.discrete_actions = np.linspace(
                -self.max_voltage, self.max_voltage, self.n_actions
            )
            self.action_space = gym.spaces.Discrete(self.n_actions)
            
        else:
            self.action_space = gym.spaces.Box(
                low=-self.max_voltage, high=self.max_voltage,
                shape=(1,), dtype=np.float32
            )
    
    def step(self, action):
        self.steps += 1
        
        alpha, alpha_dot = self.state
        
        if self.discrete_action:
            u = self.discrete_actions[action]
        else:
            u = np.clip(action, -3.0, 3.0)  # 确保电压在[-3,3]范围内
        
        self.last_u = u # for rendering
        
        # 实现系统动力学方程
        # α̈ = (1/J)(mgl*sin(α) - bα̇ - (K²/R)α̇ + (K/R)u)
        alpha_ddot = (1/self.J) * (
            self.m * self.g * self.l * np.sin(alpha) - 
            self.b * alpha_dot - 
            (self.K**2/self.R) * alpha_dot + 
            (self.K/self.R) * u
        )
        
        # 使用欧拉方法进行数值积分
        dt = 0.005  # 时间步长
        alpha_dot_new = alpha_dot + alpha_ddot * dt
        alpha_new = alpha + alpha_dot * dt
        
        # 处理边界角度更新
        if alpha_new > np.pi:
            alpha_new = alpha_new - 2 * np.pi
        elif alpha_new < -np.pi:
            alpha_new = alpha_new + 2 * np.pi
            
        # 确保角速度在[-15π,15π]范围内
        alpha_dot_new = np.clip(alpha_dot_new, -15*np.pi, 15*np.pi)
        
        # R(s,a) = -s^T diag(5,0.1)s - u² -> R(s, a) = - 5 * alpha^2 - 0.1 * alpha_dot^2 - u^2
        # a = normalize(alpha_new, -np.pi, np.pi)
        # a_dot = normalize(alpha_dot_new, -15*np.pi, 15*np.pi)
        # u = normalize(u, -3, 3)
        reward = -(5 * alpha_new**2 + 0.1 * alpha_dot_new**2 + u**2)
        
        self.state = np.array([alpha_new, alpha_dot_new], dtype=np.float32)
        
        terminated = self.steps >= self.max_episode_steps  # 任务自然终止
        truncated = self.steps >= self.max_episode_steps  # 到达最大步数限制
        
        return self._get_obs(), reward, terminated, truncated, {}
    
    def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        self.steps = 0
        if options is None:
            alpha = self.np_random.uniform(*self.state_bounds['alpha'])
            alpha_dot = self.np_random.uniform(*self.state_bounds['alpha_dot'])
            self.state = np.array([alpha, alpha_dot], dtype=np.float32)
        else:
            alpha = options.get("alpha")
            alpha_dot = options.get("alpha_dot")
            self.state = np.array([alpha, alpha_dot], dtype=np.float32)
        if self.render_mode == "human":
            self.render()
        return self._get_obs(), {}
    
    def _normalize_state(self, state):
        alpha, alpha_dot = state
        norm_alpha = normalize(alpha, *self.state_bounds['alpha'])
        norm_alpha_dot = normalize(alpha_dot, *self.state_bounds['alpha_dot'])
        return np.array([norm_alpha, norm_alpha_dot], dtype=np.float32)
    
    def _get_obs(self):
        if self.normalize_state:
            return self._normalize_state(self.state)
        else:
            alpha, alpha_dot = self.state
            return np.array([alpha, alpha_dot], dtype=np.float32)
    
    # render(), close()函数参考经典环境代码
状态更新的边界

​ 由于角度范围是 [ − π , π ] [-\pi, \pi] [π,π],角速度范围是 ( − 15 π , 15 π ) (-15\pi, 15\pi) (15π,15π),使用欧拉法更新,采样时间 T s = 0.005 T_s=0.005 Ts=0.005
θ t + 1 = θ t + θ ˙ t ⋅ T s \theta_{t+1} = \theta_t + \dot\theta_t \cdot T_s θt+1=θt+θ˙tTs
​ 考虑倒立摆垂直向下的情况 θ t = π \theta_t=\pi θt=π,最大更新量 Δ = θ ˙ t ⋅ T s \Delta=\dot\theta_t \cdot T_s Δ=θ˙tTs,此时有 π < θ t + 1 < 2 π \pi<\theta_{t+1}<2\pi π<θt+1<2π,故应取模 θ t + 1 = θ t + 1 − 2 π \theta_{t+1}=\theta_{t+1}-2\pi θt+1=θt+12π

类似的,当 − 2 π < θ t + 1 < − π -2\pi<\theta_{t+1}<-\pi 2π<θt+1<π,应取模 θ t + 1 = θ t + 1 + 2 π \theta_{t+1}=\theta_{t+1}+2\pi θt+1=θt+1+2π

wrappers.RecordEpisodeStatistic

此包装器将跟踪累积奖励和剧集时长。

# 示例
info = {
    "episode": {
        "r": "<cumulative reward>",
        "l": "<episode length>",
        "t": "<elapsed time since beginning of episode>"
    },
}

DQN实现

当前阶段超参数配置

config={
            "n_envs": 4,
            "total_timesteps": 500000,
            "learning_rate": 2e-4,
            "buffer_size": 100000,
            "batch_size": 256,
            "gamma": 0.98,
            "tau": 0.01,     # the target network update rate
            "target_network_frequency": 100,    # the timesteps it takes to update the target network
            "learning_starts": 5000,
            "train_frequency": 10,
            "start_epsilon": 1.0,
            "end_epsilon": 0.05,
            "exploration_fraction": 0.8, # the fraction of `total-timesteps` it takes from start epsilon to go end epsilon
            "eval_frequency": 10000,
        },

激活函数对本任务的影响

relu替换为leakyrelu

训练效果

episodic_return有上升趋势

在这里插入图片描述

loss比较震荡,很难收敛,似乎也正常。

q_values走势几乎都是下降趋势,留下疑问。

在这里插入图片描述

测试效果

有点效果,但不多。
在这里插入图片描述

wandb使用

wandb.Video

​ 输入的data可以是numpy array,Channels should be (time, channel, height, width) or (batch, time, channel, height width),被AI骗了,一直传的(T, H, W, C),我说怎么百试不灵……

持续更新中。