在动手学深度学习 3.2 线性回归的从零开始实现中有这样的代码:
import random
import torch
# 生成人造数据集
def synthetic_data(w, b, num_examples): #@save
"""生成y=Xw+b+噪声"""
X = torch.normal(0, 1, (num_examples, len(w))) #均值为0,方差为1的随机数,num_examples个样本,列数为len(w)
y = torch.matmul(X, w) + b
y += torch.normal(0, 0.01, y.shape)
return X,y.reshape(-1,1)
# 生成大小为batch_size的小批量
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) #将列表indices随机排序
for i in range(0, num_examples, batch_size): #batch_size为小批量的大小
batch_indices = torch.tensor(indices[i:min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]
batch_size = 10
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
# 定义初始化模型参数
w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# 定义模型
def linreg(X, w, b):
"""线性回归模型。"""
return torch.matmul(X, w) + b
# 定义损失函数
def squared_loss(y_hat, y):
"""均方损失。"""
return (y_hat - y.reshape(y_hat.shape))**2 / 2
# 定义优化算法
def sgd(params, lr, batch_size): # lr为学习率,
"""小批量随机梯度下降。"""
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size
param.grad.zero_()
# 训练过程
lr = 0.03
num_epochs = 1 # 训练次数
net = linreg # 调用线性回归模型
loss = squared_loss # 调用损失函数
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y) # 调用损失函数,赋值给l
l.sum().backward()
# print('[w,b]:',[w,b])
sgd([w, b], lr, batch_size)
with torch.no_grad():
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
# 比较真实参数和通过训练学到的参数来评估训练的成功程度
print(f'w的估计误差: {true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {true_b - b}')
基于 PyTorch 的线性回归模型梯度计算及验证
下面我将用数学公式推导这段代码中的梯度计算过程:
1. 定义变量和模型
设定 X = tensor([[ 0.3742, 1.0514],[-0.5108, -2.9390],[-0.6907, 2.3641],[-0.5569, 0.4298],[-0.4228, -1.0638],[-1.3704, -1.6127],[ 1.3422, 0.9927],[-1.6255, 0.5072],[ 1.3470, -0.5777],[ 1.6256, 0.8769]])
设定 y = tensor([[ 1.3774],[13.1741],[-5.2062],[ 1.6101],[ 6.9709],[ 6.9469],[ 3.5065],[-0.7634],[ 8.8503],[ 4.4814]])
输入矩阵 X∈R10×2 X \in \mathbb{R}^{10 \times 2} X∈R10×2
权重向量 w∈R2×1 w \in \mathbb{R}^{2 \times 1} w∈R2×1
偏置标量 b∈R b \in \mathbb{R} b∈R
目标值 y∈R10×1 y \in \mathbb{R}^{10 \times 1} y∈R10×1
2. 线性回归模型
y^=Xw+b \hat{y} = Xw + b y^=Xw+b
其中,y^i=w1xi1+w2xi2+b\hat{y}_i = w_1 x_{i1} + w_2 x_{i2} + by^i=w1xi1+w2xi2+b,i=1,…,10i=1,\dots,10i=1,…,10
3. 均方损失函数
L(w,b)=12∑i=110(y^i−yi)2 L(w, b) = \frac{1}{2} \sum_{i=1}^{10} (\hat{y}_i - y_i)^2 L(w,b)=21i=1∑10(y^i−yi)2
4. 计算梯度
4.1 计算 ∂L∂w \frac{\partial L}{\partial w} ∂w∂L
损失函数定义为:
L(w,b)=12∑i=110(y^i−yi)2 L(w, b) = \frac{1}{2} \sum_{i=1}^{10} (\hat{y}_i - y_i)^2 L(w,b)=21i=1∑10(y^i−yi)2
其中预测值:
y^i=w1xi1+w2xi2+b \hat{y}_i = w_1 x_{i1} + w_2 x_{i2} + b y^i=w1xi1+w2xi2+b
根据复合函数求导法则(链式法则):
∂L∂wj=∑i=110∂L∂y^i⋅∂y^i∂wj \frac{\partial L}{\partial w_j} = \sum_{i=1}^{10} \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial w_j} ∂wj∂L=i=1∑10∂y^i∂L⋅∂wj∂y^i
分步计算:
- 计算∂L∂y^i\frac{\partial L}{\partial \hat{y}_i}∂y^i∂L
∂L∂y^i=∂∂y^i[12(y^i−yi)2]=y^i−yi \frac{\partial L}{\partial \hat{y}_i} = \frac{\partial}{\partial \hat{y}_i} \left[ \frac{1}{2} (\hat{y}_i - y_i)^2 \right] = \hat{y}_i - y_i ∂y^i∂L=∂y^i∂[21(y^i−yi)2]=y^i−yi
2.计算∂L∂wj\frac{\partial L}{\partial w_j}∂wj∂L
∂y^i∂wj=∂∂wj[w1xi1+w2xi2+b]=xij \frac{\partial \hat{y}_i}{\partial w_j} = \frac{\partial}{\partial w_j} \left[ w_1 x_{i1} + w_2 x_{i2} + b \right] = x_{ij} ∂wj∂y^i=∂wj∂[w1xi1+w2xi2+b]=xij
- 合并结果
∂L∂wj=∑i=110(y^i−yi)⋅xij \frac{\partial L}{\partial w_j} = \sum_{i=1}^{10} (\hat{y}_i - y_i) \cdot x_{ij} ∂wj∂L=i=1∑10(y^i−yi)⋅xij
矩阵形式推导:
定义误差向量 e=y^−y e = \hat{y} - y e=y^−y,其中 ei=y^i−yi e_i = \hat{y}_i - y_i ei=y^i−yi,则:
∂L∂wj=∑i=110ei⋅xij \frac{\partial L}{\partial w_j} = \sum_{i=1}^{10} e_i \cdot x_{ij} ∂wj∂L=i=1∑10ei⋅xij
用矩阵乘法表示所有权重的梯度:
∂L∂w=[∑i=110ei⋅xi1∑i=110ei⋅xi2]=XTe=XT(y^−y) \frac{\partial L}{\partial w} = \begin{bmatrix} \sum_{i=1}^{10} e_i \cdot x_{i1} \\ \sum_{i=1}^{10} e_i \cdot x_{i2} \end{bmatrix} = X^T e = X^T (\hat{y} - y) ∂w∂L=[∑i=110ei⋅xi1∑i=110ei⋅xi2]=XTe=XT(y^−y)
4.2 计算 ∂L∂b \frac{\partial L}{\partial b} ∂b∂L
同样使用链式法则:
∂L∂b=∑i=110∂L∂y^i⋅∂y^i∂b \frac{\partial L}{\partial b} = \sum_{i=1}^{10} \frac{\partial L}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b} ∂b∂L=i=1∑10∂y^i∂L⋅∂b∂y^i
分步计算:
- 计算∂L∂y^i\frac{\partial L}{\partial \hat{y}_i}∂y^i∂L
∂L∂y^i=y^i−yi \frac{\partial L}{\partial \hat{y}_i} = \hat{y}_i - y_i ∂y^i∂L=y^i−yi
- 计算∂y^i∂b\frac{\partial \hat{y}_i}{\partial b}∂b∂y^i
∂y^i∂b=∂∂b[w1xi1+w2xi2+b]=1 \frac{\partial \hat{y}_i}{\partial b} = \frac{\partial}{\partial b} \left[ w_1 x_{i1} + w_2 x_{i2} + b \right] = 1 ∂b∂y^i=∂b∂[w1xi1+w2xi2+b]=1
- 合并结果
∂L∂b=∑i=110(y^i−yi)⋅1=∑i=110(y^i−yi) \frac{\partial L}{\partial b} = \sum_{i=1}^{10} (\hat{y}_i - y_i) \cdot 1 = \sum_{i=1}^{10} (\hat{y}_i - y_i) ∂b∂L=i=1∑10(y^i−yi)⋅1=i=1∑10(y^i−yi)
梯度的直观意义
权重梯度∂L∂w=XT(y^−y)\frac{\partial L}{\partial w}= X^T (\hat{y} - y)∂w∂L=XT(y^−y):
每个特征的梯度是该特征在所有样本上的贡献之和,贡献大小由预测误差 y^i−yi \hat{y}_i - y_i y^i−yi 缩放。偏置梯度∂L∂b=∑i=110(y^i−yi)\frac{\partial L}{\partial b}= \sum_{i=1}^{10} (\hat{y}_i - y_i)∂b∂L=i=1∑10(y^i−yi):
偏置的梯度是所有样本预测误差的总和,表示整体的误差方向。偏置的梯度是所有样本预测误差的总和,表示整体的误差方向。
5. 代入代码中的具体计算
步骤 1:前向传播计算预测值
y^=Xw+b \hat{y} = Xw + b y^=Xw+b
步骤 2:计算损失
li=12(y^i−yi)2 l_i = \frac{1}{2} (\hat{y}_i - y_i)^2 li=21(y^i−yi)2
步骤 3:反向传播计算梯度
- 权重梯度:
∂L∂w=XT(y^−y) \frac{\partial L}{\partial w} = X^T (\hat{y} - y) ∂w∂L=XT(y^−y)
- 偏置梯度:
∂L∂b=∑i=110(y^i−yi) \frac{\partial L}{\partial b} = \sum_{i=1}^{10} (\hat{y}_i - y_i) ∂b∂L=i=1∑10(y^i−yi)
6. 验证代码与数学推导的一致性
代码中执行了 l.sum().backward()
,这等价于计算总损失 L=∑i=110li L = \sum_{i=1}^{10} l_i L=i=1∑10li 关于 w 和 b 的梯度,完全符合上述数学推导。
最终梯度结果:
w 的梯度为 XT(y^−y) X^T (\hat{y} - y) XT(y^−y)
b 的梯度为 ∑i=110(y^i−yi) \sum_{i=1}^{10} (\hat{y}_i - y_i) i=1∑10(y^i−yi)
这与 PyTorch 自动求导的结果一致
- w的估计误差: w.grad = tensor([[-9.0000],[65.4477]])
- b的估计误差: b.grad = tensor([-40.9567])