动手学深度学习——线性回归 + 基础优化算法-EW帮帮网

生成数据集

# matplotlib inline
import random
import torch
from d2l import torch as d2l

def synthetic_data(w, b, num_examples):  
    """生成 y = Xw + b + 噪声。"""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

print('features:', features[0], '\nlabel:', labels[0])
d2l.set_figsize()
d2l.plt.scatter(features[:, (1)].detach().numpy(),
                labels.detach().numpy(), 1);

X = torch.normal(0, 1, (num_examples, len(w)))

创建一个形状为 (num_examples, len(w)) 的张量 X，其中每个元素服从均值为 0、标准差为 1 的正态分布。

y = torch.matmul(X, w) + b

计算目标值 y，即线性回归模型的预测结果。

数学形式： $y_i = \sum_{j=1}^d X_{ij} w_j + b$

y += torch.normal(0, 0.01, y.shape)

在真实值上添加少量噪声，使数据更接近真实场景。

return X, y.reshape((-1, 1)

返回特征矩阵 X 和目标值 y，其中 y 被调整为二维列向量，形状为 (num_examples, 1)。

解释：

reshape((-1, 1))：将 y 重塑为一列。
- -1 表示自动推导维度大小，这里会推导为 num_examples。
- 例如，如果 num_examples=5，y 的最终形状就是 (5, 1)。

true_w = torch.tensor([2, -3.4])
true_b = 4.2

features, labels = synthetic_data(true_w, true_b, 1000)

定义线性模型的真实权重和偏置项。

调用自定义函数 synthetic_data 生成模拟训练数据。

d2l.set_figsize()

设置画布的大小。

d2l.plt.scatter(features[:, (1)].detach().numpy(),
labels.detach().numpy(), 1)

绘制第二个特征（x₂）与标签（y）之间的关系图，用散点图展示数据分布。

逐步解析

(1) features[:, (1)]

取所有样本的第 2 个特征列（索引为 1）。
结果是形状为 (1000,) 的张量。

(2) .detach().numpy()

detach()：从计算图中分离张量，阻止 PyTorch 跟踪梯度。
.numpy()：将张量转换为 NumPy 数组，Matplotlib 需要 NumPy 格式才能绘图。

(3) labels.detach().numpy()

标签数据也需要转换为 NumPy 格式。
原始 labels 形状为 (1000, 1)，Matplotlib 会自动展平。

(4) scatter(x, y, 1)

scatter 用于绘制散点图。
参数：
1. x：横坐标，这里是第 2 个特征。
2. y：纵坐标，这里是标签值。
3. 1：每个点的大小，数值越小点越小。

读取数据集

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # 这些样本是随机读取的，没有特定的顺序
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

def data_iter(batch_size, features, labels):

定义了一个数据迭代器生成函数，它将数据集分批（mini-batch）返回，用于模型训练。

在深度学习中，通常不会一次性将全部数据送入模型，而是分成若干批次（mini-batch）逐步训练，提高效率并利用 GPU 并行计算。

num_examples = len(features)

len(features) 返回特征矩阵中的样本数量。

indices = list(range(num_examples))

range(num_examples) 生成 0 到 num_examples-1 的整数序列。

再用 list() 转换为列表。

random.shuffle(indices)

将索引列表 随机打乱，保证每次迭代时小批量数据顺序不同。

random.shuffle() 是 原地操作，不会返回新列表。

原因：
如果数据始终按原顺序输入模型，模型可能学到数据顺序模式，导致泛化能力差。
随机打乱样本顺序是**随机梯度下降法（SGD）**的必要步骤之一。

for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(
indices[i: min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]

以 batch_size 为步长，从 0 开始遍历整个数据集。

i + batch_size
- 计算当前批次的结束位置。
min(i + batch_size, num_examples)
- 取 i + batch_size 与总样本数的较小值，防止越界。

indices[i : min(...)]
- 切片操作，获取当前批次对应的样本索引。
torch.tensor()
- 将 Python 列表转换为 PyTorch 张量，便于后续张量索引。
- yield：将函数变为生成器（generator），每次调用只返回一个小批量数据，而不是一次性返回所有数据。
- features[batch_indices]：根据批次索引提取当前批次的特征。
- labels[batch_indices]：提取对应的标签。

PyTorch 提供了现成的数据加载工具 DataLoader，作用和此函数类似：

from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(features, labels)
data_iter = DataLoader(dataset, batch_size=3, shuffle=True)

dataset = TensorDataset(features, labels)
把多个第一维长度相同的张量（这里是 features 和 labels）打包成一个可索引的数据集对象。

data_iter = DataLoader(dataset, batch_size=3, shuffle=True)
把 dataset 按 mini-batch 迭代产出，并根据需要随机打乱样本。

dataset：数据集对象（如 TensorDataset 或自定义 Dataset）。
batch_size=3：每个 batch 3 个样本。len(data_iter) ≈ ceil(N/3)（若 drop_last=True 则是 floor(N/3)）。
shuffle=True：每个 epoch 前随机打乱数据（底层是 RandomSampler）。

初始化模型参数

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

初始化权重w，初始值来自均值为 0、方差很小的正态分布。requires_grad=True：告诉 PyTorch 在反向传播时需要对这个张量计算梯度。

初始化偏置b，初始值为0，需要计算梯度。

定义模型

def linreg(X, w, b):  
    """线性回归模型"""
    return torch.matmul(X, w) + b

完成计算 $\hat{y}=Xw+b$

定义损失函数

def squared_loss(y_hat, y):  
    """均方损失"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

计算均方损失 $L(y, \hat{y}) = \frac{1}{2} (\hat{y} - y)^2$

定义优化算法

def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

1. 函数定义

名称：sgd
参数：
- params：需要优化的参数集合（通常是 w、b 等，要求这些参数的 requires_grad=True）。
- lr：学习率 (learning rate)，决定更新步长。
- batch_size：小批量样本数，通常是一次迭代里输入的训练样本数量。

2. `with torch.no_grad():`

在 PyTorch 中，如果你直接修改张量的值（如 param -= ...），通常会被记录到计算图里。
torch.no_grad() 表示在这段代码里 不需要记录梯度，否则会干扰反向传播。
这是更新参数时的标准写法。

3. 参数更新公式

param -= lr * param.grad / batch_size

这就是 小批量随机梯度下降 的核心更新公式。
如果损失函数是 LLL，参数是 $\theta$ ，梯度是 $\nabla_\theta L$ ，那么： $\theta \; \leftarrow \; \theta - \eta \cdot \frac{1}{B}\sum_{i=1}^B \nabla_\theta L_i$

其中：

$\eta$ = 学习率 lr
$B$ = batch_size
param.grad 存储的是这一批样本计算得到的梯度
param -= ... 就是执行梯度下降更新

4. 清空梯度

param.grad.zero_()

PyTorch 在 反向传播 时，会把新计算的梯度 累加到旧的梯度上。
所以每次更新参数后，必须清空梯度，否则下一次反向传播时，梯度会被叠加，导致更新错误。
zero_() 是 原地清零 的操作（比重新赋值更高效）。

训练

lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # X和y的小批量损失
        # 因为l形状是(batch_size,1)，而不是一个标量。l中的所有元素被加到一起，
        # 并以此计算关于[w,b]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
print(f'w的估计误差: {true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {true_b - b}')

1. 参数设定

lr = 0.03 # 学习率
num_epochs = 3 # 训练的迭代次数
net = linreg # 使用之前定义的线性回归模型
loss = squared_loss # 使用之前定义的平方损失函数

2. 训练循环

for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y)
l.sum().backward()
sgd([w, b], lr, batch_size)

(1) 数据迭代

data_iter(batch_size, features, labels) 会返回一个小批量（X, y）。
X：形状 (batch_size, num_features)。
y：形状 (batch_size,)。

(2) 前向计算

net(X, w, b) → 调用 linreg，得到预测值 y_hat。
loss(net(X, w, b), y) → 调用 squared_loss，得到逐样本的损失张量，形状 (batch_size, 1)。

(3) 反向传播

l.sum().backward()：
- l 的形状是 (batch_size, 1)，不是标量。
- .sum() 把它们加起来，得到一个标量损失。
- .backward() 会计算损失对参数 [w, b] 的梯度，并存储在 w.grad 和 b.grad 中。

(4) 参数更新

sgd([w, b], lr, batch_size)：
- 使用我们之前写的 小批量随机梯度下降，更新 w 和 b。
- 同时把梯度清零，避免累积。

3. 每个 epoch 的损失计算

with torch.no_grad():
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')

with torch.no_grad() → 在验证时不需要梯度计算，提高效率。
net(features, w, b) → 用当前的参数在整个训练集上计算预测。
loss(..., labels) → 得到所有样本的损失。
train_l.mean() → 计算平均训练损失。
print(...) → 打印出当前 epoch 的平均损失。

完整代码

# matplotlib inline
import random
import torch
from d2l import torch as d2l

def synthetic_data(w, b, num_examples):  
    """生成 y = Xw + b + 噪声。"""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

print('features:', features[0], '\nlabel:', labels[0])
d2l.set_figsize()
d2l.plt.scatter(features[:, (1)].detach().numpy(),
                labels.detach().numpy(), 1);

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # 这些样本是随机读取的，没有特定的顺序
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
def linreg(X, w, b):  #@save
    """线性回归模型"""
    return torch.matmul(X, w) + b
def squared_loss(y_hat, y):  #@save
    """均方损失"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # X和y的小批量损失
        # 因为l形状是(batch_size,1)，而不是一个标量。l中的所有元素被加到一起，
        # 并以此计算关于[w,b]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
print(f'w的估计误差: {true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {true_b - b}')

pytorch简化版代码

# 如果在 notebook 里想可复现
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

# 1) 生成合成数据（保留原逻辑）
def synthetic_data(w, b, num_examples):
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = X @ w + b
    y += torch.normal(0, 0.01, y.shape)  # 加噪声
    return X, y.reshape(-1, 1)

true_w = torch.tensor([2.0, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

# 2) 用 DataLoader 代替自定义 data_iter
batch_size = 10
loader = DataLoader(TensorDataset(features, labels), batch_size=batch_size, shuffle=True)

# 3) 用 nn.Linear 代替自定义 linreg，并手动初始化成与原始设定相近
model = nn.Linear(2, 1, bias=True)
with torch.no_grad():
    model.weight.normal_(0, 0.01)  # ~ N(0, 0.01)
    model.bias.zero_()

# 4) 用 MSELoss 代替 squared_loss；reduction='mean' 等价于你原来 sum 后再 / batch_size
criterion = nn.MSELoss(reduction='mean')

# 5) 用 torch.optim.SGD 代替自定义 sgd
lr = 0.03
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# 6) 训练循环
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for X, y in loader:
        pred = model(X)
        loss = criterion(pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 每个 epoch 后在全量数据上看一次训练损失
    model.eval()
    with torch.no_grad():
        train_loss = criterion(model(features), labels).item()
    print(f'epoch {epoch + 1}, loss {train_loss:.6f}')

# 7) 查看参数误差（注意 nn.Linear 的权重形状是 (1, 2)）
with torch.no_grad():
    est_w = model.weight.view(-1)
    est_b = model.bias
    print(f"w的估计误差: {true_w - est_w}")
    print(f"b的估计误差: {true_b - est_b}")

课后题

1. 如果我们将权重初始化为零，会发生什么？算法仍然有效吗？

线性回归：零初始化没有问题，第一步梯度计算时就会打破对称性，算法仍然可以收敛。
深度神经网络：如果所有权重都是零，每个神经元的输出和梯度完全相同，更新时仍然对称，网络学不到有意义的特征。

5. 为什么在 squared_loss 函数中需要使用 reshape？

def squared_loss(y_hat, y):
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

y_hat 的形状是 (batch_size, 1)。
y 的形状可能是 (batch_size,)，直接相减会触发广播机制，导致结果形状不一致。
使用 reshape(y_hat.shape) 可以保证二者形状完全一致，避免错误。

7. 如果样本个数不能被批量大小整除，data_iter 函数的行为会有什么变化？

for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(indices[i: min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]

当 num_examples 不能整除 batch_size 时，最后一次迭代会返回 一个不足 batch_size 的小批量。

动手学深度学习——线性回归 + 基础优化算法

生成数据集

读取数据集

初始化模型参数

定义模型

定义损失函数

定义优化算法

1. 函数定义

2. `with torch.no_grad():`

3. 参数更新公式

4. 清空梯度

训练

1. 参数设定

2. 训练循环

(1) 数据迭代

(2) 前向计算

(3) 反向传播

(4) 参数更新

3. 每个 epoch 的损失计算

完整代码

pytorch简化版代码

课后题

网站公告

今日签到

热门文章

最新发布

动手学深度学习——线性回归 + 基础优化算法

生成数据集

读取数据集

初始化模型参数

定义模型

定义损失函数

定义优化算法

1. 函数定义

2. with torch.no_grad():

3. 参数更新公式

4. 清空梯度

训练

1. 参数设定

2. 训练循环

(1) 数据迭代

(2) 前向计算

(3) 反向传播

(4) 参数更新

3. 每个 epoch 的损失计算

完整代码

pytorch简化版代码

课后题

网站公告

今日签到

热门文章

最新发布

2. `with torch.no_grad():`