运行时长和内存优化:梯度积累 Gradiants Accumulation

发布于:2025-07-28 ⋅ 阅读:(14) ⋅ 点赞:(0)

在对模型训练时,batch size盲目过大会导致GPU容量不够,因此通常会采用小批量训练。梯度积累技术的出现是为了让我们能够在逻辑上增大batch进行训练,嗯,逻辑上。

引用书中的原话就是:

书中首先引导我们思考一个问题,为什么神经网络在训练过程中会爆炸?就像之前跑项目的时候,常常会看到爆红,memory不够或受限,就其原因在于,主要的内存负担集中在向后传播这一过程,因为要计算梯度更新权重,就要存储向前传播时计算得到的激活值,当模型越大,激活值就越多,占用的空间越多。此时当batch也增大的时候,就会产生更多的激活值。

这样的恶性循环导致不得不减小batch,GA的存在能够帮助在内存限制下增加bacth大小。

Gradiants Accumulation

传统的mini-batch训练过程可以分为以下过程:

  1. 获取mini-batch
  2. 向前传播,存储激活值
  3. 向后传播:计算损失-->计算梯度-->更新权重

GA主要针对向后传播的最后一步做出改变,当前mini-batch的损失和梯度计算完成之后,并不更新权重,与随后几轮mini-batch计算得到的梯度累加后再进行计算。

假设GPU最大的batch容量为16,但是我们想让batch达到64,该怎么做?

使用16的batch size,积累每一轮的gradients,每8个mini-batch更新一次权重,这样得到的batch值达到了16*8=128(官方给的栗子,可能是想凸显GA作用下batch会更灵活)

基于MINIST数据集的测试

官方给的笔记本基于MNIST数据集进行GA的测试。

受限导入必要的包

import sys
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F
import numpy as np
import pandas as pd

from time import time
from tqdm import tqdm
from torch.utils.data import DataLoader

下载MNIST数据集的训练集和测试集

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = DataLoader(testset, batch_size=32, shuffle=False)

 定义一个简单的网络

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x1 = torch.relu(self.fc1(x))
        x2 = torch.relu(self.fc2(x1))
        x3 = torch.relu(self.fc3(x2))
        x4 = self.fc4(x3)
        return x4

定义评估函数

def evaluate(model):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            inputs, labels = data
            outputs = model(inputs) # use last element returned by forward function
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

 定义一个变量accumulation_steps = 4,即梯度累积步长为4

accumulation_steps = 4
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

for epoch in range(2):
    net.train()
    running_loss = 0.0
    
    for idx, data in enumerate(trainloader):
        inputs, labels = data
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        
        if ((idx + 1) % accumulation_steps == 0) or ((idx+1) == len(trainloader)):
            optimizer.step()
            optimizer.zero_grad()
            
        running_loss += loss.item()
        
    accuracy = evaluate(net)
        
    print(f"Epoch {epoch + 1}, Loss: {round(running_loss / len(trainloader), 2)}, Accuracy: {accuracy * 100:.2f}%")

再定义一个没有用到GA的训练循环做对比

net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

for epoch in range(2):
    net.train()
    running_loss = 0.0
    
    for idx, data in enumerate(trainloader):
        optimizer.zero_grad()
        inputs, labels = data
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        
        optimizer.step()            
        running_loss += loss.item()
        
    accuracy = evaluate(net)
        
    print(f"Epoch {epoch + 1}, Loss: {round(running_loss / len(trainloader), 2)}, Accuracy: {accuracy * 100:.2f}%")

如果成功运行了上面代码的话,我们会发现对运行时间上可能并没有什么作用,甚至不如普通的训练过程。作者也指出,GA的目标是减少总体内存使用。


网站公告

今日签到

点亮在社区的每一天
去签到