相关链接:
gpt2论文传送门
microsoft Deepspeed gpt2源码传送
微软 Deepspeed 中集成的 gpt2 代码感觉比 huggingface 的代码可读性要强很多,这里只用作代码结构的学习,暂时忽略其中模型分片并行的部分。
(虽然感觉直接把精华给忽略了Orz)
文章目录
1. GPT2模型概述
GPT2 是2018年发布的预训练模型,使用超过40G的近8000万的网页文本数据对模型进行训练。
GPT-2 可以理解成是由 transforer 的decoder 堆叠成的,输入是 word embeddings + position embeddings
。
transformer 模块处理单词的步骤如下:首先通过自注意力层处理,接着将其传递给神经网络层。第一个 transformer 模块处理完但此后,会将结果向量被传入堆栈中的下一个 transformer 模块,继续进行计算。每一个 transformer 模块的处理方式都是一样的,但每个模块都会维护自己的自注意力层和神经网络层中的权重。
2. GPT2代码模块阅读
GPT-2的代码模块可读性较强,整体框架如下:
2.1 GPT2Model主模块
class GPT2Model(torch.nn.Module):
"""GPT-2 Language model.
The output of the forward method are the logits (parallel or
serial depending on the `parallel_output` flag.
"""
def __init__(self,
num_layers,
vocab_size,
hidden_size,
num_attention_heads,
embedding_dropout_prob,
attention_dropout_prob,
output_dropout_prob,
max_sequence_length,
checkpoint_activations,
checkpoint_num_layers=1,
parallel_output=True):
super(GPT2Model, self).__init__()
self.parallel_output = parallel_output
init_method = init_method_normal(std=0.02)
# Word embeddings (parallel).
# 生成 word embedding,shape 是 vocab_size * hidden_size,用于lookup embedding
self.word_embeddings = mpu.VocabParallelEmbedding(
vocab_size, hidden_size, init_method=init_method)
# Position embedding (serial).
# position embedding,shape 是vocab_size * hidden_size,用于 每个position 的 lookup embedding,是绝对位置编码
self.position_embeddings = torch.nn.Embedding(max_sequence_length,
hidden_size)
# Initialize the position embeddings.
init_method(self.position_embeddings.weight)
# Embeddings dropout
self.embedding_dropout = torch.nn.Dropout(embedding_dropout_prob)
# Transformer
# 构建transformer模块(后文详细说)
self.transformer = mpu.GPT2ParallelTransformer(num_layers, # transformer 层数
hidden_size,
num_attention_heads, # 多头attention的头数
attention_dropout_prob,
output_dropout_prob,
checkpoint_activations,
checkpoint_num_layers)
def forward(self, input_ids, position_ids, attention_mask):
# Embeddings.
# 根据输入 id 做 look up embeddings
words_embeddings = self.word_embeddings(input_ids)
# 根据位置id 做 look up embeddings
position_embeddings = self.position_embeddings(position_ids)
# 实际的输入是 文本+位置 embedding
embeddings = words_embeddings + position_embeddings
# Dropout.
embeddings = self.embedding_dropout(embeddings)
# Transformer.
# 将 embedding 和 mask作为transformer的输入
transformer_output = self.transformer(embeddings, attention_mask)
# Parallel logits.
# 并行计算的logits
transformer_output_parallel = mpu.copy_to_model_parallel_region(
transformer_output)
logits_parallel = F.linear(transformer_output_parallel,
self.word_embeddings.weight)
if self.parallel_output:
return logits_parallel
return mpu.gather_from_model_parallel_region(logits_parallel)
2.2 GPT2Transformer 模块
GPT2ParallelTransformer
模块是封装在 mpu/transformer.py
里的,mpu就是模型并行的框架了,里面封装了bert和gpt2并行训练的代码。
这里只看原理相关的部分了,暂时忽略并行的部分。
该模块是模型的主模块,即将n个的 transformer blocks 打包在一起,即 n * transformer layer + final layernorm 两部分组成。
单独的transformer layer代码详见 2.3。
class GPT2ParallelTransformer(torch.nn.Module):
"""GPT-2 transformer.
This module takes input from embedding layer and it's output can
be used directly by a logit layer. It consists of L (num-layers)
blocks of:
layer norm
self attention
residual connection
layer norm
mlp
residual connection
followed by a final layer norm.
Arguments:
num_layers: Number of transformer layers.
hidden_size: The hidden size of the self attention.
num_attention_heads: number of attention head in the self
attention.
attention_dropout_prob: dropout probability of the attention
score in self attention.
output_dropout_prob: dropout probability for the outputs
after self attention and final output.
checkpoint_activations: if True, checkpoint activations.
checkpoint_num_layers: number of layers to checkpoint. This
is basically the chunk size in checkpoitning.
layernorm_epsilon: epsilon used in layernorm to avoid
division by zero.
init_method_std: standard deviation of the init method which has
the form N(0, std).
use_scaled_init_for_output_weights: If Ture use 1/sqrt(2*num_layers)
scaling for the output weights (
output of self attention and mlp).
"""
def __init__(self,
num_layers,
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
checkpoint_activations,
checkpoint_num_layers=1,
layernorm_epsilon=1.0e-5,
init_method_std=0.02,
use_scaled_init_for_output_weights=True,
sparse_attention_config=None,
max_seq_length=None):
super(GPT2ParallelTransformer, self).__init__()
# Store activation checkpoiting flag.
self.checkpoint_activations = checkpoint_activations
self.checkpoint_num_layers = checkpoint_num_layers
output_layer_init_method = None
if use_scaled_init_for_output_weights:
output_layer_init_method = scaled_init_method(init_method_std,
num_layers)
# 返回一个 transformer layer(后面具体说)
def get_layer():
return GPT2ParallelTransformerLayer(
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
layernorm_epsilon,
unscaled_init_method(init_method_std),
output_layer_init_method=output_layer_init_method,
sparse_attention_config=sparse_attention_config,
max_seq_length=max_seq_length)
# Transformer layers.
# 构建 num_layer 个 transformer layer
self.layers = torch.nn.ModuleList(
[get_layer() for _ in range(num_layers)])
# Final layer norm before output.
self.final_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)
if deepspeed.checkpointing.is_configured():
global get_cuda_rng_tracker, checkpoint
get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
checkpoint = deepspeed.checkpointing.checkpoint
def forward(self, hidden_states, attention_mask):
def custom(start, end):
# 这里定义的 custom 函数我理解是用来加载自己的checkpoint的某些层
def custom_forward(*inputs):
layers_ = self.layers[start:end]
x_ = inputs[0]
for layer in layers_:
x_ = layer(x_, inputs[1])
return x_
return custom_forward
if self.checkpoint_activations:
l = 0
num_layers = len(self.layers)
chunk_length = self.checkpoint_num_layers
while l < num_layers:
hidden_states = checkpoint(custom(l, l+chunk_length),
hidden_states, attention_mask)
l += chunk_length
else:
# 如果不做遍历构造的所有层,本层的输入是上一层的输出,计算的过程中要做attention mask
for layer in self.layers:
hidden_states = layer(hidden_states, attention_mask)
# Final layer norm.
output = self.final_layernorm(hidden_states)
return output
2.3 GPT2TransformerLayer 模块
GPT2ParallelTransformerLayer
这一模块就是一个transformer block,里面包括了LayerNorm, self-attention,Add & MLP。
看forward就可以清楚地看到整个计算步骤。
这里的 self-attention 和 MLP 是两个单独的模块,代码见2.4,2.5.
class GPT2ParallelTransformerLayer(torch.nn.Module):
"""A single layer transformer for GPT2.
We use the following notation:
h: hidden size
n: number of attention heads
b: batch size
s: sequence length
Transformore layer takes input with size [b, s, h] and returns an
output of the same size.
Arguments:
hidden_size: The hidden size of the self attention.
num_attention_heads: number of attention head in the self
attention.
attention_dropout_prob: dropout probability of the attention
score in self attention.
output_dropout_prob: dropout probability for the outputs
after self attention and final output.
layernorm_epsilon: epsilon used in layernorm to avoid
division by zero.
init_method: initialization method used for the weights. Note
that all biases are initialized to zero and
layernorm weight are initialized to one.
output_layer_init_method: output layers (attention output and
mlp output) initialization. If None,
use `init_method`.
"""
def __init__(self,
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
layernorm_epsilon,
init_method,
output_layer_init_method=None,
sparse_attention_config=None,
max_seq_length=None):
super(GPT2ParallelTransformerLayer, self).__init__()
# Set output layer initialization if not provided.
if output_layer_init_method is None:
output_layer_init_method = init_method
# Layernorm on the input data.
self.input_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)
# Self attention.
self.attention = GPT2ParallelSelfAttention(
hidden_size,
num_attention_heads,
attention_dropout_prob,
output_dropout_prob,
init_method,
output_layer_init_method=output_layer_init_method,
sparse_attention_config=sparse_attention_config,
max_seq_length=max_seq_length)
# Layernorm on the input data.
self.post_attention_layernorm = LayerNorm(hidden_size,
eps=layernorm_epsilon)
# MLP
self.mlp = GPT2ParallelMLP(
hidden_size,
output_dropout_prob,
init_method,
output_layer_init_method=output_layer_init_method)
def forward(self, hidden_states, ltor_mask):
# hidden_states: [b, s, h] 上一层的输出作为这一层的输入
# ltor_mask: [1, 1, s, s] attention mask矩阵
# Layer norm at the begining of the transformer layer.
# 先对输入做一个 LayerNorm
layernorm_output = self.input_layernorm(hidden_states)
# Self attention.
# 计算self-attention,layernorm_output: [b, s, h]
# ltor_mask: [1, 1, s, s] 之所以是4维,是attention内部分离多头attention的时候会变成4维
attention_output = self.attention(layernorm_output, ltor_mask)
# Residual connection.
# 残差网络:将输出和输入相加作为 output
layernorm_input = hidden_states + attention_output
# Layer norm post the self attention.
# 对输出做layernorm
layernorm_output = self.post_attention_layernorm(layernorm_input)
# MLP.
# 做非线性运算变幻,h -> 4*h -> h
mlp_output = self.mlp(layernorm_output)
# Second residual connection.
output = layernorm_input + mlp_output
return output
2.4 GPT2SelfAttention 模块
这是GPT2的 self-attention模块,这里先忽略模型并行的部分,只根据原理来看对应的代码。(但是其中涉及到并行分片的部分,还是要简单看下)
这一部分是核心部分,重点其实就是多头self-attention 的运算,以及attention-mask,这部分注释写的就比较详细,搭配原作者注释一起看更好理解。
class GPT2ParallelSelfAttention(torch.nn.Module):
"""Parallel self-attention layer for GPT2.
Self-attention layer takes input with size [b, s, h] where b is
the batch size, s is the sequence lenght, and h is the hidden size
and creates output of the same size.
Arguments:
hidden_size: total hidden size of the layer (h).
num_attention_heads: number of attention heads (n). Note that we
require n to be divisible by number of GPUs
used to parallelize the model. Also, we
require hidden size to be divisible by n.
dropout_prob: dropout probability for the attention scores.
init_method: weight initialization.
output_layer_init_method: output layer initialization. If None, use
`init_method`.
We use the following notation:
h: hidden_size
n: num_attention_heads
p: number of partitions (不做模型分片的时候p=1)
np: n/p = n (p=1时)
hp: h/p = h (p=1时)
hn: h/n 每个attention的 hidden size
b: batch size
s: sequence length
"""
def __init__(self, hidden_size, num_attention_heads,
attention_dropout_prob, output_dropout_prob,
init_method, output_layer_init_method=None):
super(GPT2ParallelSelfAttention, self).__init__()
# Set output layer initialization if not provided.
if output_layer_init_method is None:
output_layer_init_method = init_method
# Per attention head and per partition values.
world_size = get_model_parallel_world_size()
# 这里是模型分片,不分片的时候 world_size=1,无变化
self.hidden_size_per_partition = divide(hidden_size, world_size)
# 这里是计算每头attention的hidden_size
# 比如 hidden_size=256,指定 attention 头数=8
# 则每个attention头能够分到的 hidden size=256/8=32
# (这里是多头attention的定义,和模型分片无关)
self.hidden_size_per_attention_head = divide(hidden_size,
num_attention_heads)
# 这里是分片后的每片模型的 attention头数,world_size=1,值为传入的 num_attention_heads
# 假如 world_size=2,即模型分成2片分别放在2块GPU上跑
# 则每片模型训练 (num_attention_heads/2) 个attention
self.num_attention_heads_per_partition = divide(num_attention_heads,
world_size)
# Strided linear layer.
# 这里的 ColumnParallelLinear 是一个基于模型分片的线性变化层,本质上就是一个 y = x * W + b 的操作
# 将 输入为 hidden_size 变换成 3*hidden_size
# 当 模型不分片的时候 stride 和 gather_output 这两个参数是没有用的
# (知道是什么样的操作就行,这里就不细看了)
self.query_key_value = ColumnParallelLinear(hidden_size, 3*hidden_size,
stride=3,
gather_output=False,
init_method=init_method)
# Dropout. Note that for a single iteration, this layer will generate
# different outputs on different number of parallel partitions but
# on average it should not be partition dependent.
# 对 attention 做 dropout,这里作者注释,对于每一次迭代,不同分片上的模型参数,将会有不同的dropout结果,但是理论上来说取平均不会受到模型分片的影响。
# (应该是这个意思,解释了一下在分片的情况下dropout不受影响)
self.attention_dropout = torch.nn.Dropout(attention_dropout_prob)
# Output.
# 这里是做一个变换,定义weight=[h,h]
self.dense = RowParallelLinear(hidden_size,
hidden_size,
input_is_parallel=True,
init_method=output_layer_init_method)
self.output_dropout = torch.nn.Dropout(output_dropout_prob)
if deepspeed.checkpointing.is_configured():
global get_cuda_rng_tracker, checkpoint
get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
checkpoint = deepspeed.checkpointing.checkpoint
def _transpose_for_scores(self, tensor):
"""Transpose a 3D tensor [b, s, np*hn] into a 4D tensor with
size [b, np, s, hn].
模型不分片的时候,np=n,hn=h/n,所以3d矩阵实际上就是 [b, s, h],需要将其按照attention 头数进行分解,得到一个 [b, n, s,hn]
"""
# 先计算目标矩阵的shape:(b,s) + (np, hn) = (b, s, np, hn)
new_tensor_shape = tensor.size()[:-1] + \
(self.num_attention_heads_per_partition,
self.hidden_size_per_attention_head)
# 将当前矩阵分解为目标 shape
tensor = tensor.view(*new_tensor_shape)
return tensor.permute(0, 2, 1, 3)
def forward(self, hidden_states, ltor_mask):
# hidden_states: [b, s, h]
# ltor_mask: [1, 1, s, s]
# Attention heads. [b, s, hp]
# p=1时为正常的 [b, s, h]
# query_key_value 运算: [b, s, h] -> [b, s, 3*h]
# 因为是self-attention,所以 qw kw vw都是和 hidden states相乘,所以这里做运算的时候 [b, s, h] * [h, 3h] -> [b, s, 3*h]
# 相当于是把 qw kw vw 拼在了一起计算,后面把最后一维拆分即可
mixed_x_layer = self.query_key_value(hidden_states)
# split_tensor_along_last_dim 函数是将输入矩阵的最后一维均匀分成n份,这里是 3 份
# 这里计算就是将 q k v 从上面得到的矩阵中份离开
# q k v 的 shape 都是[b, s, h]
(mixed_query_layer,
mixed_key_layer,
mixed_value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3)
# Reshape and transpose [b, np, s, hn]
# 根据attention的头数,将q k v 进行分解, np * hn = h (p=1时)
query_layer = self._transpose_for_scores(mixed_query_layer)
key_layer = self._transpose_for_scores(mixed_key_layer)
value_layer = self._transpose_for_scores(mixed_value_layer)
# Raw attention scores. [b, np, s, s]
# q * k 得到 attention score
attention_scores = torch.matmul(query_layer,
key_layer.transpose(-1, -2))
# q * k / sqrt(hn)
attention_scores = attention_scores / math.sqrt(
self.hidden_size_per_attention_head)
# Apply the left to right attention mask.
# 做 mask 运算,注意此时 attention-score的shape是[b, np, s, s]
# ltor_mask 的shape 也是[1,1,s,s],它是一个上三角全0 下三角全1的矩阵
# 两者做 哈达玛积(就是逐元素相乘)
# 只保留当前word之前的attention score,之后的都-1000,即设置为极小值
attention_scores = torch.mul(attention_scores, ltor_mask) - \
10000.0 * (1.0 - ltor_mask)
# Attention probabilities. [b, np, s, s]
# 对最后一维做softmax,即可得到每个位置的 attention_probs
attention_probs = torch.nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
# 作者注释:这里做attention dropout实际上会删掉一些 attention 得分,这可能有点问题。(确实。。。。这里留个疑问吧)
with get_cuda_rng_tracker().fork():
attention_probs = self.attention_dropout(attention_probs)
# Context layer.
# [b, np, s, hn]
# 这里做点积,[b, np, s, s] * [b, np, s, hn] -> [b, np, s, hn]
context_layer = torch.matmul(attention_probs, value_layer)
# [b, s, np, hn]
# 这里是多头合并,先把维度调换回去
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
# 然后计算合并后的shape (b,s)+(h) = (b,s,h),这里还是假设模型分片=1(不分片)
new_context_layer_shape = context_layer.size()[:-2] + \
(self.hidden_size_per_partition,)
# [b, s, hp] 合并成目标shape
context_layer = context_layer.view(*new_context_layer_shape)
# Output. [b, s, h]
# 输出,过一个dense+dropout
# 这里的dense层就是之前定义的RowParallelLinear,当前模型分片=1的话,就是 [b, s, h] * [h, h] -> [b, s, h]
output = self.dense(context_layer)
output = self.output_dropout(output)
return output
2.5 GPT2MLP 模块
这里GPT2的MLP模块,其实就是将最后一维 hidden states 做了个非线性变换:h -> 4h -> h
class GPT2ParallelMLP(torch.nn.Module):
"""MLP for GPT2.
MLP will take the input with h hidden state, project it to 4*h
hidden dimension, perform gelu transformation, and project the
state back into h hidden dimension. At the end, dropout is also
applied.
Arguments:
hidden_size: The hidden size of the self attention.
output_dropout_prob: dropout probability for the outputs
after self attention and final output.
init_method: initialization method used for the weights. Note
that all biases are initialized to zero and
layernorm weight are initialized to one.
output_layer_init_method: output layer initialization. If None,
use `init_method`.
"""
def __init__(self, hidden_size, output_dropout_prob, init_method,
output_layer_init_method=None):
super(GPT2ParallelMLP, self).__init__()
# Set output layer initialization if not provided.
if output_layer_init_method is None:
output_layer_init_method = init_method
# Project to 4h.
# 这里和上文一样,不要被名字唬住,其实就是 y=wx+b计算
# [b,s,h] * [4h,h]T -> [b,s,4h]
# 即 weight shape = [outout,input] ,如果模型需要分片,则对 output_size分片
self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size,
gather_output=False,
init_method=init_method)
# Project back to h.
# y=xw+b 运算,[b,s,4h] * [h,4h]T -> [b,s,h]
# 即 weight shape = [outout,input] ,如果模型需要分片,则对 input_size分片
self.dense_4h_to_h = RowParallelLinear(
4*hidden_size,
hidden_size,
input_is_parallel=True,
init_method=output_layer_init_method)
self.dropout = torch.nn.Dropout(output_dropout_prob)
def forward(self, hidden_states):
# [b, s, 4hp]
intermediate_parallel = self.dense_h_to_4h(hidden_states)
intermediate_parallel = gelu(intermediate_parallel)
# [b, s, h]
output = self.dense_4h_to_h(intermediate_parallel)
output = self.dropout(output)
return output
到这里,GPT-2的模型主要结构的代码就看完了。
3. GPT2 模型预训练
接下来从forward部分简单看下 GPT2 做预训练的时候如何构造损失函数。
这部分代码在 pretrain_gpt2.py 中
3.1 GPT2 预训练 - 构造模型
def get_model(args):
"""Build the model."""
print_rank_0('building GPT2 model ...')
# 这里构造的就是第2小节详细说的 GPT2Model
model = GPT2Model(num_layers=args.num_layers,
vocab_size=args.vocab_size,
hidden_size=args.hidden_size,
num_attention_heads=args.num_attention_heads,
embedding_dropout_prob=args.hidden_dropout,
attention_dropout_prob=args.attention_dropout,
output_dropout_prob=args.hidden_dropout,
max_sequence_length=args.max_position_embeddings,
checkpoint_activations=args.checkpoint_activations,
checkpoint_num_layers=args.checkpoint_num_layers,
parallel_output=True)
if mpu.get_data_parallel_rank() == 0:
print(' > number of parameters on model parallel rank {}: {}'.format(
mpu.get_model_parallel_rank(),
sum([p.nelement() for p in model.parameters()])), flush=True)
#To prevent OOM for model sizes that cannot fit in GPU memory in full precision
# 使用 deepspeed 和 fp16 的时候
# 仅仅在权重更新的时候使用fp32,耗时的前向和后向运算都使用fp16
# half()方法将模型中的float32转化为float16
if args.deepspeed and args.fp16:
model.half()
# GPU allocation.
# 显示地将模型加载到GPU上
model.cuda(torch.cuda.current_device())
# Fp16 conversion.
# 使用 fp16 混合精度可以有效节省内存,这部分可以另外写个代码分析,这里就不展开说了
if args.fp16:
model = FP16_Module(model)
# Wrap model for distributed training.
if USE_TORCH_DDP:
i = torch.cuda.current_device()
model = DDP(model, device_ids=[i], output_device=i,
process_group=mpu.get_data_parallel_group())
else:
model = DDP(model)
return model
3.2 GPT2 预训练 - forward
这里是预训练过程中的 forward。
过程也很简单,先走model forward,计算得到 GPT-2 的output,然后计算loss。
这里的 input 是sentence[:-1] true label是 sentence[1:],即,对于长度为seq_len 的输入,1~seq_len - 1 个token是 input,后 2~seq_len 个token是label。
def forward_step(data_iterator, model, args, timers):
"""Forward step."""
# Get the batch.
timers('batch generator').start()
tokens, labels, loss_mask, attention_mask, position_ids = get_batch(
data_iterator, args, timers)
timers('batch generator').stop()
# Forward model.
# output shape = [b,s,vocab_size]
# output 这里 seq_len 上的每个位置的 hidden_states 都可以理解成为,已知了前n个token,当前位置的预测 token
output = model(tokens, position_ids, attention_mask)
# output在最后一维上取最大值作为预测值,计算和label直接的交叉熵
losses = mpu.vocab_parallel_cross_entropy(output.contiguous().float(),
labels)
# 这里的 loss_mask 是将 end_token mask 掉
loss_mask = loss_mask.view(-1)
loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
return loss
参考内容
- 完全图解GPT-2:看完这篇就够了(一)
- 预训练模型专题_GPT2_模型代码学习笔记(这个博主做了huggingface gpt2代码的阅读笔记,可以一起学习下)