microsoft 的gpt2模型源码学习记录-EW帮帮网

微软 Deepspeed 中集成的 gpt2 代码感觉比 huggingface 的代码可读性要强很多，这里只用作代码结构的学习，暂时忽略其中模型分片并行的部分。

（虽然感觉直接把精华给忽略了Orz）

文章目录

1. GPT2模型概述

GPT2 是2018年发布的预训练模型，使用超过40G的近8000万的网页文本数据对模型进行训练。

GPT-2 可以理解成是由 transforer 的decoder 堆叠成的，输入是 word embeddings + position embeddings。
transformer 模块处理单词的步骤如下：首先通过自注意力层处理，接着将其传递给神经网络层。第一个 transformer 模块处理完但此后，会将结果向量被传入堆栈中的下一个 transformer 模块，继续进行计算。每一个 transformer 模块的处理方式都是一样的，但每个模块都会维护自己的自注意力层和神经网络层中的权重。

2. GPT2代码模块阅读

GPT-2的代码模块可读性较强，整体框架如下：

在这里插入图片描述

2.1 GPT2Model主模块

GPT2Model主模块

class GPT2Model(torch.nn.Module):
    """GPT-2 Language model.
    The output of the forward method are the logits (parallel or
    serial depending on the `parallel_output` flag.
    """

    def __init__(self,
                 num_layers,
                 vocab_size,
                 hidden_size,
                 num_attention_heads,
                 embedding_dropout_prob,
                 attention_dropout_prob,
                 output_dropout_prob,
                 max_sequence_length,
                 checkpoint_activations,
                 checkpoint_num_layers=1,
                 parallel_output=True):

        super(GPT2Model, self).__init__()

        self.parallel_output = parallel_output

        init_method = init_method_normal(std=0.02)

        # Word embeddings (parallel).
        # 生成 word embedding，shape 是 vocab_size * hidden_size，用于lookup embedding
        self.word_embeddings = mpu.VocabParallelEmbedding(
            vocab_size, hidden_size, init_method=init_method)

        # Position embedding (serial).
        # position embedding，shape 是vocab_size * hidden_size，用于 每个position 的 lookup embedding，是绝对位置编码
        self.position_embeddings = torch.nn.Embedding(max_sequence_length,
                                                      hidden_size)
        # Initialize the position embeddings.
        init_method(self.position_embeddings.weight)

        # Embeddings dropout
        self.embedding_dropout = torch.nn.Dropout(embedding_dropout_prob)

        # Transformer
        # 构建transformer模块（后文详细说）
        self.transformer = mpu.GPT2ParallelTransformer(num_layers,  # transformer 层数
                                                       hidden_size, 
                                                       num_attention_heads,  # 多头attention的头数
                                                       attention_dropout_prob, 
                                                       output_dropout_prob,
                                                       checkpoint_activations,
                                                       checkpoint_num_layers)

    def forward(self, input_ids, position_ids, attention_mask):

        # Embeddings.
        # 根据输入 id 做 look up embeddings
        words_embeddings = self.word_embeddings(input_ids)
        # 根据位置id 做 look up embeddings
        position_embeddings = self.position_embeddings(position_ids)
        # 实际的输入是 文本+位置 embedding
        embeddings = words_embeddings + position_embeddings

        # Dropout.
        embeddings = self.embedding_dropout(embeddings)

        # Transformer.
        # 将 embedding 和 mask作为transformer的输入
        transformer_output = self.transformer(embeddings, attention_mask)

        # Parallel logits.
        # 并行计算的logits
        transformer_output_parallel = mpu.copy_to_model_parallel_region(
            transformer_output)
        logits_parallel = F.linear(transformer_output_parallel,
                                   self.word_embeddings.weight)

        if self.parallel_output:
            return logits_parallel

        return mpu.gather_from_model_parallel_region(logits_parallel)

2.2 GPT2Transformer 模块

GPT2ParallelTransformer 模块是封装在 mpu/transformer.py 里的，mpu就是模型并行的框架了，里面封装了bert和gpt2并行训练的代码。
这里只看原理相关的部分了，暂时忽略并行的部分。

该模块是模型的主模块，即将n个的 transformer blocks 打包在一起，即 n * transformer layer + final layernorm 两部分组成。

单独的transformer layer代码详见 2.3。

class GPT2ParallelTransformer(torch.nn.Module):
    """GPT-2 transformer.

    This module takes input from embedding layer and it's output can
    be used directly by a logit layer. It consists of L (num-layers)
    blocks of:
        layer norm
        self attention
        residual connection
        layer norm
        mlp
        residual connection
    followed by a final layer norm.

    Arguments:
        num_layers: Number of transformer layers.
        hidden_size: The hidden size of the self attention.
        num_attention_heads: number of attention head in the self
                             attention.
        attention_dropout_prob: dropout probability of the attention
                                score in self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        checkpoint_activations: if True, checkpoint activations.
        checkpoint_num_layers: number of layers to checkpoint. This
                               is basically the chunk size in checkpoitning.
        layernorm_epsilon: epsilon used in layernorm to avoid
                           division by zero.
        init_method_std: standard deviation of the init method which has
                         the form N(0, std).
        use_scaled_init_for_output_weights: If Ture use 1/sqrt(2*num_layers)
                                            scaling for the output weights (
                                            output of self attention and mlp).
    """
    def __init__(self,
                 num_layers,
                 hidden_size,
                 num_attention_heads,
                 attention_dropout_prob,
                 output_dropout_prob,
                 checkpoint_activations,
                 checkpoint_num_layers=1,
                 layernorm_epsilon=1.0e-5,
                 init_method_std=0.02,
                 use_scaled_init_for_output_weights=True,
                 sparse_attention_config=None,
                 max_seq_length=None):
        super(GPT2ParallelTransformer, self).__init__()
        # Store activation checkpoiting flag.
        self.checkpoint_activations = checkpoint_activations
        self.checkpoint_num_layers = checkpoint_num_layers

        output_layer_init_method = None
        if use_scaled_init_for_output_weights:
            output_layer_init_method = scaled_init_method(init_method_std,
                                                          num_layers)
		
		# 返回一个 transformer layer（后面具体说）
        def get_layer():
            return GPT2ParallelTransformerLayer(
                hidden_size,
                num_attention_heads,
                attention_dropout_prob,
                output_dropout_prob,
                layernorm_epsilon,
                unscaled_init_method(init_method_std),
                output_layer_init_method=output_layer_init_method,
                sparse_attention_config=sparse_attention_config,
                max_seq_length=max_seq_length)

        # Transformer layers.
        # 构建 num_layer 个 transformer layer
        self.layers = torch.nn.ModuleList(
            [get_layer() for _ in range(num_layers)])

        # Final layer norm before output.
        self.final_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)

        if deepspeed.checkpointing.is_configured():
            global get_cuda_rng_tracker, checkpoint
            get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
            checkpoint = deepspeed.checkpointing.checkpoint


    def forward(self, hidden_states, attention_mask):

        def custom(start, end):
        	# 这里定义的 custom 函数我理解是用来加载自己的checkpoint的某些层
            def custom_forward(*inputs):
                layers_ = self.layers[start:end]
                x_ = inputs[0]
                for layer in layers_:
                    x_ = layer(x_, inputs[1])
                return x_
            return custom_forward

        if self.checkpoint_activations:
            l = 0
            num_layers = len(self.layers)
            chunk_length = self.checkpoint_num_layers
            while l < num_layers:
                hidden_states = checkpoint(custom(l, l+chunk_length),
                                           hidden_states, attention_mask)
                l += chunk_length
        else:
            # 如果不做遍历构造的所有层，本层的输入是上一层的输出，计算的过程中要做attention mask
            for layer in self.layers:
                hidden_states = layer(hidden_states, attention_mask)

        # Final layer norm.
        output = self.final_layernorm(hidden_states)

        return output

2.3 GPT2TransformerLayer 模块

GPT2ParallelTransformerLayer 这一模块就是一个transformer block，里面包括了LayerNorm， self-attention，Add & MLP。
看forward就可以清楚地看到整个计算步骤。
这里的 self-attention 和 MLP 是两个单独的模块，代码见2.4，2.5.

class GPT2ParallelTransformerLayer(torch.nn.Module):
    """A single layer transformer for GPT2.

    We use the following notation:
        h: hidden size
        n: number of attention heads
        b: batch size
        s: sequence length
    Transformore layer takes input with size [b, s, h] and returns an
    output of the same size.

    Arguments:
        hidden_size: The hidden size of the self attention.
        num_attention_heads: number of attention head in the self
                             attention.
        attention_dropout_prob: dropout probability of the attention
                                score in self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        layernorm_epsilon: epsilon used in layernorm to avoid
                           division by zero.
        init_method: initialization method used for the weights. Note
                     that all biases are initialized to zero and
                     layernorm weight are initialized to one.
        output_layer_init_method: output layers (attention output and
                                  mlp output) initialization. If None,
                                  use `init_method`.
    """
    def __init__(self,
                 hidden_size,
                 num_attention_heads,
                 attention_dropout_prob,
                 output_dropout_prob,
                 layernorm_epsilon,
                 init_method,
                 output_layer_init_method=None,
                 sparse_attention_config=None,
                 max_seq_length=None):
        super(GPT2ParallelTransformerLayer, self).__init__()
        # Set output layer initialization if not provided.
        if output_layer_init_method is None:
            output_layer_init_method = init_method

        # Layernorm on the input data.
        self.input_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)

        # Self attention.
        self.attention = GPT2ParallelSelfAttention(
            hidden_size,
            num_attention_heads,
            attention_dropout_prob,
            output_dropout_prob,
            init_method,
            output_layer_init_method=output_layer_init_method,
            sparse_attention_config=sparse_attention_config,
            max_seq_length=max_seq_length)

        # Layernorm on the input data.
        self.post_attention_layernorm = LayerNorm(hidden_size,
                                                  eps=layernorm_epsilon)

        # MLP
        self.mlp = GPT2ParallelMLP(
            hidden_size,
            output_dropout_prob,
            init_method,
            output_layer_init_method=output_layer_init_method)

    def forward(self, hidden_states, ltor_mask):
        # hidden_states: [b, s, h]  上一层的输出作为这一层的输入
        # ltor_mask: [1, 1, s, s]  attention mask矩阵

        # Layer norm at the begining of the transformer layer.
        # 先对输入做一个 LayerNorm 
        layernorm_output = self.input_layernorm(hidden_states)
        # Self attention.
        # 计算self-attention，layernorm_output: [b, s, h]
        # ltor_mask: [1, 1, s, s] 之所以是4维，是attention内部分离多头attention的时候会变成4维
        attention_output = self.attention(layernorm_output, ltor_mask)
        # Residual connection.
        # 残差网络：将输出和输入相加作为 output
        layernorm_input = hidden_states + attention_output
        # Layer norm post the self attention.
        # 对输出做layernorm
        layernorm_output = self.post_attention_layernorm(layernorm_input)
        # MLP.
        # 做非线性运算变幻，h -> 4*h -> h
        mlp_output = self.mlp(layernorm_output)
        # Second residual connection.
        output = layernorm_input + mlp_output

        return output

2.4 GPT2SelfAttention 模块

这是GPT2的 self-attention模块，这里先忽略模型并行的部分，只根据原理来看对应的代码。（但是其中涉及到并行分片的部分，还是要简单看下）

这一部分是核心部分，重点其实就是多头self-attention 的运算，以及attention-mask，这部分注释写的就比较详细，搭配原作者注释一起看更好理解。

class GPT2ParallelSelfAttention(torch.nn.Module):
    """Parallel self-attention layer for GPT2.
    Self-attention layer takes input with size [b, s, h] where b is
    the batch size, s is the sequence lenght, and h is the hidden size
    and creates output of the same size.
    Arguments:
        hidden_size: total hidden size of the layer (h).
        num_attention_heads: number of attention heads (n). Note that we
                             require n to be divisible by number of GPUs
                             used to parallelize the model. Also, we
                             require hidden size to be divisible by n.
        dropout_prob: dropout probability for the attention scores.
        init_method: weight initialization.
        output_layer_init_method: output layer initialization. If None, use
                                  `init_method`.
    We use the following notation:
        h: hidden_size
        n: num_attention_heads
        p: number of partitions （不做模型分片的时候p=1）
        np: n/p = n (p=1时)
        hp: h/p = h (p=1时)
        hn: h/n 每个attention的 hidden size
        b: batch size
        s: sequence length
    """
    def __init__(self, hidden_size, num_attention_heads,
                 attention_dropout_prob, output_dropout_prob,
                 init_method, output_layer_init_method=None):
        super(GPT2ParallelSelfAttention, self).__init__()
        # Set output layer initialization if not provided.
        if output_layer_init_method is None:
            output_layer_init_method = init_method
        # Per attention head and per partition values.
        world_size = get_model_parallel_world_size()
        # 这里是模型分片，不分片的时候 world_size=1，无变化
        self.hidden_size_per_partition = divide(hidden_size, world_size)
        # 这里是计算每头attention的hidden_size
        # 比如 hidden_size=256，指定 attention 头数=8
        # 则每个attention头能够分到的 hidden size=256/8=32
        # （这里是多头attention的定义，和模型分片无关）
        self.hidden_size_per_attention_head = divide(hidden_size,
                                                     num_attention_heads)
        # 这里是分片后的每片模型的 attention头数，world_size=1，值为传入的 num_attention_heads
        # 假如 world_size=2，即模型分成2片分别放在2块GPU上跑
        # 则每片模型训练 (num_attention_heads/2) 个attention
        self.num_attention_heads_per_partition = divide(num_attention_heads,
                                                        world_size)
        # Strided linear layer.
        # 这里的  ColumnParallelLinear 是一个基于模型分片的线性变化层，本质上就是一个 y = x * W + b 的操作
        # 将 输入为 hidden_size 变换成 3*hidden_size
        # 当 模型不分片的时候 stride 和 gather_output 这两个参数是没有用的
        # （知道是什么样的操作就行，这里就不细看了）
        self.query_key_value = ColumnParallelLinear(hidden_size, 3*hidden_size,
                                                    stride=3,
                                                    gather_output=False,
                                                    init_method=init_method)
        # Dropout. Note that for a single iteration, this layer will generate
        # different outputs on different number of parallel partitions but
        # on average it should not be partition dependent.
        # 对 attention 做 dropout，这里作者注释，对于每一次迭代，不同分片上的模型参数，将会有不同的dropout结果，但是理论上来说取平均不会受到模型分片的影响。
        # （应该是这个意思，解释了一下在分片的情况下dropout不受影响）
        self.attention_dropout = torch.nn.Dropout(attention_dropout_prob)

        # Output.
        # 这里是做一个变换，定义weight=[h,h]
        self.dense = RowParallelLinear(hidden_size,
                                       hidden_size,
                                       input_is_parallel=True,
                                       init_method=output_layer_init_method)
        self.output_dropout = torch.nn.Dropout(output_dropout_prob)

        if deepspeed.checkpointing.is_configured():
            global get_cuda_rng_tracker, checkpoint
            get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
            checkpoint = deepspeed.checkpointing.checkpoint


    def _transpose_for_scores(self, tensor):
        """Transpose a 3D tensor [b, s, np*hn] into a 4D tensor with
        size [b, np, s, hn].
        模型不分片的时候，np=n，hn=h/n，所以3d矩阵实际上就是 [b, s, h]，需要将其按照attention 头数进行分解，得到一个 [b, n, s，hn]
        """
        # 先计算目标矩阵的shape：(b,s) + (np, hn) = (b, s, np, hn) 
        new_tensor_shape = tensor.size()[:-1] + \
                           (self.num_attention_heads_per_partition,
                            self.hidden_size_per_attention_head)
        # 将当前矩阵分解为目标 shape
        tensor = tensor.view(*new_tensor_shape)
        return tensor.permute(0, 2, 1, 3)

    def forward(self, hidden_states, ltor_mask):
        # hidden_states: [b, s, h]
        # ltor_mask: [1, 1, s, s]

        # Attention heads. [b, s, hp]
        # p=1时为正常的 [b, s, h]
        # query_key_value 运算: [b, s, h] ->  [b, s, 3*h]
        # 因为是self-attention，所以 qw kw vw都是和 hidden states相乘，所以这里做运算的时候 [b, s, h] * [h, 3h] ->  [b, s, 3*h]
        # 相当于是把  qw kw vw 拼在了一起计算，后面把最后一维拆分即可
        mixed_x_layer = self.query_key_value(hidden_states)
        # split_tensor_along_last_dim 函数是将输入矩阵的最后一维均匀分成n份，这里是 3 份
        # 这里计算就是将 q k v 从上面得到的矩阵中份离开
        # q k v 的 shape 都是[b, s, h]
        (mixed_query_layer,
         mixed_key_layer,
         mixed_value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3)

        # Reshape and transpose [b, np, s, hn]
        # 根据attention的头数，将q k v 进行分解， np * hn = h （p=1时）
        query_layer = self._transpose_for_scores(mixed_query_layer)
        key_layer = self._transpose_for_scores(mixed_key_layer)
        value_layer = self._transpose_for_scores(mixed_value_layer)

        # Raw attention scores. [b, np, s, s]
        # q * k 得到 attention score
        attention_scores = torch.matmul(query_layer,
                                        key_layer.transpose(-1, -2))
        # q * k / sqrt(hn)
        attention_scores = attention_scores / math.sqrt(
            self.hidden_size_per_attention_head)
        # Apply the left to right attention mask.
        # 做 mask 运算，注意此时 attention-score的shape是[b, np, s, s]
        # ltor_mask 的shape 也是[1,1,s,s]，它是一个上三角全0 下三角全1的矩阵
        # 两者做 哈达玛积（就是逐元素相乘）
        # 只保留当前word之前的attention score，之后的都-1000，即设置为极小值
        attention_scores = torch.mul(attention_scores, ltor_mask) - \
                           10000.0 * (1.0 - ltor_mask)

        # Attention probabilities. [b, np, s, s]
        # 对最后一维做softmax，即可得到每个位置的 attention_probs
        attention_probs = torch.nn.Softmax(dim=-1)(attention_scores)
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        # 作者注释：这里做attention dropout实际上会删掉一些 attention 得分，这可能有点问题。（确实。。。。这里留个疑问吧）
        with get_cuda_rng_tracker().fork():
            attention_probs = self.attention_dropout(attention_probs)

        # Context layer.
        # [b, np, s, hn]
        # 这里做点积，[b, np, s, s] * [b, np, s, hn] -> [b, np, s, hn] 
        context_layer = torch.matmul(attention_probs, value_layer)
        # [b, s, np, hn]
        # 这里是多头合并，先把维度调换回去
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        # 然后计算合并后的shape  (b,s)+(h) = (b,s,h)，这里还是假设模型分片=1（不分片）
        new_context_layer_shape = context_layer.size()[:-2] + \
                                  (self.hidden_size_per_partition,)
        # [b, s, hp] 合并成目标shape
        context_layer = context_layer.view(*new_context_layer_shape)

        # Output. [b, s, h]
        # 输出，过一个dense+dropout
        # 这里的dense层就是之前定义的RowParallelLinear，当前模型分片=1的话，就是 [b, s, h] * [h, h] -> [b, s, h]
        output = self.dense(context_layer)
        output = self.output_dropout(output)

        return output

2.5 GPT2MLP 模块

这里GPT2的MLP模块，其实就是将最后一维 hidden states 做了个非线性变换：h -> 4h -> h

class GPT2ParallelMLP(torch.nn.Module):
    """MLP for GPT2.

    MLP will take the input with h hidden state, project it to 4*h
    hidden dimension, perform gelu transformation, and project the
    state back into h hidden dimension. At the end, dropout is also
    applied.

    Arguments:
        hidden_size: The hidden size of the self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        init_method: initialization method used for the weights. Note
                     that all biases are initialized to zero and
                     layernorm weight are initialized to one.
        output_layer_init_method: output layer initialization. If None,
                                  use `init_method`.
    """

    def __init__(self, hidden_size, output_dropout_prob, init_method,
                 output_layer_init_method=None):
        super(GPT2ParallelMLP, self).__init__()
        # Set output layer initialization if not provided.
        if output_layer_init_method is None:
            output_layer_init_method = init_method
        # Project to 4h.
        # 这里和上文一样，不要被名字唬住，其实就是 y=wx+b计算
        # [b,s,h] * [4h,h]T -> [b,s,4h]
        # 即 weight shape = [outout,input] ，如果模型需要分片，则对 output_size分片
        self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size,
                                                  gather_output=False,
                                                  init_method=init_method)
        # Project back to h.
        # y=xw+b 运算，[b,s,4h] * [h,4h]T -> [b,s,h]
        # 即 weight shape = [outout,input] ，如果模型需要分片，则对 input_size分片
        self.dense_4h_to_h = RowParallelLinear(
            4*hidden_size,
            hidden_size,
            input_is_parallel=True,
            init_method=output_layer_init_method)
        self.dropout = torch.nn.Dropout(output_dropout_prob)

    def forward(self, hidden_states):
        # [b, s, 4hp]
        intermediate_parallel = self.dense_h_to_4h(hidden_states)
        intermediate_parallel = gelu(intermediate_parallel)

        # [b, s, h]
        output = self.dense_4h_to_h(intermediate_parallel)
        output = self.dropout(output)
        return output

到这里，GPT-2的模型主要结构的代码就看完了。

3. GPT2 模型预训练

接下来从forward部分简单看下 GPT2 做预训练的时候如何构造损失函数。

这部分代码在 pretrain_gpt2.py 中

3.1 GPT2 预训练 - 构造模型

def get_model(args):
    """Build the model."""

    print_rank_0('building GPT2 model ...')
    # 这里构造的就是第2小节详细说的 GPT2Model
    model = GPT2Model(num_layers=args.num_layers,
                      vocab_size=args.vocab_size,
                      hidden_size=args.hidden_size,
                      num_attention_heads=args.num_attention_heads,
                      embedding_dropout_prob=args.hidden_dropout,
                      attention_dropout_prob=args.attention_dropout,
                      output_dropout_prob=args.hidden_dropout,
                      max_sequence_length=args.max_position_embeddings,
                      checkpoint_activations=args.checkpoint_activations,
                      checkpoint_num_layers=args.checkpoint_num_layers,
                      parallel_output=True)

    if mpu.get_data_parallel_rank() == 0:
        print(' > number of parameters on model parallel rank {}: {}'.format(
            mpu.get_model_parallel_rank(),
            sum([p.nelement() for p in model.parameters()])), flush=True)

    #To prevent OOM for model sizes that cannot fit in GPU memory in full precision
    # 使用 deepspeed 和 fp16 的时候
    # 仅仅在权重更新的时候使用fp32,耗时的前向和后向运算都使用fp16
    # half()方法将模型中的float32转化为float16
    if args.deepspeed and args.fp16:
        model.half()

    # GPU allocation.
    # 显示地将模型加载到GPU上
    model.cuda(torch.cuda.current_device())

    # Fp16 conversion.
    # 使用 fp16 混合精度可以有效节省内存，这部分可以另外写个代码分析，这里就不展开说了
    if args.fp16:
        model = FP16_Module(model)

    # Wrap model for distributed training.
    if USE_TORCH_DDP:
        i = torch.cuda.current_device()
        model = DDP(model, device_ids=[i], output_device=i,
                    process_group=mpu.get_data_parallel_group())
    else:
        model = DDP(model)

    return model

3.2 GPT2 预训练 - forward

这里是预训练过程中的 forward。

过程也很简单，先走model forward，计算得到 GPT-2 的output，然后计算loss。
这里的 input 是sentence[:-1] true label是 sentence[1:]，即，对于长度为seq_len 的输入，1～seq_len - 1 个token是 input，后 2～seq_len 个token是label。

def forward_step(data_iterator, model, args, timers):
    """Forward step."""

    # Get the batch.
    timers('batch generator').start()
    tokens, labels, loss_mask, attention_mask, position_ids = get_batch(
        data_iterator, args, timers)
    timers('batch generator').stop()

    # Forward model.
    # output shape = [b,s,vocab_size]
    # output 这里 seq_len 上的每个位置的 hidden_states 都可以理解成为，已知了前n个token，当前位置的预测 token
    output = model(tokens, position_ids, attention_mask)
    # output在最后一维上取最大值作为预测值，计算和label直接的交叉熵
    losses = mpu.vocab_parallel_cross_entropy(output.contiguous().float(),
                                              labels)
    # 这里的 loss_mask 是将 end_token mask 掉
    loss_mask = loss_mask.view(-1)
    loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()

    return loss

参考内容

完全图解GPT-2：看完这篇就够了（一）
预训练模型专题_GPT2_模型代码学习笔记（这个博主做了huggingface gpt2代码的阅读笔记，可以一起学习下）

microsoft 的gpt2模型源码学习记录