原文 12
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is L a y e r N o r m ( x + S u b l a y e r ( x ) ) LayerNorm(x + Sublayer(x)) LayerNorm(x+Sublayer(x)), where S u b l a y e r ( x ) Sublayer(x) Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
翻译
3.1 编码器与解码器栈
编码器:编码器由 N=6 个相同层的栈组成。每层包含两个子层:第一子层为多头自注意力机制,第二子层为简单的、逐位置的全链接前馈网络。我们在每个子层周围采用残差连接,随后进行层归一化。具体而言,每个子层的输出计算为 LayerNorm ( x + Sublayer ( x ) ) \text{LayerNorm}(x + \text{Sublayer}(x)) LayerNorm(x+Sublayer(x)),其中 S u b l a y e r ( x ) Sublayer(x) Sublayer(x) 表示该子层自身实现的函数。为支持这些残差链接,模型中所有子层及嵌入层的输出维度均设置为 dmodel = 512。
重点句子解析
1.The encoder is composed of a stack of N=6 identical layers.
【解析】
这是一个简单句,句子的谓语是is composed of,意为:由…组成。谓语前后的The encoder和a stack分别是主语和宾语。介词短语of …layers放在宾语a stack 后边,做后置定语。其中,of表示所属关系,意为:…的;N=6和 identical都是定语,修饰layers。
【参考翻译】
编码器由 N=6 个相同层的堆栈组成。
2.Each layer has two sub-layers.
【解析】
这是一个“主谓宾”结构的简单句。其中has是谓语动词,表示“有,包含”;谓语前后的Each layer和two sub-layers分别是主语和宾语。
【参考翻译】
每层包含两个子层。
3.The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
【解析】
这是由and连接的两个并列分句,其主干是:The first is a mechanism, and the second is a network. 原句中的multi-head和 self-attention共同做定语,修饰mechanism;simple、position-wise、 fully connected 和feed-forward是4个并列的形容词或形容词短语,它们共同做定语,修饰network。(需要注意的是:在神经网络领域,"position-wise"特指对输入序列的每个位置独立进行相同操作的处理方式,这是Transformer架构的关键设计特征之一。)
【参考翻译】
第一子层为多头自注意力机制,第二子层为简单的、逐位置的全连接前馈网络。
4.We employ a residual connection around each of the two sub-layers, followed by layer normalization.
【解析】
句子的主干是:We employ a residual connection。(注意:这里的employ表示“采用”) 后边的介词短语around each of the two sub-layers做地点状语,修饰动词employ。each of the two sub-layers字面意思是“这两个子层中的每一层”,我们也可以把它简译为“每个子层”。逗号后边的过去分词短语followed by layer normalization做状语。这里用followed by…表示接下来的动作、或者随后的动作,相当于:and the next step is layer normalization,或者and then we deal with layer normalization.
【参考翻译】
我们在每个子层周围采用残差连接[11],随后进行层归一化。
5.That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
【解析】
这句话的结构是:插入语+主句+定语从句。其中,That is是插入语,引出对上文的进一步解释说明;That is常译为“即;也就是说”,此处也可以活译为:具体而言。主句是:the output of each sub-layer is LayerNorm(x + Sublayer(x)). 如果我们把the output of each sub-layer看作A,把LayerNorm(x + Sublayer(x))看作B,那么,这个主句就可以简化为:A is B. 其中主语A 的中心词是the output,介词短语of each sub-layer做后置定语,表示:每个子层的。逗号后边的where…是非限制性定语从句,修饰LayerNorm(x + Sublayer(x))。其中,where相当于in which,也就是in LayerNorm(x + Sublayer(x));定语从句的主干是Sublayer(x) is the function. 句尾的implemented by the sub-layer itself是“过去分词(implemented)+介词短语(by…)”构成的过去分词短语,做后置定语,表被动含义,表示前边的function(函数)“被…实现”。
【参考翻译】
具体而言,每个子层的输出计算为LayerNorm(x + Sublayer(x)),其中Sublayer(x)表示该子层自身实现的函数。
6.To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
【解析】
这句话的结构是:不定式+主语+插入语+谓语+宾语。句首的不定式to facilitate…做目的状语;两个逗号之间的as well as the embedding layers是插入语,可以暂时忽略。所以,句子的主干是:all sub-layers produce outputs. 原句中,主句all sub-layers后边的in the model是介词短语做后置定语;宾语outputs后边的介词短语of dimension dmodel = 512也是后置定语,其中介词of表示“…的”。再来看介于主语和谓语之间的插入语as well as the embedding layers。这个短语对主语all sub-layers进行补充说明,其中as well as表示“和,以及”。
【参考翻译】
为支持这些残差连接,模型中所有子层及嵌入层的输出维度均设置为dmodel = 512。
原文 13
Decoder: The decoder is also composed of a stack of N=6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
翻译
解码器:解码器同样由 N=6 个相同层堆叠组成。除包含编码器每层中的两个子层外,解码器额外插入第三子层,该子层对编码器栈的输出执行多头注意力计算。与编码器类似,我们在每个子层周围采用残差连接并进行层归一化。同时,我们对解码器堆栈中的自注意力子层进行特殊处理:通过掩码机制阻止当前位置关注后续位置信息。这种掩码操作与输出嵌入向右偏移一位的特性相结合,可确保位置i的预测仅依赖于小于i的已知输出位置。
重点句子解析
1.The decoder is also composed of a stack of N=6 identical layers.
【解析】
这是一个简单句,句子的谓语是is composed of,意为:由…组成。谓语前后的The encoder和a stack分别是主语和宾语。副词also修饰be动词is, 做状语。介词短语of …layers放在宾语a stack 后边,做后置定语。其中,of表示所属关系,意为:…的;N=6和 identical都是定语,修饰layers。
【参考翻译】
解码器同样由N=6个相同层堆叠组成。
2.In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
【解析】
这个句子的结构是:状语+主句+定语从句。句首的介词短语in addition to…做状语,表示“除了…”;the two sub-layers做介词宾语;in each encoder layer做后置定语,修饰the two sub-layers。主句the decoder inserts a third sub-layer是主谓宾结构,其中inserts是谓语,前后分别是主语和宾语;which引导非限制性定语从句,其主干是which performs multi-head attention;后边的介词短语over the output…做状语,修饰动词performs。其中over引出动作的对象,表示“关于,对于”;句尾的介词短语of the encoder stack做后置定语,修饰the output。
【参考翻译】
除包含编码器每层中的两个子层外,解码器额外插入第三子层,该子层对编码器堆栈的输出执行多头注意力计算。
3.Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
【解析】
句子的结构是:形容词短语+主句+过去分词短语。确切来说,句首的Similar to the encoder是“形容词(similar)+介词短语(to the encoder)”,修饰后边的主句,也可以看作插入语;主句中的主干是:We employ a residual connection。(注意:这里的employ表示“采用”) 后边的介词短语around each of the sub-layers做地点状语,修饰动词employ。逗号后边的过去分词短语followed by layer normalization做状语。这里用followed by…引出接下来的动作、或者随后的动作,相当于:and the next step is layer normalization,或者and then we deal with layer normalization.
【参考翻译】
与编码器类似,我们在每个子层周围采用残差连接并进行层归一化。
4.We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
【解析】
这个句子的主干是:We modify sub-layer. 句中的also是副词,修饰动词modify,做状语;modify的本意是“修改”,此处可以活译为:进行特殊处理。the self-attention做定语,修饰sub-layer;in the decoder stack是介词短语做后置定语,也是修饰sub-layer;不定式to prevent positions from attending to subsequent positions做目的状语。attending to subsequent positions是一个动名词短语,相当于doing sth, 而prevent sb/sth from doing sth表示“阻止…做某事”。
【参考翻译】
同时,我们对解码器堆栈中的自注意力子层进行特殊处理:通过掩码机制阻止当前位置关注后续位置信息。
5.This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
【解析】
句子的结构是:主语+插入语+谓语+宾语从句。我们先略过两个逗号之间(同时也是主谓之间)的插入语。主句是“This masking ensures…”,后边是that引导的宾语从句。宾语从句的主干是:the predictions can depend on the known outputs. 原句中的介词短语for position i做后置定语,修饰the predictions;only是副词,修饰动词depend,做状语;介词短语at positions less than i修饰outputs。其中的less than i是形容词比较级形式,修饰positions。最后,我们再来看插入语部分,也就是combined with fact that the output embeddings are offset by one position. 这里的combined with fact是“形容词(combined)+介词短语(with fact)”,共同修饰主语This masking;这里的fact可以活译为“特性”。that the output embeddings are offset by one position.是that引导的同位语从句,对fact进行解释说明。这里的offset是技术术语,译为“偏移”。
【参考翻译:】
这种掩码操作与输出嵌入向右偏移一位的特性相结合,可确保位置i的预测仅依赖于小于i的已知输出位置。