# 2 Embedding

## 2.1 Token Embedding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

## 2.2 Positional Embedding

\begin{array}{c}
P E(p o s, 2 i)=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\
P E(p o s, 2 i+1)=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right)
\end{array}

# 3 Transformer网络结构

## 3.1 Encoder层

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.

We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.

FFN(x)=max(0,xW_{1}+b_{1})W_{2}+b_{2}

## 3.2 Decoder层

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models

## 3.4 Decoder训练解码过程

self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.