1 | %matplotlib inline |
Sequence-to-Sequence Modeling with nn.Transformer and TorchText
This is a tutorial on how to train a sequence-to-sequence model
that uses thenn.Transformer <https://pytorch.org/docs/master/nn.html?highlight=nn%20transformer#torch.nn.Transformer> module.
PyTorch 1.2 release includes a standard transformer module based on the
paper Attention is All You
Need <https://arxiv.org/pdf/1706.03762.pdf>
The transformer model
has been proved to be superior in quality for many sequence-to-sequence
problems while being more parallelizable. The nn.Transformer module
relies entirely on an attention mechanism (another module recently
implemented asnn.MultiheadAttention <https://pytorch.org/docs/master/nn.html?highlight=multiheadattention#torch.nn.MultiheadAttention>)
to draw global dependencies between input and output. The nn.Transformer module is now highly modularized such that a single component (like nn.TransformerEncoder <https://pytorch.org/docs/master/nn.html?highlight=nn%20transformerencoder#torch.nn.TransformerEncoder>in this tutorial) can be easily adapted/composed.

Define the model
In this tutorial, we train nn.TransformerEncoder model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). Thenn.TransformerEncoder consists of multiple layers ofnn.TransformerEncoderLayer <https://pytorch.org/docs/master/nn.html?highlight=transformerencoderlayer#torch.nn.TransformerEncoderLayer>. Along with the input sequence, a square
attention mask is required because the self-attention layers innn.TransformerEncoder are only allowed to attend the earlier positions in
the sequence. For the language modeling task, any tokens on the future
positions should be masked. To have the actual words, the output
of nn.TransformerEncoder model is sent to the final Linear
layer, which is followed by a log-Softmax function.
1 | import math |
1 | sz=10 |
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
PositionalEncoding module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use sine and cosine functions of
different frequencies.
1 | class PositionalEncoding(nn.Module): |
Load and batch data
The training process uses Wikitext-2 dataset from torchtext. The
vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the batchify()
function arranges the dataset into columns, trimming off any tokens remaining
after the data has been divided into batches of size batch_size.
For instance, with the alphabet as the sequence (total length of 26)
and a batch size of 4, we would divide the alphabet into 4 sequences of
length 6:
These columns are treated as independent by the model, which means that
the dependence of G and F can not be learned, but allows more
efficient batch processing.
1 | import torchtext |
/home/bool_tbb/miniconda3/envs/pytorch/lib/python3.8/site-packages/torchtext/data/field.py:150: UserWarning: Field class will be retired in the 0.8.0 release and moved to torchtext.legacy. Please see 0.7.0 release notes for further information.
warnings.warn('{} class will be retired in the 0.8.0 release and moved to torchtext.legacy. Please see 0.7.0 release notes for further information.'.format(self.__class__.__name__), UserWarning)
/home/bool_tbb/miniconda3/envs/pytorch/lib/python3.8/site-packages/torchtext/data/example.py:78: UserWarning: Example class will be retired in the 0.8.0 release and moved to torchtext.legacy. Please see 0.7.0 release notes for further information.
warnings.warn('Example class will be retired in the 0.8.0 release and moved to torchtext.legacy. Please see 0.7.0 release notes for further information.', UserWarning)
1 | data = TEXT.numericalize([train_txt.examples[0].text]) |
Functions to generate input and target sequence
get_batch() function generates the input and target sequence for
the transformer model. It subdivides the source data into chunks of
length bptt. For the language modeling task, the model needs the
following words as Target. For example, with a bptt value of 2,
we’d get the following two Variables for i = 0:

It should be noted that the chunks are along dimension 0, consistent
with the S dimension in the Transformer model. The batch dimensionN is along dimension 1.
1 | bptt = 35 |
1 | get_batch(train_data,0)[0][:10] |
tensor([[ 3, 25, 1849, 570, 7, 5, 5, 9258, 4, 56,
0, 7, 6, 6634, 4, 6603, 6, 5, 65, 30],
[ 12, 66, 13, 4889, 458, 8, 1045, 21, 19094, 34,
147, 4, 0, 10, 2280, 2294, 58, 35, 2438, 4064],
[ 3852, 13667, 2962, 68, 6, 28374, 39, 417, 0, 2034,
29, 88, 27804, 350, 7, 17, 4811, 902, 33, 20],
[ 3872, 5, 9, 4, 155, 8, 1669, 32, 2634, 257,
4, 5, 5, 11, 4568, 8205, 78, 5258, 7723, 12009],
[ 884, 91, 963, 294, 4, 548, 29, 279, 37, 4,
391, 31, 4, 2614, 948, 13583, 405, 545, 15, 16],
[ 12, 25, 5, 5, 1688, 0, 39, 59, 8785, 0,
6, 13, 3026, 43, 11, 6, 0, 349, 3134, 4538],
[ 3, 6, 82, 1780, 21, 6, 2158, 4, 8, 8,
27, 1485, 0, 194, 96, 195, 3545, 101, 1150, 3486],
[ 3, 25, 13, 885, 4, 6360, 15, 670, 0, 13,
26, 17, 5, 417, 894, 10, 5, 5, 2998, 27],
[20003, 190, 33, 1516, 1085, 34, 680, 3597, 2475, 664,
47, 11, 127, 63, 6, 46, 24995, 72, 10190, 26],
[ 86, 9076, 10540, 6, 9, 74, 198, 7, 6, 17,
3134, 5312, 4, 4, 3, 25509, 5, 2034, 5, 86]])
Initiate an instance
The model is set up with the hyperparameter below. The vocab size is
equal to the length of the vocab object.
1 | ntokens = len(TEXT.vocab.stoi) # the size of vocabulary |
Run the model
CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>
is applied to track the loss andSGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>
implements stochastic gradient descent method as the optimizer. The initial
learning rate is set to 5.0. StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>is
applied to adjust the learn rate through epochs. During the
training, we usenn.utils.clip_grad_norm\_ <https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_>
function to scale all the gradient together to prevent exploding.
1 | criterion = nn.CrossEntropyLoss() |
Loop over epochs. Save the model if the validation loss is the best
we’ve seen so far. Adjust the learning rate after each epoch.
1 | best_val_loss = float("inf") |
| epoch 1 | 200/ 2981 batches | lr 4.07 | ms/batch 9.71 | loss 5.39 | ppl 218.21
| epoch 1 | 400/ 2981 batches | lr 4.07 | ms/batch 9.50 | loss 5.39 | ppl 220.00
| epoch 1 | 600/ 2981 batches | lr 4.07 | ms/batch 9.55 | loss 5.20 | ppl 181.36
| epoch 1 | 800/ 2981 batches | lr 4.07 | ms/batch 9.60 | loss 5.26 | ppl 193.18
| epoch 1 | 1000/ 2981 batches | lr 4.07 | ms/batch 9.55 | loss 5.23 | ppl 186.05
| epoch 1 | 1200/ 2981 batches | lr 4.07 | ms/batch 9.55 | loss 5.26 | ppl 192.45
| epoch 1 | 1400/ 2981 batches | lr 4.07 | ms/batch 9.55 | loss 5.29 | ppl 197.86
| epoch 1 | 1600/ 2981 batches | lr 4.07 | ms/batch 9.60 | loss 5.33 | ppl 206.42
| epoch 1 | 1800/ 2981 batches | lr 4.07 | ms/batch 9.59 | loss 5.27 | ppl 193.88
| epoch 1 | 2000/ 2981 batches | lr 4.07 | ms/batch 9.70 | loss 5.30 | ppl 200.64
| epoch 1 | 2200/ 2981 batches | lr 4.07 | ms/batch 9.64 | loss 5.17 | ppl 176.64
| epoch 1 | 2400/ 2981 batches | lr 4.07 | ms/batch 9.62 | loss 5.26 | ppl 192.57
| epoch 1 | 2600/ 2981 batches | lr 4.07 | ms/batch 9.63 | loss 5.28 | ppl 195.69
| epoch 1 | 2800/ 2981 batches | lr 4.07 | ms/batch 9.64 | loss 5.21 | ppl 182.35
-----------------------------------------------------------------------------------------
| end of epoch 1 | time: 30.36s | valid loss 5.55 | valid ppl 256.32
-----------------------------------------------------------------------------------------
| epoch 2 | 200/ 2981 batches | lr 3.87 | ms/batch 9.73 | loss 5.26 | ppl 192.78
| epoch 2 | 400/ 2981 batches | lr 3.87 | ms/batch 9.65 | loss 5.27 | ppl 194.78
| epoch 2 | 600/ 2981 batches | lr 3.87 | ms/batch 9.68 | loss 5.08 | ppl 160.59
| epoch 2 | 800/ 2981 batches | lr 3.87 | ms/batch 9.67 | loss 5.14 | ppl 171.46
| epoch 2 | 1000/ 2981 batches | lr 3.87 | ms/batch 9.68 | loss 5.10 | ppl 164.78
| epoch 2 | 1200/ 2981 batches | lr 3.87 | ms/batch 9.70 | loss 5.14 | ppl 171.25
| epoch 2 | 1400/ 2981 batches | lr 3.87 | ms/batch 9.71 | loss 5.18 | ppl 177.40
| epoch 2 | 1600/ 2981 batches | lr 3.87 | ms/batch 9.70 | loss 5.23 | ppl 186.59
| epoch 2 | 1800/ 2981 batches | lr 3.87 | ms/batch 9.70 | loss 5.16 | ppl 173.63
| epoch 2 | 2000/ 2981 batches | lr 3.87 | ms/batch 9.61 | loss 5.19 | ppl 179.61
| epoch 2 | 2200/ 2981 batches | lr 3.87 | ms/batch 9.61 | loss 5.06 | ppl 158.22
| epoch 2 | 2400/ 2981 batches | lr 3.87 | ms/batch 9.66 | loss 5.14 | ppl 170.97
| epoch 2 | 2600/ 2981 batches | lr 3.87 | ms/batch 9.63 | loss 5.16 | ppl 173.44
| epoch 2 | 2800/ 2981 batches | lr 3.87 | ms/batch 9.62 | loss 5.10 | ppl 163.57
-----------------------------------------------------------------------------------------
| end of epoch 2 | time: 30.54s | valid loss 5.44 | valid ppl 231.52
-----------------------------------------------------------------------------------------
| epoch 3 | 200/ 2981 batches | lr 3.68 | ms/batch 9.74 | loss 5.15 | ppl 172.66
| epoch 3 | 400/ 2981 batches | lr 3.68 | ms/batch 9.81 | loss 5.16 | ppl 174.57
| epoch 3 | 600/ 2981 batches | lr 3.68 | ms/batch 9.76 | loss 4.98 | ppl 145.22
| epoch 3 | 800/ 2981 batches | lr 3.68 | ms/batch 9.69 | loss 5.04 | ppl 154.24
| epoch 3 | 1000/ 2981 batches | lr 3.68 | ms/batch 9.92 | loss 5.02 | ppl 150.68
| epoch 3 | 1200/ 2981 batches | lr 3.68 | ms/batch 9.75 | loss 5.05 | ppl 156.65
| epoch 3 | 1400/ 2981 batches | lr 3.68 | ms/batch 9.81 | loss 5.08 | ppl 161.32
| epoch 3 | 1600/ 2981 batches | lr 3.68 | ms/batch 9.87 | loss 5.13 | ppl 168.46
| epoch 3 | 1800/ 2981 batches | lr 3.68 | ms/batch 9.73 | loss 5.06 | ppl 158.11
| epoch 3 | 2000/ 2981 batches | lr 3.68 | ms/batch 9.78 | loss 5.09 | ppl 162.57
| epoch 3 | 2200/ 2981 batches | lr 3.68 | ms/batch 9.80 | loss 4.97 | ppl 143.40
| epoch 3 | 2400/ 2981 batches | lr 3.68 | ms/batch 9.84 | loss 5.05 | ppl 156.10
| epoch 3 | 2600/ 2981 batches | lr 3.68 | ms/batch 9.78 | loss 5.07 | ppl 158.92
| epoch 3 | 2800/ 2981 batches | lr 3.68 | ms/batch 9.80 | loss 5.01 | ppl 149.25
-----------------------------------------------------------------------------------------
| end of epoch 3 | time: 30.91s | valid loss 5.46 | valid ppl 234.33
-----------------------------------------------------------------------------------------
Evaluate the model with the test dataset
Apply the best model to check the result with the test dataset.
1 | test_loss = evaluate(best_model, test_data) |
=========================================================================================
| End of training | test loss 5.48 | test ppl 238.72
=========================================================================================
1 | PATH = './transformer_net.pth' |
1 | model_dict=model.load_state_dict(torch.load(PATH)) |
1 | type(model_dict) |
torch.nn.modules.module._IncompatibleKeys
REFERENCE:SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT.