1
1
"""
2
- Language Translation with TorchText
2
+ TorchText๋ก ์ธ์ด ๋ฒ์ญํ๊ธฐ
3
3
===================================
4
4
5
- This tutorial shows how to use several convenience classes of ``torchtext`` to preprocess
6
- data from a well-known dataset containing sentences in both English and German and use it to
7
- train a sequence-to-sequence model with attention that can translate German sentences
8
- into English.
5
+ ์ด ํํ ๋ฆฌ์ผ์์๋ ``torchtext`` ์ ์ ์ฉํ ์ฌ๋ฌ ํด๋์ค๋ค๊ณผ ์ํ์ค ํฌ ์ํ์ค(sequence-to-sequence, seq2seq)๋ชจ๋ธ์ ํตํด
6
+ ์์ด์ ๋
์ผ์ด ๋ฌธ์ฅ๋ค์ด ํฌํจ๋ ์ ๋ช
ํ ๋ฐ์ดํฐ ์
์ ์ด์ฉํด์ ๋
์ผ์ด ๋ฌธ์ฅ์ ์์ด๋ก ๋ฒ์ญํด ๋ณผ ๊ฒ์
๋๋ค.
9
7
10
- It is based off of
11
- `this tutorial <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb >`__
12
- from PyTorch community member `Ben Trevett <https://github.com/bentrevett>`__
13
- and was created by `Seth Weidman <https://github.com/SethHWeidman/>`__ with Ben's permission .
8
+ ์ด ํํ ๋ฆฌ์ผ์
9
+ PyTorch ์ปค๋ฎค๋ํฐ ๋ฉค๋ฒ์ธ `Ben Trevett <https://github.com/bentrevett>`__ ์ด ์์ฑํ
10
+ `ํํ ๋ฆฌ์ผ <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb >`__ ์ ๊ธฐ์ดํ๊ณ ์์ผ๋ฉฐ
11
+ `Seth Weidman <https://github.com/SethHWeidman/>`__ ์ด Ben์ ํ๋ฝ์ ๋ฐ๊ณ ๋ง๋ค์์ต๋๋ค .
14
12
15
- By the end of this tutorial, you will be able to :
13
+ ์ด ํํ ๋ฆฌ์ผ์ ํตํด ์ฌ๋ฌ๋ถ์ ๋ค์๊ณผ ๊ฐ์ ๊ฒ์ ํ ์ ์๊ฒ ๋ฉ๋๋ค :
16
14
17
- - Preprocess sentences into a commonly-used format for NLP modeling using the following ``torchtext`` convenience classes :
15
+ - ``torchtext`` ์ ์๋์ ๊ฐ์ ์ ์ฉํ ํด๋์ค๋ค์ ํตํด ๋ฌธ์ฅ๋ค์ NLP๋ชจ๋ธ๋ง์ ์์ฃผ ์ฌ์ฉ๋๋ ํํ๋ก ์ ์ฒ๋ฆฌํ ์ ์๊ฒ ๋ฉ๋๋ค :
18
16
- `TranslationDataset <https://torchtext.readthedocs.io/en/latest/datasets.html#torchtext.datasets.TranslationDataset>`__
19
17
- `Field <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field>`__
20
18
- `BucketIterator <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator>`__
21
19
"""
22
20
23
21
######################################################################
24
- # `Field` and `TranslationDataset`
22
+ # `Field` ์ `TranslationDataset`
25
23
# ----------------
26
- # ``torchtext`` has utilities for creating datasets that can be easily
27
- # iterated through for the purposes of creating a language translation
28
- # model. One key class is a
29
- # `Field <https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64>`__,
30
- # which specifies the way each sentence should be preprocessed, and another is the
31
- # `TranslationDataset` ; ``torchtext``
32
- # has several such datasets; in this tutorial we'll use the
33
- # `Multi30k dataset <https://github.com/multi30k/dataset>`__, which contains about
34
- # 30,000 sentences (averaging about 13 words in length) in both English and German.
24
+ # ``torchtext`` ์๋ ์ธ์ด ๋ณํ ๋ชจ๋ธ์ ๋ง๋ค๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋ ๋ฐ์ดํฐ์
์ ๋ง๋ค๊ธฐ ์ ํฉํ ๋ค์ํ ๋๊ตฌ๊ฐ ์์ต๋๋ค.
25
+ # ๊ทธ ์ค์์๋ ์ค์ํ ํด๋์ค ์ค ํ๋์ธ `Field <https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64>`__ ๋
26
+ # ๊ฐ ๋ฌธ์ฅ์ด ์ด๋ป๊ฒ ์ ์ฒ๋ฆฌ๋์ด์ผ ํ๋์ง ์ง์ ํ๋ฉฐ, ๋ ๋ค๋ฅธ ์ค์ํ ํด๋์ค๋ก๋ `TranslationDataset` ์ด ์์ต๋๋ค.
27
+ # ``torchtext`` ์๋ ์ด ์ธ์๋ ๋น์ทํ ๋ฐ์ดํฐ์
๋ค์ด ์๋๋ฐ, ์ด๋ฒ ํํ ๋ฆฌ์ผ์์๋ `Multi30k dataset <https://github.com/multi30k/dataset>`__ ์ ์ฌ์ฉํ ๊ฒ์
๋๋ค.
28
+ # ์ด ๋ฐ์ดํฐ ์
์ ํ๊ท ์ฝ 13๊ฐ์ ๋จ์ด๋ก ๊ตฌ์ฑ๋ ์ฝ ์ผ๋ง ๊ฐ์ ๋ฌธ์ฅ์ ์์ด์ ๋
์ผ์ด ๋ ์ธ์ด๋ก ํฌํจํ๊ณ ์์ต๋๋ค.
35
29
#
36
- # Note: the tokenization in this tutorial requires `Spacy <https://spacy.io>`__
37
- # We use Spacy because it provides strong support for tokenization in languages
38
- # other than English. ``torchtext`` provides a ``basic_english`` tokenizer
39
- # and supports other tokenizers for English (e.g.
40
- # `Moses <https://bitbucket.org/luismsgomes/mosestokenizer/src/default/>`__)
41
- # but for language translation - where multiple languages are required -
42
- # Spacy is your best bet.
30
+ # ์ฐธ๊ณ : ์ด ํํ ๋ฆฌ์ผ์์์ ํ ํฐํ(tokenization)์๋ `Spacy <https://spacy.io>`__ ๊ฐ ํ์ํฉ๋๋ค.
31
+ # Spacy๋ ์์ด ์ด ์ธ์ ๋ค๋ฅธ ์ธ์ด์ ๋ํ ๊ฐ๋ ฅํ ํ ํฐํ ๊ธฐ๋ฅ์ ์ ๊ณตํ๊ธฐ ๋๋ฌธ์ ์ฌ์ฉํฉ๋๋ค. ``torchtext`` ๋
32
+ # `basic_english`` ํ ํฌ๋์ด์ ๋ฅผ ์ ๊ณตํ ๋ฟ ์๋๋ผ ์์ด์ ์ฌ์ฉํ ์ ์๋ ๋ค๋ฅธ ํ ํฌ๋์ด์ ๋ค(์์ปจ๋ฐ
33
+ # `Moses <https://bitbucket.org/luismsgomes/mosestokenizer/src/default/>`__ )์ ์ง์ํฉ๋๋ค๋ง, ์ธ์ด ๋ฒ์ญ์ ์ํด์๋ ๋ค์ํ ์ธ์ด๋ฅผ
34
+ # ๋ค๋ฃจ์ด์ผ ํ๊ธฐ ๋๋ฌธ์ Spacy๊ฐ ๊ฐ์ฅ ์ ํฉํฉ๋๋ค.
43
35
#
44
- # To run this tutorial, first install ``spacy `` using ``pip `` or ``conda``.
45
- # Next, download the raw data for the English and German Spacy tokenizers:
36
+ # ์ด ํํ ๋ฆฌ์ผ์ ์คํํ๋ ค๋ฉด, ์ฐ์ ``pip `` ๋ ``conda `` ๋ก ``spacy`` ๋ฅผ ์ค์นํ์ธ์. ๊ทธ ๋ค์,
37
+ # Spacy ํ ํฌ๋์ด์ ๊ฐ ์ธ ์์ด์ ๋
์ผ์ด์ ๋ํ ๋ฐ์ดํฐ๋ฅผ ๋ค์ด๋ก๋ ๋ฐ์ต๋๋ค.
46
38
#
47
39
# ::
48
40
#
49
41
# python -m spacy download en
50
42
# python -m spacy download de
51
43
#
52
- # With Spacy installed, the following code will tokenize each of the sentences
53
- # in the ``TranslationDataset`` based on the tokenizer defined in the ``Field``
54
-
44
+ # Spacy๊ฐ ์ค์น๋์ด ์๋ค๋ฉด, ๋ค์ ์ฝ๋๋ ``TranslationDataset`` ์ ์๋ ๊ฐ ๋ฌธ์ฅ์ ``Field`` ์ ์ ์๋
45
+ # ๋ด์ฉ์ ๊ธฐ๋ฐ์ผ๋ก ํ ํฐํํ ๊ฒ์
๋๋ค.
55
46
from torchtext .datasets import Multi30k
56
47
from torchtext .data import Field , BucketIterator
57
48
71
62
fields = (SRC , TRG ))
72
63
73
64
######################################################################
74
- # Now that we've defined ``train_data``, we can see an extremely useful
75
- # feature of ``torchtext``'s ``Field``: the ``build_vocab`` method
76
- # now allows us to create the vocabulary associated with each language
65
+ # ์ด์ ``train_data`` ๋ฅผ ์ ์ํ์ผ๋, ``torchtext`` ์ ``Field`` ์ ์๋ ์์ฒญ๋๊ฒ ์ ์ฉํ ๊ธฐ๋ฅ์
66
+ # ๋ณด๊ฒ ๋ ๊ฒ์
๋๋ค : ๋ฐ๋ก ``build_vovab`` ๋ฉ์๋(method)๋ก ๊ฐ ์ธ์ด์ ์ฐ๊ด๋ ์ดํ๋ค์ ๋ง๋ค์ด ๋ผ ๊ฒ์
๋๋ค.
77
67
78
68
SRC .build_vocab (train_data , min_freq = 2 )
79
69
TRG .build_vocab (train_data , min_freq = 2 )
80
70
81
71
######################################################################
82
- # Once these lines of code have been run, ``SRC.vocab.stoi`` will be a
83
- # dictionary with the tokens in the vocabulary as keys and their
84
- # corresponding indices as values; ``SRC.vocab.itos`` will be the same
85
- # dictionary with the keys and values swapped. We won't make extensive
86
- # use of this fact in this tutorial, but this will likely be useful in
87
- # other NLP tasks you'll encounter.
72
+ # ์ ์ฝ๋๊ฐ ์คํ๋๋ฉด, ``SRC.vocab.stoi`` ๋ ์ดํ์ ํด๋นํ๋ ํ ํฐ์ ํค๋ก, ๊ด๋ จ๋ ์์ธ์ ๊ฐ์ผ๋ก ๊ฐ์ง๋
73
+ # ์ฌ์ (dict)์ด ๋ฉ๋๋ค. ``SRC.vocab.itos`` ์ญ์ ์ฌ์ (dict)์ด์ง๋ง, ํค์ ๊ฐ์ด ์๋ก ๋ฐ๋์
๋๋ค. ์ด ํํ ๋ฆฌ์ผ์์๋
74
+ # ๊ทธ๋ค์ง ์ค์ํ์ง ์์ ๋ด์ฉ์ด์ง๋ง, ์ด๋ฐ ํน์ฑ์ ๋ค๋ฅธ ์์ฐ์ด ์ฒ๋ฆฌ ๋ฑ์์ ์ ์ฉํ๊ฒ ์ฌ์ฉํ ์ ์์ต๋๋ค.
88
75
89
76
######################################################################
90
77
# ``BucketIterator``
91
78
# ----------------
92
- # The last ``torchtext`` specific feature we'll use is the ``BucketIterator``,
93
- # which is easy to use since it takes a ``TranslationDataset`` as its
94
- # first argument. Specifically, as the docs say:
95
- # Defines an iterator that batches examples of similar lengths together.
96
- # Minimizes amount of padding needed while producing freshly shuffled
97
- # batches for each new epoch. See pool for the bucketing procedure used.
79
+ # ๋ง์ง๋ง์ผ๋ก ์ฌ์ฉํด ๋ณผ ``torchtext`` ์ ํนํ๋ ๊ธฐ๋ฅ์ ๋ฐ๋ก ``BucketIterator`` ์
๋๋ค.
80
+ # ์ฒซ ๋ฒ์งธ ์ธ์๋ก ``TranslationDataset`` ์ ์ ๋ฌ๋ฐ๊ธฐ ๋๋ฌธ์ ์ฌ์ฉํ๊ธฐ๊ฐ ์ฝ์ต๋๋ค. ๋ฌธ์์์๋ ๋ณผ ์ ์๋ฏ
81
+ # ์ด ๊ธฐ๋ฅ์ ๋น์ทํ ๊ธธ์ด์ ์์ ๋ค์ ๋ฌถ์ด์ฃผ๋ ๋ฐ๋ณต์(iterator)๋ฅผ ์ ์ํฉ๋๋ค. ๊ฐ๊ฐ์ ์๋ก์ด ์ํฌํฌ(epoch)๋ง๋ค
82
+ # ์๋ก ์์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ง๋๋๋ฐ ํ์ํ ํจ๋ฉ์ ์๋ฅผ ์ต์ํ ํฉ๋๋ค. ๋ฒ์ผํ
๊ณผ์ ์์ ์ฌ์ฉ๋๋ ์ ์ฅ ๊ณต๊ฐ์ ํ๋ฒ ์ดํด๋ณด์๊ธฐ ๋ฐ๋๋๋ค.
98
83
99
84
import torch
100
85
108
93
device = device )
109
94
110
95
######################################################################
111
- # These iterators can be called just like ``DataLoader``s; below, in
112
- # the ``train`` and ``evaluate`` functions, they are called simply with:
113
- #
96
+ # ์ด ๋ฐ๋ณต์๋ค์ ``DataLoader`` ์ ๋ง์ฐฌ๊ฐ์ง๋ก ํธ์ถํ ์ ์์ต๋๋ค. ์๋ ``train`` ๊ณผ
97
+ # ``evaluation`` ํจ์์์ ๋ณด๋ฉด, ๋ค์๊ณผ ๊ฐ์ด ๊ฐ๋จํ ํธ์ถํ ์ ์์์ ์ ์ ์์ต๋๋ค :
114
98
# ::
115
99
#
116
100
# for i, batch in enumerate(iterator):
117
101
#
118
- # Each ``batch`` then has ``src`` and ``trg`` attributes:
102
+ # ๊ฐ ``batch`` ๋ ``src`` ์ ``trg`` ์์ฑ์ ๊ฐ์ง๊ฒ ๋ฉ๋๋ค.
119
103
#
120
104
# ::
121
105
#
122
106
# src = batch.src
123
107
# trg = batch.trg
124
108
125
109
######################################################################
126
- # Defining our ``nn.Module`` and ``Optimizer``
110
+ # ``nn.Module`` ๊ณผ ``Optimizer`` ์ ์ํ๊ธฐ
127
111
# ----------------
128
- # That's mostly it from a ``torchtext`` perspecive: with the dataset built
129
- # and the iterator defined, the rest of this tutorial simply defines our
130
- # model as an ``nn.Module``, along with an ``Optimizer``, and then trains it.
112
+ # ๋๋ถ๋ถ์ ``torchtext`` ๊ฐ ์์์ ํด์ค๋๋ค : ๋ฐ์ดํฐ์
์ด ๋ง๋ค์ด์ง๊ณ ๋ฐ๋ณต์๊ฐ ์ ์๋๋ฉด, ์ด ํํ ๋ฆฌ์ผ์์
113
+ # ์ฐ๋ฆฌ๊ฐ ํด์ผ ํ ์ผ์ด๋ผ๊ณ ๋ ๊ทธ์ ``nn.Module`` ์ ``Optimizer`` ๋ฅผ ๋ชจ๋ธ๋ก์ ์ ์ํ๊ณ ํ๋ จ์ํค๋ ๊ฒ์ด ์ ๋ถ์
๋๋ค.
114
+ #
131
115
#
132
- # Our model specifically, follows the architecture described
133
- # `here <https://arxiv.org/abs/1409.0473>`__ (you can find a
134
- # significantly more commented version
135
- # `here <https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__).
136
- #
137
- # Note: this model is just an example model that can be used for language
138
- # translation; we choose it because it is a standard model for the task,
139
- # not because it is the recommended model to use for translation. As you're
140
- # likely aware, state-of-the-art models are currently based on Transformers;
141
- # you can see PyTorch's capabilities for implementing Transformer layers
142
- # `here <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__; and
143
- # in particular, the "attention" used in the model below is different from
144
- # the multi-headed self-attention present in a transformer model.
116
+ # ์ด ํํ ๋ฆฌ์ผ์์ ์ฌ์ฉํ ๋ชจ๋ธ์ `์ด๊ณณ <https://arxiv.org/abs/1409.0473>`__ ์์ ์ค๋ช
ํ๊ณ ์๋ ๊ตฌ์กฐ๋ฅผ ๋ฐ๋ฅด๊ณ ์์ผ๋ฉฐ,
117
+ # ๋ ์์ธํ ๋ด์ฉ์ `์ฌ๊ธฐ <https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__
118
+ # ๋ฅผ ์ฐธ๊ณ ํ์๊ธฐ ๋ฐ๋๋๋ค.
119
+ #
120
+ # ์ฐธ๊ณ : ์ด ํํ ๋ฆฌ์ผ์์ ์ฌ์ฉํ๋ ๋ชจ๋ธ์ ์ธ์ด ๋ฒ์ญ์ ์ํด ์ฌ์ฉํ ์์ ๋ชจ๋ธ์
๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๊ฒ์
121
+ # ์ด ์์
์ ์ ๋นํ ํ์ค ๋ชจ๋ธ์ด๊ธฐ ๋๋ฌธ์ด์ง, ๋ฒ์ญ์ ์ ํฉํ ๋ชจ๋ธ์ด๊ธฐ ๋๋ฌธ์ ์๋๋๋ค. ์ฌ๋ฌ๋ถ์ด ์ต์ ๊ธฐ์ ํธ๋ ๋๋ฅผ
122
+ # ์ ๋ฐ๋ผ๊ฐ๊ณ ์๋ค๋ฉด ์ ์์๊ฒ ์ง๋ง, ํ์ฌ ๋ฒ์ญ์์ ๊ฐ์ฅ ๋ฐ์ด๋ ๋ชจ๋ธ์ Transformers์
๋๋ค. PyTorch๊ฐ
123
+ # Transformer ๋ ์ด์ด๋ฅผ ๊ตฌํํ ๋ด์ฉ์ `์ฌ๊ธฐ <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__
124
+ # ์์ ํ์ธํ ์ ์์ผ๋ฉฐ ์ด ํํ ๋ฆฌ์ผ์ ๋ชจ๋ธ์ด ์ฌ์ฉํ๋ "attention" ์ Transformer ๋ชจ๋ธ์์ ์ ์ํ๋
125
+ # ๋ฉํฐ ํค๋ ์
ํ ์ดํ
์
(multi-headed self-attention) ๊ณผ๋ ๋ค๋ฅด๋ค๋ ์ ์ ์๋ ค๋๋ฆฝ๋๋ค.
145
126
146
127
147
128
import random
@@ -316,7 +297,7 @@ def forward(self,
316
297
317
298
encoder_outputs , hidden = self .encoder (src )
318
299
319
- # first input to the decoder is the <sos> token
300
+ # ๋์ฝ๋๋ก์ ์ฒซ ๋ฒ์งธ ์
๋ ฅ์ <sos> ํ ํฐ์
๋๋ค.
320
301
output = trg [0 ,:]
321
302
322
303
for t in range (1 , max_len ):
@@ -376,16 +357,15 @@ def count_parameters(model: nn.Module):
376
357
print (f'The model has { count_parameters (model ):,} trainable parameters' )
377
358
378
359
######################################################################
379
- # Note: when scoring the performance of a language translation model in
380
- # particular, we have to tell the ``nn.CrossEntropyLoss`` function to
381
- # ignore the indices where the target is simply padding.
360
+ # ์ฐธ๊ณ : ์ธ์ด ๋ฒ์ญ์ ์ฑ๋ฅ ์ ์๋ฅผ ๊ธฐ๋กํ๋ ค๋ฉด, ``nn.CrossEntropyLoss`` ํจ์๊ฐ ๋จ์ํ
361
+ # ํจ๋ฉ์ ์ถ๊ฐํ๋ ๋ถ๋ถ์ ๋ฌด์ํ ์ ์๋๋ก ํด๋น ์์ธ๋ค์ ์๋ ค์ค์ผ ํฉ๋๋ค.
382
362
383
363
PAD_IDX = TRG .vocab .stoi ['<pad>' ]
384
364
385
365
criterion = nn .CrossEntropyLoss (ignore_index = PAD_IDX )
386
366
387
367
######################################################################
388
- # Finally, we can train and evaluate this model :
368
+ # ๋ง์ง๋ง์ผ๋ก ์ด ๋ชจ๋ธ์ ํ๋ จํ๊ณ ํ๊ฐํฉ๋๋ค :
389
369
390
370
import math
391
371
import time
@@ -486,11 +466,8 @@ def epoch_time(start_time: int,
486
466
print (f'| Test Loss: { test_loss :.3f} | Test PPL: { math .exp (test_loss ):7.3f} |' )
487
467
488
468
######################################################################
489
- # Next steps
469
+ # ๋ค์ ๋จ๊ณ
490
470
# --------------
491
471
#
492
- # - Check out the rest of Ben Trevett's tutorials using ``torchtext``
493
- # `here <https://github.com/bentrevett/>`__
494
- # - Stay tuned for a tutorial using other ``torchtext`` features along
495
- # with ``nn.Transformer`` for language modeling via next word prediction!
496
- #
472
+ # - ``torchtext`` ๋ฅผ ์ฌ์ฉํ Ben Trevett์ ํํ ๋ฆฌ์ผ์ `์ด๊ณณ <https://github.com/bentrevett/>`__ ์์ ํ์ธํ ์ ์์ต๋๋ค.
473
+ # - ``nn.Transformer`` ์ ``torchtext`` ์ ๋ค๋ฅธ ๊ธฐ๋ฅ๋ค์ ์ด์ฉํ ๋ค์ ๋จ์ด ์์ธก์ ํตํ ์ธ์ด ๋ชจ๋ธ๋ง ํํ ๋ฆฌ์ผ์ ์ดํด๋ณด์ธ์.
0 commit comments