[Paper Review] Attention is all you need

카테고리 없음

[Paper Review] Attention is all you need

bona.0 2022. 12. 1. 15:16

Before..

RNN
LSTM
: slow to train

Can we parallelize sequential data?

Transformers

Input sequence can be transmitted parallel

No concept of time step

Pass all the words simultaneously and determine the word embedding simultaneously

(RNN passes input word one after another)

Input Embedding

In embedding space, close-meaning words locate close to each other
There're already pretrained embedding spaces.

But, the same word in a different sentence has a different meaning!

Positional Encoder

: vector that gives context information based on the position of the word in a sentence

Can use sin/cos function to generate PE, but any reasonable function is ok

Structure of Encoder

1. Attention

: What part of the input should we focus on?

How much the word 'The' is relevant to other words(big, red, dog) in the same sentence?

Attention Vectors (of English) contain contextual relationships between the words in the sentence.

2. Feed Forward

A simple feed-forward network is applied to every one of the attention vectors!

Problem of Attention vector

Focuses too much on itself..
We want to know the interactions and relationships between words!

➡ Use multiple attention vector for the same word and average them: Multi-Head Attention Block
(Q. vectors from different sentences..?)

Attention vectors are fed to Feed Forward Network one vector at a time
Each of the attention vectors of different words is independent the other
➡ can Parallize Feed Forward Network!
➡ All words can be passed to the encoder block at the same time and the output is a set of encoded vectors for every word

Structure of Decoder

In the English -> French task, we feed output French to the decoder

1. Self-attention block

Generates attention vectors(of French) showing how much each word is related to another

Attention vectors from both Encoder(English) and Decoder(French) are passed to another Encoder-Decoder Attention block.
➡ output of this block: Attention vector of all words(English+French)
➡ Each attention vector shows the relationship of other words including both languages
➡ English to French word mapping happens!

Multi-headed Attention

If we use all the words in the French sentence, there'd be no learning, just spitting out the next word
➡ mask input: observe only the previous and itself

Single-headed attention vs Multi-headed attention

1) Single-headed attention

V,K,Q: abstract vector that extracts different components of input words
We have V,K,Q vectors for every single word
➡ create an attention vector for every word using V, K, Q

2) Multi-headed attention

Have multiple weight matrices(Wv, Wk, Wq)
➡ multiple attention vectors for every word
➡ another weighted matrices(Wz)
➡ now feed-forward nn can be fed only one attention vector per word

2. Feed-forward unit

Pass each attention vector to the feed-forward unit

3. Linear layer

: another Feed Forward Layer

Used to expand the dimension to the number of words in French

4. Softmax layer

Transforms it into Probability Distribution

Output: The word with the highest probability to come next

Codes

Reference to https://github.com/hyunwoongko/transformer

Scale Dot Production Attention

class ScaleDotProductAttention(nn.Module):
    """
    compute scale dot product attention

    Query : given sentence that we focused on (decoder)
    Key : every sentence to check relationship with Qeury(encoder)
    Value : every sentence same with Key (encoder)
    """

    def __init__(self):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, q, k, v, mask=None, e=1e-12):
        # input is 4 dimension tensor
        # [batch_size, head, length, d_tensor]
        batch_size, head, length, d_tensor = k.size()

        # 1. dot product Query with Key^T to compute similarity
        k_t = k.transpose(2, 3)  # transpose
        score = (q @ k_t) / math.sqrt(d_tensor)  # scaled dot product

        # 2. apply masking (opt)
        if mask is not None:
            score = score.masked_fill(mask == 0, -e)

        # 3. pass them softmax to make [0, 1] range
        score = self.softmax(score)

        # 4. multiply with Value
        v = score @ v

        return v, score

Multi-head Attention

  class MultiHeadAttention(nn.Module):

      def __init__(self, d_model, n_head):
          super(MultiHeadAttention, self).__init__()
          self.n_head = n_head
          self.attention = ScaleDotProductAttention()
          self.w_q = nn.Linear(d_model, d_model)
          self.w_k = nn.Linear(d_model, d_model)
          self.w_v = nn.Linear(d_model, d_model)
          self.w_concat = nn.Linear(d_model, d_model)

      def forward(self, q, k, v, mask=None):
          # 1. dot product with weight matrices
          q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)

          # 2. split tensor by number of heads
          q, k, v = self.split(q), self.split(k), self.split(v)

          # 3. do scale dot product to compute similarity
          out, attention = self.attention(q, k, v, mask=mask)

          # 4. concat and pass to linear layer
          out = self.concat(out)
          out = self.w_concat(out)

          # 5. visualize attention map
          # TODO : we should implement visualization

          return out

      def split(self, tensor):
          """
          split tensor by number of head

          :param tensor: [batch_size, length, d_model]
          :return: [batch_size, head, length, d_tensor]
          """
          batch_size, length, d_model = tensor.size()

          d_tensor = d_model // self.n_head
          tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
          # it is similar with group convolution (split by number of heads)

          return tensor

      def concat(self, tensor):
          """
          inverse function of self.split(tensor : torch.Tensor)

          :param tensor: [batch_size, head, length, d_tensor]
          :return: [batch_size, length, d_model]
          """
          batch_size, head, length, d_tensor = tensor.size()
          d_model = head * d_tensor

          tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
          return tensor

REF
https://github.com/hyunwoongko/transformer
https://www.youtube.com/watch?v=TQQlZhbC5ps

현재글[Paper Review] Attention is all you need

bona's log