Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

NotANumber124 · 2019-09-04T06:17:43Z

The default setting is to use the bidirectional data, attn_type='bi', but bsz=1.
But in this function,

Line 371 in cb793a1

    
           def relative_positional_encoding(self, qlen, klen, d_model, clamp_len, attn_type,

It shows the bidirectional data only works when bsz%2 ==0. However in default, bsz = 1.
I am confused, if bsz=1, the setting for the beg, and end in the following code, is it right?

xlnet-Pytorch/xlnet.py

Lines 380 to 387 in cb793a1

    
           if attn_type == 'bi': 
        
               # beg, end = klen - 1, -qlen 
        
               beg, end = klen, -qlen 
        
           elif attn_type == 'uni': 
        
               # beg, end = klen - 1, -1 
        
               beg, end = klen, -1 
        
           else: 
        
               raise ValueError('Unknown `attn_type` {}.'.format(attn_type))

Could anyone help me with this confusion?

Asichurter · 2021-08-28T17:22:47Z

@NotANumber124 Actually there's no problem with the beg and end value of positional encoding whether the bidirectional data works, because beg and end values are used for relative distance ranging.

To see this, you can imagine there's a sequence of hidden states to make self-attention with shape: [mlen+qlen, hidden_dim] (mlen memory first, qlen input follows, ignores batch), where 'mlen' refers to memories and 'qlen' refers to input data sequence. Because relative position is modeled to replace absolute position, we have to determine the range of relative position distances (i - j) by finding maximum and minimum i-j value and embed them. When bidirectional attention is used, maximum relative distance comes from the last element of the sequence (index=mlen+qlen-1) by looking the left-most element, which results (mlen+qlen)-0=mlen+qlen. Similarily, minimum relative distance comes from the first element of the input sequence (index=mlen), by looking the right-most element of the sequence, which results mlen-(mlen+qlen) = -qlen. In summary, the range of relative position distance is [-qlen, mlen+qlen], the same as code. Note that memories of mlen can only be attended as K, but not Q to attend K because memories are not queries.

However, I found another place confusing. When bidirectional data is used, forward and backward data are concatenated to an united pos_emb tensor on dim=1, which means this dimension refers to direction of data:

xlnet-Pytorch/xlnet.py

Lines 401 to 404 in cb793a1

    
           fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq) 
        
           bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq) 
        
           pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)

However, this dimension of pos_emb tensor is misused as batch_size dimension when calculating relative attention score:

xlnet-Pytorch/xlnet.py

Line 219 in cb793a1

bd = torch.einsum('ibnd,jbnd->ijbn', q_head + r_r_bias, k_head_r)

No errors are occuring just because we set batch_size=1 in default and the direction of data is incidentally equal to batch_size. When batch_size increases, there may be some dimension incompatible errors occurring caused by this. Hope some explaination and fixing could be given out.

graykode added bug Something isn't working help wanted Extra attention is needed question Further information is requested and removed bug Something isn't working labels Sep 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

NotANumber124 commented Sep 4, 2019

Asichurter commented Aug 28, 2021

Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

Comments

NotANumber124 commented Sep 4, 2019

Asichurter commented Aug 28, 2021