Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about the relative position embedding with attn_type='bi' but bsz=1 #12

Open
NotANumber124 opened this issue Sep 4, 2019 · 1 comment
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@NotANumber124
Copy link

The default setting is to use the bidirectional data, attn_type='bi', but bsz=1.
But in this function,

def relative_positional_encoding(self, qlen, klen, d_model, clamp_len, attn_type,

It shows the bidirectional data only works when bsz%2 ==0. However in default, bsz = 1.
I am confused, if bsz=1, the setting for the beg, and end in the following code, is it right?

xlnet-Pytorch/xlnet.py

Lines 380 to 387 in cb793a1

if attn_type == 'bi':
# beg, end = klen - 1, -qlen
beg, end = klen, -qlen
elif attn_type == 'uni':
# beg, end = klen - 1, -1
beg, end = klen, -1
else:
raise ValueError('Unknown `attn_type` {}.'.format(attn_type))

Could anyone help me with this confusion?

@graykode graykode added bug Something isn't working help wanted Extra attention is needed question Further information is requested and removed bug Something isn't working labels Sep 29, 2019
@Asichurter
Copy link

@NotANumber124 Actually there's no problem with the beg and end value of positional encoding whether the bidirectional data works, because beg and end values are used for relative distance ranging.

To see this, you can imagine there's a sequence of hidden states to make self-attention with shape: [mlen+qlen, hidden_dim] (mlen memory first, qlen input follows, ignores batch), where 'mlen' refers to memories and 'qlen' refers to input data sequence. Because relative position is modeled to replace absolute position, we have to determine the range of relative position distances (i - j) by finding maximum and minimum i-j value and embed them. When bidirectional attention is used, maximum relative distance comes from the last element of the sequence (index=mlen+qlen-1) by looking the left-most element, which results (mlen+qlen)-0=mlen+qlen. Similarily, minimum relative distance comes from the first element of the input sequence (index=mlen), by looking the right-most element of the sequence, which results mlen-(mlen+qlen) = -qlen. In summary, the range of relative position distance is [-qlen, mlen+qlen], the same as code. Note that memories of mlen can only be attended as K, but not Q to attend K because memories are not queries.

However, I found another place confusing. When bidirectional data is used, forward and backward data are concatenated to an united pos_emb tensor on dim=1, which means this dimension refers to direction of data:

xlnet-Pytorch/xlnet.py

Lines 401 to 404 in cb793a1

fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)
bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)
pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)

However, this dimension of pos_emb tensor is misused as batch_size dimension when calculating relative attention score:

bd = torch.einsum('ibnd,jbnd->ijbn', q_head + r_r_bias, k_head_r)

No errors are occuring just because we set batch_size=1 in default and the direction of data is incidentally equal to batch_size. When batch_size increases, there may be some dimension incompatible errors occurring caused by this. Hope some explaination and fixing could be given out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants