-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback welcome on xformers #1
Comments
oh thank you so much! let me have a look and get back to you!! I truly appreciate your help! |
@blefaudeux one issue I have is that If I set the encoder of my transformer to be an xformer encoder block and then use the standard decoder in the Thanks!! |
hmm, checking the code right now but it's often a sign that the graph is broken, in that autograd cannot walk back the chain from the final loss to the inputs, it can happen when you change variables in the middle, do some in place operations (there's a guard for that and it normally asserts), or mess up a transform in the middle which make things uniformly random. When you say that it does not train, the loss changes but does not improve, right ? If that's the case I would go for the third point, the graph is not broken but some operation in the middle randomizes it |
to give you an example with pictures: if you do a reshape of a batched tensor in the middle of a model, and by mistake mix the contents from all the pictures in doing so (it can happen relatively easily given some reshape assumptions), in that case there would be nothing to learn from the end of the pipe, the data is randomized really |
right it could be some shuffling going on... I will check what the format of the memory output from the encoder in xformers is vs. in the vanilla pytorch encoders... thanks! |
ahh, it makes me think, there's an option when constructing the pytorch transformers to say that you are "batch first", maybe that's because of that. xFormers follows [Batch x Context x Embedding] everywhere |
right that is taken care of so i make nn.transformers batch_first so then internally it transposes the output from the xformer encoder via transpose(0, 1) and then passes that to the MHA function etc. If I use the xformer decoder the model trains etc. but if I use the MHA, pytorch decoder it doesn't seem to train... the MHA i believe is implemented in C++ on the pytorch side so I might try a python version of that to see... |
@blefaudeux I believe I got it to train and with nystrom etc. it works... however with some other attention heads e.g.
should I open an issue for them on the xformer side? |
@blefaudeux ok so i figured out the issue with the Inference tensors (above) occurs since in my validation step I am using |
oh great ! sorry for the delay, I saw your message but was not sure about that one, I'm glad that you found out what was happening.. I didn't know of this torch.inference_mode(), I'll look that up |
hey there, trying to investigate some strange downloads numbers for xformers I stumbled here, and had a look at https://github.com/kashif/pytorch-transformer-ts/blob/main/xformers/xformers.ipynb
From what I can see an attention mask was passed, and we've tried to homogenize this to an additive mask (ie. values which you want to nuke are "-inf", since this is applied pre-softmax). One benefit is that other encodings can be passed as additive masks, for instance if you want to emphasize local attention (a la Alibi).
Feel free to reach out if something looks suspicious, xformers probably lacks some explanations here and there, any feedback welcome. If mail is easier
benjamin lefaudeux $ pm me
would work. CheersThe text was updated successfully, but these errors were encountered: