Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback welcome on xformers #1

Open
blefaudeux opened this issue Apr 28, 2022 · 10 comments
Open

Feedback welcome on xformers #1

blefaudeux opened this issue Apr 28, 2022 · 10 comments

Comments

@blefaudeux
Copy link

blefaudeux commented Apr 28, 2022

hey there, trying to investigate some strange downloads numbers for xformers I stumbled here, and had a look at https://github.com/kashif/pytorch-transformer-ts/blob/main/xformers/xformers.ipynb

From what I can see an attention mask was passed, and we've tried to homogenize this to an additive mask (ie. values which you want to nuke are "-inf", since this is applied pre-softmax). One benefit is that other encodings can be passed as additive masks, for instance if you want to emphasize local attention (a la Alibi).

Feel free to reach out if something looks suspicious, xformers probably lacks some explanations here and there, any feedback welcome. If mail is easier benjamin lefaudeux $ pm me would work. Cheers

@kashif
Copy link
Owner

kashif commented Apr 28, 2022

oh thank you so much! let me have a look and get back to you!! I truly appreciate your help!

@kashif
Copy link
Owner

kashif commented May 24, 2022

@blefaudeux one issue I have is that If I set the encoder of my transformer to be an xformer encoder block and then use the standard decoder in the nn.Transformer, then such a model doesn't seem to train or learn anything... e.g. see the notebook here: https://github.com/kashif/pytorch-transformer-ts/blob/main/xformers/xformers.ipynb for how i am doing that... would you have any intuition on why this is the case?

Thanks!!

@blefaudeux
Copy link
Author

hmm, checking the code right now but it's often a sign that the graph is broken, in that autograd cannot walk back the chain from the final loss to the inputs, it can happen when you change variables in the middle, do some in place operations (there's a guard for that and it normally asserts), or mess up a transform in the middle which make things uniformly random.

When you say that it does not train, the loss changes but does not improve, right ? If that's the case I would go for the third point, the graph is not broken but some operation in the middle randomizes it

@blefaudeux
Copy link
Author

to give you an example with pictures: if you do a reshape of a batched tensor in the middle of a model, and by mistake mix the contents from all the pictures in doing so (it can happen relatively easily given some reshape assumptions), in that case there would be nothing to learn from the end of the pipe, the data is randomized really

@kashif
Copy link
Owner

kashif commented May 25, 2022

right it could be some shuffling going on... I will check what the format of the memory output from the encoder in xformers is vs. in the vanilla pytorch encoders... thanks!

@blefaudeux
Copy link
Author

right it could be some shuffling going on... I will check what the format of the memory output from the encoder in xformers is vs. in the vanilla pytorch encoders... thanks!

ahh, it makes me think, there's an option when constructing the pytorch transformers to say that you are "batch first", maybe that's because of that. xFormers follows [Batch x Context x Embedding] everywhere

@kashif
Copy link
Owner

kashif commented May 25, 2022

right that is taken care of so i make nn.transformers batch_first so then internally it transposes the output from the xformer encoder via transpose(0, 1) and then passes that to the MHA function etc.

If I use the xformer decoder the model trains etc. but if I use the MHA, pytorch decoder it doesn't seem to train...

the MHA i believe is implemented in C++ on the pytorch side so I might try a python version of that to see...

@kashif
Copy link
Owner

kashif commented May 30, 2022

@blefaudeux I believe I got it to train and with nystrom etc. it works... however with some other attention heads e.g. random and others I do get errors like:

RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.

should I open an issue for them on the xformer side?

@kashif
Copy link
Owner

kashif commented May 31, 2022

@blefaudeux ok so i figured out the issue with the Inference tensors (above) occurs since in my validation step I am using with torch.inference_mode() and if I change it to torch.no_grad() I do not get the RuntimeError...

@blefaudeux
Copy link
Author

oh great ! sorry for the delay, I saw your message but was not sure about that one, I'm glad that you found out what was happening.. I didn't know of this torch.inference_mode(), I'll look that up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants