-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/ Better LoRA : Dropout, Conv2d #133
Conversation
Excellent! |
Definitely interesting experiments. What parameter are we optimizing for, total binary size, training speed, multiple concept accuracy, something else? (The reason I ask is my first thought was "how does this compare to just rank 16 on CrossAttention / Attention / GEGLU") Is there some way we can auto-detect when the rank is insufficient? (Like maybe a flat gradient when training while still having a high error rate). |
Also, just from a end-user POV, the greatest strength of LoRA for me is the easy adjust-ability of strength of the various adjustments. It'd be interesting to see how useful post-training adjustment of individual model LoRA weights was. |
So I think it is kindof constraint optimization at this point. We just don't want output too large, but quality as dreambooth as high as possible. But the objective is also mixed. distortion, perceptual fidelity, and editability is all the performance we desire, but they have tradeoff relationship. perceptual fidelity is also kinda ill-defined, as CLIP score doesn't seem to represent that very well, unlike custom diffusion or textual inversion paper would like to suggest. |
It seems like what many people are looking for can be simply described : identity preservation + editability. |
Now @brian6091 and I had an idea on making the LoRA on the resent part only on the upsample unet layers, because downsample parts are kinda used to compress representation and not to generate with fidelity. So we'll see if it works better. @brian6091 are you going to leave a PR or it just on your own thing? |
I've added dropout as well so this PR is no longer only about conv layer. I'll rename it |
Your work and contributions are very underrated. Great stuff @cloneofsimo! Edit: I'm doing some testing of my own with these changes, and Better LoRA is a very big understatement. |
Yeah, I think the ability to do ablation experiments will be super interesting. There are subjective differences between the different components (CrossAttention, FFN for example) that may be hard to capture with objective metrics (but come out with more complex prompting). But from the end-user perspective I think being able to define your own objective and having the tools to achieve that is ideal. |
I'll work the scale/nonlinearity code back in once you've stabilized this (PR #111). Also subtle effects, but worth it IMO for the trivial cost. |
I can leave a PR. |
Done various experiments, but more needs to be done. I've got mixed results in my case, so i'll add these options as optional for now. using dropout does make a difference though. |
Note:I've found that having resnet trained requires very low learning rate : something like 5e-6 for me. |
@cloneofsimo First of all thank you for creating this repo. I've been tinkering with LoRA but can't say it's faster than (let alone twice as fast) than Dreambooth. Could you share the args you used for the above examples please? |
Is this related to conv loras? Or just lora pti in general? @okaris |
The comment, lora pti in general. I can open a new issue for that. The question, I'd really like to know the settings you used for the above trainings and how long they took. Thanks! |
It's been a while since I've evaluated time to train these models, but these take < 6min in general. I think they aren't as fast as the previous ones (currently in training scripts folder), because lora_pti is not optimized for speed and memory since no 8bit adam + xformers are tested. It is the textual inversion part that takes a long time since they are currently done with full precision. |
I am continuing to optimize for perceptual performance first, and README is bit misguiding because they were not based on lora_pti scripts. Better fix that. |
Hey @cloneofsimo. Is Line 565 in 583b1e7
|
Ah, these + also other tools are currently unsupported for lora conv2d. Rests are coming in as a feature soon. |
Sweet, thanks! |
So unlike classical LLMs, LDM also has other many modules as well. Arguably, it seems like many "important" features come from Resnet features. This is clearly demonstrated by, for example, plug-and-play prior
Natural question to ask is : Does dreambooth yield fine-grained level of details, such as eyes or skins, because it is able to tune resnets? Is Q, K, V, O simply not enough?
In this PR I will try to answer this questions with bunch of experiments.
Here are some initial results :
This result is LoRA rank 4, with "CrossAttention", "Attention", "GEGLU" Replaced
Now, this is LoRA rank 1, with "ResnetBlock2D", "CrossAttention", "Attention", "GEGLU" Replaced. Now, number of parameters are now about as 2 times higher : as high as 9.7MB. Of which, only 2MB is LoRA of "CrossAttention", "Attention", "GEGLU", so I suspect that if it helps, we might make LoRA of TR rank 4 and LoRA of Resnet rank 1.
All trained with same number of steps, sampled with same parameters. This case, it looks like its a tie. I'll try on other models as well. I think it is a good time to start implementing fidelity metrics as well, instead of CLIP alignment score.