Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

Draft
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

brian6091
Copy link
Collaborator

Not strictly a LoRA change, but quite easy to fit into the current code. The idea is that squeezing an element-wise nonlinearity in between the Down and Up transformations can in some cases increase your performance at essentially no cost.

I also exposed the alpha parameter (keeping the convention from the original LoRA paper that scale=alpha/rank), which can also get you some extra performance for little effort.

Ideas from this fantastic paper:

He, Junxian, et al. "Towards a unified view of parameter-efficient transfer learning." arXiv preprint arXiv:2110.04366 (2021).
https://arxiv.org/pdf/2110.04366.pdf

Any thoughts before I start testing more thoroughly?

@brian6091 brian6091 marked this pull request as draft January 5, 2023 14:38
@cloneofsimo
Copy link
Owner

Nice work as always!!

Keep tune_lora_scale as previously to minimize breakage.
@cloneofsimo
Copy link
Owner

Ooooo so alpha = 4 is the default? This is going to make my head hurt 🫠
So in the paper they had high alpha values, but it seemed to me that is equivalent to scaling initialization + increasing learning rate so I didnt bother...

@cloneofsimo
Copy link
Owner

What if we set alpha = 1 as default, and scale up lora instead?

@brian6091
Copy link
Collaborator Author

brian6091 commented Jan 5, 2023

So I set alpha=4 as default so that everything should produce scale=1.0 by default (which is what you had before). Hence my last addition of a 'tune_lora_alpha' and leaving 'tune_lora_scale' alone. This should be transparent to most users, as they won't see alpha unless they want to...

I agree that alpha and scale are redundant, and I honestly have no idea why it was defined that way in the paper.
We could remove one, but both alpha and scale are used in different places in the code. If you think it useful, I could go through and change everything to 'scale'?

@cloneofsimo
Copy link
Owner

Hmm I want to talk about this bit more deeper. Its just that I"m not really understanding why we need to introduce scale term here. I think that the previous implementation + new implementation on the initialization with scale would be enough.

My reason is suppose we train a lora with scale = 16
That would mean that end user using lora would have to know that this lora was trained with 16, and its the base option. So scale = 16 is the "full model", scale < 16 would mean interpolation.
So somehow the end user need to know another parameter as well as lora to get things done quickly : scale parameter

But instead, since

$$ W + s A B^T = W + 1. (sA) B^T $$

If we instead use the scale parameter to make A larger on the initialization (and use higher lr), end user would just know 1. Is the "default full weight", and would need to use somewhere 0.~1 to get a decent result, which is less confusing

@cloneofsimo
Copy link
Owner

But its just my thought, and I think this will change a lot of code all at once, (with readme and some of your analysis as well) so im happy to discuss with more depth

@brian6091
Copy link
Collaborator Author

brian6091 commented Jan 6, 2023

You raise great points that I completely agree with. Using both alpha and scale is not a good idea, and even the LoRA authors ignore this after defining it. So I think it's best to revert back to just using scale.

Now the question is whether to expose scale for users. As you said, changing the scale can effectively be compensated by changing the learning rate or the the variance of the weight initialization. He et al. casually mention that changing scale can have an impact, and they actually hand-tune it. This may become more relevant if the pointwise nonlinearity is incorporated, since in that case, scale is no longer equivalent to just changing the variance of the weight initialization.

So here's what I suggest:

  • Keep scale but not alpha. I also clean up any use of alpha and convert to scale (this is just a parameter naming thing e.g., in tune_lora_scale, weight_apply_lora).
  • Expose scale, but set the default to 1, which is what it was fixed to before. This leaves it open to advanced users to play with.
  • Modify the metadata storage to include scale value and the name of the nonlinearity

@cloneofsimo cloneofsimo changed the base branch from master to develop January 6, 2023 13:42
@cloneofsimo
Copy link
Owner

I think whatever works fine if it is just backward compatible. Overall, I just want to keep the notion of "full training == $\alpha = 1$" and "base model == $\alpha = 0$" intact.

@cloneofsimo
Copy link
Owner

Basically because I've built many test scripts with those notion, and lot of other implementations are on that idea as well, so I think it would surprise a lot of people if alpha is unknown argument, and you suddenly need to use large scale value to get it to work
Also I'm kinda ok with the way it is because LoRA is already too good at fine tuning, it's just that we need more regularization methods to work on

@cloneofsimo
Copy link
Owner

cloneofsimo commented Jan 6, 2023

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?

  • during saving, fallback scale into A?

$$ W' = W + s AB^T = W + (sA) B^T $$

So save $sA , B$ after training?

@brian6091
Copy link
Collaborator Author

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?

* during saving, fallback scale into A?

W′=W+sABT=W+(sA)BT

So save sA,B after training?

Yeah, that sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants