Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make this pip installable #82

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

winglian
Copy link
Contributor

this is a pretty big refactor to:

  • allow anyone to use any of the submodules in the repo
  • I've removed a cyclical dependency
  • the repo no longer requires cuda/gptq if you prefer to use the triton backend, you can choose or the other (pip install .[cuda] or pip install .[triton]

There is some other cleanup that probably needs to be done, but I figured I should see if you want to go down this path. thanks!

@johnsmith0031
Copy link
Owner

Thanks for doing this! I think I would also merge the cuda kernel used into this repo so that external dependency on GPTQ fork would be no longer needed. I think it would have better compatibility with main GPTQ.

@winglian
Copy link
Contributor Author

problem is the main gptq doesn't even keep the cuda kernel around anymore, they've hitched their horse to triton.

delete kernel:
qwopqwop200/GPTQ-for-LLaMa@2d3256b
deletequant_cuda.cpp:
qwopqwop200/GPTQ-for-LLaMa@e43c506

@winglian
Copy link
Contributor Author

alright, I've moved the quant_cuda into this repo. because of the way setuptools works, it's nearly impossible to make the cudaextension an extras without it being a separate external package, so it will get installed by default and triton is optional

@johnsmith0031
Copy link
Owner

Thanks you for putting everything together! I made a PR to text-generation-webui, once it is merged I'll merge the PR into main. And I think we should adjust the Dockerfile for pip installable alpaca_lora_4bit as well for compatibility.

@winglian
Copy link
Contributor Author

I took a pass at updating the dockerfile, but I don't have cuda on my local machine so can't validate that it's totally correct, if someone else has a chance to look at the dockerfile and build/run it 🙏

Dockerfile Outdated
@@ -61,14 +61,14 @@ RUN cd text-generation-webui-tmp && python download-model.py --text-only decapod
# Get LoRA
RUN cd text-generation-webui-tmp && python download-model.py samwit/alpaca7b-lora && mv loras/samwit_alpaca7b-lora ../alpaca7b_lora

COPY *.py .
COPY src .
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. I tried to build the image and run it to test it for you, but the symlinks below were not pointing to anything.

If they were ln -s ../alpaca_lora_4bit/autograd_4bit.py ./autograd_4bit.py (remove 'src/') then they would have linked. So I recommend, either change the copy or change the symlinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, COPY src . didn't do what I thought 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile updated!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to test that for a bit. I broke my machine pretty badly.

@johnsmith0031
Copy link
Owner

Thanks for everything done here! I think I'll temporarily keep it in winglian-setup_pip branch for those who want to use the pip installable version and the old version as main branch for compatibility with monkeypatch code in webui. May merge them if something changes in the future

@myyk
Copy link

myyk commented Apr 20, 2023

Still seeing an error when trying to run from Docker. I don't know what's going on enough to fix this, but it's not working by simply running pip install triton. It seems to me like quant_cuda not found. is coming from matmul_utils_4bit.py not finding the quant_cuda folder.

Well anyway, I got my machine back up so that I can help test this.


==========
== CUDA ==
==========

CUDA Version 11.7.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

quant_cuda not found. Please run "pip install alpaca_lora_4bit[cuda]".
Triton not found. Please run "pip install triton".
WARNING:root:Neither gptq/cuda or triton backends are available.
Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/server.py", line 1, in <module>
    import custom_monkey_patch # apply monkey patch
  File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 7, in <module>
    from models import Linear4bitLt
  File "/alpaca_lora_4bit/text-generation-webui/models.py", line 6, in <module>
    from peft.tuners.lora import is_bnb_available, Linear, Linear8bitLt, LoraLayer
ImportError: cannot import name 'Linear8bitLt' from 'peft.tuners.lora' (/root/.local/lib/python3.10/site-packages/peft/tuners/lora.py)

@myyk
Copy link

myyk commented Apr 20, 2023

I think that last change improved it, but there's still something off. I upgraded CUDA to 11.8 because I don't think 11.7 is working with my driver and it's on it's way out anyway.

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

quant_cuda not found. Please run "pip install alpaca_lora_4bit[cuda]".
Triton not found. Please run "pip install triton".
WARNING:root:Neither gptq/cuda or triton backends are available.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/server.py", line 1, in <module>
    import custom_monkey_patch # apply monkey patch
  File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 8, in <module>
    replace_peft_model_with_int4_lora_model()
  File "/alpaca_lora_4bit/text-generation-webui/monkeypatch/peft_tuners_lora_monkey_patch.py", line 4, in replace_peft_model_with_int4_lora_model
    from ..models import GPTQLoraModel
ImportError: attempted relative import beyond top-level package

@nealchandra
Copy link

nealchandra commented Apr 24, 2023

I believe this branch is missing this commit 94851ce which at least for me causes a breaking error during build.

I'm curious about the vision for this project, is the intent primarily to support folks who just want an easy way to run text-generation-webui with 4bit quantization? This seems like the case to me (for instance the inference.py code does not actually apply a LoRA, the best example for inference is actually in the webui monkeypatch).

I think it is useful if that is the case, but for me this project would be even more valuable if it moved in the direction of this PR -- e.g. creating a core library which supports running inference and finetunes against multiple model types. This abstraction would then make it easy to plug this into the webui, or an API wrapper, or directly embed in some other python project. It seems hard to accomplish that goal without at least merging this PR back into the trunk.

@tensiondriven
Copy link

This seems like the case to me

I am using it for a different purpose - To run local training at 4-bit via scripts in an automated and repeatable fashion. It's important to me that I be able to run it separately from text-generation-webui, so I'd hate to lose that functionality.

@tensiondriven
Copy link

creating a core library which supports running inference and finetunes against multiple model types

I'm sure @johnsmith0031 would know better than me, but I expect that this project's functionality will eventually be exposed in HuggingFace Transformers or other large packages. This project is very cutting-edge, and does things that haven't previously been possible. I like where your intention is, and, I wouldn't want this project to get formalized to the point where it loses the agility needed to support features that are sometimes only a few days old.

@urbien
Copy link

urbien commented Apr 27, 2023

@johnsmith0031 have you seen LocalAI project?
It creates an OpenAI-compatible server / API wrapper and supports multiple models simultaneously. I want to use my own open source web/mobile app so it fits, but it is designed for CPU-based execution around GGML library, which while being awesome, is too slow and not even possible for 30b models.
So this project, with LoRAs + 4bit + flash-attention optimizations to serve 30b models from 3090-level single GPU would be just heaven! But I had trouble getting it running, let alone start the fine tuning on my own data (I have personal datasets I want to create my own loras on and experiment with multiple different loras on top of the base model). I am a newbie in deep learning, so I might be missing things. In any case, thank you so much for putting this together.

@johnsmith0031
Copy link
Owner

Thanks! Currently the hosting mode is compatible with text generation webui, which have better inference performance. Feel free to have a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants