-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vigogne on CPU #31
Comments
Same issue, I try to run vigogne on a mac M1 Max |
See on Github flash-attention |
It makes sense…
There must be a way, but only @bofenghuang knows it. |
You can use llama.cpp (https://github.com/ggerganov/llama.cpp) to run a ggml/gguf model available from TheBloke hf account (see https://huggingface.co/TheBloke). See for instance https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGML. This works quite well, use Q5_K_M quantization for the best balance between performance/memory consumption/computation time. BTW I just tested the Mistral AI model (https://huggingface.co/TheBloke/Kimiko-Mistral-7B-GGUF) which gives very good results in french. |
Hi, Sorry for my late response. Flash Attention is a powerful tool for accelerating training and inference and reducing memory usage when working with GPUs in PyTorch. Please note that it's compatible with GPU architectures newer than Amper, and its support for older architectures is limited (See the doc and this issue). If you're planning to perform inference with llama.cpp, especially on a Mac machine, there's no need to install Flash Attention. As @tpaviot pointed out, you can find quantized versions of specific models on the Hugging Face Hub that are ready to use, thanks to the Bloke. Otherwise, you may need to quantize them by yourself (see the doc here). PS: We've also released a Mistral-7B-based model here. Feel free to give it a try and share your feedback with us :) |
@tpaviot |
@LeMoussel yep, although llama.cpp can run on CPU, it support GPU as well, and it works very well on colab at an affordable cost (if you use T4), and can be smoothly integrated with langchain or llama_index. What do you mean exactly with "an example"? |
I haven't been able to run llama_index LlamaCPP with Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q4_K_M.gguf on Colab with a T4 GPU. I still have |
This is what I use to install the latest llama-cpp-python using pip on colab: if HAVE_CUDA:
print("cuda available, build llama-cpp-python on GPU using CUBLAS")
os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
print("Build llama-cpp-python on CPU using OPENBLAS")
os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python |
On COLAB import os
import torch
if torch.cuda.is_available():
print("cuda available, build llama-cpp-python on GPU using CUBLAS")
os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
print("Build llama-cpp-python on CPU using OPENBLAS")
os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python |
Hi @bofenghuang, excellent news, thanks for your great work! I have a question concerning this model. May I fine-tune with my own French instruction dataset directly from Vigostral or should I start from the pre-trained Mistral-7B? Thanks in advance for your reply. |
Hello @mauryaland, I would recommend beginning with the pre-trained Mistral-7B model if you have sufficient instruction data. Otherwise, you can consider using Vigostral, which has already undergone fine-tuning on French instructions. |
Hi is it possible to use vigogne with a CPU? for example with LocalAI ?
when i want to install Vigogne i got this message
The text was updated successfully, but these errors were encountered: