Vigogne on CPU #31

scenaristeur · 2023-09-22T16:33:32Z

Hi is it possible to use vigogne with a CPU? for example with LocalAI ?

when i want to install Vigogne i got this message

pip install --no-build-isolation flash-attn
Defaulting to user installation because normal site-packages is not writeable
Collecting flash-attn
  Downloading flash_attn-2.2.4.post1.tar.gz (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 8.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      fatal: ni ceci ni aucun de ses répertoires parents n'est un dépôt git : .git
      /tmp/pip-install-psod8lyo/flash-attn_9087f455708048fcb436e3488114a325/setup.py:79: UserWarning: flash_attn was requested, but nvcc was not found.  Are you sure your environment has nvcc available?  If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
        warnings.warn(
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-psod8lyo/flash-attn_9087f455708048fcb436e3488114a325/setup.py", line 136, in <module>
          CUDAExtension(
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
          raise EnvironmentError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      
      
      torch.__version__  = 2.0.1+cu117
      
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

The text was updated successfully, but these errors were encountered:

loranger · 2023-10-11T18:47:12Z

Same issue, I try to run vigogne on a mac M1 Max

LeMoussel · 2023-10-12T06:26:48Z

See on Github flash-attention
Requirements:
CUDA 11.6 and above.
=> flash-attn can't run with CPU

loranger · 2023-10-12T06:44:43Z

It makes sense…
Though the readme displays

💡 The screencast below shows the current 🦙 Vigogne-7B-Chat model running on Apple M1 Pro using 4GB of weights (no sped up).

There must be a way, but only @bofenghuang knows it.

tpaviot · 2023-10-12T08:06:14Z

You can use llama.cpp (https://github.com/ggerganov/llama.cpp) to run a ggml/gguf model available from TheBloke hf account (see https://huggingface.co/TheBloke). See for instance https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGML. This works quite well, use Q5_K_M quantization for the best balance between performance/memory consumption/computation time.

BTW I just tested the Mistral AI model (https://huggingface.co/TheBloke/Kimiko-Mistral-7B-GGUF) which gives very good results in french.

bofenghuang · 2023-10-12T08:47:47Z

Hi,

Sorry for my late response.

Flash Attention is a powerful tool for accelerating training and inference and reducing memory usage when working with GPUs in PyTorch. Please note that it's compatible with GPU architectures newer than Amper, and its support for older architectures is limited (See the doc and this issue).

If you're planning to perform inference with llama.cpp, especially on a Mac machine, there's no need to install Flash Attention. As @tpaviot pointed out, you can find quantized versions of specific models on the Hugging Face Hub that are ready to use, thanks to the Bloke. Otherwise, you may need to quantize them by yourself (see the doc here).

PS: We've also released a Mistral-7B-based model here. Feel free to give it a try and share your feedback with us :)

LeMoussel · 2023-10-12T09:41:15Z

@tpaviot
Do you test TheBloke/Kimiko-Mistral-7B-GGUF on Colab ?
Will you have an example with results in french ?

tpaviot · 2023-10-12T13:13:49Z

@LeMoussel yep, although llama.cpp can run on CPU, it support GPU as well, and it works very well on colab at an affordable cost (if you use T4), and can be smoothly integrated with langchain or llama_index. What do you mean exactly with "an example"?

LeMoussel · 2023-10-12T15:05:26Z

I haven't been able to run llama_index LlamaCPP with Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q4_K_M.gguf on Colab with a T4 GPU. I still have BLAS = 0instead of BLAS = 1
If you have an example of code, I'm interested.

tpaviot · 2023-10-13T04:46:54Z

This is what I use to install the latest llama-cpp-python using pip on colab:

if HAVE_CUDA:
    print("cuda available, build llama-cpp-python on GPU using CUBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
    print("Build llama-cpp-python on CPU using OPENBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python

LeMoussel · 2023-10-13T06:25:20Z

On COLAB HAVE_CUDAis not defined.
I replaced it with this

import os
import torch

if torch.cuda.is_available():
    print("cuda available, build llama-cpp-python on GPU using CUBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
    print("Build llama-cpp-python on CPU using OPENBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python

mauryaland · 2023-10-13T14:35:45Z

PS: We've also released a Mistral-7B-based model here. Feel free to give it a try and share your feedback with us :)

Hi @bofenghuang, excellent news, thanks for your great work! I have a question concerning this model. May I fine-tune with my own French instruction dataset directly from Vigostral or should I start from the pre-trained Mistral-7B? Thanks in advance for your reply.

bofenghuang · 2023-10-16T09:20:16Z

Hello @mauryaland, I would recommend beginning with the pre-trained Mistral-7B model if you have sufficient instruction data. Otherwise, you can consider using Vigostral, which has already undergone fine-tuning on French instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vigogne on CPU #31

Vigogne on CPU #31

scenaristeur commented Sep 22, 2023

loranger commented Oct 11, 2023

LeMoussel commented Oct 12, 2023

loranger commented Oct 12, 2023

tpaviot commented Oct 12, 2023

bofenghuang commented Oct 12, 2023

LeMoussel commented Oct 12, 2023

tpaviot commented Oct 12, 2023

LeMoussel commented Oct 12, 2023

tpaviot commented Oct 13, 2023

LeMoussel commented Oct 13, 2023

mauryaland commented Oct 13, 2023

bofenghuang commented Oct 16, 2023

Vigogne on CPU #31

Vigogne on CPU #31

Comments

scenaristeur commented Sep 22, 2023

loranger commented Oct 11, 2023

LeMoussel commented Oct 12, 2023

loranger commented Oct 12, 2023

tpaviot commented Oct 12, 2023

bofenghuang commented Oct 12, 2023

LeMoussel commented Oct 12, 2023

tpaviot commented Oct 12, 2023

LeMoussel commented Oct 12, 2023

tpaviot commented Oct 13, 2023

LeMoussel commented Oct 13, 2023

mauryaland commented Oct 13, 2023

bofenghuang commented Oct 16, 2023