Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

role play 下 VRAM使用不斷的增加 #9

Open
allencyhsu opened this issue Jun 16, 2024 · 2 comments
Open

role play 下 VRAM使用不斷的增加 #9

allencyhsu opened this issue Jun 16, 2024 · 2 comments

Comments

@allencyhsu
Copy link

模型加載大概占用5G,來回的對話幾次後,就跳到6G,增加一次對話大概增加300MB記憶體,請問有辦法克服這個問題嗎?

==============================
python realtime_chat.py --role_name 三三
-----PERFORM NORM HEAD
user:你好
/home/allen/miniconda3/envs/index/lib/python3.10/site-packages/transformers/generation/utils.py:1417: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
三三:你好,我是三三,请问有什么我可以帮助您的吗?
user:介紹一下B站
三三:B站是中国最大的在线视频平台之一,提供丰富的动画、游戏、音乐、舞蹈等视频内容,以及直播、互动社区等功能。同时,B站也是一个多元化的社区,吸引了大量的年轻用户。

@BitVoyage
Copy link
Collaborator

可以使用我们README中提供的量化脚本来量化模型,可以大幅降低内存占用

@lingyun-gao
Copy link
Collaborator

多轮对话会涨显存,可以用量化脚本,经测试显存减半

修改realtime_chat.py

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
)

model = AutoModelForCausalLM.from_pretrained(
    self.huggingface_local_path,
    trust_remote_code=True,
    config=config,
    quantization_config=quantization_config,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants