Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] update llamafactory doc #21

Merged
merged 2 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,11 @@
</div>
<div class="flex-grow"></div>
<div class="flex space-x-4 text-blue-600">
<a href="#">官方链接</a>
<a href="https://github.com/hiyouga/LLaMA-Factory">官方链接</a>
<span class="split">|</span>
<a href="#">安装指南</a>
<a href="sources/llamafactory/install.html">安装指南</a>
<span class="split">|</span>
<a href="#">快速上手</a>
<a href="sources/llamafactory/quick_start.html">快速上手</a>
</div>
</div>
<!-- Card 2 -->
Expand Down
119 changes: 119 additions & 0 deletions sources/llamafactory/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
FAQ
=======

设备指定
--------

**Q:为什么我的 NPU 卡没调用起来?**

1. 通过 ``ASCEND_RT_VISIBLE_DEVICES`` 环境变量指定昇腾 NPU 卡,如 ``ASCEND_RT_VISIBLE_DEVICES=0,1,2,3`` 指定使用 0,1,2,3四张 NPU 卡进行微调/推理。

.. hint::

昇腾 NPU 卡从 0 开始编号,docker 容器内也是如此;
如映射物理机上的 6,7 号 NPU 卡到容器内使用,其对应的卡号分别为 0,1

2. 检查是否安装 torch-npu,建议通过 ``pip install -e '.[torch-npu,metrics]'`` 安装 LLaMA-Factory。

推理报错
----------

**Q:使用昇腾 NPU 推理报错 RuntimeError: ACL stream synchronize failed, error code:507018**

A:设置 do_sample: false,取消随机抽样策略

关联 issues:

- https://github.com/hiyouga/LLaMA-Factory/issues/3840

微调/训练报错
--------------

**Q:使用 ChatGLM 系列模型微调/训练模型时,报错 NotImplementedError: Unknown device for graph fuser**

A:在 modelscope 或 huggingface 下载的 repo 里修改 ``modeling_chatglm.py`` 代码,取消 torch.jit 装饰器注释

关联 issues:

- https://github.com/hiyouga/LLaMA-Factory/issues/3788
- https://github.com/hiyouga/LLaMA-Factory/issues/4228


**Q:微调/训练启动后,HCCL 报错,包含如下关键信息:**

.. code-block:: shell

RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:64
[ERROR] 2024-05-21-11:57:54 (PID:927000, Device:3, RankID:3) ERR02200 DIST call hccl api failed.
EJ0001: 2024-05-21-11:57:54.167.645 Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
Solution: Wait for 10s after killing the last training process and try again.
TraceBack (most recent call last):
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:290]
Fail to get sq reg virtual addr, deviceId=3, sqId=40.[FUNC:Setup][FILE:stream.cc][LINE:1102]
stream setup failed, retCode=0x7020010.[FUNC:SyncGetDevMsg][FILE:api_impl.cc][LINE:4643]
Sync get device msg failed, retCode=0x7020010.[FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4704]
rtGetDevMsg execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]

A:杀掉 device 侧所有进程,等待 10s 后重新启动训练。

关联 issues:

- https://github.com/hiyouga/LLaMA-Factory/issues/3839

.. **Q:微调 ChatGLM3 使用 fp16 报错 Gradient overflow. Skipping step Loss scaler reducing loss scale to ...;使用 bf16 时 'loss': 0.0, 'grad_norm': nan**
.. https://github.com/hiyouga/LLaMA-Factory/issues/3308


**Q:使用 TeleChat 模型在昇腾 NPU 推理时,报错 AssertionError: Torch not compiled with CUDA enabled**

A:此问题一般由代码中包含 cuda 相关硬编码造成,根据报错信息,找到 cuda 硬编码所在位置,对应修改为 NPU 代码。如 ``.cuda()`` 替换为 ``.npu()`` ; ``.to("cuda")`` 替换为 ``.to("npu")``

**Q:模型微调遇到报错 DeviceType must be NPU. Actual DeviceType is: cpu,例如下列报错信息**

.. code-block:: shell

File "/usr/local/pyenv/versions/3.10.13/envs/x/lib/python3.10/site-packages/transformers-4.41.1-py3.10.egg/transformers/generation/utils.py", line 1842, in generate
result = self._sample(
File "/usr/local/pyenv/versions/3.10.13/envs/x/lib/python3.10/site-packages/transformers-4.41.1-py3.10.egg/transformers/generation/utils.py", line 2568, in _sample
next_tokens = next_tokens * unfinished_sequences + \
RuntimeError: t == c10::DeviceType::PrivateUse1 INTERNAL ASSERT FAILED at "third_party/op-plugin/op_plugin/ops/base_ops/opapi/MulKernelNpuOpApi.cpp":26, please report a bug to PyTorch. DeviceType must be NPU. Actual DeviceType is: cpu
[ERROR] 2024-05-29-17:04:48 (PID:70209, Device:0, RankID:-1) ERR00001 PTA invalid parameter

A:此类报错通常为部分 Tensor 未放到 NPU 上,请确保报错中算子所涉及的操作数均在 NPU 上。如上面的报错中,MulKernelNpuOpApi 算子为乘法算子,应确保 next_tokens 和 unfinished_sequences 均已放在 NPU 上。


.. **Q:单卡 NPU 情况下,使用 DeepSpeed 训练模型,报错 AttributeError :'GemmaForCausalLM'obiect has no attribute"save checkpoint",此处 GemmaForCausalLM 还可能为其他模型,详细报错如下图**
**Q:单卡 NPU 情况下,使用 DeepSpeed 训练模型,报错 AttributeError :'GemmaForCausalLM'obiect has no attribute"save checkpoint",此处 GemmaForCausalLM 还可能为其他模型**

.. .. figure:: ./images/lf-bugfix.png
.. :align: center

A:此问题一般为使用 ``python src/train.py`` 启动训练脚本或使用 ``llamafactory-cli train`` 的同时设置环境变量 ``FORCE_TORCHRUN`` 为 false 或 0 时出现。
由于 DeepSpeed 只对分布式 launcher 启动的程序中的模型用 ``DeepSpeedEngine`` 包装,包装后才有 ``save_checkpoint`` 等方法。
因此使用 ``torchrun`` 启动训练即可解决问题,即:

.. code-block:: shell

torchrun --nproc_per_node $NPROC_PER_NODE \
--nnodes $NNODES \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
src/train.py

同时使用 ``llamafactory-cli train`` 和 DeepSpeed 时,LLaMA-Factory 将自动设置 ``FORCE_TORCHRUN`` 为 1,启动分布式训练。如果您的代码中没有这个功能,请更新 LLaMA-Factory 为最新代码。

关联 issue 及 PR:

- https://github.com/hiyouga/LLaMA-Factory/issues/4077
- https://github.com/hiyouga/LLaMA-Factory/pull/4082



问题反馈
----------

如果您遇到任何问题,欢迎在 `官方社区 <https://github.com/hiyouga/LLaMA-Factory/issues/>`_ 提 issue,我们将第一时间进行响应。

*持续更新中 ...*

5 changes: 3 additions & 2 deletions sources/llamafactory/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,6 @@ LLaMA-Factory
.. toctree::
:maxdepth: 2

install.md
quick_start.md
install.rst
quick_start.rst
faq.rst
85 changes: 27 additions & 58 deletions sources/llamafactory/install.rst
Original file line number Diff line number Diff line change
@@ -1,90 +1,59 @@
LLAMA-Factory × 昇腾 安装指南
===========================
安装指南
==============

本教程面向使用 LLAMA-Factory & 昇腾的开发者,帮助完成昇腾环境下 LLaMA-Factory 的安装。

.. - [LLAMA-Factory × 昇腾 安装指南](#llama-factory--昇腾-安装指南)
.. - [昇腾环境安装](#昇腾环境安装)
.. - [LLaMA-Factory 安装](#llama-factory-安装)
.. - [最简安装](#最简安装)
.. - [推荐安装](#推荐安装)
.. - [安装校验](#安装校验)

昇腾环境安装
------------

请根据已有昇腾产品型号及CPU架构等按照 `快速安装昇腾环境指引 <https://ascend.github.io/docs/sources/ascend/quick_install.html>`_ 进行昇腾环境安装,或使用已安装好昇腾环境及 LLaMA-Factory 的 docker 镜像:
请根据已有昇腾产品型号及CPU架构等按照 :doc:`快速安装昇腾环境指引 <../ascend/quick_install>` 进行昇腾环境安装,或使用已安装好昇腾环境及 LLaMA-Factory 的 docker 镜像:

- `[32GB]LLaMA-Factory-Cann8-Python3.10-Pytorch2.2.0 <http://mirrors.cn-central-221.ovaijisuan.com/detail/130.html>`_
- TODO

- `[64GB]LLaMA-Factory-Cann8-Python3.10-Pytorch2.2.0 <http://mirrors.cn-central-221.ovaijisuan.com/detail/131.html>`_
.. warning::
LLAMA-Factory 支持的 CANN 最低版本为 8.0.rc1

LLaMA-Factory 下载安装
Python 环境创建
----------------------

.. note::
如果你已经选择使用上述 docker 镜像,可忽略此步骤,直接开始 LLaMA-Factory 探索之旅。

准备好昇腾环境后,下面即可安装 LLaMA-Factory。推荐使用 conda 创建和管理 Python 虚拟环境,有关 conda 的使用方法不在本教程范围内,此处仅给出用到的指令,如有需要可到 [conda 用户指南](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)中详细查阅。
如果你已经选择使用上述 docker 镜像,可忽略此步骤,直接开始使用 LLaMA-Factory。

.. code-block:: shell
:linenos:

# 创建 python 3.10 的虚拟环境
conda create -n <your_env_name> python=3.10
# 激活虚拟环境
conda activate <your_env_name>


LLaMA-Factory 下载
~~~~~~~~~~~~~~~~~~~
LLaMA-Factory 安装
----------------------

从 `LLaMA-Factory github 官方仓库 <https://github.com/hiyouga/LLaMA-Factory>`_ 手动下载,或使用 git 拉取最新的 LLaMA-Factory 库
使用以下指令安装带有 torch-npu 的 LLaMA-Factory:

.. code-block:: shell
:linenos:
git clone [email protected]:hiyouga/LLaMA-Factory.git


最简安装
~~~~~~~~~~~~~~~~~~~

完成 conda 虚拟环境的激活后,使用以下命令安装带有 torch-npu 的 LLaMA-Factory:

.. code-block:: shell
:linenos:
pip install -e .[torch_npu,metrics]

安装校验
----------------------

推荐安装
~~~~~~~~~~~~~~~~~~~

推荐使用 deepspeed 、modelscope 功能,可在 ``[]`` 中继续添加依赖项安装,如:
使用 ``llamafactory-cli env`` 指令对 LLaMA-Factory × 昇腾的安装进行校验,如下图所示,正确显示 LLaMA-Factory、PyTorch NPU 和 CANN 版本号及 NPU 型号等信息即说明安装成功。

.. code-block:: shell
:linenos:
pip install -e .[torch_npu,metrics,deepspeed,modelscope]


根据 LLaMA-Factory 官方指引,现已支持的可选额外依赖项包括:

> 可选的额外依赖项:torch、torch_npu、metrics、deepspeed、bitsandbytes、vllm、galore、badam、gptq、awq、aqlm、qwen、modelscope、quality

可根据需要进行选择安装。

安装完成后出现 ``Successfully installed xxx xxx ...`` 关键回显信息即说明各依赖包安装成功,如遇依赖包版本冲突,可使用 ``pip install --no-deps -e .`` 安装。

### 安装校验

在[LLaMA-Factory 安装](#LLaMA-Factory 安装)中搭建好的 conda 虚拟环境下,使用 ``llamafactory-cli version`` 指令对 LLaMA-Factory × 昇腾的安装进行校验,如下图所示,正确显示 LLaMA-Factory 版本号说明 LLaMA-Factory 安装成功;显示 `Setting ds_accelerator to npu` 说明 deepspeed 及 npu 环境安装成功。

.. figure:: ./images/install_check.png
:align: left


.. note::
如果采用最简安装,未安装 deepspeed,则回显如下图:

.. figure:: ./images/install_check_simple.png
:align: center

- `llamafactory` version: 0.8.2.dev0
- Platform: Linux-4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64-aarch64-with-glibc2.31
- Python version: 3.10.14
- PyTorch version: 2.1.0 (NPU)
- Transformers version: 4.41.2
- Datasets version: 2.19.2
- Accelerate version: 0.31.0
- PEFT version: 0.11.1
- TRL version: 0.9.4
- NPU type: xxx
- CANN version: 8.0.RC2.alpha001

请愉快使用 LLaMA-Factory × 昇腾实现大语言模型微调、推理吧!
Loading