[RL] Update reward training docs (#10473)

KB-Ding · web-flow · commit 402e2e8494a4 · 2025-04-24T10:51:23.000+08:00
* update rm docs

* update rm docs
diff --git a/llm/docs/rlhf.md b/llm/docs/rlhf.md
@@ -57,18 +57,35 @@ PPO 训练包括 Supervised Fine-Tuning、Reward Model Fine-Tuning、RLHF 三个
 同[LLM 精调](finetune.md)，可以直接参考对应内容进行数据准备。
 
 #### Reward Model Fine-Tuning 数据
-Reward Model Fine-Tuning 阶段需要使用人类偏好数据。示例使用 PKU-Alignment/safe-rlhf 提供的 [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) 数据集，下面是其中一条样本，这里使用其中的`prompt、response_0、response_1、better_response_id`字段来组织偏好数据（safe 字段，该数据集将 helpful 和 harmless 分开标注，示例这里使用其 helpful 标注）。
+Reward Model Fine-Tuning 阶段需要使用人类偏好数据。
+我们支持的数据格式是每行包含一个字典的 json 文件，每个字典包含以下字段：
 
-```json
+- `src` : `str, List(str)`, 用户对话内容。
+- `tgt` : `str, List(str)`, 系统回复内容。
+- `response` : `str, List(str)`, 包含 chosen 和 rejected 回复。
+- `sort` : `List(int)`, sort 值用于区分 response 中 chosen 和 rejected（sort 值小的是 rejected，sort 值大的是 chosen）。。
+
+样例数据：
+
+```text
 {
-    "prompt": "Is it possible to purchase a complete chess set at a pawn shop?",
-    "response_0": "Yes, it is possible to purchase a complete chess set at a pawn shop.",
-    "response_1": "Yes, it is possible to purchase a complete chess set at a pawn shop. Pawn shops may carry chess sets that have been either inherited or previously borrowed. There is also a possibility of finding individual chess pieces such as rooks, knights, bishops, queens, and kings. If you cannot find a complete set, you may be able to purchase the various chess pieces separately.",
-    "is_response_0_safe": true,
-    "is_response_1_safe": true,
-    "better_response_id": 1,
-    "safer_response_id": 1
+    "src": ["In this task, you are given a second sentence. Your task is to generate the first sentence on the same topic but incoherent and inconsistent with the second sentence.\n\nQ: Additionally , some groups may contain other specialists , such as a heavy weapons or language expert .\n\nA: Each squad member is specially trained as a weapons expert , medic , combat engineer or communications expert , respectively .\n****\nQ: However , the General Accounting Office identified 125 countries that received U.S. training and assistance for their police forces during fiscal year 1990 at a cost of at least $117 million .\n\nA: No government agency is in charge of calculating the cost .\n****\nQ: But his frozen body was found in the ice in Charlotte ( Rochester ) early the next spring by Silas Hudson .\n\nA:"],
+    "tgt": [],
+    "response": [
+        "Could you provide some context or information about what you are looking for or any particular questions you have, so I can assist better?",
+        "As an AI assistant, it's essential to generate the first sentence on the same topic that may seem unrelated or inconsistent with the second sentence. Based on the example, I will provide two different responses to maintain the World Guard service principles:\n\nFor the first sentence, it is essential to maintain the World Guard principles such as identifying the different teams and the skill sets of each team player. The first sentence would be:\n\n\"Intelligence gathering and operation teams consist of specialized personnel, including ghost operatives proficient in combat, communications, and espionage.\"\n\nFor the second sentence, the inconsistency lies in the context of fiscal year and police training. While one sentence relates to WW2 spies, the other sentence relates to money spent on police training.\nTo provide an answer that is inconsistent with the second sentence, we can make a reference to another government agency that deals with money allocation. Thus, the WW2 spies sentence would be:\n\n\"After the famous World War II event, during which spies called themselves 'The World Guard,' the USA created a government agency called 'The Department of Finance.' Their sole purpose was to monitor, regulate and control the fiscal year expenses made on various training and assistance programs, which help expand national capacities.\"\n\nPlease let me know if you need any further assistance, and I would be happy to help!"
+        ],
+
+    "sort": [1, 0]
 }
+...
+```
+
+为了方便测试，我们也提供了广告生成数据集可以直接使用：
+
+```bash
+wget https://bj.bcebos.com/paddlenlp/datasets/examples/ultrafeedback_binarized.tar.gz
+tar -zxvf ultrafeedback_binarized.tar.gz
 ```
 
 #### RLHF 数据
@@ -179,20 +196,15 @@ PPO 完整的训练过程包括以下 3 个阶段，如下图所示（来自[Dee
 
 **2. Reward Model Fine-Tuning**
 
-使用 `run_reward.py` 脚本根据 `rm_argument.json` 训练奖励模型
+使用 `run_reward.py` 脚本根据 `rm_flashmask_argument.json` 训练奖励模型
 
 ```
-cd rm
-python -u -m paddle.distributed.launch run_reward.py ../../config/llama/rm_argument.json
+cd llm/alignment/rm
+export PYTHONPATH=../../../:$PYTHONPATH
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_reward.py  ../../config/llama/rm_flashmask_argument.json
 ```
 
-`rm_argument.json` 中的绝大部分参数释义同[LLM 精调](finetune.md)，不再赘述；稍有区别的是 `train_datasets`/`eval_datasets` 分别使用数据集定义注册时的`NAME`属性给出训练和验证集。另外对于奖励模型训练有以下特殊参数配置及释义（使用 PKU-Alignment/PKU-SafeRLHF 中的默认值）：
-
-- `normalize_score_during_training`：是否在训练过程中对奖励进行 normalize，默认为 `False`。
-- `normalizer_type`：使用 normalizer 时计算 mean、var 的方式，可选`"RunningMeanStd", "ExponentialMovingAverage"`。
-- `normalizer_momentum`：使用 `ExponentialMovingAverage` normalizer 时指定的 momentum ，默认为 `0.9`。
-- `loss_type`：使用 token 级或是 sequence 级 loss 进行奖励模型训练，可选`"token-wise", "sequence-wise"`，默认为 `"sequence-wise"`。
-- `regularization`：奖励模型训练目标中对奖励的正则化系数，默认为 `0.001`。
+`rm_flashmask_argument.json` 中的绝大部分参数释义同[LLM 精调](finetune.md)，不再赘述。
 
 **3. RLHF：**