Skip to content

[RL] Fix typo and add wandb log #10641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/zh/llm/devices/intel_hpu/tests/README.md
1 change: 0 additions & 1 deletion docs/zh/llm/docs/pretrain.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/zh/llm/docs/pretrain.md
1 change: 1 addition & 0 deletions llm/alignment/rl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ export FLAGS_cascade_attention_max_partition_size=2048

python -u -m paddle.distributed.launch --devices "0,1,2,3" run_rl.py ../../config/qwen/reinforce_plus_plus_argument.yaml
```
我们提供根据上述脚本可复现的[wandb 日志](https://api.wandb.ai/links/ainlp66-netflix/injcw3ra)。

### 在线监控
在`grpo_argument.yaml`和`reinforce_plus_plus_argument.yaml`中设置的输出目录为`"logging_dir": "vdl_log"`, 可以通过以下命令查看训练过程
Expand Down
2 changes: 1 addition & 1 deletion llm/config/qwen/grpo_32b_argument.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ disable_tqdm: true # Whether to disable tqdm progress bar

# RL args
kl_coeff: 0.001 # KL coefficient for PPO and Reinforce++
kl_loss_coeff: 0.001 # KL loss coefficient
kl_loss_coeff: 0.000 # KL loss coefficient
pg_loss_coeff: 1.0 # Policy gradient loss coefficient
entropy_coeff: 0.0 # Entropy coefficient
clip_range_ratio: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
Expand Down
2 changes: 1 addition & 1 deletion llm/config/qwen/grpo_argument.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ disable_tqdm: true # Whether to disable tqdm progress bar

# RL args
kl_coeff: 0.001 # KL coefficient for PPO and Reinforce++
kl_loss_coeff: 0.001 # KL loss coefficient
kl_loss_coeff: 0.000 # KL loss coefficient
pg_loss_coeff: 1.0 # Policy gradient loss coefficient
entropy_coeff: 0.0 # Entropy coefficient
clip_range_ratio: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
Expand Down
7 changes: 3 additions & 4 deletions paddlenlp/rl/trainer/ppo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1526,7 +1526,7 @@
if self.args.rl_algorithm == "ppo":
batch["reward_values"] = self.critic_trainer.compute_value(**batch)

# danamic sampling: filter generated samples by rewards, keep generating until valid samples are enough
# dynamic sampling: filter generated samples by rewards, keep generating until valid samples are enough
if self.args.dynamic_sampling:
local_valid_prompt = 0
# combined_batch = combine_micro_batches_into_batch(micro_batches, pad_token_id=self.tokenizer.pad_token_id)
Expand Down Expand Up @@ -1601,7 +1601,7 @@
total_batch = defaultdict(list)
total_valid_prompt = 0
num_gen_batches = 0
logger.info("Danymic sampling completed. \n")
logger.info("Dynamic sampling completed. \n")

Check warning on line 1604 in paddlenlp/rl/trainer/ppo_trainer.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/rl/trainer/ppo_trainer.py#L1604

Added line #L1604 was not covered by tests

else:
if self.args.max_gen_batches > 0 and num_gen_batches > self.args.max_gen_batches:
Expand Down Expand Up @@ -1664,7 +1664,7 @@
paddle.device.cuda.empty_cache()

if self.args.rl_algorithm == "ppo":
rl_info["train_value_loss"] = self.critic_trainer.update_critc(micro_batch)
rl_info["train_value_loss"] = self.critic_trainer.update_critic(micro_batch)

Check warning on line 1667 in paddlenlp/rl/trainer/ppo_trainer.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/rl/trainer/ppo_trainer.py#L1667

Added line #L1667 was not covered by tests
if self.is_step_end():
self.state.global_step += 1
self.state.epoch = epoch + (step + 1) / steps_in_epoch
Expand Down Expand Up @@ -1701,7 +1701,6 @@

if self.control.should_training_stop:
break
# TODO(guosheng): add epilogue of training
logger.info("\nTraining completed. \n")
if args.load_best_model_at_end and self.state.best_model_checkpoint is not None:
if args.local_rank != -1:
Expand Down