Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support General Reward Model #2427

Open
3 tasks
zhaochenyang20 opened this issue Dec 10, 2024 · 0 comments
Open
3 tasks

[Feature] Support General Reward Model #2427

zhaochenyang20 opened this issue Dec 10, 2024 · 0 comments

Comments

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Dec 10, 2024

Motivation

As mentioned in our devlopmap, #1487:

Support generalized reward API (adding linear layers to any Causal LM to get the reward) as required by the OpenRLHF team.

https://github.com/OpenRLHF/OpenRLHF

Add linear layers to any Causal LM to get rewards.

We formalize this requirement in this issue and invite @M0gician to contribute with us.

Features Request

1. Add linear layers to any Causal LM to get rewards.

  • Add linear layer at the end and assign a specific token (like final eos in the prompt) and manuplate the logits of it as rewards.
  • Add linear layer after a spcific value head name at any layer, manuplate it's logits as rewards.

2. Add --task parameter.

  • Get rewards/embedding from any Causal LM, adding a parameter like --task embedding.

3. Better Accuracy.

Many users may have noticed that the reward results of SGLang's current API show a discrepancy (around 3/1000) compared to those obtained from training engines like DeepSpeed or Llama-factory. This discrepancy is not due to an issue with our framework implementation; in fact, this problem exists in all current inference engines:

The kernel fusion in inference engines differs significantly from that in training engines. When the batch size varies, inference requests are dispatched to different kernels, and numerical errors accumulate layer by layer. By the time it reaches the logits layer, these errors become noticeable. This issue has been around since the BERT era—precision differences between training and inference engines are unavoidable.

As a result, in RLHF, inference engines are primarily used to accelerate sampling, while reward and embeddings still rely on training scripts. It may take several months for our team to address this issue properly.

We will add a logging regarding this issue in our Engine and our documents for this. Even if the reward may be inaccurate, we provide a general reward interface, in hope that community users could design more robust RL algorithm that works well in this scenario.

Related resources

  1. The reward forward function in OpenRLHF.
  2. HuggingFace LlamaForSequence and AutoLinearXXX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant