[Feature] Support General Reward Model #2427

zhaochenyang20 · 2024-12-10T04:15:08Z

Motivation

As mentioned in our devlopmap, #1487:

Support generalized reward API (adding linear layers to any Causal LM to get the reward) as required by the OpenRLHF team.

https://github.com/OpenRLHF/OpenRLHF

Add linear layers to any Causal LM to get rewards.

We formalize this requirement in this issue and invite @M0gician to contribute with us.

Features Request

1. Add linear layers to any Causal LM to get rewards.

Add linear layer at the end and assign a specific token (like final eos in the prompt) and manuplate the logits of it as rewards.
Add linear layer after a spcific value head name at any layer, manuplate it's logits as rewards.

2. Add `--task` parameter.

Get rewards/embedding from any Causal LM, adding a parameter like --task embedding.

3. Better Accuracy.

Many users may have noticed that the reward results of SGLang's current API show a discrepancy (around 3/1000) compared to those obtained from training engines like DeepSpeed or Llama-factory. This discrepancy is not due to an issue with our framework implementation; in fact, this problem exists in all current inference engines:

The kernel fusion in inference engines differs significantly from that in training engines. When the batch size varies, inference requests are dispatched to different kernels, and numerical errors accumulate layer by layer. By the time it reaches the logits layer, these errors become noticeable. This issue has been around since the BERT era—precision differences between training and inference engines are unavoidable.

As a result, in RLHF, inference engines are primarily used to accelerate sampling, while reward and embeddings still rely on training scripts. It may take several months for our team to address this issue properly.

We will add a logging regarding this issue in our Engine and our documents for this. Even if the reward may be inaccurate, we provide a general reward interface, in hope that community users could design more robust RL algorithm that works well in this scenario.

Related resources

The reward forward function in OpenRLHF.
HuggingFace LlamaForSequence and AutoLinearXXX.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support General Reward Model #2427

[Feature] Support General Reward Model #2427

zhaochenyang20 commented Dec 10, 2024 •

edited

Loading

[Feature] Support General Reward Model #2427

[Feature] Support General Reward Model #2427

Comments

zhaochenyang20 commented Dec 10, 2024 • edited Loading

Motivation

Features Request

1. Add linear layers to any Causal LM to get rewards.

2. Add --task parameter.

3. Better Accuracy.

Related resources

zhaochenyang20 commented Dec 10, 2024 •

edited

Loading

2. Add `--task` parameter.