Add Direct Preference Optimization (DPO) method #1279
+211
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #513
Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example.
get_batched_logps
anddpo_loss
functions tollms/mlx_lm/utils.py
for DPO implementation.llms/mlx_lm/tuner/trainer.py
to include DPO-specific training logic, including a newdpo_loss
function and condition to check for DPO loss in the training loop.llms/mlx_lm/examples/lora_config.yaml
.llms/mlx_lm/README.md
to include instructions for using DPO.llms/tests/test_dpo.py
with unit tests forget_batched_logps
,dpo_loss
, and DPO-specific training logic.For more details, open the Copilot Workspace session.