Add Direct Preference Optimization (DPO) method #1279

anupamme · 2025-02-12T09:51:22Z

Fixes #513

Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example.

Add DPO Functions: Add get_batched_logps and dpo_loss functions to llms/mlx_lm/utils.py for DPO implementation.
Update Training Logic: Update llms/mlx_lm/tuner/trainer.py to include DPO-specific training logic, including a new dpo_loss function and condition to check for DPO loss in the training loop.
Add Configuration Options: Add configuration options for DPO in llms/mlx_lm/examples/lora_config.yaml.
Update Documentation: Update llms/mlx_lm/README.md to include instructions for using DPO.
Add Unit Tests: Add llms/tests/test_dpo.py with unit tests for get_batched_logps, dpo_loss, and DPO-specific training logic.

For more details, open the Copilot Workspace session.

Fixes ml-explore#513 Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example. * **Add DPO Functions**: Add `get_batched_logps` and `dpo_loss` functions to `llms/mlx_lm/utils.py` for DPO implementation. * **Update Training Logic**: Update `llms/mlx_lm/tuner/trainer.py` to include DPO-specific training logic, including a new `dpo_loss` function and condition to check for DPO loss in the training loop. * **Add Configuration Options**: Add configuration options for DPO in `llms/mlx_lm/examples/lora_config.yaml`. * **Update Documentation**: Update `llms/mlx_lm/README.md` to include instructions for using DPO. * **Add Unit Tests**: Add `llms/tests/test_dpo.py` with unit tests for `get_batched_logps`, `dpo_loss`, and DPO-specific training logic. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/ml-explore/mlx-examples/issues/513?shareId=XXXX-XXXX-XXXX-XXXX).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Direct Preference Optimization (DPO) method #1279

Add Direct Preference Optimization (DPO) method #1279

anupamme commented Feb 12, 2025 •

edited

Loading

Add Direct Preference Optimization (DPO) method #1279

Are you sure you want to change the base?

Add Direct Preference Optimization (DPO) method #1279

Conversation

anupamme commented Feb 12, 2025 • edited Loading

anupamme commented Feb 12, 2025 •

edited

Loading