Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New paper: The Hitchhiker's Guide to Human Alignment with *PO #20

Open
maykcaldas opened this issue Jul 25, 2024 · 0 comments
Open

New paper: The Hitchhiker's Guide to Human Alignment with *PO #20

maykcaldas opened this issue Jul 25, 2024 · 0 comments

Comments

@maykcaldas
Copy link
Collaborator

Paper: The Hitchhiker's Guide to Human Alignment with *PO

Authors: Kian Ahrabian, Xihui Lin, Barun Patra, Vishrav Chaudhary, Alon

Abstract: With the growing utilization of large language models (LLMs) across domains,alignment towards human preferences has become one of the most critical aspectsof training models. At the forefront of state-of-the-art human alignmentmethods are preference optimization methods (*PO). However, prior research hasoften concentrated on identifying the best-performing method, typicallyinvolving a grid search over hyperparameters, which can be impractical forgeneral practitioners. In this paper, we aim to identify the algorithm that,while being performant, is simultaneously more robust to varyinghyperparameters, thereby increasing the likelihood of achieving better results.We focus on a realistic out-of-distribution (OOD) scenario that mirrorsreal-world applications of human alignment, offering practical insights intothe strengths and weaknesses of these methods. Furthermore, to betterunderstand the shortcomings of generations from the different methods, weanalyze the model generations through the lens of KL divergence of the SFTmodel and the response length statistics. Our analysis reveals that the widelyadopted DPO method consistently produces lengthy responses of inferior qualitythat are very close to the SFT responses. Motivated by these findings, wepropose an embarrassingly simple extension to the DPO algorithm, LN-DPO,resulting in more concise responses without sacrificing quality compared to thepolicy obtained by vanilla DPO.

Link: https://arxiv.org/abs/2407.15229

Reasoning: Reasoning: Let's think step by step in order to determine if the paper is about a language model. We start by examining the title and abstract. The title mentions "Human Alignment with *PO," which suggests a focus on aligning models with human preferences. The abstract further elaborates on the use of large language models (LLMs) and discusses methods for optimizing these models to align with human preferences. It also mentions specific algorithms and their performance in generating responses, which are typical concerns in the development and fine-tuning of language models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant