GPT2Chat: Creating a GPT-2-Based Chatbot with Human Preferences

Abstract

In this project, a conversational chatbot named GPT2Chat, based on the GPT-2 language model, is developed. To enable the model to engage in meaningful dialogues, first, instruction fine-tuning is performed using the OpenAssistant Conversations Dataset (OASST1) and the Alpaca dataset. The LLaMA3 instruction template is adopted and adapted by incorporating special tokens <|start_context|> and <|end_context|> to encapsulate conversation history, thereby providing contextual awareness to the model. Subsequently, the ORPO method is employed for preference alignment, utilizing the trl-lib/ultrafeedback_binarized dataset to refine the model's responses based on human feedback. The resulting chatbot demonstrates decent conversational capabilities, leveraging both fine-tuning and preference learning techniques.

Introduction

In this project, a conversational chatbot called GPT2Chat, based on the GPT-2 language model [1], is developed. The goal was to create a system that can hold natural and coherent conversations with users. To make this happen, a two-step process is used: instruction fine-tuning and preference alignment, tailoring the pre-trained GPT-2 model for dialogue generation.

Background

GPT-2 [1], created by OpenAI, is a powerful transformer-based [2] language model great at generating human-like text. However, it’s not inherently built for conversations. To adapt it for chatting, it is fine-tuned with specific datasets and aligned its responses with human preferences using advanced techniques.

Instruction Fine-Tuning: This step trains the model on conversational or instruction-based datasets to improve its ability to respond to user inputs.
Preference Alignment: This refines the model further by incorporating human feedback, ensuring responses are not just relevant but also user-friendly.

Methodology

Instruction Fine-Tuning

GPT-2 is fine-tuned using two datasets:

OpenAssistant Conversations Dataset (OASST1): A dataset full of human-AI conversations, perfect for teaching dialogue skills [3].
Alpaca Dataset: Built by Stanford, this focuses on instruction-following, helping the model respond accurately to prompts [4].

To keep the chatbot aware of the conversation flow, the LLaMA3 instruction template [5] is utilized and modified. The <|start_context|> and <|end_context|> tokens are added to wrap the past conversation history or any extra context. This way, the model knows what’s been said before and can reply accordingly.

The fine-tuning process trained the model to predict the next response based on this context and the user’s latest input.

Preference Alignment with ORPO

Next, the ORPO method [6] is used for preference alignment. ORPO, or Odds Ratio Preference Optimization, is a technique that improves the model’s responses by learning from human feedback. It trains the model to favor responses that people prefer while avoiding less desirable ones, using a dataset of “chosen” (good) and “rejected” (bad) response pairs. In this project, the trl-lib/ultrafeedback_binarized dataset, which provides these pairs, is used to make GPT2Chat’s replies more helpful and engaging. Unlike other methods, ORPO combines this preference learning with instruction tuning in one step, making it efficient for fine-tuning GPT-2.

Dataset Synergy: The trl-lib/ultrafeedback_binarized dataset contains preference pairs (chosen response, rejected response), which ORPO leverages to refine GPT2Chat’s outputs. This complements earlier SFT with OASST1 (conversational data) and Alpaca (instruction-following data), as ORPO builds on the model’s ability to generate coherent responses while aligning them with human preferences.
Context Tokens: The use of <|start_context|> and <|end_context|> tokens to encode conversation history aligns well with ORPO, as the method can incorporate contextual inputs in the prompt. ORPO’s loss function evaluates responses based on the full input (prompt + context), ensuring GPT2Chat remains contextually aware during preference alignment.
Efficiency: Since GPT-2 is a smaller model, ORPO’s single-step process and lack of a reference model make it computationally feasible for this project, likely reducing training time and resource needs compared to RLHF [7] or DPO [8].
Outcome: ORPO enhances GPT2Chat’s ability to generate user-preferred responses, such as more helpful, accurate, or engaging replies. For example, if a user asks, “What’s the best way to learn Python?” ORPO ensures GPT2Chat favors a detailed, practical response over a vague or incorrect one, based on the preference data.

This step ensures the chatbot doesn’t just make sense—it also generates replies people actually like.

Implementation

Here’s how it is put all together:

Data Preparation: The OASST1 and Alpaca datasets are processed to fit the modified template, wrapping conversation history in context tokens.
Fine-Tuning: Using frameworks (i.e., PyTorch [9], HuggingFace [10], and PyTorch Lightning [11]), GPT-2 is fine-tuned to optimize its dialogue performance.
Preference Alignment: ORPO is applied with the trl-lib/ultrafeedback_binarized dataset, using feedback to reward better responses.
Testing: The chatbot’s ability to chat naturally is checked.

Results

Quantitative Evaluation

The following metrics measure the performance of GPT2Chat:

Model	Params	FLOPs/token	Token/sec.	PPL ↓
GPT2Chat (w/o ORPO)	125M	125M	90.214	15.712
GPT2Chat	125M	125M	90.214	12.889

Perplexity: Calculated on the OASST1 validation set.
Inference Speed: Measured in token/sec.
Compute Overhead: Measured in FLOPs/token.

Qualitative Evaluation

GPT2Chat with and without ORPO

Examples below demonstrate the difference between the responses of GPT2Chat without any preference alignment and GPT2Chat with ORPO.

Example 1:

Model	Prompt	Response
GPT2Chat w/o ORPO	What’s the best way to learn Python?	The best way to learn Python is to use it as often as you can, read beginner-friendly books and articles, watch tutorials and examples, and participate in online communities.
GPT2Chat	What’s the best way to learn Python?	The best way to learn Python is to start with a basic understanding of the language, such as the basic syntax, rules, and syntax. You should also make sure to have a strong understanding of the language's history, including the core concepts, frameworks, and frameworks. Additionally, you should have a good understanding of the ecosystem and its ecosystem, as well as practicing your understanding through practice.

Example 2:

Model	Prompt	Response
GPT2Chat w/o ORPO	How can I improve my English skills?	There are many ways to improve your English skills, such as practicing speaking with native speakers, reading English books and articles, watching English movies and TV shows, listening to English podcasts and speaking with native English speakers, and practicing the English vocabulary. Additionally, you should study for English exam and spelling, as well as writing and editing, to improve your English skills.
GPT2Chat	How can I improve my English skills?	To improve your English skills, try the following: 1. Learn the language: Start by learning the language from a book, magazine, or online newspaper, and gradually increase your English skills as you become more familiar with the language. 2. Use an English tutor: An English tutor can help you improve your English skills by providing practical advice and guidance. They can help you learn basic vocabulary, grammar, and basic vocabulary, as well as provide a support system to help you practice in English. 3. Study vocabulary: Spend time studying in English, such as practicing with English

Example 3:

Model	Prompt	Response
GPT2Chat w/o ORPO	How can we maintain our well-being?	We can maintain our well-being by taking care of ourselves and our families, eating nutritious meals, getting enough sleep, engaging in physical activity, engaging in creative activities, and taking care of our physical health.
GPT2Chat	How can we maintain our well-being?	Here are some ways to maintain your well-being: 1. Reduce stress: Find ways to reduce stress, such as spending time in nature or engaging in active exercise. 2. Reduce stress: Find ways to reduce stress, such as practicing relaxation techniques, practicing relaxation techniques, or taking time for self-care. 3. Exercise regularly: Regular exercise can help improve your overall health and well-being, as it can reduce the risk of developing chronic illnesses such as heart disease, stroke, and other chronic conditions. 4. Avoid stress: Avoiding stress and

Live on Colab

A sample of conversation with GPT2Chat on Colab.

Telegram Bot

GPT2Chat conversational ability via Telegram Bot. With LangChain [12], the conversation flow becomes manageable.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
Creating_a_GPT_2_Based_Chatbot_with_Human_Preferences.ipynb		Creating_a_GPT_2_Based_Chatbot_with_Human_Preferences.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT2Chat: Creating a GPT-2-Based Chatbot with Human Preferences

Abstract

Introduction

Background