Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about general_dataset.json #11

Open
Taekyo-Lee opened this issue Aug 20, 2024 · 1 comment
Open

Questions about general_dataset.json #11

Taekyo-Lee opened this issue Aug 20, 2024 · 1 comment

Comments

@Taekyo-Lee
Copy link

Hello authors,
I have some questions to ask about your general_dataset.json.

  1. Why didn't you include other modes than gpt4 and gpt 3.5?
  2. What are the specific versions of gpt4 and gpt 3.5?
  3. Why do some questions appear repeatedly? For instance, the first 20 lines are the same question "Who was the first person to climb Mount Everest?" with 10 times for gpt4 and gpt3.5 each.
@aidarmyrzakhan
Copy link
Collaborator

Hi @Taekyo-Lee, thanks for your interest in our work.

  1. Why didn't you include other modes than gpt4 and gpt 3.5?

This json file is prepared specifically for instruction fine-tuning the pretrained LLM models. Our responses from other models are available at GitHub. We include only GPT-4 and GPT-3.5 responses to ensure a higher-quality dataset. Responses from smaller models often lack the depth and coherence needed for effective fine-tuning, which could compromise the dataset's overall quality. By focusing on these more advanced models, we aim to provide more reliable data for downstream fine-tuning.

  1. What are the specific versions of gpt4 and gpt 3.5?

We collected responses using gpt4-1106-preview and gpt-3.5-turbo-1106.

  1. Why do some questions appear repeatedly? For instance, the first 20 lines are the same question "Who was the first person to climb Mount Everest?" with 10 times for gpt4 and gpt3.5 each.

As mentioned, this file is designed for instruction tuning. By generating 10 responses per question from both GPT-4 and GPT-3.5, we aim to enhance the dataset's scale, richness and variability, which can be used for fine-tuning models to handle a wide range of possible inputs and scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants