-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13
Comments
Hi Harryis, Yes. We generated around 120k synthetic topics (after filtering on the topics) from TGRT Self-Instruct, generated the corresponding 120k prompts, and did some filtering on the prompts to get the final 99k prompts. As can be seen from the code, we randomly sample topics from existing topics to construct new topics. So ideally we will get the same amount of topics of each instruction type, but that number will be different due to filtering. |
Thanks a lot for your reply! It's clear for me that you use 20 instruction types to generate 120K synthetic topics, which means every instruction type will generate about 120k/20 =6k topics. However, How do you "randomly sample topics from existing topics to construct new topics"? As I know, the topic generation prompt as below doesn't involve selecting existing topics, it only involves one of the 20 instuction types:
Otherwise you use ICL as topic examples in "List of 10 topics:", and then the topic examples are iterative. Thanks a lot for your help! |
Hi Harryis, We generate the topics in several rounds (called |
Oh,I get it!Thanks! |
Hello,Thanks for your awesome work and code!
However, I encountered some confusion while trying to understand how you generated TGRT Self Instruction. You mentioned in the article that you first handwrite 20 instruction types and then generated some topics from these types. Finally, instructions were generated by the “instruction type - topic" pair.
Therefore, my first question is:
How many topics have you generated with each instruction type? I see in Appendix G that your prompt generates 10 topics for each instruction type.
My second question is :
How many instructions will be generated for each "instruction type - topic" pair? Because you finally get 99,121 synthetic prompts from TGRT Self-Instruct, if every "instruction type - topic" pair generates only one instruction, does it mean you at least generate 99,121 topics?
Thanks a lot for your help!
The text was updated successfully, but these errors were encountered: