Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbmaruf/project instruct data using psrc #1

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

sbmaruf
Copy link
Contributor

@sbmaruf sbmaruf commented Mar 8, 2023

Feature

  • Project/transform structured dataset to an instruction dataset.
  • Multiprocessing generation

A sample script to run the code.

DUMP_FOLDER='' # fill this with your desired address
SRC_DATA_FOLDER=$DUMP_FOLDER/projection_from_psrc
mkdir -p $SRC_DATA_FOLDER
mkdir -p $SRC_DATA_FOLDER/cache

python data/project_from_psrc.py \
--dataset-name-or-paths glue glue glue glue glue \
--dataset-configs cola sst2 mrpc qqp stsb \
--prompt-templates-configs None None None None None \
--cache-dir $SRC_DATA_FOLDER/cache \
--output-dir $SRC_DATA_FOLDER \
--highlight-variables \
--add-source-metadata \
--num-proc 16

Output folder structure

Output folder of the above run. tree $SRC_DATA_FOLDER/glue/

├── cola
│   ├── test
│   │   ├── glue_cola.editing.jsonl
│   │   ├── glue_cola.Following_sentence_acceptable.jsonl
│   │   ├── glue_cola.is_this_correct.jsonl
│   │   ├── glue_cola.Make_sense_yes_no.jsonl
│   │   └── glue_cola.Previous_sentence_acceptable.jsonl
│   ├── train
│   │   ├── glue_cola.editing.jsonl
│   │   ├── glue_cola.Following_sentence_acceptable.jsonl
│   │   ├── glue_cola.is_this_correct.jsonl
│   │   ├── glue_cola.Make_sense_yes_no.jsonl
│   │   └── glue_cola.Previous_sentence_acceptable.jsonl
│   └── validation
│       ├── glue_cola.editing.jsonl
│       ├── glue_cola.Following_sentence_acceptable.jsonl
│       ├── glue_cola.is_this_correct.jsonl
│       ├── glue_cola.Make_sense_yes_no.jsonl
│       └── glue_cola.Previous_sentence_acceptable.jsonl
├── mrpc
│   ├── test
│   │   ├── glue_mrpc.equivalent.jsonl
│   │   ├── glue_mrpc.generate_paraphrase.jsonl
│   │   ├── glue_mrpc.generate_sentence.jsonl
│   │   ├── glue_mrpc.paraphrase.jsonl
│   │   ├── glue_mrpc.replace.jsonl
│   │   ├── glue_mrpc.same_thing.jsonl
│   │   └── glue_mrpc.want_to_know.jsonl
│   ├── train
│   │   ├── glue_mrpc.equivalent.jsonl
│   │   ├── glue_mrpc.generate_paraphrase.jsonl
│   │   ├── glue_mrpc.generate_sentence.jsonl
│   │   ├── glue_mrpc.paraphrase.jsonl
│   │   ├── glue_mrpc.replace.jsonl
│   │   ├── glue_mrpc.same_thing.jsonl
│   │   └── glue_mrpc.want_to_know.jsonl
│   └── validation
│       ├── glue_mrpc.equivalent.jsonl
│       ├── glue_mrpc.generate_paraphrase.jsonl
│       ├── glue_mrpc.generate_sentence.jsonl
│       ├── glue_mrpc.paraphrase.jsonl
│       ├── glue_mrpc.replace.jsonl
│       ├── glue_mrpc.same_thing.jsonl
│       └── glue_mrpc.want_to_know.jsonl
├── qqp
│   ├── test
│   │   ├── glue_qqp.answer.jsonl
│   │   ├── glue_qqp.duplicate.jsonl
│   │   ├── glue_qqp.duplicate_or_not.jsonl
│   │   ├── glue_qqp.meaning.jsonl
│   │   ├── glue_qqp.quora.jsonl
│   │   └── glue_qqp.same_thing.jsonl
│   ├── train
│   │   ├── glue_qqp.answer.jsonl
│   │   ├── glue_qqp.duplicate.jsonl
│   │   ├── glue_qqp.duplicate_or_not.jsonl
│   │   ├── glue_qqp.meaning.jsonl
│   │   ├── glue_qqp.quora.jsonl
│   │   └── glue_qqp.same_thing.jsonl
│   └── validation
│       ├── glue_qqp.answer.jsonl
│       ├── glue_qqp.duplicate.jsonl
│       ├── glue_qqp.duplicate_or_not.jsonl
│       ├── glue_qqp.meaning.jsonl
│       ├── glue_qqp.quora.jsonl
│       └── glue_qqp.same_thing.jsonl
├── sst2
│   ├── test
│   │   ├── glue_sst2.following_positive_negative.jsonl
│   │   ├── glue_sst2.happy_or_mad.jsonl
│   │   ├── glue_sst2.positive_negative_after.jsonl
│   │   ├── glue_sst2.review.jsonl
│   │   └── glue_sst2.said.jsonl
│   ├── train
│   │   ├── glue_sst2.following_positive_negative.jsonl
│   │   ├── glue_sst2.happy_or_mad.jsonl
│   │   ├── glue_sst2.positive_negative_after.jsonl
│   │   ├── glue_sst2.review.jsonl
│   │   └── glue_sst2.said.jsonl
│   └── validation
│       ├── glue_sst2.following_positive_negative.jsonl
│       ├── glue_sst2.happy_or_mad.jsonl
│       ├── glue_sst2.positive_negative_after.jsonl
│       ├── glue_sst2.review.jsonl
│       └── glue_sst2.said.jsonl
└── stsb
    ├── test
    │   ├── glue_stsb.examples.jsonl
    │   ├── glue_stsb.rank.jsonl
    │   ├── glue_stsb.rate.jsonl
    │   ├── glue_stsb.score.jsonl
    │   └── glue_stsb.similarity.jsonl
    ├── train
    │   ├── glue_stsb.examples.jsonl
    │   ├── glue_stsb.rank.jsonl
    │   ├── glue_stsb.rate.jsonl
    │   ├── glue_stsb.score.jsonl
    │   └── glue_stsb.similarity.jsonl
    └── validation
        ├── glue_stsb.examples.jsonl
        ├── glue_stsb.rank.jsonl
        ├── glue_stsb.rate.jsonl
        ├── glue_stsb.score.jsonl
        └── glue_stsb.similarity.jsonl
  1. Each of the task has 3 different folder train, validation, test (huggingface datasets split name).
  2. Each of the files in this folder is a project dataset of a prompt template.
  3. A file name glue_cola.editing.jsonl means, dataset is "glue", dataset config is "cola" and prompt name is "editing"
  4. Each of the lines in the file (i.e., glue_cola.editing.jsonl) is a prompted sample.

Output Format

A sample json data in the jsonl file,

{"id": 0, 
"source": "I'm copy-editing a story for publication. It has the following sentence in it:\nOur friends won't buy this analysis, let alone the next one we propose.\nDoes this sentence make sense and is it grammatically correct? Please answer yes or no.", 
"target": "yes", 
"psrc_prompt_template_signature": "glue/cola", 
"prompt_name": "editing", 
"prompt_answer_choice_list": ["no", "yes"], 
"dataset_name": "glue", 
"dataset_config": "cola", 
"split": "train", 
"metrics": ["Accuracy"], 
"original_task": true, 
"choices_in_prompt": true, 
"languages": ["en"], 
"highlighted_source": "I'm copy-editing a story for publication. It has the following sentence in it:\n<span style='color: #F08080'>Our friends won't buy this analysis, let alone the next one we propose.</span>\nDoes this sentence make sense and is it grammatically correct? Please answer <span style='color: #F08080'>yes or no</span>.", 
"highlighted_target": "<span style='color: #F08080'>yes</span>", 
"src_meta_sentence": "Our friends won't buy this analysis, let alone the next one we propose.", 
"src_meta_label": 1, 
"src_meta_idx": 0
}

The definition of each of the keys in the data,

  1. id: An unique id for the sample. Each line of the jsonl file contains json data which has a unique id within the jsonl file. (datatype: string/int)
  2. source: projected input for the language model. This is the instruction. (datatype: string)
  3. target: projected output for the language model. This is the gold response. (datatype: string)
  4. psrc_prompt_template_signature: prompt template signature from promptsource repository. Usually, a set of prompt templates are written for a task (i.e., glue/cola, glue/mrpc). This usually refers to that task. (datatype: string)
  5. prompt_name: Name of the individual prompt template. Under a psrc_prompt_template_signature there could be many prompt templates. prompt_name refers to each of those prompt templates. (datatype: string)
  6. prompt_answer_choice_list: Name of all potential outcomes. We often do not have any data for this field. Especially for generative tasks. Only categorical task has this field (i.e., [yes, no], [True, False], [A, B, C, D]). (datatype: list of strings)
  7. dataset_name: Name of the huggingface dataset (datatype: string)
  8. dataset_config: Subset name of the huggingface dataset (datatype: string)
  9. split: Split name (i.e., train, dev, test) (datatype: string)
  10. metrics: metrics to evaluate the response. (datatype: list of strings)
  11. original_task: If the prompted sample (source, target) refers to the original task for the dataset being created (datatype: True/False)
  12. choices_in_prompt: If there is any randomness in the prompt generation (datatype: list of strings)
  13. languages: The language of the prompt template (not the dataset). (datatype: list of strings)
  14. highlighted_source: Highlight input tokens that are coming from the prompts and original dataset. This feature can be used to differentiate prompt tokens and input tokens. (datatype: string)
  15. highlighted_target: "Highlight response/output tokens that are coming from the prompts and original dataset. This feature can be used to differentiate prompt tokens and input tokens. (datatype: string)
  16. src_meta_sentence: In the original huggingface dataset there was a column named sentence. we save those data here. (datatype: from huggingface data source)
  17. src_meta_label: In the original huggingface dataset there was a column named label. we save those data here. (datatype: from huggingface data source)
  18. src_meta_idx: the index of the original huggingface data. (datatype: from huggingface data source)

Note: Different datasets may have a different number of "src_meta_*" keys. It depends on the original huggingface dataset columns.

@sbmaruf sbmaruf requested a review from AmrMKayid March 8, 2023 20:10
@sbmaruf
Copy link
Contributor Author

sbmaruf commented Mar 8, 2023

@AmrMKayid Here is the complete PR.

@sbmaruf sbmaruf closed this Mar 8, 2023
@sbmaruf sbmaruf reopened this Mar 8, 2023
@sbmaruf
Copy link
Contributor Author

sbmaruf commented Mar 8, 2023

Closed it by accident!

data/project_from_psrc.py Outdated Show resolved Hide resolved
data/project_from_psrc.py Outdated Show resolved Hide resolved
data/project_from_psrc.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,227 @@
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why the name is project_from_psrc.py ? 👀

Copy link
Contributor Author

@sbmaruf sbmaruf Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project from promptsource. If it's not clear we can rename it as project_from_promptsource.py. But I like the psrc short form. :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think project_from_promptsource.py is better, let's change it pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! :)

Comment on lines 331 to 367
executor.map(
export_dataset_func,
[
prompted_sample_gen_io[0]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_output_dir
[
prompted_sample_gen_io[1]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_name_or_path
[
prompted_sample_gen_io[2]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_config
[
prompted_sample_gen_io[3]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # psrc_prompt_template_signature
[
prompted_sample_gen_io[4]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # prompt_template
[
prompted_sample_gen_io[5]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset
[
prompted_sample_gen_io[6]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # args.add_source_metadata
[
prompted_sample_gen_io[7]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # args.highlight_variables
),
total=len(args.dataset_name_or_paths),
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please change this? 🥺

Suggested change
executor.map(
export_dataset_func,
[
prompted_sample_gen_io[0]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_output_dir
[
prompted_sample_gen_io[1]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_name_or_path
[
prompted_sample_gen_io[2]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset_config
[
prompted_sample_gen_io[3]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # psrc_prompt_template_signature
[
prompted_sample_gen_io[4]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # prompt_template
[
prompted_sample_gen_io[5]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # dataset
[
prompted_sample_gen_io[6]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # args.add_source_metadata
[
prompted_sample_gen_io[7]
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
], # args.highlight_variables
),
total=len(args.dataset_name_or_paths),
):
executor.map(export_dataset_func, zip(*prompted_sample_gen_io_tuple_list)),
total=len(args.dataset_name_or_paths),
):

Copy link
Contributor Author

@sbmaruf sbmaruf Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to write map because it was difficult to debug.
I think it should be executor.map(export_dataset_func, *zip(*prompted_sample_gen_io_tuple_list)),
Can you recheck?
Already updated that in the code.

Comment on lines 118 to 144
def xp3_export_dataset(
dataset_output_dir: str,
dataset_name: str,
dataset_config: str,
psrc_prompt_template_signature: str,
prompt_template: Type[Template],
dataset: Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset],
add_source_metadata: bool = False,
highlight_variables: bool = False,
lang: str = 'en'
) -> str:
"""
Given a `hf-dataset` (arg: dataset) and a prompt template (arg: prompt_template),
project/transform samples from all the splits of dataset (arg: dataset) into an instruction format and
writes in the disk (arg: dataset_output_dir)

Args:
dataset_output_dir (str): Path to the output directory where data will be saved.
dataset_name (str): Name of the hf-dataset.
dataset_config (str): Name of the hf-dataset config.
psrc_prompt_template_signature (str): Name of the dataset & dataset-config for which prompts are written for.
prompt_template (Type[Template]): Transformation/projection module that will take a sample from arg:dataset and transform it to an instruction.
dataset (Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]): huggingface dataset that will be transformed into an instruction dataset.
add_source_metadata (bool = False): If True, all the data column from the args:dataset will be saved as a meta information with the instruction dataset.
add_source_metadata (bool = False): If True, prompt tokens and dataset tokens will be highlighted differently. This metadata will be saved as `highlighted_source` & `highlighted_target`.
lang (str = 'en'): language name of the dataset
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbmaruf what is the difference between this method and export_dataset I can see that both are very similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I wrote export_dataset where I exported all possible metadata while projecting data with templates. xp3_export_dataset doesn't export all possible metadata and strictly follows xP3 format. Please note that the xP3 projection doesn't contain any metadata.

Copy link
Member

@AmrMKayid AmrMKayid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few comments for code quality improvements, otherwise LGTM, thanks @sbmaruf!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants