what's the specific meaning of dsir? #99

BBetteroff · 2024-01-16T06:20:59Z

I am trying to reproduce this repo on my macOS, and I don't have a aws account .can i get your help, i'd appreciate it

mauriceweber · 2024-01-16T08:31:54Z

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:

RedPajama-Data/configs/rp_v2.0.conf

Lines 31 to 33 in bb594b0

    
           # DSIR 
        
           DSIR_NUM_SAMPLES=500000 
        
           DSIR_FEATURE_DIM=10000

BBetteroff · 2024-01-16T08:49:39Z

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:

RedPajama-Data/configs/rp_v2.0.conf

Lines 31 to 33 in bb594b0

# DSIR

DSIR_NUM_SAMPLES=500000

DSIR_FEATURE_DIM=10000

Thanks! I'll keeping reproducing this repo and talking to you.

BBetteroff · 2024-01-19T07:57:13Z

what‘s the content of listing file?,can you show me a example? and what's the use?

mauriceweber · 2024-01-22T07:50:34Z

The listing files contain the ids of inputs which, when concatenated with the base uri point to the location of the data. For example:

2023-06/0000/de_head.json.gz
2023-06/0000/de_middle.json.gz
2023-06/0000/de_tail.json.gz
2023-06/0000/en_head.json.gz
2023-06/0000/en_middle.json.gz
2023-06/0000/en_tail.json.gz
2023-06/0000/es_head.json.gz

For example, if your data is stored locally under, e.g., /data/documents/2023-06/0000/de_middle.json.gz you would use file:///data/documents/ as the base uri.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the specific meaning of dsir? #99

what's the specific meaning of dsir? #99

BBetteroff commented Jan 16, 2024 •

edited

Loading

mauriceweber commented Jan 16, 2024

BBetteroff commented Jan 16, 2024

BBetteroff commented Jan 19, 2024 •

edited

Loading

mauriceweber commented Jan 22, 2024

what's the specific meaning of dsir? #99

what's the specific meaning of dsir? #99

Comments

BBetteroff commented Jan 16, 2024 • edited Loading

mauriceweber commented Jan 16, 2024

BBetteroff commented Jan 16, 2024

BBetteroff commented Jan 19, 2024 • edited Loading

mauriceweber commented Jan 22, 2024

BBetteroff commented Jan 16, 2024 •

edited

Loading

BBetteroff commented Jan 19, 2024 •

edited

Loading