Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's the specific meaning of dsir? #99

Open
BBetteroff opened this issue Jan 16, 2024 · 4 comments
Open

what's the specific meaning of dsir? #99

BBetteroff opened this issue Jan 16, 2024 · 4 comments

Comments

@BBetteroff
Copy link

BBetteroff commented Jan 16, 2024

I am trying to reproduce this repo on my macOS, and I don't have a aws account .can i get your help, i'd appreciate it
截屏2024-01-16 14 20 30

@mauriceweber
Copy link
Collaborator

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:

# DSIR
DSIR_NUM_SAMPLES=500000
DSIR_FEATURE_DIM=10000

@BBetteroff
Copy link
Author

Hi @BBetteroff , DSIR stands for "Data Selection with Importance Resampling" (see paper here) and is used to compute importance weights for each sample with respect to different target domains.

The screenshot you posted is from the prep_artifacts.py scripts. The flag --dsir_num_samples corresponds to the number of samples you use from the target domain. The flag --dsir_feature_dim corresponds to the dimension of the feature vector used to fit the bag-of-ngram model based on which dsir importance weights are computed. If you look into the default config, you can see that the default values used are 500k samples and dimension 10k:

# DSIR
DSIR_NUM_SAMPLES=500000
DSIR_FEATURE_DIM=10000

Thanks! I'll keeping reproducing this repo and talking to you.

@BBetteroff
Copy link
Author

BBetteroff commented Jan 19, 2024

what‘s the content of listing file?,can you show me a example? and what's the use?

@mauriceweber
Copy link
Collaborator

The listing files contain the ids of inputs which, when concatenated with the base uri point to the location of the data. For example:

2023-06/0000/de_head.json.gz
2023-06/0000/de_middle.json.gz
2023-06/0000/de_tail.json.gz
2023-06/0000/en_head.json.gz
2023-06/0000/en_middle.json.gz
2023-06/0000/en_tail.json.gz
2023-06/0000/es_head.json.gz

For example, if your data is stored locally under, e.g., /data/documents/2023-06/0000/de_middle.json.gz you would use file:///data/documents/ as the base uri.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants