Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regarding to deduplication #79

Open
kimcando opened this issue Nov 1, 2023 · 6 comments
Open

regarding to deduplication #79

kimcando opened this issue Nov 1, 2023 · 6 comments

Comments

@kimcando
Copy link

kimcando commented Nov 1, 2023

Hey,

thank you in advance for your great work and sharing the data :)
I read README and huggingface details and was unclear whether fuzzy deduplication is actually done on this dataset.
I understand that

  • bloomfilter, which is EXACT MATCH, seems to be clearly applied.( huggingface data creation part : Finally, the documents were deduplicated based on the text, using a Bloomfilter.)
  • meta data provides several threshold based hash fingerprint and the article says anyone can process fuzzy deduplication. The unclear part is that it seems that you applied fuzzy deduplication when you train your model, but this shared dataset is the version before fuzzy deduplication applied on.

Therefore, my question is the provided dataset is the one that fuzzy deduplication is also applied?
If so, could you please share the info that how many cores(if under distributed environments, how many and which type instances) you use? and how long does it take?

Cheeeers!!

@ManuelFay
Copy link

+1 - given the fuzzy deduplication hashes, is there a simple/suggested way to cluster and sample them ?

Thanks for thegreat work !

@mauriceweber
Copy link
Collaborator

Hi @kimcando and @ManuelFay and thanks for your questions!

bloomfilter, which is EXACT MATCH, seems to be clearly applied.( huggingface data creation part : Finally, the documents were deduplicated based on the text, using a Bloomfilter.)

Yes, we ran the entire dataset through a Bloomfilter for exact deduplication and published the duplicate ids as separate files (mirroring the dataset structure). Important to note is that the duplicates were deliberately kept in the dataset so that everyone can experiment with and study duplication in the training data.

meta data provides several threshold based hash fingerprint and the article says anyone can process fuzzy deduplication. The unclear part is that it seems that you applied fuzzy deduplication when you train your model, but this shared dataset is the version before fuzzy deduplication applied on.

This is correct, we compute the minhash signatures in the same pass as the other quality signals. Note that this is just the signatures; to do fuzzy deduplication, you need to run LSH on these (see below on how to run this).

is the provided dataset is the one that fuzzy deduplication is also applied?

the dataset we provide comes with the minhash signatures, but not with the deduplication clusters. These need to be computed using the script in app/src/run_lsh.py.

Here is a minimal example you can run from the root of RedPajama-Data:

1) Download listings

DATA_ROOT="${HOME}/path/to/data" # make sure this is an absolute path
mkdir -p "${DATA_ROOT}/listings"
listings_file="listings/en-2023-06-head_middle.txt"
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/${listings_file}" -O "${DATA_ROOT}/${listings_file}"

2) Download MinHash signatures

# read the first 5 lines here to run the example
head -n5 "${DATA_ROOT}/${listings_file}" | while read line; 
do
    url="https://data.together.xyz/redpajama-data-v2/v1.0.0/minhash/${line}.minhash.parquet"
    dest="${DATA_ROOT}/minhash/${line}.minhash.parquet"
    mkdir -p $(dirname $dest)
    wget "$url" -O "$dest"
    echo "minhash/${line}.minhash.parquet" >> "${DATA_ROOT}/minhash_listings.txt"
done

3) Run LSH at similarity level 0.7

cd app/
python3 src/run_lsh.py --input_base_uri "file://${DATA_ROOT}/" --output_dir "${DATA_ROOT}/minhash_clusters/" --similarity 0.7 --num_perm 128 --listings "${DATA_ROOT}/minhash_listings.txt"

This will result in one parquet file for each input file, containing the MinHash cluster id for every (fuzzy duplicate) document in the corresponding documents file.

@kimcando
Copy link
Author

kimcando commented Nov 9, 2023

Hi @kimcando and @ManuelFay and thanks for your questions!

bloomfilter, which is EXACT MATCH, seems to be clearly applied.( huggingface data creation part : Finally, the documents were deduplicated based on the text, using a Bloomfilter.)

Yes, we ran the entire dataset through a Bloomfilter for exact deduplication and published the duplicate ids as separate files (mirroring the dataset structure). Important to note is that the duplicates were deliberately kept in the dataset so that everyone can experiment with and study duplication in the training data.

meta data provides several threshold based hash fingerprint and the article says anyone can process fuzzy deduplication. The unclear part is that it seems that you applied fuzzy deduplication when you train your model, but this shared dataset is the version before fuzzy deduplication applied on.

This is correct, we compute the minhash signatures in the same pass as the other quality signals. Note that this is just the signatures; to do fuzzy deduplication, you need to run LSH on these (see below on how to run this).

is the provided dataset is the one that fuzzy deduplication is also applied?

the dataset we provide comes with the minhash signatures, but not with the deduplication clusters. These need to be computed using the script in app/src/run_lsh.py.

Here is a minimal example you can run from the root of RedPajama-Data:

1) Download listings

DATA_ROOT="${HOME}/path/to/data" # make sure this is an absolute path
mkdir -p "${DATA_ROOT}/listings"
listings_file="listings/en-2023-06-head_middle.txt"
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/${listings_file}" -O "${DATA_ROOT}/${listings_file}"

2) Download MinHash signatures

# read the first 5 lines here to run the example
head -n5 "${DATA_ROOT}/${listings_file}" | while read line; 
do
    url="https://data.together.xyz/redpajama-data-v2/v1.0.0/minhash/${line}.minhash.parquet"
    dest="${DATA_ROOT}/minhash/${line}.minhash.parquet"
    mkdir -p $(dirname $dest)
    wget "$url" -O "$dest"
    echo "minhash/${line}.minhash.parquet" >> "${DATA_ROOT}/minhash_listings.txt"
done

3) Run LSH at similarity level 0.7

cd app/
python3 src/run_lsh.py --input_base_uri "file://${DATA_ROOT}/" --output_dir "${DATA_ROOT}/minhash_clusters/" --similarity 0.7 --num_perm 128 --listings "${DATA_ROOT}/minhash_listings.txt"

This will result in one parquet file for each input file, containing the MinHash cluster id for every (fuzzy duplicate) document in the corresponding documents file.

Thanks for replying.
But for the provided deduplication is tested on only 200M documents which is a significantly small number given the all number of documents 100B and that 200M documents can be easily deduplicated using the well known libraries(For instance, let say you guys used 80 snapshots -> then each single index is approximately 1.25B docs . The 200M documents is only about less than 20% of the single index.) However, tackling a large volume is another problem.

Therfore, my question is when Redpajama-V2 is used for training models, then the considerable amount of datasets must be deduplicated, and at that situation(e.g, to handle 20 trillion tokens) could you give me some hints how many cores have you used?

@mauriceweber
Copy link
Collaborator

absolutely, the current LSH implementation does not scale to the entire dataset. I think to do full fuzzy deduplication, you will need to use multiple nodes (the implementation of MinHashLSH provided by BigCode here is probably a good starting point).

With that said, a way forward with the single node LSH implementation in src/run_lsh.py would be to first reduce the number of documents using exact dedupe and quality filtering to get a smaller dataset and only then run LSH.

To run LSH on 200M documents with used a machine with 500GB RAM and 64 cores, and it took ~40 minutes. The exact (.wet document hash based) dedupe with the bloomfilter ran on the same machine in ~3.5 days for the 25B english documents.

@edwardzjl
Copy link

Is it possible to replace minhash with simhash? IIRC, dedup on exact match of simhash signatures is sufficient to remove near-duplicate documents.

@mauriceweber
Copy link
Collaborator

Hi @edwardzjl , you can use simhash for near deduplication but you need to explicitly compute new hashes for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants