Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible incorrect indexing in snip_download.py? #9

Open
trojblue opened this issue May 21, 2023 · 0 comments
Open

Possible incorrect indexing in snip_download.py? #9

trojblue opened this issue May 21, 2023 · 0 comments

Comments

@trojblue
Copy link

Hi, I looked through the code and the original code looks like this:

    is_dup_all = np.load(dedup_set_path).ravel()
    abs_ind = 0
    for n in range(start, end):
        print(f"downloading metadata file {n}/{end}")
        url = f"https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_{n:04d}.parquet"
        response = requests.get(url)
        parquet_path = os.path.join(metadata_dir, f"metadata_{n:04d}.parquet")
        open(parquet_path, "wb").write(response.content)

        # perform the deduplication
        md = pd.read_parquet(parquet_path)
        non_dup_chunk = is_dup_all[abs_ind : abs_ind + len(md.index)]

        # take only non-dupped (uniques)
        non_dup_chunk = np.logical_not(non_dup_chunk)

        # make sure there is at least one unique
        non_dup_chunk[0] = True
        md = md[non_dup_chunk]

        # overwrite metadata
        md.to_parquet(parquet_path)
        abs_ind += len(md.index)

I believe there might be an oversight here:

instead of incrementing abs_ind by the length of the deduplicated md.index, wouldn't it be more accurate to increment it by the total number of entries in the original Parquet file before deduplication?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant