Possible incorrect indexing in snip_download.py? #9

trojblue · 2023-05-21T11:59:33Z

Hi, I looked through the code and the original code looks like this:

    is_dup_all = np.load(dedup_set_path).ravel()
    abs_ind = 0
    for n in range(start, end):
        print(f"downloading metadata file {n}/{end}")
        url = f"https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_{n:04d}.parquet"
        response = requests.get(url)
        parquet_path = os.path.join(metadata_dir, f"metadata_{n:04d}.parquet")
        open(parquet_path, "wb").write(response.content)

        # perform the deduplication
        md = pd.read_parquet(parquet_path)
        non_dup_chunk = is_dup_all[abs_ind : abs_ind + len(md.index)]

        # take only non-dupped (uniques)
        non_dup_chunk = np.logical_not(non_dup_chunk)

        # make sure there is at least one unique
        non_dup_chunk[0] = True
        md = md[non_dup_chunk]

        # overwrite metadata
        md.to_parquet(parquet_path)
        abs_ind += len(md.index)

I believe there might be an oversight here:

instead of incrementing abs_ind by the length of the deduplicated md.index, wouldn't it be more accurate to increment it by the total number of entries in the original Parquet file before deduplication?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible incorrect indexing in snip_download.py? #9

Possible incorrect indexing in snip_download.py? #9

trojblue commented May 21, 2023

Possible incorrect indexing in snip_download.py? #9

Possible incorrect indexing in snip_download.py? #9

Comments

trojblue commented May 21, 2023