You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I looked through the code and the original code looks like this:
is_dup_all=np.load(dedup_set_path).ravel()
abs_ind=0forninrange(start, end):
print(f"downloading metadata file {n}/{end}")
url=f"https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_{n:04d}.parquet"response=requests.get(url)
parquet_path=os.path.join(metadata_dir, f"metadata_{n:04d}.parquet")
open(parquet_path, "wb").write(response.content)
# perform the deduplicationmd=pd.read_parquet(parquet_path)
non_dup_chunk=is_dup_all[abs_ind : abs_ind+len(md.index)]
# take only non-dupped (uniques)non_dup_chunk=np.logical_not(non_dup_chunk)
# make sure there is at least one uniquenon_dup_chunk[0] =Truemd=md[non_dup_chunk]
# overwrite metadatamd.to_parquet(parquet_path)
abs_ind+=len(md.index)
I believe there might be an oversight here:
instead of incrementing abs_ind by the length of the deduplicated md.index, wouldn't it be more accurate to increment it by the total number of entries in the original Parquet file before deduplication?
The text was updated successfully, but these errors were encountered:
Hi, I looked through the code and the original code looks like this:
I believe there might be an oversight here:
instead of incrementing
abs_ind
by the length of the deduplicatedmd.index
, wouldn't it be more accurate to increment it by the total number of entries in the original Parquet file before deduplication?The text was updated successfully, but these errors were encountered: