Not able to push data to google cloud storage #56

krmayankb · 2023-09-11T02:25:16Z

While trying to push data to the Google Cloud, I am getting a file not found error. Any help would be highly appreciated.

python download_upstream.py --scale medium --data_dir "gs://dataset/datacomp/" --thread_count 2
Downloading metadata to gs://dataset/datacomp/metadata...

Downloading (…)c76a589ef5d0.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 395MB/s]
Downloading (…)30fd0d497176.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 360MB/s]
.
.
.
Downloading (…)0edfcd0a6bc7.parquet: 100%|██████████████████████████████████████████████████████████████████| 121M/121M [00:00<00:00, 260MB/s]
Fetching 253 files: 100%|███████████████████████████████████████████████████████████████████████████████████| 253/253 [00:54<00:00,  4.67it/s]
Done downloading metadata.7.parquet:  26%|████████████████▉                                                | 31.5M/121M [00:00<00:00, 143MB/s]
Downloading images to gs://dataset/datacomp/shards██████████████████████████████████████████████   | 115M/121M [00:00<00:00, 307MB/s]
Starting the downloading of this file 
Sharding file number 1 of 1 called dataset/datacomp/metadata
0it [00:08, ?it/s]
Traceback (most recent call last):
  File "/home/mayank/datacomp/datacomp/download_upstream.py", line 218, in <module>
    img2dataset.download(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/main.py", line 232, in download
    distributor_fn(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 36, in multiprocessing_distributor
    failed_shards = run(reader)
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 31, in run
    for (status, row) in tqdm(process_pool.imap_unordered(downloader, gen)):
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
FileNotFoundError: b/dataset/o/datacomp%2Fmetadata

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2023-09-25T18:11:33Z

I was able to save the small and medium detasets to S3 by setting --metadata_dir to a local (temporary) path and --data_dir to the remote path:

python download_upstream.py --scale medium --data_dir gs://dataset/datacomp --metadata_dir /tmp/metadata

0x2b3bfa0 mentioned this issue Sep 25, 2023

--output_dir does not do correct thing if --output_dir is a cloud path #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to push data to google cloud storage #56

Not able to push data to google cloud storage #56

krmayankb commented Sep 11, 2023 •

edited

Loading

0x2b3bfa0 commented Sep 25, 2023

Not able to push data to google cloud storage #56

Not able to push data to google cloud storage #56

Comments

krmayankb commented Sep 11, 2023 • edited Loading

0x2b3bfa0 commented Sep 25, 2023

krmayankb commented Sep 11, 2023 •

edited

Loading