Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to push data to google cloud storage #56

Open
krmayankb opened this issue Sep 11, 2023 · 1 comment
Open

Not able to push data to google cloud storage #56

krmayankb opened this issue Sep 11, 2023 · 1 comment

Comments

@krmayankb
Copy link

krmayankb commented Sep 11, 2023

While trying to push data to the Google Cloud, I am getting a file not found error. Any help would be highly appreciated.

python download_upstream.py --scale medium --data_dir "gs://dataset/datacomp/" --thread_count 2
Downloading metadata to gs://dataset/datacomp/metadata...

Downloading (…)c76a589ef5d0.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 395MB/s]
Downloading (…)30fd0d497176.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 360MB/s]
.
.
.
Downloading (…)0edfcd0a6bc7.parquet: 100%|██████████████████████████████████████████████████████████████████| 121M/121M [00:00<00:00, 260MB/s]
Fetching 253 files: 100%|███████████████████████████████████████████████████████████████████████████████████| 253/253 [00:54<00:00,  4.67it/s]
Done downloading metadata.7.parquet:  26%|████████████████▉                                                | 31.5M/121M [00:00<00:00, 143MB/s]
Downloading images to gs://dataset/datacomp/shards██████████████████████████████████████████████   | 115M/121M [00:00<00:00, 307MB/s]
Starting the downloading of this file 
Sharding file number 1 of 1 called dataset/datacomp/metadata
0it [00:08, ?it/s]
Traceback (most recent call last):
  File "/home/mayank/datacomp/datacomp/download_upstream.py", line 218, in <module>
    img2dataset.download(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/main.py", line 232, in download
    distributor_fn(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 36, in multiprocessing_distributor
    failed_shards = run(reader)
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 31, in run
    for (status, row) in tqdm(process_pool.imap_unordered(downloader, gen)):
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
FileNotFoundError: b/dataset/o/datacomp%2Fmetadata                                                                                                          
@0x2b3bfa0
Copy link
Contributor

I was able to save the small and medium detasets to S3 by setting --metadata_dir to a local (temporary) path and --data_dir to the remote path:

python download_upstream.py --scale medium --data_dir gs://dataset/datacomp --metadata_dir /tmp/metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants