-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clip-retrieval inference
: Failed to read Parquet files from Huggingface laion/laion-pop dataset
#393
Comments
Due to limit of number of characters, I copied the full output message after running
|
When I ran
|
Hi, you need to run img2dataset first, see readme
…On Tue, Nov 26, 2024, 01:18 jenniferlin0815 ***@***.***> wrote:
When I ran clip-retrieval end2end
'/project/jl0815/cvqa_sota/laion_pop_snapdownloaded/part-{00000..00127}-a5835434-5909-4f72-a89e-2fc1d17efc62-c000.snappy.parquet'
$tmp_folder, I got the following messages:
/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/albumentations/__init__.py:24: UserWarning: A new version of Albumentations is available: 1.4.21 (you have 1.4.20+computecanada). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
Starting the downloading of this file
0it [00:00, ?it/s]er 1 of 1 called /project/jl0815/laion_pop_snapdownloaded/part-{00000..00127}-a5835434-5909-4f72-a89e-2fc1d17efc62-c000.snappy.parquet
0it [01:00, ?it/s]
Traceback (most recent call last):
File "/project/jl0815/venv_llama32_laion/bin/clip-retrieval", line 8, in <module>
sys.exit(main())
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/clip_retrieval/cli.py", line 18, in main
fire.Fire(
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/clip_retrieval/clip_end2end.py", line 24, in clip_end2end
download(
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/img2dataset/main.py", line 262, in download
distributor_fn(
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/img2dataset/distributor.py", line 36, in multiprocessing_distributor
failed_shards = run(reader)
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/img2dataset/distributor.py", line 31, in run
for status, row in tqdm(process_pool.imap_unordered(downloader, gen)):
File "/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
FileNotFoundError: [Errno 2] No such file or directory: '/project/jl0815/laion_pop_snapdownloaded/part-{00000..00127}-a5835434-5909-4f72-a89e-2fc1d17efc62-c000.snappy.parquet'
/project/jl0815/venv_llama32_laion/lib/python3.10/site-packages/albumentations/__init__.py:24: UserWarning: A new version of Albumentations is available: 1.4.21 (you have 1.4.20+computecanada). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/python/3.10.13/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
—
Reply to this email directly, view it on GitHub
<#393 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437VBYGCJ3M2QRJZCTR32CO44RAVCNFSM6AAAAABSPFADVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJZGI4TMNBWGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi! Thank you for the great work!
I am trying to build a system that given an image, return k most similar images. I was trying to use laion/relaion2B-en-research from huggingface but decided to use laion/laion-pop dataset first to familiarize myself with the tool. I downloaded the laion/laion-pop dataset using huggingface's
snapshot_download()
function so the Parquet files downloaded are storing real data, unlike using the cached one which are storing links.When I ran
clip-retrieval inference --input_dataset '<folder>/part-{00000..00127}-a5835434-5909-4f72-a89e-2fc1d17efc62-c000.snappy.parquet' --output_folder $output_folder --input_format webdataset --enable_text False
, the output is as below:with 125
UserWarning: ReadError('invalid header'...
and 3UserWarning: ReadError('bad checksum'...
and the
$output_folder
has following structure:where 0.json, 1.json, and 2.json are files containing a
{}
and nothing else.I tried re-downloading the huggingface dataset but it did not work. Comparing to the official example which uses
test_1000.parquet
having just three keys:The parquet file of laion/laion-pop has much more keys:
and I wonder if this is the reason of error.
My virtual environment has following packages:
I appreciate any input and please share with me if there is a tutorial of using huggingface dataset with clip-retrieval.
Thank you!
The text was updated successfully, but these errors were encountered: