MemoryError on computing checksums for large files #79

oroszgy · 2024-02-21T11:21:28Z

When creating a command which depends on a large file (which cannot be fitted into memory), weasel still tries to load the whole file which results in a MemoryError.

The traceback for such a run:


  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/bin/weasel", line 8, in <module>                                                                                                                                                  sys.exit(app())                                                                                                                                                                                                                         

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 42, in project_run_cli                                                                                                      project_run(                            
                                                           
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 88, in project_run
    project_run(                                                                                                                                                                                                                                                                                       
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 113, in project_run
    update_lockfile(current_dir, cmd)                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 270, in update_lockfile
    data[command["name"]] = get_lock_entry(project_dir, command)
                                                                                                                                                                                                                                              File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 286, in get_lock_entry
    deps = get_fileinfo(project_dir, command.get("deps", []))

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 308, in get_fileinfo
    md5 = get_checksum(file_path) if file_path.exists() else None

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/util/hashing.py", line 33, in get_checksum
    return hashlib.md5(Path(path).read_bytes()).hexdigest() 

  File "/home/gorosz/Applications/miniconda3/lib/python3.10/pathlib.py", line 1127, in read_bytes
    return f.read()

MemoryError

The text was updated successfully, but these errors were encountered:

svlandeg · 2024-02-29T14:03:00Z

This happens because Weasel checks the dependencies of a command and whether they've changed or not. To prevent this from happening, the large file should simply not be listed as output or input to a given command - then it won't be processed / validated.

oroszgy · 2024-02-29T21:10:01Z

Thanks. I just discovered this workaround for myself as well. Do you think it's feasible to use the last modification date instead of hashes? Alternatively, would it be a solution to compute hashes for file chunks to address this issue (refer to https://stackoverflow.com/questions/1131220/get-the-md5-hash-of-big-files-in-python)?

svlandeg added the enhancement New feature or request label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError on computing checksums for large files #79

MemoryError on computing checksums for large files #79

oroszgy commented Feb 21, 2024 •

edited

Loading

svlandeg commented Feb 29, 2024 •

edited

Loading

oroszgy commented Feb 29, 2024

MemoryError on computing checksums for large files #79

MemoryError on computing checksums for large files #79

Comments

oroszgy commented Feb 21, 2024 • edited Loading

svlandeg commented Feb 29, 2024 • edited Loading

oroszgy commented Feb 29, 2024

oroszgy commented Feb 21, 2024 •

edited

Loading

svlandeg commented Feb 29, 2024 •

edited

Loading