Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError on computing checksums for large files #79

Open
oroszgy opened this issue Feb 21, 2024 · 2 comments
Open

MemoryError on computing checksums for large files #79

oroszgy opened this issue Feb 21, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@oroszgy
Copy link

oroszgy commented Feb 21, 2024

When creating a command which depends on a large file (which cannot be fitted into memory), weasel still tries to load the whole file which results in a MemoryError.

The traceback for such a run:


  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/bin/weasel", line 8, in <module>                                                                                                                                                  sys.exit(app())                                                                                                                                                                                                                         

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 42, in project_run_cli                                                                                                      project_run(                            
                                                           
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 88, in project_run
    project_run(                                                                                                                                                                                                                                                                                       
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 113, in project_run
    update_lockfile(current_dir, cmd)                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 270, in update_lockfile
    data[command["name"]] = get_lock_entry(project_dir, command)
                                                                                                                                                                                                                                              File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 286, in get_lock_entry
    deps = get_fileinfo(project_dir, command.get("deps", []))

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 308, in get_fileinfo
    md5 = get_checksum(file_path) if file_path.exists() else None

  File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/util/hashing.py", line 33, in get_checksum
    return hashlib.md5(Path(path).read_bytes()).hexdigest() 

  File "/home/gorosz/Applications/miniconda3/lib/python3.10/pathlib.py", line 1127, in read_bytes
    return f.read()

MemoryError
@svlandeg svlandeg added the enhancement New feature or request label Feb 29, 2024
@svlandeg
Copy link
Member

svlandeg commented Feb 29, 2024

This happens because Weasel checks the dependencies of a command and whether they've changed or not. To prevent this from happening, the large file should simply not be listed as output or input to a given command - then it won't be processed / validated.

@oroszgy
Copy link
Author

oroszgy commented Feb 29, 2024

Thanks. I just discovered this workaround for myself as well. Do you think it's feasible to use the last modification date instead of hashes? Alternatively, would it be a solution to compute hashes for file chunks to address this issue (refer to https://stackoverflow.com/questions/1131220/get-the-md5-hash-of-big-files-in-python)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants