Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent util.execute from reading output files #296

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

anabiman
Copy link
Collaborator

@anabiman anabiman commented Apr 12, 2021

Description

Adds an extra keyword (outfiles_load) to util.execute that instructs disk_files how to handle output files. If outfiles_load=True (default), the output files are read and their contents returned in outfiles. When outfiles_load=False, only the file paths are returned instead. This is useful and can be necessary when output files are large so keeping track of the output files while retaining the scratch dir becomes a viable alternative:

inp = {
    "command": ...,
    "outfiles": [huge_data_file, ...],
    "outfiles_load": False, # do not load outfiles in memory
    "scratch_messy": True, # do not delete scratch dir
}

_, proc = execute(**inp)

proc["outfiles"][huge_data_file] -> Path

The default behavior is preserved (outfiles_load=True in util.execute), so by default outfiles stores the file contents i.e.

inp = {
    "command": ...,
    "outfiles": [small_data_file, ...],
}

_, proc = execute(**inp)

proc["outfiles"][small_data_file] -> Union[str, bytes]

Changelog description

util.execute supports tracking output files without loading them in memory

Status

  • Code base linted
  • Ready to go

@codecov
Copy link

codecov bot commented Apr 12, 2021

Codecov Report

Merging #296 (a3d0ace) into master (43639f2) will increase coverage by 0.19%.
The diff coverage is 100.00%.

@loriab
Copy link
Collaborator

loriab commented May 3, 2021

What about making it like binary where only the named files are not read into memory?

Perhaps add a note that scratch_messy=True required to avoid surprise.

@anabiman
Copy link
Collaborator Author

anabiman commented May 4, 2021

I like your suggestion. I'll refine this and improve the docstring.

qcengine/util.py Outdated
@@ -377,6 +377,7 @@ def execute(
infiles: Optional[Dict[str, str]] = None,
outfiles: Optional[List[str]] = None,
*,
outfiles_track: Optional[List[str]] = [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point but shouldn't matter here since outfiles_track should never be modified, but if you're concerned someone might I could change its type to an immutable object like a Tuple OR if you prefer a list for consistency that is by default None, I could add a single line to disk_files:

outfiles_track = outfiles_track or []

that would make the current code work with no additional changes.

qcengine/util.py Outdated
except (OSError, FileNotFoundError):
if "*" in fl:
filename = lwd / fl
if fl not in outfiles_track:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you handling the case when outfiles=["subdir/*"], outfiles_track=["subdir/bigfile"]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this will not work (bigfile will be loaded in memory). For outfiles_track to work, it has to be a subset of outfiles. So for the following input:

{
    ...
    "outfiles": ["*"],
    "outfiles_track": ["bigfile"],
    ...
}

or the opposite:

{
    ...
    "outfiles": ["bigfile", ...],
    "outfiles_track": ["*"],
    ...
}

outfiles_track will be ignored. There are also no sanity checks being done on outfiles_track so if one specifies a filename not part of outfiles, it will be ignored without any warnings.

I guess the main advantage in setting outfiles_load: bool is we don't have to worry about individual cases like these.

@loriab loriab added the LocalCompute Using qcng as front-end for large jobs, rather than distributed compute label Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LocalCompute Using qcng as front-end for large jobs, rather than distributed compute
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants