-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Evaluate the cost of relocatable files in parallel filesystems #53810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you very much for taking this up. One possible solution I see is to not cache source files at the time of loading, but at the time of modification for all non- |
I think your proposal is close to a custom CAS system. Regarding the restriction: Regarding keeping hashes in a separate file: These thoughts raise the questions:
|
I recently started looking at the cost of depot operations for CliMA on the distributed filesystem of our cluster. Given the large number of (small) files involved, such operations are already relatively expensive. Is there a small test I can run to benchmark the impact of this change? |
Not yet. We are still trying to schedule a meeting, but some are on holiday atm. Otherwise, I think we will provide some benchmark code + reference results here for posterity. |
Thanks! Looking forward to it! |
With #49866 coming in v1.11 we changed the staleness check for cachefiles from using the
mtime
of the source files to computing their_crc32
hash. Because_crc32
needs to read the whole source file, it is likely more expensive to compute than querying themtime
.So far no one seems to have noted regressions in loading times related to this change. If you do notice any, please link to this issue!
However, some noted that this change might have a notable impact on loading times on parallel filesystems.
This issue should track the status of this problem. I will update the list below regularly.
Related to this is a report from @sloede in the initial relocation issue, Make precompile files relocatable/servable #47943 (comment). The Trixi.jl folks already have a workflow setup that we can utilize for a benchmark. We are currently preparing a meeting to see how we can evaluate this.
Assuming the impact is not to be neglected, there is at least one way forward: @vchuravy suggested to counteract this by implementing a content-aware-storage (CAS) system.
In https://hackmd.io/@Je8OcLYBQr2ociLAtslIug/BJvv7G9pa I prepared a draft for a blogpost that we want to publish that explains how to utilize this relocation feature. Eventually, this report should be shared on discourse, ideally also including an answer to whether this has an impact on parallel filesystems. Any feedback to that post is very much appreciated!
Tangentially related: Computing the CRC hash of the sysimage costs a non-negligible startup time #50166. Sysimages are of the order 200 MB, at least for v1.8, v1.9, v1.10. Note that the problem with parallel filesystems seems to not be the size of a single file, but the vast number of files.
The text was updated successfully, but these errors were encountered: