-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for memory-mapped files #604
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JanSellner, this is a great start! I like the general approach, although I have some doubts with the new API function that you introduced. Also, I'd like to have some style corrected (see updated guidelines in https://github.com/Blosc/c-blosc2/blob/main/DEVELOPING-GUIDE.rst).
I'd also love to have a benchmark for comparing the new mmap IO and the existing stdio one, so it would be great if you can add one to the bench/
directory for us to test. In particular, if benchmarks work well for different boxes (including MacOS, this I can check), I am curious about the possibility to make the mmap the default for IO. This could open a can of worms for other OS than Linux/MacOS/Win though.
* @return The new super-chunk. NULL if not found or not in frame format. | ||
*/ | ||
BLOSC_EXPORT blosc2_schunk* blosc2_schunk_open_offset_mmap(const char* urlpath, int64_t offset, const char* mmap_mode); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only mmap specific API introduced. May I suggest to add blosc2_schunk_open_offset_udio
instead? This would work more generally for other udio. OTOH, I agree that blosc2_schunk_open_offset_mmap
is handy, but having it and not blosc2_schunk_open_mmap
is not consistent (i.e. the latter should be added too). A bit undecided here... Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The blosc2_schunk_open_udio
function can already be used for mmap and it only makes sense to also have a blosc2_schunk_open_offset_udio
counterpart (especially since this is needed for Python). I removed the blosc2_schunk_open_offset_mmap
function.
Right, a lazychunk only contains metadata (header of the chunk + index to where blocks start), so I don't think having an additional memcpy is too bad.
Yep, sounds good to me. If you could provide a benchmark for writing, I'll be happy to contribute figures on whether it is accelerated (and how much) or not.
Frankly, I was not expecting too much gain here, but at any rate, I am in favor of changing the API for reading now (this could come handy for other IO plugins in the future).
Ok, but we need to be consistent with the C API as well. I have added a comment about this in the review. |
Thank you for your review @FrancescAlted! I added an io argument to
Ok good. Then this can stay as is. I also made a few code style changes. btw: as a side note (not directly connected to this PR): have you thought about adding a pre-commit hook with clang-format to this repository to automate and unify the code style in this repo? |
Cool! I have started doing some comparisons, and mmap is better sometimes on my MacBook Air:
So, the initial appends (compr) seems to work faster with regular file, whereas reads (decomp) seems significantly faster with mmap, which is good news indeed. However, inserts are definitely slower with mmap (0.300s vs 0.889s). Same thing with updates, where mmap is significantly slower (0.242s vs 0.661s):
Following that, I assume that mmap is going to accelerate reads significantly, whereas for writes, it can add a significant overhead (at least for MacOS). If that trend is confirmed, I wonder whether choosing the mmap IO when opening files in 'r'ead-only would be a good default.
I see. Well, it is not that great for windows to reserve 1 GB when you have to write e.g. 100 KB, but that's life.
Well done!
Maybe avoiding seek and tell is better, yes. Do you think that the new
Yes, there are situations where this is desirable. We blogged about this: https://www.blosc.org/posts/introducing-sparse-frames/
It would be cool, but no rush.
Yes, that would be great. Would you like to provide the hook on a different PR? The only condition is that this would only enforce the format for new code only, and refuse to change the existing one (I value the fact that I can easily trace/blame lines of code). |
I removed tell/seek and introduced the size callback. There might be a slight performance benefit in case of mmap (less function calls needed) but since it seems to be used rarely anyway, I don't think it matters.
Oh sorry, no, it is not possible to apply formatting changes only for new changes since this would be against the original idea of such a hook which is to to enforce a consistent style throughout the repository without human intervention so that users don't have to worry about style consistencies anymore. However, it actually is possible to solve your use cases of excluding such formatting commits for git blame. There even seems to be automatic support by GitHub. So, if this would work for you, we could proceed with this in a separate PR (but after this one^^). Benchmarking Linux
Windows
On Linux and Windows, there is quite some performance boost for decompression and inserts. Compression is slightly faster but less drastic (more pronounced on Windows). I am actually a bit surprised that MacOS performs so bad here. Maybe it is just because of the fluctuations? Maybe you could also repeat it on the latest version via Given these numbers, I think that mmap could indeed be a good default. Especially since all this is in addition to the original problem of deteriorated NFS performance. However, maybe it is best to first provide the option to the user and then after some transition period this could become the default. Just in case we are missing a problem which only occurs in certain situations or platforms. Regarding sframe support: if (frame->sframe) {
frame->fp = sframe_open_chunk(frame->urlpath, nchunk, "rb", schunk->storage->io);
frame->fp_nchunk = nchunk;
if (frame->fp == NULL) {
BLOSC_TRACE_ERROR("Error opening file in: %s", frame->urlpath);
return BLOSC2_ERROR_FILE_OPEN;
}
}
chunksize = frame_decompress_chunk(schunk->dctx, frame, nchunk, dest, nbytes);
if (frame->sframe) {
if (sframe_close_chunk(frame->fp, schunk->storage->io) < 0) {
return BLOSC2_ERROR_FAILURE;
}
} The main idea was to use this function as clear entry and exit points for the I/O since the purpose of the function nchunk = *(int32_t*)(src + trailer_offset); Using the The main problem for mmap sframe support is that this makes it completely unpredictable which chunk should be loaded and in return when to close it (to ideally only open and close every chunk once). If this is not a bug, then I would propose to not use mmap for sframes. |
Probably no, but it has been a good exercise to check that tell/seek is not needed. Thanks.
Ok, thanks for the detective work. However, I think I still prefer to keep things manually formatted for now. But, for the future, this can be a good advice to have in mind.
Cool. Here there are my numbers. First for my Linux box (i13900K): Linux
Then for my MacBook Air (ARM M1): MacOS
So, it looks like MacOS still has some quirk regarding write (compression) performance.
+1. Let's provide the mmap IO as an option first and, as time goes by, let's decide on promoting mmap as default (which would be quite a no-brainer if/when MacOSX would perform well for writes). Another possibility is to use mmap for reads and regular stdio for writes. But yeah, it is a bit too soon for this.
It is intended indeed. The reason is the chunk index being located right after the chunks, so the procedure to access to it is to find the end of the file to retrieve it (and decompress it), prior to access the desired chunk. The rational is that the index is variable-length, and putting it at the end of the file (near the trailer) is the only way to add more chunks without having to re-write the existing ones.
+1 on not using mmap for sframes. The only issue is that using mmap will only be possible for cframes, and possibly providing the user a false hint on when mmap is actually acting. But if properly documented, this should not be a big problem IMO. |
Hmm, there is definitively something weird going on with writes on MacOS. If you find some time, maybe you can do some line profiling (I think they have something in xcode: Instruments). Maybe there is a certain command which causes this. On Linux, I used callgrind@valgrind successfully and observed quite a drop in costs for
I added an error message when a different file path is about to be opened by the user. This happens for sframes but is also an error in any case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on docstrings
Ok. I don't have experience with xcode. What I have is some profile made with CLion; it is not a line profile, but rather a sampling profile for function calls, but it might help. Here there are a couple of screenshots for mmap ( And a similar thing, but for io_file ( Hope this helps. |
Interesting! I expected that In any case, a |
Yep, you nailed it. On my MacBook Air, before applying the latest commit:
and after applying it:
So, definitely, a good performance bump, and better write speed than the default stdio in all major OS. Congrats! Before merging, I have a question: why did you choose 1 GB for the initial file size for mmap on Windows? Would not it be enough something like 10 MB or, even 100 MB? |
Nice! Cool that it now also works nicely on Mac OS :-) 1 GB is a bit arbitrary indeed. The general idea is to make it reasonable large so that (potentially costly) remaps are avoided. On Linux and Mac OS, this only absorbs virtual address space which is no problem because there is plenty. It is of course a bit sad that on Windows it also costs (temporary) disk space. I just thought that it is reasonable to assume that there is at least 1 GB of free space available. In principle, this parameter can also be adjusted by the user and we can also finetune the default value in the future. For the Python interface, it might makes sense to expose this parameter explicitly so that users are aware of it and can tune it accordingly. If possible, it should always be set to cover the expected output size (not always possible but a rough guess would be fine). |
Ok, makes sense to me. Merging. |
Thanks @JanSellner ! |
This PR fixes the issue of slow file reading performance on NFS shares.
Interface Changes
Compared to the previous benchmarking, I changed the interface of the
blosc2_stdio_read
function to avoid unnecessary copies of the data. This required changes in allio_cb->read
usages since previous allocations could have become obsolete. In principle, this worked everywhere except forframe_get_lazychunk
where the allocation of the chunk has a different size (lazychunk_cbytes
) compared to what is actual read from the file (streams_offset
). Further, the chunk is read from the file and then further modified, i.e. not compatible with read-only memory pages. This means that in this case amemcpy
is still necessary in the current implementation even though it would theoretically be possible to avoid it.I am actually not sure about the difference between a lazychunk and a normal chunk and whether it only contains metadata (header, trailer) or also actual chunk data. If it is only metadata, then it probably won't matter since metadata should be small compared to the user data. If this function, however, copies the user data as well, then it might have an effect.
Write Operations
Our main use case (and the original issue) is about reading files and not about writing (we usually write offline and copy the compressed data to the target node). Writing on memory-mapped files is more complex than reading because for writing we ideally need to know the expected file size in advance. In terms of blosc, this is not suitable because we cannot know in advance how much data should be compressed and how large the file will be after compression.
A common solution is to make the memory-mapping larger than the actual file mapping which is also what I did. The default is to start with 1 GB. If the file gets larger, the file has to be re-mapped where I double the size in each remapping step. I hope this covers most use cases efficiently.
Latest Benchmark
Here is the latest benchmark of the current mmap implementation using NFS files with caching:
A bit sad: the difference between
blosc_mmap_new
with lessmemcpy
compared to the previous implementation of the benchmark (blosc_mmap mmap
) is rather low. However, both are much faster than the current default.Interface
To use memory-mapped files, users can either use the new
blosc2_schunk_open_offset_mmap
function or create ablosc2_io
object with the corresponding parameters directly (cf.test_mmap.c
). I usedblosc2_schunk_open_offset_mmap
becauseblosc2_schunk_open_offset
seems to be the only function which is used in Python which passes a filename.The new
mmap_mode
is basically the same as for Numpy's memmap except that Numpy doesn't allow the file to grow but here this is possible.OS-specific Implementations
You will see that the implementation of memory-mapped files is highly OS-specific. The Windows implementation is completely different and the support of the
c
mode is ... a bit weird since the size of the memory mapping cannot change (FILE_MAP_COPY
cannot be combined withFILE_MAP_RESERVE
).I also did not use
mremap
to stay POSIX-compliant (mremap
is available for Linux but not for Mac OS).I tested everything on Linux and Windows. Mac OS is not tested.
Post-PR Steps
After this PR is merged, I hope it will be as simple as exposing the
mmap_mode
argument to theblosc2.open
function and usingblosc2_schunk_open_offset_mmap
in this caes :-)