Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for memory-mapped files #604

Merged
merged 18 commits into from
May 16, 2024
Merged

Conversation

JanSellner
Copy link
Contributor

This PR fixes the issue of slow file reading performance on NFS shares.

Interface Changes
Compared to the previous benchmarking, I changed the interface of the blosc2_stdio_read function to avoid unnecessary copies of the data. This required changes in all io_cb->read usages since previous allocations could have become obsolete. In principle, this worked everywhere except for frame_get_lazychunk where the allocation of the chunk has a different size (lazychunk_cbytes) compared to what is actual read from the file (streams_offset). Further, the chunk is read from the file and then further modified, i.e. not compatible with read-only memory pages. This means that in this case a memcpy is still necessary in the current implementation even though it would theoretically be possible to avoid it.

I am actually not sure about the difference between a lazychunk and a normal chunk and whether it only contains metadata (header, trailer) or also actual chunk data. If it is only metadata, then it probably won't matter since metadata should be small compared to the user data. If this function, however, copies the user data as well, then it might have an effect.

Write Operations
Our main use case (and the original issue) is about reading files and not about writing (we usually write offline and copy the compressed data to the target node). Writing on memory-mapped files is more complex than reading because for writing we ideally need to know the expected file size in advance. In terms of blosc, this is not suitable because we cannot know in advance how much data should be compressed and how large the file will be after compression.

A common solution is to make the memory-mapping larger than the actual file mapping which is also what I did. The default is to start with 1 GB. If the file gets larger, the file has to be re-mapped where I double the size in each remapping step. I hope this covers most use cases efficiently.

Latest Benchmark
Here is the latest benchmark of the current mmap implementation using NFS files with caching:

blosc default:
Benchmark 1: build/blosc_mmap default test_files_nfs/test_real0.b2nd
  Time (mean ± σ):      34.0 ms ±   2.8 ms    [User: 6.8 ms, System: 22.4 ms]
  Range (min … max):    30.7 ms …  40.8 ms    88 runs

mmap old:
Benchmark 1: build/blosc_mmap mmap test_files_nfs/test_real0.b2nd
  Time (mean ± σ):       8.4 ms ±   0.6 ms    [User: 6.1 ms, System: 2.3 ms]
  Range (min … max):     7.6 ms …  10.8 ms    321 runs

mmap new:
Benchmark 1: build/blosc_mmap_new mmap test_files_nfs/test_real0.b2nd
  Time (mean ± σ):       8.0 ms ±   0.6 ms    [User: 5.7 ms, System: 2.3 ms]
  Range (min … max):     7.2 ms …  10.0 ms    341 runs

A bit sad: the difference between blosc_mmap_new with less memcpy compared to the previous implementation of the benchmark (blosc_mmap mmap) is rather low. However, both are much faster than the current default.

Interface
To use memory-mapped files, users can either use the new blosc2_schunk_open_offset_mmap function or create a blosc2_io object with the corresponding parameters directly (cf. test_mmap.c). I used blosc2_schunk_open_offset_mmap because blosc2_schunk_open_offset seems to be the only function which is used in Python which passes a filename.

The new mmap_mode is basically the same as for Numpy's memmap except that Numpy doesn't allow the file to grow but here this is possible.

OS-specific Implementations
You will see that the implementation of memory-mapped files is highly OS-specific. The Windows implementation is completely different and the support of the c mode is ... a bit weird since the size of the memory mapping cannot change (FILE_MAP_COPY cannot be combined with FILE_MAP_RESERVE).

I also did not use mremap to stay POSIX-compliant (mremap is available for Linux but not for Mac OS).

I tested everything on Linux and Windows. Mac OS is not tested.

Post-PR Steps
After this PR is merged, I hope it will be as simple as exposing the mmap_mode argument to the blosc2.open function and using blosc2_schunk_open_offset_mmap in this caes :-)

Copy link
Member

@FrancescAlted FrancescAlted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JanSellner, this is a great start! I like the general approach, although I have some doubts with the new API function that you introduced. Also, I'd like to have some style corrected (see updated guidelines in https://github.com/Blosc/c-blosc2/blob/main/DEVELOPING-GUIDE.rst).

I'd also love to have a benchmark for comparing the new mmap IO and the existing stdio one, so it would be great if you can add one to the bench/ directory for us to test. In particular, if benchmarks work well for different boxes (including MacOS, this I can check), I am curious about the possibility to make the mmap the default for IO. This could open a can of worms for other OS than Linux/MacOS/Win though.

include/blosc2.h Outdated Show resolved Hide resolved
* @return The new super-chunk. NULL if not found or not in frame format.
*/
BLOSC_EXPORT blosc2_schunk* blosc2_schunk_open_offset_mmap(const char* urlpath, int64_t offset, const char* mmap_mode);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only mmap specific API introduced. May I suggest to add blosc2_schunk_open_offset_udio instead? This would work more generally for other udio. OTOH, I agree that blosc2_schunk_open_offset_mmap is handy, but having it and not blosc2_schunk_open_mmap is not consistent (i.e. the latter should be added too). A bit undecided here... Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The blosc2_schunk_open_udio function can already be used for mmap and it only makes sense to also have a blosc2_schunk_open_offset_udio counterpart (especially since this is needed for Python). I removed the blosc2_schunk_open_offset_mmap function.

tests/test_mmap.c Outdated Show resolved Hide resolved
tests/test_mmap.c Outdated Show resolved Hide resolved
tests/test_mmap.c Outdated Show resolved Hide resolved
@FrancescAlted
Copy link
Member

I am actually not sure about the difference between a lazychunk and a normal chunk and whether it only contains metadata (header, trailer) or also actual chunk data. If it is only metadata, then it probably won't matter since metadata should be small compared to the user data. If this function, however, copies the user data as well, then it might have an effect.

Right, a lazychunk only contains metadata (header of the chunk + index to where blocks start), so I don't think having an additional memcpy is too bad.

Write Operations Our main use case (and the original issue) is about reading files and not about writing (we usually write offline and copy the compressed data to the target node). Writing on memory-mapped files is more complex than reading because for writing we ideally need to know the expected file size in advance. In terms of blosc, this is not suitable because we cannot know in advance how much data should be compressed and how large the file will be after compression.

A common solution is to make the memory-mapping larger than the actual file mapping which is also what I did. The default is to start with 1 GB. If the file gets larger, the file has to be re-mapped where I double the size in each remapping step. I hope this covers most use cases efficiently.

Yep, sounds good to me. If you could provide a benchmark for writing, I'll be happy to contribute figures on whether it is accelerated (and how much) or not.

Latest Benchmark Here is the latest benchmark of the current mmap implementation using NFS files with caching:

blosc default:
Benchmark 1: build/blosc_mmap default test_files_nfs/test_real0.b2nd
  Time (mean ± σ):      34.0 ms ±   2.8 ms    [User: 6.8 ms, System: 22.4 ms]
  Range (min … max):    30.7 ms …  40.8 ms    88 runs

mmap old:
Benchmark 1: build/blosc_mmap mmap test_files_nfs/test_real0.b2nd
  Time (mean ± σ):       8.4 ms ±   0.6 ms    [User: 6.1 ms, System: 2.3 ms]
  Range (min … max):     7.6 ms …  10.8 ms    321 runs

mmap new:
Benchmark 1: build/blosc_mmap_new mmap test_files_nfs/test_real0.b2nd
  Time (mean ± σ):       8.0 ms ±   0.6 ms    [User: 5.7 ms, System: 2.3 ms]
  Range (min … max):     7.2 ms …  10.0 ms    341 runs

A bit sad: the difference between blosc_mmap_new with less memcpy compared to the previous implementation of the benchmark (blosc_mmap mmap) is rather low. However, both are much faster than the current default.

Frankly, I was not expecting too much gain here, but at any rate, I am in favor of changing the API for reading now (this could come handy for other IO plugins in the future).

Interface To use memory-mapped files, users can either use the new blosc2_schunk_open_offset_mmap function or create a blosc2_io object with the corresponding parameters directly (cf. test_mmap.c). I used blosc2_schunk_open_offset_mmap because blosc2_schunk_open_offset seems to be the only function which is used in Python which passes a filename.

Ok, but we need to be consistent with the C API as well. I have added a comment about this in the review.

@JanSellner
Copy link
Contributor Author

JanSellner commented May 8, 2024

Thank you for your review @FrancescAlted! I added an io argument to sframe_bench for an initial benchmarking. This was a good idea since I also discovered a few more problems along the way:

  1. The FILE_MAP_RESERVE flag on Windows does not work for larger files (larger being > 1 MiB 🙈). After this, it just crashes with access violations. I don't know why but since this flag isn't even documented, I guess it is not supposed to be used. This means that, on Windows, the file size is coupled with the mapping size and for writing with a start size of 1 GiB the size on disk is also directly 1 GiB. I truncate the file in the end after the mapping is closed to bring the file size back to the actual size.

  2. There was a nasty bug when it comes to multithreaded decompression of the data. With the udio interface, the following could happen:

    # thread 1
    io_cb->seek(pos1)
    io_cb->read()
    
    # thread 2
    io_cb->seek(pos2)
    io_cb->read()
    

    Since there is now a single state for the entire file (seek only stores the position), this raced and a thread ended up reading data from the position of another thread. To resolve this issue and to make the udio interface thread-safe, I added an position argument to the read/write functions and dropped the seek usage for those cases.

    There are now only two io_cb->seek (and io_cb->tell) usages left and the only purpose of those is to get the size of the file. If you like, we could also remove seek and tell completely from the udio interface and introduce a new size function instead. Would be a bit cleaner IMHO and also makes sure that no one accidentally uses seek in situations where it shouldn't be used (because it is not thread-safe).

  3. In sframe_bench, I only use mmap for the cframe and not the sframe. Actually, I wasn't aware that the urlpath may also be a folder and that blosc automatically creates multiple chunk files in this folder. Since the current mmap implementation only works for one open file at once which is directly connected to a schunk, this does not work. There is no solution for this at the moment but I'll have a look next week how this could be done.

I am actually not sure about the difference between a lazychunk and a normal chunk and whether it only contains metadata (header, trailer) or also actual chunk data. If it is only metadata, then it probably won't matter since metadata should be small compared to the user data. If this function, however, copies the user data as well, then it might have an effect.

Right, a lazychunk only contains metadata (header of the chunk + index to where blocks start), so I don't think having an additional memcpy is too bad.

Ok good. Then this can stay as is.

I also made a few code style changes. btw: as a side note (not directly connected to this PR): have you thought about adding a pre-commit hook with clang-format to this repository to automate and unify the code style in this repo?

@FrancescAlted
Copy link
Member

FrancescAlted commented May 10, 2024

Thank you for your review @FrancescAlted! I added an io argument to sframe_bench for an initial benchmarking.

Cool! I have started doing some comparisons, and mmap is better sometimes on my MacBook Air:

francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> bench/sframe_bench 1000 insert 1000 io_file
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.860 s.  Processed data: 7.451 GB (8.661 GB/s)
[Cframe Compr] Elapsed time:	  0.789 s.  Processed data: 7.451 GB (9.443 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.8x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.8x)
[Sframe Decompr] Elapsed time:	  0.342 s.  Processed data: 7.451 GB (21.789 GB/s)
[Cframe Decompr] Elapsed time:	  0.335 s.  Processed data: 7.451 GB (22.272 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.362 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.300 s.  Total cframe size: 95103993 bytes
francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> bench/sframe_bench 1000 insert 1000 io_mmap
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.861 s.  Processed data: 7.451 GB (8.654 GB/s)
[Cframe Compr] Elapsed time:	  1.071 s.  Processed data: 7.451 GB (6.954 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.8x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.8x)
[Sframe Decompr] Elapsed time:	  0.346 s.  Processed data: 7.451 GB (21.521 GB/s)
[Cframe Decompr] Elapsed time:	  0.205 s.  Processed data: 7.451 GB (36.316 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.370 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.889 s.  Total cframe size: 94133109 bytes

So, the initial appends (compr) seems to work faster with regular file, whereas reads (decomp) seems significantly faster with mmap, which is good news indeed. However, inserts are definitely slower with mmap (0.300s vs 0.889s).

Same thing with updates, where mmap is significantly slower (0.242s vs 0.661s):

francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> bench/sframe_bench 1000 update 1000 io_file
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.852 s.  Processed data: 7.451 GB (8.749 GB/s)
[Cframe Compr] Elapsed time:	  0.776 s.  Processed data: 7.451 GB (9.606 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.8x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.8x)
[Sframe Decompr] Elapsed time:	  0.342 s.  Processed data: 7.451 GB (21.789 GB/s)
[Cframe Decompr] Elapsed time:	  0.336 s.  Processed data: 7.451 GB (22.158 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:	  0.302 s. Total sframe size: 25109385 bytes
[Cframe Update] Elapsed time:	  0.242 s. Total cframe size: 64758702 bytes
francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> bench/sframe_bench 1000 update 1000 io_mmap
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.883 s.  Processed data: 7.451 GB (8.437 GB/s)
[Cframe Compr] Elapsed time:	  1.111 s.  Processed data: 7.451 GB (6.706 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.8x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.8x)
[Sframe Decompr] Elapsed time:	  0.356 s.  Processed data: 7.451 GB (20.920 GB/s)
[Cframe Decompr] Elapsed time:	  0.207 s.  Processed data: 7.451 GB (35.964 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:	  0.321 s. Total sframe size: 25046988 bytes
[Cframe Update] Elapsed time:	  0.661 s. Total cframe size: 64750809 bytes

Following that, I assume that mmap is going to accelerate reads significantly, whereas for writes, it can add a significant overhead (at least for MacOS). If that trend is confirmed, I wonder whether choosing the mmap IO when opening files in 'r'ead-only would be a good default.

This was a good idea since I also discovered a few more problems along the way:

  1. The FILE_MAP_RESERVE flag on Windows does not work for larger files (larger being > 1 MiB 🙈). After this, it just crashes with access violations. I don't know why but since this flag isn't even documented, I guess it is not supposed to be used. This means that, on Windows, the file size is coupled with the mapping size and for writing with a start size of 1 GiB the size on disk is also directly 1 GiB. I truncate the file in the end after the mapping is closed to bring the file size back to the actual size.

I see. Well, it is not that great for windows to reserve 1 GB when you have to write e.g. 100 KB, but that's life.

  1. There was a nasty bug when it comes to multithreaded decompression of the data. With the udio interface, the following could happen:
    # thread 1
    io_cb->seek(pos1)
    io_cb->read()
    
    # thread 2
    io_cb->seek(pos2)
    io_cb->read()
    
    Since there is now a single state for the entire file (seek only stores the position), this raced and a thread ended up reading data from the position of another thread. To resolve this issue and to make the udio interface thread-safe, I added an position argument to the read/write functions and dropped the seek usage for those cases.

Well done!

There are now only two io_cb->seek (and io_cb->tell) usages left and the only purpose of those is to get the size of the file. If you like, we could also remove seek and tell completely from the udio interface and introduce a new size function instead. Would be a bit cleaner IMHO and also makes sure that no one accidentally uses seek in situations where it shouldn't be used (because it is not thread-safe).

Maybe avoiding seek and tell is better, yes. Do you think that the new size function would perform efficiently too? If so, I'm +1 on getting rid of seek/tell entirely.

  1. In sframe_bench, I only use mmap for the cframe and not the sframe. Actually, I wasn't aware that the urlpath may also be a folder and that blosc automatically creates multiple chunk files in this folder.

Yes, there are situations where this is desirable. We blogged about this: https://www.blosc.org/posts/introducing-sparse-frames/

Since the current mmap implementation only works for one open file at once which is directly connected to a schunk, this does not work. There is no solution for this at the moment but I'll have a look next week how this could be done.

It would be cool, but no rush.

I also made a few code style changes. btw: as a side note (not directly connected to this PR): have you thought about adding a pre-commit hook with clang-format to this repository to automate and unify the code style in this repo?

Yes, that would be great. Would you like to provide the hook on a different PR? The only condition is that this would only enforce the format for new code only, and refuse to change the existing one (I value the fact that I can easily trace/blame lines of code).

@JanSellner
Copy link
Contributor Author

Maybe avoiding seek and tell is better, yes. Do you think that the new size function would perform efficiently too? If so, I'm +1 on getting rid of seek/tell entirely.

I removed tell/seek and introduced the size callback. There might be a slight performance benefit in case of mmap (less function calls needed) but since it seems to be used rarely anyway, I don't think it matters.

I also made a few code style changes. btw: as a side note (not directly connected to this PR): have you thought about adding a pre-commit hook with clang-format to this repository to automate and unify the code style in this repo?

Yes, that would be great. Would you like to provide the hook on a different PR? The only condition is that this would only enforce the format for new code only, and refuse to change the existing one (I value the fact that I can easily trace/blame lines of code).

Oh sorry, no, it is not possible to apply formatting changes only for new changes since this would be against the original idea of such a hook which is to to enforce a consistent style throughout the repository without human intervention so that users don't have to worry about style consistencies anymore.

However, it actually is possible to solve your use cases of excluding such formatting commits for git blame. There even seems to be automatic support by GitHub. So, if this would work for you, we could proceed with this in a separate PR (but after this one^^).

Benchmarking
I did some performance measurements on Linux and Windows (on different machines, so numbers are not comparable across systems). I noticed that the command bench/sframe_bench 1000 insert 1000 io_file has very high fluctuations when repeated multiple times with quite some drastic changes in the resulting numbers. To get more stable results, I added a short helper benchmarking script to bench/aggregate_multiple.py which basically repeats the benchmark multiple times and yields an aggregated output (min times).

Linux
# python bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      0.513 s.  Processed data: 7.451 GB (9.774 GB/s)
[Cframe Compr] Elapsed time:      0.543 s.  Processed data: 7.451 GB (10.412 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    0.246 s.  Processed data: 7.451 GB (8.814 GB/s)
[Cframe Decompr] Elapsed time:    0.238 s.  Processed data: 7.451 GB (9.072 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:     0.211 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:     0.226 s.  Total cframe size: 93810829 bytes

# python bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      0.536 s.  Processed data: 7.451 GB (10.048 GB/s)
[Cframe Compr] Elapsed time:      0.510 s.  Processed data: 7.451 GB (10.663 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    0.246 s.  Processed data: 7.451 GB (8.639 GB/s)
[Cframe Decompr] Elapsed time:    0.131 s.  Processed data: 7.451 GB (10.022 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:     0.214 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:     0.135 s.  Total cframe size: 93742862 bytes

# python bench/aggregate_multiple.py --command "sframe_bench 1000 update 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      0.519 s.  Processed data: 7.451 GB (10.371 GB/s)
[Cframe Compr] Elapsed time:      0.522 s.  Processed data: 7.451 GB (10.438 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    0.251 s.  Processed data: 7.451 GB (8.407 GB/s)
[Cframe Decompr] Elapsed time:    0.242 s.  Processed data: 7.451 GB (8.620 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:     0.195 s. Total sframe size: 22890267 bytes
[Cframe Update] Elapsed time:     0.189 s. Total cframe size: 64733735 bytes

# python bench/aggregate_multiple.py --command "sframe_bench 1000 update 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      0.527 s.  Processed data: 7.451 GB (10.156 GB/s)
[Cframe Compr] Elapsed time:      0.502 s.  Processed data: 7.451 GB (9.340 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    0.247 s.  Processed data: 7.451 GB (8.968 GB/s)
[Cframe Decompr] Elapsed time:    0.131 s.  Processed data: 7.451 GB (10.124 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:     0.196 s. Total sframe size: 23005640 bytes
[Cframe Update] Elapsed time:     0.073 s. Total cframe size: 64723280 bytes
Windows
> python bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      2.878 s.  Processed data: 7.451 GB (2.307 GB/s)
[Cframe Compr] Elapsed time:      2.845 s.  Processed data: 7.451 GB (2.361 GB/s)
Compression super-chunk-sframe: -589934592 -> 64622105 (123.800x)
Compression super-chunk-cframe: -589934592 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    1.544 s.  Processed data: 7.451 GB (4.515 GB/s)
[Cframe Decompr] Elapsed time:    1.538 s.  Processed data: 7.451 GB (4.500 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:     1.905 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:     1.837 s.  Total cframe size: 94327337 bytes

> python bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      3.160 s.  Processed data: 7.451 GB (2.204 GB/s)
[Cframe Compr] Elapsed time:      2.495 s.  Processed data: 7.451 GB (2.646 GB/s)
Compression super-chunk-sframe: -589934592 -> 64622105 (123.800x)
Compression super-chunk-cframe: -589934592 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    1.498 s.  Processed data: 7.451 GB (4.571 GB/s)
[Cframe Decompr] Elapsed time:    0.637 s.  Processed data: 7.451 GB (9.904 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:     2.116 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:     1.200 s.  Total cframe size: 93874138 bytes

> python bench/aggregate_multiple.py --command "sframe_bench 1000 update 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      2.916 s.  Processed data: 7.451 GB (2.422 GB/s)
[Cframe Compr] Elapsed time:      2.877 s.  Processed data: 7.451 GB (2.474 GB/s)
Compression super-chunk-sframe: -589934592 -> 64622105 (123.800x)
Compression super-chunk-cframe: -589934592 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    1.526 s.  Processed data: 7.451 GB (4.704 GB/s)
[Cframe Decompr] Elapsed time:    1.510 s.  Processed data: 7.451 GB (4.637 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:     1.796 s. Total sframe size: 22818233 bytes
[Cframe Update] Elapsed time:     1.696 s. Total cframe size: 64731171 bytes

> python bench/aggregate_multiple.py --command "sframe_bench 1000 update 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:      3.119 s.  Processed data: 7.451 GB (2.282 GB/s)
[Cframe Compr] Elapsed time:      2.483 s.  Processed data: 7.451 GB (2.768 GB/s)
Compression super-chunk-sframe: -589934592 -> 64622105 (123.800x)
Compression super-chunk-cframe: -589934592 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:    1.480 s.  Processed data: 7.451 GB (4.726 GB/s)
[Cframe Decompr] Elapsed time:    0.620 s.  Processed data: 7.451 GB (10.758 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
******************* Updating 1000 chunks ******************
*******************************************************
[Sframe Update] Elapsed time:     1.817 s. Total sframe size: 23720564 bytes
[Cframe Update] Elapsed time:     0.857 s. Total cframe size: 64729790 bytes

On Linux and Windows, there is quite some performance boost for decompression and inserts. Compression is slightly faster but less drastic (more pronounced on Windows). I am actually a bit surprised that MacOS performs so bad here. Maybe it is just because of the fluctuations? Maybe you could also repeat it on the latest version via python bench/aggregate_multiple.py --command "sframe_bench...?

Given these numbers, I think that mmap could indeed be a good default. Especially since all this is in addition to the original problem of deteriorated NFS performance. However, maybe it is best to first provide the option to the user and then after some transition period this could become the default. Just in case we are missing a problem which only occurs in certain situations or platforms.

Regarding sframe support:
This is getting quite complicated. I first thought about opening and closing the chunk of the sframe once in blosc2_schunk_decompress_chunk, something along these lines:

    if (frame->sframe) {
      frame->fp = sframe_open_chunk(frame->urlpath, nchunk, "rb", schunk->storage->io);
      frame->fp_nchunk = nchunk;
      if (frame->fp == NULL) {
        BLOSC_TRACE_ERROR("Error opening file in: %s", frame->urlpath);
        return BLOSC2_ERROR_FILE_OPEN;
      }
    }
    chunksize = frame_decompress_chunk(schunk->dctx, frame, nchunk, dest, nbytes);
    if (frame->sframe) {
      if (sframe_close_chunk(frame->fp, schunk->storage->io) < 0) {
        return BLOSC2_ERROR_FAILURE;
      }
    }

The main idea was to use this function as clear entry and exit points for the I/O since the purpose of the function blosc2_schunk_decompress_chunk seems to be to decompress the chunk with the number nchunk. However, I noticed that while decompressing one chunk, it also needs access to other chunks due to this line:

nchunk = *(int32_t*)(src + trailer_offset);

Using the test_b2nd_insert as example, while decompressing the chunk 3, it also needs access to the chunk 5 (from the trailer info). Is this actually intended? It sounds a bit weird to me that for decompressing a certain chunk you need also access to another chunk.

The main problem for mmap sframe support is that this makes it completely unpredictable which chunk should be loaded and in return when to close it (to ideally only open and close every chunk once). If this is not a bug, then I would propose to not use mmap for sframes.

@FrancescAlted
Copy link
Member

FrancescAlted commented May 14, 2024

Maybe avoiding seek and tell is better, yes. Do you think that the new size function would perform efficiently too? If so, I'm +1 on getting rid of seek/tell entirely.

I removed tell/seek and introduced the size callback. There might be a slight performance benefit in case of mmap (less function calls needed) but since it seems to be used rarely anyway, I don't think it matters.

Probably no, but it has been a good exercise to check that tell/seek is not needed. Thanks.

Oh sorry, no, it is not possible to apply formatting changes only for new changes since this would be against the original idea of such a hook which is to to enforce a consistent style throughout the repository without human intervention so that users don't have to worry about style consistencies anymore.

However, it actually is possible to solve your use cases of excluding such formatting commits for git blame. There even seems to be automatic support by GitHub. So, if this would work for you, we could proceed with this in a separate PR (but after this one^^).

Ok, thanks for the detective work. However, I think I still prefer to keep things manually formatted for now. But, for the future, this can be a good advice to have in mind.

Benchmarking I did some performance measurements on Linux and Windows (on different machines, so numbers are not comparable across systems). I noticed that the command bench/sframe_bench 1000 insert 1000 io_file has very high fluctuations when repeated multiple times with quite some drastic changes in the resulting numbers. To get more stable results, I added a short helper benchmarking script to bench/aggregate_multiple.py which basically repeats the benchmark multiple times and yields an aggregated output (min times).

Cool. Here there are my numbers. First for my Linux box (i13900K):

Linux
(python-blosc2) faltet@beast:~/blosc/c-blosc2/build$ python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.412 s.  Processed data: 7.451 GB (14.252 GB/s)
[Cframe Compr] Elapsed time:	  0.397 s.  Processed data: 7.451 GB (14.868 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.212 s.  Processed data: 7.451 GB (22.417 GB/s)
[Cframe Decompr] Elapsed time:	  0.209 s.  Processed data: 7.451 GB (30.835 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.085 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.082 s.  Total cframe size: 93744057 bytes

(python-blosc2) faltet@beast:~/blosc/c-blosc2/build$ python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.406 s.  Processed data: 7.451 GB (13.934 GB/s)
[Cframe Compr] Elapsed time:	  0.399 s.  Processed data: 7.451 GB (14.314 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.214 s.  Processed data: 7.451 GB (30.680 GB/s)
[Cframe Decompr] Elapsed time:	  0.179 s.  Processed data: 7.451 GB (38.305 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.085 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.064 s.  Total cframe size: 93552663 bytes

Then for my MacBook Air (ARM M1):

MacOS
francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap) [2]> python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_file"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.823 s.  Processed data: 7.451 GB (6.923 GB/s)
[Cframe Compr] Elapsed time:	  0.754 s.  Processed data: 7.451 GB (7.782 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.341 s.  Processed data: 7.451 GB (12.302 GB/s)
[Cframe Decompr] Elapsed time:	  0.336 s.  Processed data: 7.451 GB (12.888 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.350 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.266 s.  Total cframe size: 93874491 bytes

francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.823 s.  Processed data: 7.451 GB (7.709 GB/s)
[Cframe Compr] Elapsed time:	  1.065 s.  Processed data: 7.451 GB (6.287 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.345 s.  Processed data: 7.451 GB (20.624 GB/s)
[Cframe Decompr] Elapsed time:	  0.205 s.  Processed data: 7.451 GB (34.856 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.356 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.852 s.  Total cframe size: 93939951 bytes

So, it looks like MacOS still has some quirk regarding write (compression) performance.

On Linux and Windows, there is quite some performance boost for decompression and inserts. Compression is slightly faster but less drastic (more pronounced on Windows). I am actually a bit surprised that MacOS performs so bad here. Maybe it is just because of the fluctuations? Maybe you could also repeat it on the latest version via python bench/aggregate_multiple.py --command "sframe_bench...?

Given these numbers, I think that mmap could indeed be a good default. Especially since all this is in addition to the original problem of deteriorated NFS performance. However, maybe it is best to first provide the option to the user and then after some transition period this could become the default. Just in case we are missing a problem which only occurs in certain situations or platforms.

+1. Let's provide the mmap IO as an option first and, as time goes by, let's decide on promoting mmap as default (which would be quite a no-brainer if/when MacOSX would perform well for writes). Another possibility is to use mmap for reads and regular stdio for writes. But yeah, it is a bit too soon for this.

Regarding sframe support: This is getting quite complicated. I first thought about opening and closing the chunk of the sframe once in blosc2_schunk_decompress_chunk, something along these lines:
...

nchunk = *(int32_t*)(src + trailer_offset);

Using the test_b2nd_insert as example, while decompressing the chunk 3, it also needs access to the chunk 5 (from the trailer info). Is this actually intended? It sounds a bit weird to me that for decompressing a certain chunk you need also access to another chunk.

It is intended indeed. The reason is the chunk index being located right after the chunks, so the procedure to access to it is to find the end of the file to retrieve it (and decompress it), prior to access the desired chunk. The rational is that the index is variable-length, and putting it at the end of the file (near the trailer) is the only way to add more chunks without having to re-write the existing ones.

The main problem for mmap sframe support is that this makes it completely unpredictable which chunk should be loaded and in return when to close it (to ideally only open and close every chunk once). If this is not a bug, then I would propose to not use mmap for sframes.

+1 on not using mmap for sframes. The only issue is that using mmap will only be possible for cframes, and possibly providing the user a false hint on when mmap is actually acting. But if properly documented, this should not be a big problem IMO.

@JanSellner
Copy link
Contributor Author

So, it looks like MacOS still has some quirk regarding write (compression) performance.

Hmm, there is definitively something weird going on with writes on MacOS. If you find some time, maybe you can do some line profiling (I think they have something in xcode: Instruments). Maybe there is a certain command which causes this. On Linux, I used callgrind@valgrind successfully and observed quite a drop in costs for io_cb->write lines when moving from stdio to mmap. In the end, there was nothing in the mmap functions which had any significant cost. But I think this is not the case for MacOS.

+1 on not using mmap for sframes. The only issue is that using mmap will only be possible for cframes, and possibly providing the user a false hint on when mmap is actually acting. But if properly documented, this should not be a big problem IMO.

I added an error message when a different file path is about to be opened by the user. This happens for sframes but is also an error in any case.

blosc/blosc2-stdio.c Outdated Show resolved Hide resolved
Copy link
Member

@FrancescAlted FrancescAlted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on docstrings

@FrancescAlted
Copy link
Member

So, it looks like MacOS still has some quirk regarding write (compression) performance.

Hmm, there is definitively something weird going on with writes on MacOS. If you find some time, maybe you can do some line profiling (I think they have something in xcode: Instruments). Maybe there is a certain command which causes this. On Linux, I used callgrind@valgrind successfully and observed quite a drop in costs for io_cb->write lines when moving from stdio to mmap. In the end, there was nothing in the mmap functions which had any significant cost. But I think this is not the case for MacOS.

Ok. I don't have experience with xcode. What I have is some profile made with CLion; it is not a line profile, but rather a sampling profile for function calls, but it might help. Here there are a couple of screenshots for mmap (bench/sframe_bench 1000 insert 10000 io_mmap):

image image

And a similar thing, but for io_file (bench/sframe_bench 1000 insert 10000 io_file):

image image

Hope this helps.

@JanSellner
Copy link
Contributor Author

Interesting! I expected that msync might be the problem here. I already noticed on Linux that msync together with MS_SYNC is slow which is why I switched to MS_ASYNC. Here, the msync call should return immediately and this is also the case on Linux but not on Mac OS even though the documentation states differently.

In any case, a msync at this place might not be a good idea anyway since it is the job of the OS to write modified pages to disk when it thinks it is time. Hence, I moved the msync call to the end before unmapping as a final flush. This also seems to be the recommended way, especially to avoid problems with NFS. Hopefully, this solves the performance issues on Mac OS.

@FrancescAlted
Copy link
Member

Yep, you nailed it. On my MacBook Air, before applying the latest commit:

francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.823 s.  Processed data: 7.451 GB (7.709 GB/s)
[Cframe Compr] Elapsed time:	  1.065 s.  Processed data: 7.451 GB (6.287 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.345 s.  Processed data: 7.451 GB (20.624 GB/s)
[Cframe Decompr] Elapsed time:	  0.205 s.  Processed data: 7.451 GB (34.856 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.356 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.852 s.  Total cframe size: 93939951 bytes

and after applying it:

francesc@Francescs-MacBook-Air ~/b/c/build (JanSellner-mmap)> python ../bench/aggregate_multiple.py --command "sframe_bench 1000 insert 1000 io_mmap"
Warmup 3/3
Run 30/30
*******************************************************
***** Creating the frame and sframe with 1000 chunks ****
*******************************************************
Test comparison frame vs sframe with 1000 chunks.
[Sframe Compr] Elapsed time:	  0.830 s.  Processed data: 7.451 GB (6.289 GB/s)
[Cframe Compr] Elapsed time:	  0.720 s.  Processed data: 7.451 GB (7.356 GB/s)
Compression super-chunk-sframe: 8000000000 -> 64622105 (123.800x)
Compression super-chunk-cframe: 8000000000 -> 64622105 (123.800x)
[Sframe Decompr] Elapsed time:	  0.347 s.  Processed data: 7.451 GB (8.699 GB/s)
[Cframe Decompr] Elapsed time:	  0.209 s.  Processed data: 7.451 GB (26.105 GB/s)
Decompression successful!
Successful roundtrip!
*******************************************************
****************** Inserting 1000 chunks *****************
*******************************************************
[Sframe Insert] Elapsed time:	  0.344 s.  Total sframe size: 129244210 bytes
[Cframe Insert] Elapsed time:	  0.150 s.  Total cframe size: 94327912 bytes

So, definitely, a good performance bump, and better write speed than the default stdio in all major OS. Congrats!

Before merging, I have a question: why did you choose 1 GB for the initial file size for mmap on Windows? Would not it be enough something like 10 MB or, even 100 MB?

@JanSellner
Copy link
Contributor Author

Nice! Cool that it now also works nicely on Mac OS :-)

1 GB is a bit arbitrary indeed. The general idea is to make it reasonable large so that (potentially costly) remaps are avoided. On Linux and Mac OS, this only absorbs virtual address space which is no problem because there is plenty. It is of course a bit sad that on Windows it also costs (temporary) disk space. I just thought that it is reasonable to assume that there is at least 1 GB of free space available. In principle, this parameter can also be adjusted by the user and we can also finetune the default value in the future.

For the Python interface, it might makes sense to expose this parameter explicitly so that users are aware of it and can tune it accordingly. If possible, it should always be set to cover the expected output size (not always possible but a rough guess would be fine).

@FrancescAlted
Copy link
Member

Ok, makes sense to me. Merging.

@FrancescAlted FrancescAlted merged commit 660cbc5 into Blosc:main May 16, 2024
17 checks passed
@FrancescAlted
Copy link
Member

Thanks @JanSellner !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants