-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retrieve data by byte offset and line position #36
Comments
How exactly are you calling gztool? and, wouldn't ratarmount solve your feature request? It has rapidgzip as a backend and exposes a view to the decompressed file via FUSE. You can then use the usual POSIX APIs or |
To e.g. get just the second line of a file, I do something like |
Ah ok, I was only thinking about byte offset. Lines are another matter altogether. And currently, the index does not store any line indexing information. However, I am working on a compressed and sparse index format and this feature request is in time to add line information as a consideration to the file. Alternatively, supporting gztool indexes would also be an option I would consider. I like when stuff like that (index compatibility) just works for the user. I can imagine adding options like
Understandable. It also takes some runtime to set up a FUSE mount point and there are other difficulties. You could also do: python3 -c '
import sys
import ratarmountcore as rmc
a = rmc.open(sys.argv[1])
name = next(iter(a.listDir("/").keys()))
f = a.open(a.getFileInfo("/" + name))
f.seek(10)
sys.stdout.buffer.write(f.read(5))
' test.py.gz The rapidgzip Python-bindings can also be used directly, but then index saving and loading would have to be managed manually. I am aware that this is too cumbersome for a solution. It would be nicer to have an option directly in the rapidgzip CLI.
No. That was only about the deflate block searching heuristic that wasn't stable and fast enough to work with anything other than "printable" characters and even then might fail because of false positives. |
Sounds great. In the meantime I made some interesting observation, I wonder weather you could comment on them (or even include them into your benchmark). Even though, I did not turned off NUMA, cache effects and kept cpu frequencies constant, I can say, that the throughput/bandwidth when piping decompressed fastq data back into compression tools heavily depends on the used tool. Whereas I was not able to achieve more then 110MB/s for pigz (76 threads), the maximum throughput of 400MB/s was gained by using rapidgzip on 8 threads streamed into bgzip with 32 threads. |
I don't think that I entirely understand your benchmark. Are you doing |
exactly. I am on an AMD EPYC 7513 with an NVMe that should be capable of 7BG/s reading. I was trying a 5gb and 50gb compressed fastq file. CPU load was according to my parameterizations. |
Regarding the issue title, I see the following steps:
Roughly, in this order, but I am starting with the API. This was also the reason for asking about the gztool usage. Personally, I dislike the single-letter options. They are hard to read. My current idea for the API would be a single option
This specification has some advantages:
And here is where it becomes complicated:
|
This took a bit longer and more code lines than I thought, especially the newline logic. And it still could benefit from many more unit tests, but at least there are some. Issue #10 is still not implemented, therefore working with large gztool indexes to extract only very small amounts of data is not recommended because the index is always read fully into memory. However, the option to specify multiple ranges should help to alleviate this. Else, the other way around: creating indexes with rapidgzip and accessing lines or byte offsets via gztool could be a viable option. I have merged it. It would be helpful if you could give it a try. There might still be bugs with the newline features. You can install it with: python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip' This introduces the new command line arguments |
Wow, you made it - this is super cool.. fast decompression with an offset! As far as I can tell you, I didn't observe any issues so far. But I'd like to add minor recommendations.
|
followup: index creation from stdin does not work anymore. rapidgzip now complains |
However, I am getting:
when trying:
I'm not sure why you are seeing another error message. The introducing commit for this bug is: mxmlnkn/indexed_bzip2@582d6f7. As you may be able to see in the commit, it was a really dumb copy-paste error... |
I have pushed a fix for the stdin problem and will release 0.14.1 after the CI has finished. Thank you for noticing the bug! I have also added a test, so it shouldn't happen again so easily. The other two things, as they are small features and performance optimizations, not bugfixes, will probably have to wait until the next minor release. Ah, did I mention that the sparseness is also used for the gztool indexes? This is why the new sparseness feature meshes surprisingly well with the gztool support feature, both newly added in rapidgzip 0.14.0. E.g., try this: base64 /dev/urandom |head -c $(( 4 * 1024 * 1024 * 1024 )) | gzip > 4GiB-base64.gz
rapidgzip --no-sparse-windows --index-format gztool --export-index 4GiB-base64.gz{.index,} && stat -c %s 4GiB-base64.gz.index
# -> 39 MB (38811851 B)
# 3 MiB spacing instead of the default 10 MiB for roughly comparable comparison with rapidgzip
gztool -s 3 -I 4GiB-base64.gz{.gzi,} && stat -c %s 4GiB-base64.gz.gzi
# 34 MB -> (33983860 B)
rapidgzip --index-format gztool --export-index 4GiB-base64.gz{.index,} && stat -c %s 4GiB-base64.gz.index
# -> 702 kB (701964)
# The index is compatible with gztool of course, even though it is ~50x smaller!
gztool -b 0 -I 4GiB-base64.gz{.index,}
# ACTION: Extract from byte = 0
#
# Index file '4GiB-base64.gz.index' already exists and will be used.
# Processing '4GiB-base64.gz' ...
# Processing index to '4GiB-base64.gz.index'...
# 4.00 GiB (4294967296 Bytes) of data extracted.
#
# 1 files processed
#
# 229af148b81ef63faf7e723d50173d1f -
gzip -d -c 4GiB-base64.gz | md5sum
# 229af148b81ef63faf7e723d50173d1f - Of course, random base64 data is almost the best case for sparseness ;). For wikidata.json, I observed the mentioned 3x size reduction of the gztool index from ~10 GB |
No, you didn't :) but I already wondered how you achieved gztool index size reduction..
I meant: When using the default rapidgzip index and the same number of threads as without using the index, the decompression throughput is much higher due to ISA-L. When using the gztool index, I guess the algorithm falls back to your "own custom-written gzip decompression engine". To overcome this, I asked if it would be possible to make use of both indices i.e. using gztool index to infer the line/byte offset and the rapidgzip index for the actual decompression. |
I am sorry to bother you again with stdin related issues. Using v0.14.1 index export works with "indexed_gzip" and "gztool" keywords, but fails on "gztool-with-lines" giving me the error
|
You are right about ISA-L and the slower custom decoder. But, ISA-L is also used when importing a gztool index. I.e., gztool already should give all the benefits, no reason to combine it with the old index format. There is one edge case that gztool indexes do not work with: > rapidgzip --index-format gztool --export-index SRR22401085_2.fastq.gz{.index,}
> cat SRR22401085_2.fastq.gz | /home/maximiliank/.local/bin/rapidgzip --count --import-index SRR22401085_2.fastq.gz.index
Traceback (most recent call last):
File "/home/maximiliank/.local/bin/rapidgzip", line 8, in <module>
sys.exit(cli())
^^^^^
File "rapidgzip.pyx", line 685, in rapidgzip.cli
ValueError: Cannot import gztool index without knowing the archive size! This is a known limitation of the gztool index because it does not store the total archive size but rapidgzip kinda needs it and therefore tries to query it from the file and this will not work for stdin input.
|
I see. So except for index file size and the use case with
Yes, that's correct, because I would like to create the index asynchronously while compressing the data and not afterwards by doing |
Yes.
I see, thanks for the feedback. I'll see what I can do for the next minor release. It might be more difficult than it sounds because of the abstraction layers, but maybe I'll "just" have to move the line counting one level deeper (into ParallelGzipReader or even into the ChunkData, instead of doing it outside in the CLI logic). |
I have fixed some of the performance ideas for
If you are compressing files, I would really recommend bgzip, which also should be a drop-in replacement for gzip and is available via the Debian/Ubuntu package manager. It can also create an index while compressing files, although those indexes do not contain line information. Rapidgzip should also be a lot faster when the input is a bgzip file because it can delegate to ISA-L. |
I have pushed a commit that should make your use case of using stdin input to create an index work to develop. You can try it out with:
It needs a bit more testing before it will be merged. |
I know :) To add up on this, bgzip is also much faster than pigz. But be aware of the version you are using, because from v1.16 on, the developers switched to hfile with a default block size of 4kb when data comes from stdin, which heavily reduces throughput. See my issue opened here bgzip performance drop from v1.16. Unfortunately, the pull request which partially should fix this issue by increasing the block size, is not merged yet. |
Works like charm. You also implemented the infinity option for |
Hi,
I guess you are aware of the https://github.com/circulosmeos/gztool project, which allows to index gz files including line information for random access data retrieval. Since I am using rapidgzip for full decompression and gztool for parallel random access decompression on a regular basis when working with fastq files, I would be very happy to have this combined in one tool. Can you imagine to implement it?
The text was updated successfully, but these errors were encountered: