-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support zstd compression in miniostorage #405
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #405 +/- ##
==========================================
- Coverage 90.69% 90.06% -0.63%
==========================================
Files 394 324 -70
Lines 12002 9251 -2751
Branches 2018 1649 -369
==========================================
- Hits 10885 8332 -2553
+ Misses 1028 858 -170
+ Partials 89 61 -28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have this abstraction in place for all the different storage backends, we should add support for this to the other variants as well.
Reason is that API is using the minio backend to access GCS. I have no idea how that even works, but thats the code that runs in production, and we have to make sure things work out correctly using that.
Other than that, I left some individual comments mostly in the tests about:
- The
is_already_gzipped
flag should assumegzip
- We should verify that the
application/x-gzip
handling works properly - This applies zstd decompression only in the "return as
bytes
" codepath, but not the "save to file".
We have gone a few iterations on the archive service, and there is a bunch of PRs open: Now the circumstances have changed a little in the sense that it looks like we might want to consolidate the storage layer as only using IMO, we could:
|
we want to use zstd compression when compressing files for storage in object storage because it performs better than gzip which is what we were using before these changes are only being made to the minio storage service because we want to consolidate the storage service functionality into this one so both worker and API will be using this backend in the future (API was already using this one) we have to manually decompress the zstd compressed files in read_file but HTTPResponse takes care of it for us if the content encoding of the file is gzip the is_already_gzipped argument is being deprecated in favour of compression_type and is_compressed, also the ability to pass a str to write_file is being deprecated. we're keeping track of the use of these using sentry capture_message
28c85d4
to
1ad7fa7
Compare
shared/storage/minio.py
Outdated
|
||
reader: Readable | None = None | ||
if content_encoding == "gzip": | ||
# HTTPResponse automatically decodes gzipped data for us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if an uptodate urllib3
would also handle zstd
automatically for us.
We have quite an outdated urllib3
dependency primarily for reasons of compatiblity with vcr
snapshots.
I’m pretty sure that Content-Encoding: zstd
is properly standardized and transparently handled by browsers by now for example.
Another thing I remembered is that we should add a test that |
I was looking up
|
- using fget_object was unecessary since we were streaming the response data regardless - no need for all the warning logs and sentry stuff, we'll just do a 3 step migration in both API and worker (update shared supporting old behaviour, update {api,worker}, remove old behaviour support from shared) - zstandard version pinning can be more flexible - add test for content type = application/x-gzip since there was some specific handling for that in the GCP storage service
- in write file: - data arg is not BinaryIO it's actually bytes | str | IO[bytes] bytes and str are self-explanatory it's just how it's being used currently, so we must support it. IO[bytes] is there to support files handles opened with "rb" that are being passed and BytesIO objects - start accepting None value for compression_type which will mean no automatic compression even if is_compressed is false - do automatic compression using gzip if is_compressed=False and compression_type="gzip" - in put_object set size = -1 and use a part_size of 20MiB. the specific part size is arbitrary. Different sources online suggest different numbers. It probably depends on the size of the underlying data we're trying to send but 20MiB seems like a good flat number to pick for now. - in read_file: - generally reorganize the function do spend less time under the try except blocks - use the CHUNK_SIZE const defined in storage/base for the amount to read from the streams - accept IO[bytes] for the file_obj since we don't use any of the BinaryIO specific methods - create GZipStreamReader that takes in a IO[bytes] and implements a read() method that reads a certain amount of bytes from the IO[bytes] compresses whatever it reads using gzip, and returns the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this looks great!
feel free to add a test for gzip chunking, and/or add an explicit check for the urllib3
version.
otherwise its also fine to do those as followups, and lets ship this (carefully to staging first :-)
this is because if urllib3 is >= 2.0.0 and the zstd extra is installed then it is capable (and will) decode zstd encoded data when it's used in get_object so when we create the MinioStorageService we check the urllib3 version and we check if it's been installed with the zstd extra this commit also adds a test to ensure that the gzip compression and decompression used in the GzipStreamReader actually works
instead of doing a 0-100 launch of the new minio storage service i'd like to have it so we incrementally ship it using a feature flag. so if a repoid is passed to the get_appropriate_storage_service function and the chosen storage is minio, then it will check the use_new_minio feature to decide whether to use the new or old minio storage service as mentioned this will be decided via the repoid (to reduce the impact IF it is broken) changes had to be made to avoid circular imports in the model_utils and rollout_utils files
we want to use zstd compression when compressing files for storage in
object storage because it performs better than gzip which is what we
were using before
these changes are only being made to the minio storage service because
we want to consolidate the storage service functionality into this one
so both worker and API will be using this backend in the future (API was
already using this one)
we have to manually decompress the zstd compressed files in read_file
but HTTPResponse takes care of it for us if the content encoding of the
file is gzip
the is_already_gzipped argument is being deprecated in favour of
compression_type and is_compressed, also the ability to pass a str to
write_file is being deprecated. we're keeping track of the use of these
using sentry capture_message
Fixes: codecov/engineering-team#2257