Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: support zstd compression in miniostorage (#405)
* feat: support zstd compression in miniostorage we want to use zstd compression when compressing files for storage in object storage because it performs better than gzip which is what we were using before these changes are only being made to the minio storage service because we want to consolidate the storage service functionality into this one so both worker and API will be using this backend in the future (API was already using this one) we have to manually decompress the zstd compressed files in read_file but HTTPResponse takes care of it for us if the content encoding of the file is gzip the is_already_gzipped argument is being deprecated in favour of compression_type and is_compressed, also the ability to pass a str to write_file is being deprecated. we're keeping track of the use of these using sentry capture_message * fix: address feedback - using fget_object was unecessary since we were streaming the response data regardless - no need for all the warning logs and sentry stuff, we'll just do a 3 step migration in both API and worker (update shared supporting old behaviour, update {api,worker}, remove old behaviour support from shared) - zstandard version pinning can be more flexible - add test for content type = application/x-gzip since there was some specific handling for that in the GCP storage service * fix: update MinioStorageService - in write file: - data arg is not BinaryIO it's actually bytes | str | IO[bytes] bytes and str are self-explanatory it's just how it's being used currently, so we must support it. IO[bytes] is there to support files handles opened with "rb" that are being passed and BytesIO objects - start accepting None value for compression_type which will mean no automatic compression even if is_compressed is false - do automatic compression using gzip if is_compressed=False and compression_type="gzip" - in put_object set size = -1 and use a part_size of 20MiB. the specific part size is arbitrary. Different sources online suggest different numbers. It probably depends on the size of the underlying data we're trying to send but 20MiB seems like a good flat number to pick for now. - in read_file: - generally reorganize the function do spend less time under the try except blocks - use the CHUNK_SIZE const defined in storage/base for the amount to read from the streams - accept IO[bytes] for the file_obj since we don't use any of the BinaryIO specific methods - create GZipStreamReader that takes in a IO[bytes] and implements a read() method that reads a certain amount of bytes from the IO[bytes] compresses whatever it reads using gzip, and returns the result * fix(minio): check urllib3 version in read_file this is because if urllib3 is >= 2.0.0 and the zstd extra is installed then it is capable (and will) decode zstd encoded data when it's used in get_object so when we create the MinioStorageService we check the urllib3 version and we check if it's been installed with the zstd extra this commit also adds a test to ensure that the gzip compression and decompression used in the GzipStreamReader actually works * feat: add feature flag for new minio storage instead of doing a 0-100 launch of the new minio storage service i'd like to have it so we incrementally ship it using a feature flag. so if a repoid is passed to the get_appropriate_storage_service function and the chosen storage is minio, then it will check the use_new_minio feature to decide whether to use the new or old minio storage service as mentioned this will be decided via the repoid (to reduce the impact IF it is broken) changes had to be made to avoid circular imports in the model_utils and rollout_utils files * fix: revert changes to old minio
- Loading branch information