feat: support zstd compression in miniostorage #405

joseph-sentry · 2024-10-21T20:00:29Z

we want to use zstd compression when compressing files for storage in
object storage because it performs better than gzip which is what we
were using before

these changes are only being made to the minio storage service because
we want to consolidate the storage service functionality into this one
so both worker and API will be using this backend in the future (API was
already using this one)

we have to manually decompress the zstd compressed files in read_file
but HTTPResponse takes care of it for us if the content encoding of the
file is gzip

the is_already_gzipped argument is being deprecated in favour of
compression_type and is_compressed, also the ability to pass a str to
write_file is being deprecated. we're keeping track of the use of these
using sentry capture_message

Fixes: codecov/engineering-team#2257

codecov · 2024-10-21T20:01:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.06%. Comparing base (00e90bb) to head (54bfa03).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
- Coverage   90.69%   90.06%   -0.63%     
==========================================
  Files         394      324      -70     
  Lines       12002     9251    -2751     
  Branches     2018     1649     -369     
==========================================
- Hits        10885     8332    -2553     
+ Misses       1028      858     -170     
+ Partials       89       61      -28

Flag	Coverage Δ
shared-docker-uploader	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Swatinem

As we have this abstraction in place for all the different storage backends, we should add support for this to the other variants as well.
Reason is that API is using the minio backend to access GCS. I have no idea how that even works, but thats the code that runs in production, and we have to make sure things work out correctly using that.

Other than that, I left some individual comments mostly in the tests about:

The is_already_gzipped flag should assume gzip
We should verify that the application/x-gzip handling works properly
This applies zstd decompression only in the "return as bytes" codepath, but not the "save to file".

tests/unit/storage/test_gcp.py

Swatinem · 2024-11-26T08:55:32Z

We have gone a few iterations on the archive service, and there is a bunch of PRs open:

Now the circumstances have changed a little in the sense that it looks like we might want to consolidate the storage layer as only using minio as the storage adapter, as that already runs against the GCS server in api and staging worker.
So we can just remove all the other storage adapters.

IMO, we could:

take Clean up ArchiveService #415 as that is some general cleanup and reduces the surface area (I would like to land this together with Store v2 reports in GCS instead of Redis codecov-api#960, as that changes the user of said APIs)
remove all the other storage adapters, moving exclusively to minio
take the remaining parts of Clean up Archive Storage Tests #351, which would remove all the cumbersome vcr tests and replace those with running against a live minio server in docker
adapt this PR to add a better story around compression, and move the default to zstd
then as a followup, we can also consolidate the ArchiveService layer across worker, api and shared (the latter two should be handled by Store v2 reports in GCS instead of Redis codecov-api#960)

we want to use zstd compression when compressing files for storage in object storage because it performs better than gzip which is what we were using before these changes are only being made to the minio storage service because we want to consolidate the storage service functionality into this one so both worker and API will be using this backend in the future (API was already using this one) we have to manually decompress the zstd compressed files in read_file but HTTPResponse takes care of it for us if the content encoding of the file is gzip the is_already_gzipped argument is being deprecated in favour of compression_type and is_compressed, also the ability to pass a str to write_file is being deprecated. we're keeping track of the use of these using sentry capture_message

pyproject.toml

shared/storage/minio.py

Swatinem · 2024-11-27T11:16:11Z

shared/storage/minio.py

+
+            reader: Readable | None = None
+            if content_encoding == "gzip":
+                # HTTPResponse automatically decodes gzipped data for us


I wonder if an uptodate urllib3 would also handle zstd automatically for us.
We have quite an outdated urllib3 dependency primarily for reasons of compatiblity with vcr snapshots.

I’m pretty sure that Content-Encoding: zstd is properly standardized and transparently handled by browsers by now for example.

shared/storage/minio.py

Swatinem · 2024-11-27T11:24:54Z

Another thing I remembered is that we should add a test that Content-Encoding: gzip; Content-Type: application/x-gzip works correctly.
The gcp backend had a workaround specifically for that. I suspect that whatever http abstraction that was using was not doing the transparent decompression correctly, and also triggered a bogus checksum mismatch error in that case.

Swatinem · 2024-11-27T11:35:05Z

I was looking up Content-Encoding: zstd support:

According to https://caniuse.com/zstd, its supported in all the latest browsers except Safari (proving once again that Safari is the new IE6 in terms of lagging behind on features and being a pain for webdev)
☝🏻 because of that, we might consider sticking to gzip when "rewriting uploads as readable", as those are the only files which are downloadable by customers and support folks. all other files are purely internal and won’t ever be exposed to customers directly.
It is indeed supported out of the box in urllib > 2.0.0: https://urllib3.readthedocs.io/en/2.1.0/changelog.html#added, https://urllib3.readthedocs.io/en/2.1.0/advanced-usage.html#zstandard-encoding
☝🏻 so as long as we are stuck with a <2 version, your workaround to do zstd decompression manually is needed
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#directives lists a few more officially supported options here. Like I mentioned in review, the initial upload is user-initiated through the CLI or bash script, so those clients are free to chose whatever they want, and we have to support that accordingly.

- using fget_object was unecessary since we were streaming the response data regardless - no need for all the warning logs and sentry stuff, we'll just do a 3 step migration in both API and worker (update shared supporting old behaviour, update {api,worker}, remove old behaviour support from shared) - zstandard version pinning can be more flexible - add test for content type = application/x-gzip since there was some specific handling for that in the GCP storage service

shared/storage/minio.py

- in write file: - data arg is not BinaryIO it's actually bytes | str | IO[bytes] bytes and str are self-explanatory it's just how it's being used currently, so we must support it. IO[bytes] is there to support files handles opened with "rb" that are being passed and BytesIO objects - start accepting None value for compression_type which will mean no automatic compression even if is_compressed is false - do automatic compression using gzip if is_compressed=False and compression_type="gzip" - in put_object set size = -1 and use a part_size of 20MiB. the specific part size is arbitrary. Different sources online suggest different numbers. It probably depends on the size of the underlying data we're trying to send but 20MiB seems like a good flat number to pick for now. - in read_file: - generally reorganize the function do spend less time under the try except blocks - use the CHUNK_SIZE const defined in storage/base for the amount to read from the streams - accept IO[bytes] for the file_obj since we don't use any of the BinaryIO specific methods - create GZipStreamReader that takes in a IO[bytes] and implements a read() method that reads a certain amount of bytes from the IO[bytes] compresses whatever it reads using gzip, and returns the result

shared/storage/minio.py

Swatinem

nice, this looks great!

feel free to add a test for gzip chunking, and/or add an explicit check for the urllib3 version.
otherwise its also fine to do those as followups, and lets ship this (carefully to staging first :-)

shared/storage/minio.py

this is because if urllib3 is >= 2.0.0 and the zstd extra is installed then it is capable (and will) decode zstd encoded data when it's used in get_object so when we create the MinioStorageService we check the urllib3 version and we check if it's been installed with the zstd extra this commit also adds a test to ensure that the gzip compression and decompression used in the GzipStreamReader actually works

instead of doing a 0-100 launch of the new minio storage service i'd like to have it so we incrementally ship it using a feature flag. so if a repoid is passed to the get_appropriate_storage_service function and the chosen storage is minio, then it will check the use_new_minio feature to decide whether to use the new or old minio storage service as mentioned this will be decided via the repoid (to reduce the impact IF it is broken) changes had to be made to avoid circular imports in the model_utils and rollout_utils files

joseph-sentry requested a review from Swatinem October 21, 2024 20:00

Swatinem reviewed Oct 22, 2024

View reviewed changes

tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved

tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved

tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved

joseph-sentry force-pushed the joseph/zstandard branch from 28c85d4 to 1ad7fa7 Compare November 26, 2024 17:42

joseph-sentry changed the title ~~feat: use zstd instead of gzip for compression~~ feat: support zstd compression in miniostorage Nov 26, 2024

Swatinem reviewed Nov 27, 2024

View reviewed changes

joseph-sentry requested a review from Swatinem November 28, 2024 14:26

Swatinem reviewed Nov 28, 2024

View reviewed changes

shared/storage/minio.py Outdated Show resolved Hide resolved

shared/storage/minio.py Outdated Show resolved Hide resolved

shared/storage/minio.py Outdated Show resolved Hide resolved

shared/storage/minio.py Outdated Show resolved Hide resolved

Swatinem reviewed Nov 29, 2024

View reviewed changes

shared/storage/minio.py Outdated Show resolved Hide resolved

shared/storage/minio.py Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:codecov/shared into joseph/zstandard

70549aa

Swatinem approved these changes Dec 2, 2024

View reviewed changes

shared/storage/minio.py Outdated Show resolved Hide resolved

joseph-sentry added 3 commits December 2, 2024 10:59

fix: revert changes to old minio

54bfa03

joseph-sentry added this pull request to the merge queue Dec 5, 2024

Merged via the queue into main with commit 3b22b03 Dec 5, 2024
6 checks passed

joseph-sentry deleted the joseph/zstandard branch December 5, 2024 14:51

feat: support zstd compression in miniostorage #405

feat: support zstd compression in miniostorage #405

Uh oh!

Conversation

joseph-sentry commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Swatinem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Swatinem commented Nov 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Swatinem Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Swatinem commented Nov 27, 2024

Uh oh!

Swatinem commented Nov 27, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Swatinem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joseph-sentry commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading