Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support zstd compression in miniostorage #405

Merged
merged 7 commits into from
Dec 5, 2024
Merged

Conversation

joseph-sentry
Copy link
Contributor

@joseph-sentry joseph-sentry commented Oct 21, 2024

we want to use zstd compression when compressing files for storage in
object storage because it performs better than gzip which is what we
were using before

these changes are only being made to the minio storage service because
we want to consolidate the storage service functionality into this one
so both worker and API will be using this backend in the future (API was
already using this one)

we have to manually decompress the zstd compressed files in read_file
but HTTPResponse takes care of it for us if the content encoding of the
file is gzip

the is_already_gzipped argument is being deprecated in favour of
compression_type and is_compressed, also the ability to pass a str to
write_file is being deprecated. we're keeping track of the use of these
using sentry capture_message

Fixes: codecov/engineering-team#2257

Copy link

codecov bot commented Oct 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.06%. Comparing base (00e90bb) to head (54bfa03).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #405      +/-   ##
==========================================
- Coverage   90.69%   90.06%   -0.63%     
==========================================
  Files         394      324      -70     
  Lines       12002     9251    -2751     
  Branches     2018     1649     -369     
==========================================
- Hits        10885     8332    -2553     
+ Misses       1028      858     -170     
+ Partials       89       61      -28     
Flag Coverage Δ
shared-docker-uploader ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have this abstraction in place for all the different storage backends, we should add support for this to the other variants as well.
Reason is that API is using the minio backend to access GCS. I have no idea how that even works, but thats the code that runs in production, and we have to make sure things work out correctly using that.

Other than that, I left some individual comments mostly in the tests about:

  • The is_already_gzipped flag should assume gzip
  • We should verify that the application/x-gzip handling works properly
  • This applies zstd decompression only in the "return as bytes" codepath, but not the "save to file".

tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved
tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved
tests/unit/storage/test_gcp.py Outdated Show resolved Hide resolved
@Swatinem
Copy link
Contributor

We have gone a few iterations on the archive service, and there is a bunch of PRs open:

Now the circumstances have changed a little in the sense that it looks like we might want to consolidate the storage layer as only using minio as the storage adapter, as that already runs against the GCS server in api and staging worker.
So we can just remove all the other storage adapters.

IMO, we could:

we want to use zstd compression when compressing files for storage in
object storage because it performs better than gzip which is what we
were using before

these changes are only being made to the minio storage service because
we want to consolidate the storage service functionality into this one
so both worker and API will be using this backend in the future (API was
already using this one)

we have to manually decompress the zstd compressed files in read_file
but HTTPResponse takes care of it for us if the content encoding of the
file is gzip

the is_already_gzipped argument is being deprecated in favour of
compression_type and is_compressed, also the ability to pass a str to
write_file is being deprecated. we're keeping track of the use of these
using sentry capture_message
@joseph-sentry joseph-sentry changed the title feat: use zstd instead of gzip for compression feat: support zstd compression in miniostorage Nov 26, 2024
pyproject.toml Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved

reader: Readable | None = None
if content_encoding == "gzip":
# HTTPResponse automatically decodes gzipped data for us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if an uptodate urllib3 would also handle zstd automatically for us.
We have quite an outdated urllib3 dependency primarily for reasons of compatiblity with vcr snapshots.

I’m pretty sure that Content-Encoding: zstd is properly standardized and transparently handled by browsers by now for example.

shared/storage/minio.py Outdated Show resolved Hide resolved
@Swatinem
Copy link
Contributor

Another thing I remembered is that we should add a test that Content-Encoding: gzip; Content-Type: application/x-gzip works correctly.
The gcp backend had a workaround specifically for that. I suspect that whatever http abstraction that was using was not doing the transparent decompression correctly, and also triggered a bogus checksum mismatch error in that case.

@Swatinem
Copy link
Contributor

I was looking up Content-Encoding: zstd support:

- using fget_object was unecessary since we were streaming the response
  data regardless
- no need for all the warning logs and sentry stuff, we'll just do
  a 3 step migration in both API and worker (update shared supporting
  old behaviour, update {api,worker}, remove old behaviour support from
  shared)
- zstandard version pinning can be more flexible
- add test for content type = application/x-gzip since there was some
  specific handling for that in the GCP storage service
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
- in write file:
    - data arg is not BinaryIO it's actually bytes | str | IO[bytes]
      bytes and str are self-explanatory it's just how it's being used
      currently, so we must support it. IO[bytes] is there to support
      files handles opened with "rb" that are being passed and BytesIO
      objects
    - start accepting None value for compression_type which will mean
      no automatic compression even if is_compressed is false
    - do automatic compression using gzip if is_compressed=False and
      compression_type="gzip"
    - in put_object set size = -1 and use a part_size of 20MiB. the
      specific part size is arbitrary. Different sources online suggest
      different numbers. It probably depends on the size of the
      underlying data we're trying to send but 20MiB seems like a good
      flat number to pick for now.
- in read_file:
    - generally reorganize the function do spend less time under the try
      except blocks
    - use the CHUNK_SIZE const defined in storage/base for the amount to
      read from the streams
    - accept IO[bytes] for the file_obj since we don't use any of the
      BinaryIO specific methods
- create GZipStreamReader that takes in a IO[bytes] and implements a
  read() method that reads a certain amount of bytes from the IO[bytes]
  compresses whatever it reads using gzip, and returns the result
shared/storage/minio.py Outdated Show resolved Hide resolved
shared/storage/minio.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this looks great!

feel free to add a test for gzip chunking, and/or add an explicit check for the urllib3 version.
otherwise its also fine to do those as followups, and lets ship this (carefully to staging first :-)

shared/storage/minio.py Outdated Show resolved Hide resolved
this is because if urllib3 is >= 2.0.0 and the zstd extra is installed
then it is capable (and will) decode zstd encoded data when it's used
in get_object

so when we create the MinioStorageService we check the urllib3 version
and we check if it's been installed with the zstd extra

this commit also adds a test to ensure that the gzip compression and
decompression used in the GzipStreamReader actually works
instead of doing a 0-100 launch of the new minio storage service i'd
like to have it so we incrementally ship it using a feature flag.

so if a repoid is passed to the get_appropriate_storage_service function
and the chosen storage is minio, then it will check the use_new_minio
feature to decide whether to use the new or old minio storage service

as mentioned this will be decided via the repoid (to reduce the impact
IF it is broken)

changes had to be made to avoid circular imports in the model_utils and
rollout_utils files
@joseph-sentry joseph-sentry added this pull request to the merge queue Dec 5, 2024
Merged via the queue into main with commit 3b22b03 Dec 5, 2024
6 checks passed
@joseph-sentry joseph-sentry deleted the joseph/zstandard branch December 5, 2024 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch Archive compression from zlib to zstd
2 participants