Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to xz compression for downloads #236

Open
scottcain opened this issue Apr 12, 2022 · 3 comments
Open

Change to xz compression for downloads #236

scottcain opened this issue Apr 12, 2022 · 3 comments
Labels
enhancement New feature or request minor WP_1

Comments

@scottcain
Copy link
Contributor

https://en.wikipedia.org/wiki/XZ_Utils

xz is achieves much better compression. This change can take place along with #126 (splitting metadata and fasta)

Thanks to Art for the suggestion.

@scottcain scottcain added the enhancement New feature or request label Apr 12, 2022
@ghost ghost added the WP_1 label Jun 3, 2022
@b-f-chan
Copy link
Contributor

b-f-chan commented Jun 3, 2022

Good suggestion, thanks @scottcain, we can consider this, but need you to indicate priority as there's a lot of other stuff to fix

@sifavahora --> Would be good to get @joneubank opinion on this also

@b-f-chan b-f-chan assigned ghost Jun 3, 2022
@scottcain
Copy link
Contributor Author

While it would save space in storage when we create downloads, and save users time when downloading, I'm not sure it is necessarily be something we should spend a lot of time on. If perhaps it could be done "easily" along with fixing the current download issues, that would be a nice to have. Otherwise it can wait for later (as will splitting the metadata from the fasta a la #126

@ghost ghost added the minor label Jun 9, 2022
@justincorrigible
Copy link
Contributor

justincorrigible commented Jun 29, 2022

After brief research on the subject, while xz is indeed a more effective compression method (in terms of resulting file size), all my findings seem to point at it being substantially slower, and much more resource intensive.

In practice, this would mean three things:

  • users would have to wait painstakingly long times for a large download to even start (time that is already quite long as it is). Compression speed reflects directly on the user experience, whereas download speed is tied to a user's own network connection. A better balance of both is more easily achieved with other compression methods.
  • the algorithm changes would demand significantly higher resources (e.g.., cpu, memory) from the cluster, which may lead to more frequent download failures, which in turn may lead to increasing operation costs to mitigate that.
  • assuming the reason for the suggestion is other than file size (e.g., concerns about connection stability during a lengthy downloads... longer it takes, more possibilities for connection "dropping"), there are UX based alternatives such as offering multipart downloads with smaller file sizes, etc. which may be more cost-effective in the long term.

For reference, a few comparison charts:
https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
From that site, of note, a compression speed chart where 1-9 are compression levels:

image

The smallest level (1) puts gzip at 4x faster than xz. The highest level (9) gzip is almost 10x faster.
Granted, a balance must be achieved between compression and speed, but the impact of that extra compression benefit would be far costlier in almost every other way than what's currently implemented.

For easier visualisation:

image
(from https://www.r-bloggers.com/2015/11/compression-benchmarks-brotli-gzip-xz-bz2/)

That said, I will continue researching in case these stats are somehow inaccurate or badly outdated.


Edit: On further thought, if the product owners determine switching to xz is the choice to make and the additional computational resources are allocated for this to be possible, I would advise we revisit how we currently handle downloads.

Currently, we use a modal window that needs to stay open until the download has started. If this initial time is increased without a progress indicator of some sort, it may be better to instead notify the user they'll receive an email with a link, to download the file once its bundling is completed; as well as warning them that the compression process make take several minutes/hours).
Switching to a much slower compression method may entail extensive additional work and changes, if we want to keep or even improve our users' experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor WP_1
Projects
None yet
Development

No branches or pull requests

3 participants