Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts about the mutual information threshold #259

Open
thodson-usgs opened this issue Feb 7, 2024 · 3 comments
Open

Thoughts about the mutual information threshold #259

thodson-usgs opened this issue Feb 7, 2024 · 3 comments

Comments

@thodson-usgs
Copy link

thodson-usgs commented Feb 7, 2024

Running some tests with bitinfo codec, and it seemed like a good time to revisit whether it's better to trim mutual_information by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.

Here's the code:

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('codec.zarr', mode="w", encoding=encoding)

By default, ds.to_zarr will chunk this dataset into 730x7x27 blocks for compression.

Here are the results:

Compression Size
None 17 Mb
Zstd 5.3 Mb
Zstd + Bitinfo (default tol w/ factor = 1.1) 1.2 Mb
Zstd + Bitinfo (free entropy tol) 2.8 Mb

(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 7, 2024

Ah, maybe aspects of this are resolved by #234

@milankl
Copy link
Collaborator

milankl commented Feb 27, 2024

An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?

You can technically do that, but it just gets really expensive. I agree it's a nice academic exercise to see how smooth an continuous field of keepbits would become, but in any practical sense I can imagine you'd need to read in and throw out so much memory that I'd doubt you'd be anywhere near a reasonable compression speed.

In the original paper I've experimented with calculation the bitinformation in various directions (lon first, lat first, time first, ensemble dimension first) but I generally found the information in longitude to be effectively an upper bound on the information in other dimensions. Meaning that, if you'd use information in the vertical and use that to cut down on the false information than you're ignoring additional information that you have in the longitude dimension. This obviously highly depends on the resolution you have in the various directions, e.g. a highly temporal resolution may have more information than a coarsly resolved longitude.

But overall I found that in practice you'd want to compute the bitinformation contiguously in memory, that way you can do it in a single pass and (at least BitInformation.jl) reaches about 100MB/s which is a reasonable speed people can work with. If it gets down to 1-10MB/s I see limits in any practical big data applications.

Technically you are statistically predicting the state of a bit given some predictor. For the longitude dimension that's the same bit position in the previous grid point in longitude. However you could use any predictor you like, including any bit that's anywhere in the dataset. But for practical purposes I found that you'd want your resulting joint probability matrix to be of size 2x2 because for everything else you'll need to count so many bitpair combinations that it gets easily out of hand I've seen hardly any evidence that this improves anything.

@thodson-usgs
Copy link
Author

thodson-usgs commented Feb 28, 2024

Those are good insights. I'll look for lower fruit.

Regarding my initial test. I altered the threshold to 1.1 based on performance with my dataset. Later I realized that the problem wasn't the threshold, rather the data were quantized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants