Thoughts about the mutual information threshold #259

thodson-usgs · 2024-02-07T15:27:16Z

Running some tests with bitinfo codec, and it seemed like a good time to revisit whether it's better to trim mutual_information by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.

Here's the code:

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('codec.zarr', mode="w", encoding=encoding)

By default, ds.to_zarr will chunk this dataset into 730x7x27 blocks for compression.

Here are the results:

Compression	Size
None	17 Mb
Zstd	5.3 Mb
Zstd + Bitinfo (default tol w/ factor = 1.1)	1.2 Mb
Zstd + Bitinfo (free entropy tol)	2.8 Mb

(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)

The text was updated successfully, but these errors were encountered:

thodson-usgs · 2024-02-07T20:51:39Z

Ah, maybe aspects of this are resolved by #234

milankl · 2024-02-27T23:08:15Z

An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?

You can technically do that, but it just gets really expensive. I agree it's a nice academic exercise to see how smooth an continuous field of keepbits would become, but in any practical sense I can imagine you'd need to read in and throw out so much memory that I'd doubt you'd be anywhere near a reasonable compression speed.

In the original paper I've experimented with calculation the bitinformation in various directions (lon first, lat first, time first, ensemble dimension first) but I generally found the information in longitude to be effectively an upper bound on the information in other dimensions. Meaning that, if you'd use information in the vertical and use that to cut down on the false information than you're ignoring additional information that you have in the longitude dimension. This obviously highly depends on the resolution you have in the various directions, e.g. a highly temporal resolution may have more information than a coarsly resolved longitude.

But overall I found that in practice you'd want to compute the bitinformation contiguously in memory, that way you can do it in a single pass and (at least BitInformation.jl) reaches about 100MB/s which is a reasonable speed people can work with. If it gets down to 1-10MB/s I see limits in any practical big data applications.

Technically you are statistically predicting the state of a bit given some predictor. For the longitude dimension that's the same bit position in the previous grid point in longitude. However you could use any predictor you like, including any bit that's anywhere in the dataset. But for practical purposes I found that you'd want your resulting joint probability matrix to be of size 2x2 because for everything else you'll need to count so many bitpair combinations that it gets easily out of hand I've seen hardly any evidence that this improves anything.

thodson-usgs · 2024-02-28T15:47:55Z

Those are good insights. I'll look for lower fruit.

Regarding my initial test. I altered the threshold to 1.1 based on performance with my dataset. Later I realized that the problem wasn't the threshold, rather the data were quantized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts about the mutual information threshold #259

Thoughts about the mutual information threshold #259

thodson-usgs commented Feb 7, 2024 •

edited

Loading

thodson-usgs commented Feb 7, 2024 •

edited

Loading

milankl commented Feb 27, 2024

thodson-usgs commented Feb 28, 2024 •

edited

Loading

Thoughts about the mutual information threshold #259

Thoughts about the mutual information threshold #259

Comments

thodson-usgs commented Feb 7, 2024 • edited Loading

thodson-usgs commented Feb 7, 2024 • edited Loading

milankl commented Feb 27, 2024

thodson-usgs commented Feb 28, 2024 • edited Loading

thodson-usgs commented Feb 7, 2024 •

edited

Loading

thodson-usgs commented Feb 7, 2024 •

edited

Loading

thodson-usgs commented Feb 28, 2024 •

edited

Loading