-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts about the mutual information threshold #259
Comments
Ah, maybe aspects of this are resolved by #234 |
You can technically do that, but it just gets really expensive. I agree it's a nice academic exercise to see how smooth an continuous field of keepbits would become, but in any practical sense I can imagine you'd need to read in and throw out so much memory that I'd doubt you'd be anywhere near a reasonable compression speed. In the original paper I've experimented with calculation the bitinformation in various directions (lon first, lat first, time first, ensemble dimension first) but I generally found the information in longitude to be effectively an upper bound on the information in other dimensions. Meaning that, if you'd use information in the vertical and use that to cut down on the false information than you're ignoring additional information that you have in the longitude dimension. This obviously highly depends on the resolution you have in the various directions, e.g. a highly temporal resolution may have more information than a coarsly resolved longitude. But overall I found that in practice you'd want to compute the bitinformation contiguously in memory, that way you can do it in a single pass and (at least BitInformation.jl) reaches about 100MB/s which is a reasonable speed people can work with. If it gets down to 1-10MB/s I see limits in any practical big data applications. Technically you are statistically predicting the state of a bit given some predictor. For the longitude dimension that's the same bit position in the previous grid point in longitude. However you could use any predictor you like, including any bit that's anywhere in the dataset. But for practical purposes I found that you'd want your resulting joint probability matrix to be of size 2x2 because for everything else you'll need to count so many bitpair combinations that it gets easily out of hand I've seen hardly any evidence that this improves anything. |
Those are good insights. I'll look for lower fruit. Regarding my initial test. I altered the threshold to 1.1 based on performance with my dataset. Later I realized that the problem wasn't the threshold, rather the data were quantized. |
Running some tests with
bitinfo
codec, and it seemed like a good time to revisit whether it's better to trimmutual_information
by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.Here's the code:
By default,
ds.to_zarr
will chunk this dataset into 730x7x27 blocks for compression.Here are the results:
(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)
The text was updated successfully, but these errors were encountered: