-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low number of keepbits
using get_keepbits
in quick-start.ipynb
#162
Comments
you are referring to the visible contour lines in the first plot I guess we should add such a plot and a difference plot to the quick start are you suggesting to rather just set the last 4 bits to 0 instead of all bits lower that the max 4 last bits? @milankl on discussing the criterion |
The general idea I had is to identify in some way the bit(s) with minimal information, and use this as a reference to say if the other bits have an important amount of information or not. The method could for instance be:
This is working with the assumption that bit info in the mantissa bits should decrease more or less in a monotone way (I will read more on this). So if the bit info start increasing again, we could just discard this. But this might lead to other problems, there are cases where the bit information remains at an appreciable level for all mantissa bits. In general, I see many behaviours for how the bit info evolves along the bit chain. Setting a hard cutoff on mantissa bits beyond the minimum might be risky. Maybe the approach I describe in the next comment (that I found in a notebook by @milankl) is better suited in general. |
In the notebook keepbits_var = argmax(cumsum(bitinf_var)/sum(bitinf_var) .>= 0.99) - 9 # subtract 9 to count mantissa bits (for "var" in {T,q,CO2}). Perhaps you also have something like Using this method in the case I described above, we don't reach the 99% info threshold until the last mantissa bit because of the large information content in the last bits. This method is safer, it will only cut bits in situations where it is clear the bit information content is negligible compared to the rest. I have seen many different behaviours for bitinformation plots depending on the variables, so I'm not sure the method I proposed in my previous comment can be applied systematically. Here is an example for precipitations. I used the |
Unfortunately, your bitwise information has some artificial information included, which becomes evident in the last mantissa bits:
I believe this data was somehow lossily compressed before. For example, any kind of quantisation (include the offset+scale options that some netcdfs use) can introduce an artifical mutual informaiton in the last bits which isn't actually physical. I therefore always advise to use untreated Float32/Float64 data to do the bitwise information content analysis. The algorithm itself doesn't know about binary encodings and just treats every bit position independently. Assume you don't have access to a high precision version of the data: what can you do? 1) Given that your data spans both plus and minus and is also smaller and bigger than 1 (because you have non-zero information in the sign and in the first exponent bit) I would recommend trying to use a signed exponent instead ( So yes, you can just set the 4 last mantissa bits to zero and repeat the 99% threshold calculation -- the 1.5*max idea is more applicable when you have information that never really drops to zero, which we'll come to now:
Precipitation is interesting as it is usually a variable with many zeros but then some hotspots. Many exact zeros means that there is some information even in the last mantissa bits because there is a mutual information going from one grid point to the next. (If it doesn't rain in one grid point it's likely that it doesn't rain in the next either). However, it's not really important to have a comparably high number of keepbits (your analysis says 19 or 21) because rounding doesn't affect the zeros. This is the case where I'd try the c*max(inf in last mantissa bits) because while there is some (presumably real) information in m5 to say m22 it's not really information you are intersted in, and I agree with you m4 looks like a good threshold here. As m23 increases slightly again, I'm wondering whether you are dealing again with data that was previously lossily compressed, please check that. In general: You clearly found two examples of variables/data where the interpretation of the bitwise information content is tricky and not always straight forward. That's definitely one of the bigger caveats with this method. I believe the algorithm is so naive that it's reliable, but how you count the 99% (or any other <100% threshold) especially when the data was previously compressed isn't always straight forwardad and I haven't found a reliable translation between the bitinformation distribution and # of keepbits that works in less ideal cases too. So thanks for joining the discussion here and I hope my comments are helpful to better understand what's going on. |
Exactly. set_zero_insignificant (bool): defaults to True (julia implementation) https://xbitinfo.readthedocs.io/en/latest/api/xbitinfo.xbitinfo.get_bitinformation.html#xbitinfo.xbitinfo.get_bitinformation |
My rule of thumb that nearly always works: Take 99% and then two additional bits to be on the safe side |
Haha, sure, but for @coxipi's precipitation example +2 seems like an overkill, it should be more like -14 😉 |
Haha, that's true, but better be safe than sorry, thanks for the tip! It's a very nice topic theoretically with exciting benefits for data management, so I'm glad I can join the discussion! I'll be working on this in the next few months for , we will be testing this on our datasets. I'll keep you posted. Thank you for your detailed explanations @milankl, I understand better limitations and what to look for. By cutting the last 4 bits in the first example I gave, I get 'm4' or 'm5' (depending on how I exactly I formulate the new condition), so that matches your visual intuition of needing at least one more bit. As for the questions regarding compression:
I applied the first two steps on data that I directly download from The green line is the bit information obtained from the dataset where I applied the first two steps above on the ERA5 dataset, and the red line is the analog dataset's bit information from Ouranos. The bit information are similar, but you can see the red line (Ouranos) is more noisy. I will have to look more into the compression we use when storing the data. Maybe there are other manipulations I have overlooked. There are noticeable differences in some spots (some spots where one dataset is exactly 0 and not the other, and vice-versa). The workaround you discussed for poorer quality data could come in handy, I'll keep this in mind. But ideally, we should try to avoid getting there, so I'll make sure I can use the data earlier in the pipeline! |
@coxipi please keep in mind that ERA5 data is originally written in the GRIB format and even if you select to download the netCDF4 files from e.g. the Copernicus Climate Data Store, the files are just converted from the GRIB files, which are already lossy. As a consequence, the calculation of the mutual information does not provide useful results. |
Ah, got it, thanks! The example dataset # Compressing dTdx and dTdy
import xbitinfo as xb
import xarray as xr
import matplotlib.pyplot as plt
ds = xr.tutorial.load_dataset("air_temperature_gradient")
bi = xb.get_bitinformation(ds, dim="lon", set_zero_insignificant=True)
kb = xb.get_keepbits(bi, 0.99)
ds_br = xb.xr_bitround(ds, kb+2)
fig = plt.figure(figsize=(14,3))
varss = ["dTdx", "dTdy"]
for var in varss :
bi[var].plot(marker="o")
plt.legend(varss)
plt.ylabel("bit info")
fig = plt.figure(figsize=(14,3))
for var in varss :
bi[var].isel(bit32=slice(9,None)).plot(marker="o")
plt.legend(varss)
plt.ylabel("bit info (only Mantissa)")
ti = 0
for var in varss:
fig, (ax1,ax2) = plt.subplots(1,2,dpi=100)
ax1.set_title(f"Original {var}")
ax1.imshow(ds[var].isel(time=ti))
ax2.set_title(f"99% info (keepbits = {kb[var][0]+2})")
ax2.imshow(ds_br[var].isel(time=ti)) |
Just discovered this. I love the discussion, but @coxipi, did you make any more progress on this topic? |
@rsignell-usgs I didn't focus on this aspect lately (still looking at data with lossy compression somewhere in the pipeline as discussed above). The approach "inspecting all bits and keep 99% info" is surely safer. If the profile of bit information doesn't have close to zero info in the trailing bits, then this method will give more strict rounding (higher An alternative approach could be to inspect the bit information profile to find where information starts stabilizing in the trailing bits. This would cover the case of variables where bit information converges to a finite value (zero or non-zero). Some tests could be added to raise a flag if this constant value is suspiciously too high. Even then, it would still leave some probably arbitrary values to determine, like: what is a considered a "small variation" of info that allows to define the plateau. It could be a certain % of the total bit information. Also, once you define a point where the plateau starts, you might find yourself in the same position as before, wanting to add one or two bits to |
Just a note on artificial information (what we call it in the paper):
It's an open problem how to distinguish between artificial and real information without prior knowledge. So if you come up with any ideas, feel free to communicate them. |
Description
I am trying to verify if the library identifies a reasonable number of bits to keep. I am working with the
quick_start.ipynb
notebook.I plotted
ds
andds_rounded
side-by-side for variableu
(I fixedmonth
andlevel
to their first value). I also checked variablev
and get similar issues. I expected that by looking at the plots, I would not be able to tell which dataset had been rounded. Instead, I can clearly see which dataset has been rounded.What I Did
Plotting both original and bitrounded
u
in the notebookquick_start.ipynb
(as taken on current main branch)More info
I'm pretty sure this is related to function
get_keepbits
and_cdf_from_info_per_bit
. Prior to keeping only 99% of bit info, there is preliminary cleaning step where only bits that respect:bit info > 1.5*max(last 4 bit info)
are kept. This process is shown in the following figure:The comment in the
_cdf_from_info_per_bit
function explaining the cleaning step is# set below rounding error from last digit to zero
. Does this mean that the large values of bit info for the last bits are due to this rounding error? The idea behind the procedure is that one should not trust bit info which are of comparable size to these last bit info?I could agree with that, but then, how would we interpret the first comparison figure I show? The original plot seems more physical than the rounded one. If we were to trust
get_keepbits
, should we conclude we should not trust the data too much (beyond the level of precision ofkeepbits
)?Less strict condition
Say that we can still trust mantissa bits with small information content even though the last few bits have suspiciously large information. Could the criteria to trust or not bit information be given with respect to bit information variation rather than its amplitude? We expect that bit information should generally decrease as we look further down the chain (higher mantissa bit number), correct? With this in mind, we could simply identify the bit with minimum information (or the bit with 4th minimal information, to mimic your condition) and drop higher mantissa bits (because the information on the right of this bit will increase, which is the behaviour we consider as problematic). Something more refined, e.g. cutting mantissa bits once a sequence of 3 increasing bit info is observed, could also be considered.
The text was updated successfully, but these errors were encountered: