-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: missing hist flow bins handling #580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the arrays are the wrong shapes (having flow bins when they're not supposed to or vice-versa), then that would be a bug and it's important to fix them. That is, as long as what happens here is still compatible with UHI.
But—see below—we can't be returning NumPy MaskedArrays. The outputs need to be plain NumPy arrays for compatibility. (MaskedArrays are not as popular as perhaps the NumPy team wants them to be.)
The top priority is getting the shapes right, but once that's done, "imputed" bins should either be zeros or nans, not np.ma.masked
. Zeros are natural for histograms, since histogram bins start in this state. A histogram of zeros is an identity in the commutative monoid of histogram-addition. However, nan would be less ambiguous, since a zero could either mean that we filled it and no data landed in the flow bins or it could mean that the bins didn't exist and we had to impute them. However however, nan is contagious: once a nan gets into a formula, everything it touches becomes nan. I think I'd prefer the zeros for their algebraic property.
src/uproot/behaviors/TH1.py
Outdated
for i in range(array.ndim) | ||
] | ||
reshape_dim = [len(axis) - numpy.sum(axis) for axis in mask_axes] | ||
return numpy.ma.masked_invalid(array).compressed().reshape(reshape_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that everywhere Uproot used to return a NumPy array (values
, variances
, etc.), it now returns a NumPy MaskedArray?
MaskedArrays are not as compatible as plain NumPy arrays: more libraries are capable of working with np.ndarray
than np.ma.MaskedArray
. It would be better to fill the empty flow bins with either zeros or nans, but not np.ma.masked
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, numpy.ma.masked_invalid(array).compressed()
is just another way of writing array[~numpy.isnan(array)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good. But then, doesn't this make the shape uncertain? What if there are legitimate nans among the bins?
For instance, this happened:
hist1d->Fill(12.3, std::sqrt(-3.14)); // 12.3 determines which bin, the nan is in the weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a bit shaky, but I think it's reasonably robust now. Checks if all values along the "axis" are nan and then only removes these axes if at edges of the dimension.
There's ambiguity if there's a real flow bin that is nan across all other axes, but I think that's the best we can if we don't have a way to store the actual info about which flow bins exists or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape ambiguity remains as I mentioned in the opening comment, in case only overflow or underflow is set but not the other.
We have 3 options
- live with that ambiguity and have a matching return to
hist
(current PR) - return Nan/0 padded dimensions and break that convention
- advantage is that it's clear* which flow bins existed in the original hist
- disadvantage is breaking the
hist
convention and possibly some used code (probably not likely, because if anyone did use this with turned off flow bins, they'd raise this issue already)
- return a masked array
- advantage is making clear* which flow bins existed and being still somewhat consistent with the hist return shape-wise
- disadvantage is returning a new type (probably breaking a bit more code than option 2), but I am not sure too many rely on
flow=True
in the first place
*clear - in the sense, I mentioned below, ambiguity if all flow bins are supposed to be nans, but I don't think there's a way around it
1 (current PR) is purely a fix without changing the current behaviour. The ambiguity already exists, but this option fixes the situation where TH1.value()
doesn't return the real bin values at edges and TH1.value(flow=True)
returns edge bins instead of flow bins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you were pointing out @jpivarski now. Fixed to remove dimensions by index instead, a test has been adapted to reflect that.
The current PR removes the "padded" flow bins at read time such that Instead of doing that we can return flow bins even if not present in the original I think masked array would be a way to indicate whether over/underflow was present in the original object, but indeed I assumed that would be causing backcomp issues, so it's not done here. |
We want this to be consistent with hist (that's what the UHI is supposed to do). I'm going to ask @henryiii to look at it, because he has a better perspective of histogram consistency. |
af77fea
to
8691e43
Compare
0493eeb
to
e9db555
Compare
for more information, see https://pre-commit.ci
I don't want to let this go unaddressed. I checked it out today and tried some tests with it; one of the things that I found is that it breaks Here's a comparison between >>> import uproot, hist, numpy as np, mplhep as hep
>>>
>>> h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>>
>>> with uproot.recreate('test.root') as fout:
... fout['test'] = h
...
>>> with uproot.open('test.root') as fin:
... h_read = fin['test']
...
>>> h.values(flow=False).shape
(20,)
>>> h_read.values(flow=False).shape
(20,)
>>>
>>> h.values(flow=True).shape
(20,)
>>> h_read.values(flow=True).shape
(20,)
>>>
>>> h.axes[0].edges.shape
(21,)
>>> h_read.axes[0].edges(flow=False).shape
(21,)
>>>
>>> h.axes[0].edges.shape
(21,)
>>> h_read.axes[0].edges(flow=True).shape
(23,) Yet another difference is that >>> h_read.axes[0].edges(flow=False)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.,
13., 14., 15., 16., 17., 18., 19., 20.])
>>> h_read.axes[0].edges(flow=True)
array([-inf, 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.,
inf]) Clearly, boost-histogram/hist and Uproot are interpreting these differently. I might have thought that Uproot was not adhering to the UHI protocol, but the protocol doesn't specify the Digging yet further into this, I noticed that a hist histogram with flow bins does interpret the >>> h2 = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))
>>> h2.values(flow=False).shape
(20,)
>>> h2.values(flow=True).shape
(22,)
>>> h2.axes[0].edges.shape
(21,) The difference is just that ROOT histograms always have flow bins—you can't turn them off. Given the difference in data model between n-dimensional, mixed-axis, optional-flow boost-histogram/hist histograms and 1‒3-dimensional, floating-point-axis, required-flow ROOT histograms, it's understandable that serializing through ROOT won't preserve boost-histogram/hist in a lossless way. Lossless serialization is being discussed elsewhere. So as I currently understand it, this PR is not correct (it's fixing something that's not broken, and breaking it in the process) because in @henryiii, is this right? Even though |
I sadly have to say I don't really see how the behaviour qualifies as non-broken. The default read, without specifying flow is
where I would expect both to match
I don't think user should be required to externally bookkeep or "duck type" whether flow bins were used or not to read the histogram back correctly. The current behaviour simply treats edge bins as flow bins regardless if they in fact are flow bins. In particular, I find the following combo to be super confusing
~ number of values same for
~ number of values different for
Ah, guess a round trip test is missing.
This could probably be made to work with the proposed change. Indeed At a minimum, the flow bins should be filled with nans/zeros if missing at write time to prevent this strangeness. Whether these are then stripped back at read time under the hood of |
Something seemed amiss at the beginning, though it took me some time to remember this stuff well enough to say what it is. The problem is that my interpretation of the The ultimate problem here is that The shape of the return value of >>> without_flow = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>> with_flow = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))
>>> without_flow.values(flow=False).shape
(20,)
>>> without_flow.values(flow=True).shape
(20,)
>>> with_flow.values(flow=False).shape
(20,)
>>> with_flow.values(flow=True).shape
(22,) When a hist histogram does have flow bins, The surprise, for me, is that on a hist histogram without flow bins, Uproot's writing code assumes that The question is, where's the bug: in Uproot or in hist? One school of thought is that Uproot is right and hist histograms without flow bins should create fake flow bins when I'm with you on the user not being required to know whether a histogram has flow bins: that would be the most useful. I would have thought that the whole reason one wants a The discrepancy with In summary, though, you're absolutely right that users shouldn't have to keep track of this, but this PR is the wrong fix. Uproot and hist are in agreement (at least) that |
Hist cannot "invent" empty flow bins with
|
I actually agree with this. I haven't really spent much time with flow bins prior to this issue happening, so never gave it any thought but it seems more intuitive to me that passing I am still not sure what you mean that is wrong about the PR. The |
I had to write that whole issue twice because GitHub lost it with a connectivity flicker the first time. Sorry I missed your comments.
That is correct: what I'm proposing would make It's certainly an option to keep hist's behavior the way it is and change Uproot to not assume that hist will create those bins. In fact, it amounts to removing the However, this really isn't what I thought the
In this PR (quoting from above), >>> import uproot, hist, numpy as np, mplhep as hep
>>>
>>> h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>>
>>> with uproot.recreate('test.root') as fout:
... fout['test'] = h
...
>>> with uproot.open('test.root') as fin:
... h_read = fin['test']
...
>>> h.values(flow=False).shape
(20,)
>>> h_read.values(flow=False).shape
(20,)
>>>
>>> h.values(flow=True).shape
(20,)
>>> h_read.values(flow=True).shape
(20,)
|
To me, If you really want this behavior, I'd recommend adding A user can always check the axis.traits if they need to handle flow bins differently. It is rather annoying that there's no |
This is also undefined for axes that can't have an underflow bin. Do you add an underflow bin for Categorical axes? (This is the same problem as Boolean, but even weirder since you do have one flow bin, but not the other). "I want exactly what is defined" is a very important thing to be able to request, while "I want a array with flow bins regardless of whether they exist or not on the axis" is wasteful even without the copy, and mostly only useful when you don't want to bother to check to see if flow bins exist or not. Given they are meaningless and you are not checking to see if they are defined, are you really even using them in that case? |
So right now we have (main)
while this PR proposes
which "fixes" the discrepancy to I would say that
is a reasonable expectation only if the flow bins actually exist. This PR maintains that, but in the case of flow bins not being initiated it fixes the current behaviour of treating real edge bins as flow bins, which really needs to go. Now since ROOT doesn't let us not write out flow bins, we have to add the padding with nans/zeros to fix the underlying problem, but the choice is if we a) strip the padding on reading in uproot to match hist signature (and lose the universality of |
@andrzejnovak, see PR #582, which should replace this one. I'm accepting hist's interpretation of the That's the Uproot bug that you originally encountered. However, padding and then unpadding is complicating the situation even more—it will likely have unforeseen consequences in other use-cases. To fix just the bug at its source (in histogram-writing and nowhere else), we need to verify that the histogram object actually has flow bins, so that we interpret the diff --git a/src/uproot/writing/identify.py b/src/uproot/writing/identify.py
index f9c7f02f..c56282fa 100644
--- a/src/uproot/writing/identify.py
+++ b/src/uproot/writing/identify.py
@@ -263,6 +263,13 @@ def to_writable(obj):
data = obj.values(flow=True)
fSumw2 = obj.variances(flow=True)
+ # and flow=True is different from flow=False (obj actually has flow bins)
+ if (
+ data.shape == obj.values(flow=False).shape
+ and fSumw2.shape == obj.variances(flow=True).shape
+ ):
+ raise TypeError
+
except TypeError:
# flow=True is not supported, fallback to allocate-and-fill (Raising PR #582 includes your original test. A hist histogram with no flow bins returns the same thing as its round-tripped histogram with The fourth is to take a hist histogram without flow bins, pass it through a ROOT file, and then ask for For lossless serialization, we need scikit-hep/boost-histogram#726. |
👍 IIUC this corresponds to the above-mentioned option b), the except block takes the no-flow |
Yes, this corresponds to the "hist is right" interpretation of the Thanks for understanding. I didn't want this fix to become a surprise in someone else's workflow, so it has to be as minimal as possible. |
The reason for that was that |
Having overflow and underflow for a category axis makes no sense, since categories are not ordered. There simply some axes which do not need underflow and/or overflow bins and should not have any. |
Currently, when
uproot
writeshist
histograms without flow bins, the property isn't checked and regular edge bins get written as flow bins causing weird issues with reading.This PR materializes the empty flow bins as Nans when writing and filters them out again when reading to match the current expected behaviour. The caveat is that when only one flow bin exists when storing, it won't be obvious which one it is when reading the TH1 with
values(flow=True)
Possibly this should return a masked array instead to be clearer, but I didn't want to change the return type here.