Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: missing hist flow bins handling #580

Closed
wants to merge 4 commits into from

Conversation

andrzejnovak
Copy link
Member

Currently, when uproot writes hist histograms without flow bins, the property isn't checked and regular edge bins get written as flow bins causing weird issues with reading.

import uproot
import hist
import numpy as np
import mplhep as hep

h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))

fout = uproot.recreate('test.root')
fout['test'] = h
fout.close()

fin = uproot.open('test.root')
h_read = fin['test']
fin.close()

len(h.values()), len(h_read.values()), len(h.axes[0].edges)

>>> (20, 18, 21)

This PR materializes the empty flow bins as Nans when writing and filters them out again when reading to match the current expected behaviour. The caveat is that when only one flow bin exists when storing, it won't be obvious which one it is when reading the TH1 with values(flow=True) Possibly this should return a masked array instead to be clearer, but I didn't want to change the return type here.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the arrays are the wrong shapes (having flow bins when they're not supposed to or vice-versa), then that would be a bug and it's important to fix them. That is, as long as what happens here is still compatible with UHI.

But—see below—we can't be returning NumPy MaskedArrays. The outputs need to be plain NumPy arrays for compatibility. (MaskedArrays are not as popular as perhaps the NumPy team wants them to be.)

The top priority is getting the shapes right, but once that's done, "imputed" bins should either be zeros or nans, not np.ma.masked. Zeros are natural for histograms, since histogram bins start in this state. A histogram of zeros is an identity in the commutative monoid of histogram-addition. However, nan would be less ambiguous, since a zero could either mean that we filled it and no data landed in the flow bins or it could mean that the bins didn't exist and we had to impute them. However however, nan is contagious: once a nan gets into a formula, everything it touches becomes nan. I think I'd prefer the zeros for their algebraic property.

for i in range(array.ndim)
]
reshape_dim = [len(axis) - numpy.sum(axis) for axis in mask_axes]
return numpy.ma.masked_invalid(array).compressed().reshape(reshape_dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that everywhere Uproot used to return a NumPy array (values, variances, etc.), it now returns a NumPy MaskedArray?

MaskedArrays are not as compatible as plain NumPy arrays: more libraries are capable of working with np.ndarray than np.ma.MaskedArray. It would be better to fill the empty flow bins with either zeros or nans, but not np.ma.masked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, numpy.ma.masked_invalid(array).compressed() is just another way of writing array[~numpy.isnan(array)]

Copy link
Member

@jpivarski jpivarski Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good. But then, doesn't this make the shape uncertain? What if there are legitimate nans among the bins?

For instance, this happened:

hist1d->Fill(12.3, std::sqrt(-3.14));   // 12.3 determines which bin, the nan is in the weight

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a bit shaky, but I think it's reasonably robust now. Checks if all values along the "axis" are nan and then only removes these axes if at edges of the dimension.

There's ambiguity if there's a real flow bin that is nan across all other axes, but I think that's the best we can if we don't have a way to store the actual info about which flow bins exists or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape ambiguity remains as I mentioned in the opening comment, in case only overflow or underflow is set but not the other.

We have 3 options

  1. live with that ambiguity and have a matching return to hist (current PR)
  2. return Nan/0 padded dimensions and break that convention
  • advantage is that it's clear* which flow bins existed in the original hist
  • disadvantage is breaking the hist convention and possibly some used code (probably not likely, because if anyone did use this with turned off flow bins, they'd raise this issue already)
  1. return a masked array
  • advantage is making clear* which flow bins existed and being still somewhat consistent with the hist return shape-wise
  • disadvantage is returning a new type (probably breaking a bit more code than option 2), but I am not sure too many rely on flow=True in the first place

*clear - in the sense, I mentioned below, ambiguity if all flow bins are supposed to be nans, but I don't think there's a way around it

1 (current PR) is purely a fix without changing the current behaviour. The ambiguity already exists, but this option fixes the situation where TH1.value() doesn't return the real bin values at edges and TH1.value(flow=True) returns edge bins instead of flow bins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you were pointing out @jpivarski now. Fixed to remove dimensions by index instead, a test has been adapted to reflect that.

@andrzejnovak
Copy link
Member Author

The current PR removes the "padded" flow bins at read time such that hist.Hist.values(flow=True) and TH1.values(flow=True) return the same, which is what the current tests rely on.

Instead of doing that we can return flow bins even if not present in the original hist.Hist padded with 0/Nans. I don't have a particular preference, but it will no longer be consistent with the hist behaviour.

I think masked array would be a way to indicate whether over/underflow was present in the original object, but indeed I assumed that would be causing backcomp issues, so it's not done here.

@jpivarski
Copy link
Member

We want this to be consistent with hist (that's what the UHI is supposed to do). I'm going to ask @henryiii to look at it, because he has a better perspective of histogram consistency.

@jpivarski
Copy link
Member

I don't want to let this go unaddressed. I checked it out today and tried some tests with it; one of the things that I found is that it breaks h_read.to_hist() because the implementation of TH1.to_boost depends on TH1.values(flow=False) having a different shape from TH1.values(flow=True), which this PR eliminates.

Here's a comparison between h_read (an Uproot TH1 object, which always has flow bins because it's from ROOT) and h (a hist histogram with no flow bins):

>>> import uproot, hist, numpy as np, mplhep as hep
>>> 
>>> h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>> 
>>> with uproot.recreate('test.root') as fout:
...      fout['test'] = h
... 
>>> with uproot.open('test.root') as fin:
...      h_read = fin['test']
... 
>>> h.values(flow=False).shape
(20,)
>>> h_read.values(flow=False).shape
(20,)
>>> 
>>> h.values(flow=True).shape
(20,)
>>> h_read.values(flow=True).shape
(20,)
>>> 
>>> h.axes[0].edges.shape
(21,)
>>> h_read.axes[0].edges(flow=False).shape
(21,)
>>> 
>>> h.axes[0].edges.shape
(21,)
>>> h_read.axes[0].edges(flow=True).shape
(23,)

Yet another difference is that edges is a property in hist and a method in Uproot. It has to be a method to take a flow argument, and it needs a flow argument to include or exclude the overflow bins.

>>> h_read.axes[0].edges(flow=False)
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20.])
>>> h_read.axes[0].edges(flow=True)
array([-inf,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,
        10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,
        inf])

Clearly, boost-histogram/hist and Uproot are interpreting these differently. I might have thought that Uproot was not adhering to the UHI protocol, but the protocol doesn't specify the flows argument or edges property/method. So they disagree in their interpretation of unspecified extensions.

Digging yet further into this, I noticed that a hist histogram with flow bins does interpret the flow argument the same way Uproot (in main) does:

>>> h2 = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))
>>> h2.values(flow=False).shape
(20,)
>>> h2.values(flow=True).shape
(22,)
>>> h2.axes[0].edges.shape
(21,)

The difference is just that ROOT histograms always have flow bins—you can't turn them off.

Given the difference in data model between n-dimensional, mixed-axis, optional-flow boost-histogram/hist histograms and 1‒3-dimensional, floating-point-axis, required-flow ROOT histograms, it's understandable that serializing through ROOT won't preserve boost-histogram/hist in a lossless way. Lossless serialization is being discussed elsewhere.

So as I currently understand it, this PR is not correct (it's fixing something that's not broken, and breaking it in the process) because in main, Uproot histograms and hist histograms with flow bins agree on the meaning of the flow argument. Uproot histograms and hist histograms do not agree on edges. If hist is to be taken as the standard, then Uproot needs to turn edges into a property with flow always False. These edges are used in TH1.to_numpy, in which the interpretation of the flow bins being bins with infinite area is convenient, but could be reimplemented directly in to_numpy.

@henryiii, is this right?

Even though flow and edges weren't accepted into the specification, could they form some "deuterocanonical" or "optional" part of the spec, perhaps with terminology like "may" instead of "must"? It seems we need something to break the symmetry of "if you're going to implement edges, please do it like this, not that."

@andrzejnovak
Copy link
Member Author

I sadly have to say I don't really see how the behaviour qualifies as non-broken. The default read, without specifying flow is

h.values().shape, h_read.values().shape
>>> ((20,), (18,))

where I would expect both to match

h.axes[0].edges.shape, h_read.axes[0].edges().shape
>>> ((21,), (21,))

I don't think user should be required to externally bookkeep or "duck type" whether flow bins were used or not to read the histogram back correctly. The current behaviour simply treats edge bins as flow bins regardless if they in fact are flow bins. In particular, I find the following combo to be super confusing

h.values(flow=False).shape, h_read.values(flow=True).shape
>>> ((20,), (20,))

~ number of values same for flow=False/True

h.values(flow=False).shape, h_read.values(flow=False).shape
>>> ((20,), (18,))

~ number of values different for flow=False, which is the default. I don't really see a way that on inspecting a random TH1 anyone won't be confused with a histogram of 18 values and 21 edges.

it breaks h_read.to_hist()

Ah, guess a round trip test is missing.

implementation of TH1.to_boost depends on TH1.values(flow=False) having a different shape from TH1.values(flow=True)

This could probably be made to work with the proposed change. Indeed h.values(flow=False) and h.values(flow=True) already have the same shape if flow=False, so there already is an inconsistency to hist.

At a minimum, the flow bins should be filled with nans/zeros if missing at write time to prevent this strangeness. Whether these are then stripped back at read time under the hood of .values() (as proposed in the PR) or just returned plain, I don't have a strong opinion on. The round trip wouldn't be perfect but would hit a steady state after one iteration.

@jpivarski
Copy link
Member

Something seemed amiss at the beginning, though it took me some time to remember this stuff well enough to say what it is. The problem is that my interpretation of the flow argument in values and variances is that it should change the shape of the return value. This PR makes it not change the shape of the return value, breaking my understanding of what this flow argument means.

The ultimate problem here is that flow is not specified. There wasn't agreement among all parties to require flow in UHI, but @henryiii and I both wanted to include it in Uproot and hist, and we thought that we were understanding it in the same way. That was wrong, and the subtleties are just starting to come out now.

The shape of the return value of values and variances depends on flow in hist, not just Uproot, so that part of my understanding is right. However, hist has the added complication that some histograms don't have flow bins.

>>> without_flow = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>> with_flow = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))

>>> without_flow.values(flow=False).shape
(20,)
>>> without_flow.values(flow=True).shape
(20,)

>>> with_flow.values(flow=False).shape
(20,)
>>> with_flow.values(flow=True).shape
(22,)

When a hist histogram does have flow bins, flow changes the shape of values. It does that in Uproot's main branch, too, so that's a point of agreement. However, it does not do that in this PR, and that's why I'm saying that it's "breaking" this feature.

The surprise, for me, is that on a hist histogram without flow bins, flow does not change the shape of values. Some fake values have to be added when converting a hist histogram into a TH1 subclass in a ROOT file, and that's what's not happening here (hist satisfies the try block; the except shows what would happen without the flow argument—we'd always assume flow=False):

https://github.com/scikit-hep/uproot4/blob/c968c5fa26047f0e21614854b98e07de7c4789c8/src/uproot/writing/identify.py#L261-L280

Uproot's writing code assumes that obj.values(flow=True) will include the flow bins if they exist and create them if they do not. That assumption is wrong, and hence the above is a bug.

The question is, where's the bug: in Uproot or in hist? One school of thought is that Uproot is right and hist histograms without flow bins should create fake flow bins when flow=True. Another school of thought is that hist is right and Uproot should be the one to create those fake flow bins. We'll need some guidance from the spec: even though UHI isn't going to require adherents to implement a flow argument, it should provide some agreement on how it ought to behave if it is implemented.

I'm with you on the user not being required to know whether a histogram has flow bins: that would be the most useful. I would have thought that the whole reason one wants a flow argument is because they can't remember, don't know, or don't want to know whether a particular histogram has flow bins, so they say flow=True and always get flow bins or flow=False and always not get flow bins. That's the "Uproot is right" school of thought. Talking through this is convincing me—I think I should open an issue on hist (or UHI) to say that it should behave the way I think it should.

The discrepancy with edges being a method or a property is separate, though it's also a case where we'd like guidance from the spec on non-required parts of the spec.

In summary, though, you're absolutely right that users shouldn't have to keep track of this, but this PR is the wrong fix. Uproot and hist are in agreement (at least) that flow=True is supposed to change the shape of values for a histogram with flow bins.

@henryiii
Copy link
Member

henryiii commented Apr 5, 2022

Hist cannot "invent" empty flow bins with flow=True, since both flow=True and flow=False are no-copy operations. You are, in both cases, getting a view on the underlying data, regardless of it it has flow bins (using strides and offsets). If we invented flow bins and filled them with NaNs or 0s (and there would be disagreement on which one), then we'd have to copy the entire histogram's data to do so.

flow=True means "give me the entire histogram, I understand flow bins and how to deal with them", while flow=False is the default because it's simple, requires a less-complex mental model, and is enough for most uses. If you don't have flow bins to begin with, they are equivalent.

@andrzejnovak
Copy link
Member Author

I actually agree with this. I haven't really spent much time with flow bins prior to this issue happening, so never gave it any thought but it seems more intuitive to me that passing flow=True would return the same shape every time whether the flow bins existed or not (still either nans or zeros). Aka "uproot is right"

I am still not sure what you mean that is wrong about the PR. The flow=True/False should still be changing the return shape if the flow bins existed in the first place. As implemented only the nan padded dimensions are stripped at read time and they are padded only if the flow bins were missing.

@jpivarski
Copy link
Member

I had to write that whole issue twice because GitHub lost it with a connectivity flicker the first time. Sorry I missed your comments.

If we invented flow bins and filled them with NaNs or 0s (and there would be disagreement on which one), then we'd have to copy the entire histogram's data to do so.

That is correct: what I'm proposing would make obj.values(flow=True) a copy operation when obj does not have built-in flow bins. Did we say somewhere that values has to be a view?

It's certainly an option to keep hist's behavior the way it is and change Uproot to not assume that hist will create those bins. In fact, it amounts to removing the try block (above) and keeping only the except. That would do the copy that we need to impute flow bins, to fit TH1's data model. (Still, there's a choice of zero versus nan: the code above and I both vote for zero.)

However, this really isn't what I thought the flow argument meant: I thought it meant "ensure there are flow bins when flow=True and ensure there are not flow bins when flow=False, so that I don't need to know if the object has built-in flow bins or not." I thought it was a way of regularizing inputs into a single, explicitly specified format. Since TH1s always have flow bins, I can go on thinking that about Uproot's histograms, but hist's interpretation seems a bit subtle.

I am still not sure what you mean that is wrong about the PR. The flow=True/False should still be changing the return shape if the flow bins existed in the first place.

In this PR (quoting from above),

>>> import uproot, hist, numpy as np, mplhep as hep
>>> 
>>> h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
>>> 
>>> with uproot.recreate('test.root') as fout:
...      fout['test'] = h
... 
>>> with uproot.open('test.root') as fin:
...      h_read = fin['test']
... 
>>> h.values(flow=False).shape
(20,)
>>> h_read.values(flow=False).shape
(20,)
>>> 
>>> h.values(flow=True).shape
(20,)
>>> h_read.values(flow=True).shape
(20,)

h_read has built-in flow because it is a TH1, but h_read.values(flow=False) and h_read.values(flow=True) both return arrays with the same shape: (20,). Uproot and hist already agree on this point: when a histogram has built-in flow (like h_read), the flow argument should change the shape of the return value. That's what I mean about this PR breaking that feature.

@henryiii
Copy link
Member

henryiii commented Apr 5, 2022

To me, flow=True means "give me all the flow bins you have", not "inject flow bins even if they don't apply". Setting underflow=False or overflow=False on an axis means that flow doesn't apply. If I have a radius, with only overflow, I don't want to suddenly get both underflow and overflow, I just want to get that one-sided overflow bin that is normally hidden. If I have a Boolean axes, I don't want flow bins inject to it - they are meaningless. It is also very useful that it's no copy, as then .view(flow=...) matches .values(flow=...), and .view(...) is guaranteed to be no-copy.

If you really want this behavior, I'd recommend adding .values(flow=True, fill=0) and .values(flow=True, fill=math.nan). This will remove the ambiguity with what to fill it with (which there's already disagreement even with just two opinions), and will not break the (to me) the most important reason to have this argument, the "please give me exactly what you have" case.

A user can always check the axis.traits if they need to handle flow bins differently. It is rather annoying that there's no .edges(flow=True) (I originally had those as methods, but @HDembinski changed them all into properties in scikit-hep/boost-histogram#121 three years ago), but .to_numpy(flow=True) actually gives you the flow=True edges if you don't want to check axes.traits.

@henryiii
Copy link
Member

henryiii commented Apr 5, 2022

This is also undefined for axes that can't have an underflow bin. Do you add an underflow bin for Categorical axes? (This is the same problem as Boolean, but even weirder since you do have one flow bin, but not the other). "I want exactly what is defined" is a very important thing to be able to request, while "I want a array with flow bins regardless of whether they exist or not on the axis" is wasteful even without the copy, and mostly only useful when you don't want to bother to check to see if flow bins exist or not. Given they are meaningless and you are not checking to see if they are defined, are you really even using them in that case?

@andrzejnovak
Copy link
Member Author

So right now we have (main)

# No flow bins
h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
print(h.values(flow=False).shape, h_read.values(flow=False).shape)
print(h.values(flow=True).shape, h_read.values(flow=True).shape)
>>> (20,) (18,)
(20,) (20,)

# Flow bins 
h2 = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))
print(h2.values(flow=False).shape, h2_read.values(flow=False).shape)
print(h2.values(flow=True).shape, h2_read.values(flow=True).shape)
>>> (20,) (20,)
(22,) (22,)

while this PR proposes

# No flow bins
h = hist.new.Reg(20, 0, 20, name='msd', flow=False).Weight().fill(np.random.normal(10, 6, 1000))
print(h.values(flow=False).shape, h_read.values(flow=False).shape)
print(h.values(flow=True).shape, h_read.values(flow=True).shape)
>>> (20,) (20,)
(20,) (20,)

# Flow bins 
h2 = hist.new.Reg(20, 0, 20, name='msd', flow=True).Weight().fill(np.random.normal(10, 6, 1000))
print(h2.values(flow=False).shape, h2_read.values(flow=False).shape)
print(h2.values(flow=True).shape, h2_read.values(flow=True).shape)
>>> (20,) (20,)
(22,) (22,)

which "fixes" the discrepancy to hist behaviour.

I would say that

the flow argument should change the shape of the return value

is a reasonable expectation only if the flow bins actually exist. This PR maintains that, but in the case of flow bins not being initiated it fixes the current behaviour of treating real edge bins as flow bins, which really needs to go.

Now since ROOT doesn't let us not write out flow bins, we have to add the padding with nans/zeros to fix the underlying problem, but the choice is if we a) strip the padding on reading in uproot to match hist signature (and lose the universality of flow=True changing shape) as proposed in the PR or b) leave the padding in on read (which is internally self-consistent because we had to create it to write the file), preserve the flow=True always changing shape property, but return a different shape than hist does and have an imperfect round-trip hist(flow=False) -> uproot.TH1(flow padded) -> hist(flow=True).

@jpivarski
Copy link
Member

@andrzejnovak, see PR #582, which should replace this one. I'm accepting hist's interpretation of the flow argument as the correct one (see scikit-hep/uhi#58 (comment)), so Uproot has to consume obj.values(flow=True) as an expression that might return an array without flow bins.

That's the Uproot bug that you originally encountered. However, padding and then unpadding is complicating the situation even more—it will likely have unforeseen consequences in other use-cases. To fix just the bug at its source (in histogram-writing and nowhere else), we need to verify that the histogram object actually has flow bins, so that we interpret the values array correctly. Using only the UHI + flow argument extension, this can be done with

diff --git a/src/uproot/writing/identify.py b/src/uproot/writing/identify.py
index f9c7f02f..c56282fa 100644
--- a/src/uproot/writing/identify.py
+++ b/src/uproot/writing/identify.py
@@ -263,6 +263,13 @@ def to_writable(obj):
             data = obj.values(flow=True)
             fSumw2 = obj.variances(flow=True)
 
+            # and flow=True is different from flow=False (obj actually has flow bins)
+            if (
+                data.shape == obj.values(flow=False).shape
+                and fSumw2.shape == obj.variances(flow=True).shape
+            ):
+                raise TypeError
+
         except TypeError:
             # flow=True is not supported, fallback to allocate-and-fill

(Raising TypeError puts it into the code block that handles histograms that don't support the flow argument, which is also what we should do if there are no flow bins in obj.)

PR #582 includes your original test. A hist histogram with no flow bins returns the same thing as its round-tripped histogram with flow=False (this is the default value of flow). A hist histogram with flow bins returns the same thing as its round-tripped histogram with flow=True or flow=False. That's 3 of the 4 logically possible comparisons.

The fourth is to take a hist histogram without flow bins, pass it through a ROOT file, and then ask for values(flow=True). Under hist's interpretation of the flow argument, which we are accepting, this should not return the same result as values(flow=True) on a TH1. values(flow=True) on a histogram with no flow bins does not include flow bins in the output, and values(flow=True) on a histogram with flow bins, as every TH1 would, does include flow bins in the output. This is not the only way in which a hist histogram differs from the TH1 that represents it in a ROOT file: the ROOT conversion is lossy.

For lossless serialization, we need scikit-hep/boost-histogram#726.

@andrzejnovak
Copy link
Member Author

👍 IIUC this corresponds to the above-mentioned option b), the except block takes the no-flow values() and pads zeros at the edges to make it ROOT compliant. Should fix the original issue, thanks.

@jpivarski
Copy link
Member

Yes, this corresponds to the "hist is right" interpretation of the flow argument. Uproot has been choosing to impute missing flow bins with zeros for a long time now, so that's a minimal change. Nothing has been done, so far, about edges being a method, rather than a property.

Thanks for understanding. I didn't want this fix to become a surprise in someone else's workflow, so it has to be as minimal as possible.

@HDembinski
Copy link
Member

A user can always check the axis.traits if they need to handle flow bins differently. It is rather annoying that there's no .edges(flow=True) (I originally had those as methods, but @HDembinski changed them all into properties in scikit-hep/boost-histogram#121 three years ago), but .to_numpy(flow=True) actually gives you the flow=True edges if you don't want to check axes.traits.

The reason for that was that .to_numpy automatically gives you consistent edges and values. When you make edges and values methods with a flow flow argument, you have to make sure they are consistent. Since users rarely need to mess with the flow bins, the current API design is better.

@HDembinski
Copy link
Member

HDembinski commented Apr 9, 2022

This is also undefined for axes that can't have an underflow bin. Do you add an underflow bin for Categorical axes? (This is the same problem as Boolean, but even weirder since you do have one flow bin, but not the other). "I want exactly what is defined" is a very important thing to be able to request, while "I want a array with flow bins regardless of whether they exist or not on the axis" is wasteful even without the copy, and mostly only useful when you don't want to bother to check to see if flow bins exist or not. Given they are meaningless and you are not checking to see if they are defined, are you really even using them in that case?

Having overflow and underflow for a category axis makes no sense, since categories are not ordered. There simply some axes which do not need underflow and/or overflow bins and should not have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants