Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: missing hist flow bins handling #580

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions src/uproot/behaviors/TH1.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@
boost_axis_metadata = {"name": "fName", "label": "fTitle"}


def _remove_nan_dims(array):
mask_axes = [
numpy.isnan(array).all(axis=tuple(numpy.delete(numpy.arange(array.ndim), i)))
for i in range(array.ndim)
]
reshape_dim = [len(axis) - numpy.sum(axis) for axis in mask_axes]
return numpy.ma.masked_invalid(array).compressed().reshape(reshape_dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that everywhere Uproot used to return a NumPy array (values, variances, etc.), it now returns a NumPy MaskedArray?

MaskedArrays are not as compatible as plain NumPy arrays: more libraries are capable of working with np.ndarray than np.ma.MaskedArray. It would be better to fill the empty flow bins with either zeros or nans, but not np.ma.masked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, numpy.ma.masked_invalid(array).compressed() is just another way of writing array[~numpy.isnan(array)]

Copy link
Member

@jpivarski jpivarski Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good. But then, doesn't this make the shape uncertain? What if there are legitimate nans among the bins?

For instance, this happened:

hist1d->Fill(12.3, std::sqrt(-3.14));   // 12.3 determines which bin, the nan is in the weight

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a bit shaky, but I think it's reasonably robust now. Checks if all values along the "axis" are nan and then only removes these axes if at edges of the dimension.

There's ambiguity if there's a real flow bin that is nan across all other axes, but I think that's the best we can if we don't have a way to store the actual info about which flow bins exists or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape ambiguity remains as I mentioned in the opening comment, in case only overflow or underflow is set but not the other.

We have 3 options

  1. live with that ambiguity and have a matching return to hist (current PR)
  2. return Nan/0 padded dimensions and break that convention
  • advantage is that it's clear* which flow bins existed in the original hist
  • disadvantage is breaking the hist convention and possibly some used code (probably not likely, because if anyone did use this with turned off flow bins, they'd raise this issue already)
  1. return a masked array
  • advantage is making clear* which flow bins existed and being still somewhat consistent with the hist return shape-wise
  • disadvantage is returning a new type (probably breaking a bit more code than option 2), but I am not sure too many rely on flow=True in the first place

*clear - in the sense, I mentioned below, ambiguity if all flow bins are supposed to be nans, but I don't think there's a way around it

1 (current PR) is purely a fix without changing the current behaviour. The ambiguity already exists, but this option fixes the situation where TH1.value() doesn't return the real bin values at edges and TH1.value(flow=True) returns edge bins instead of flow bins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you were pointing out @jpivarski now. Fixed to remove dimensions by index instead, a test has been adapted to reflect that.



def _boost_axis(axis, metadata):
boost_histogram = uproot.extras.boost_histogram()

Expand Down Expand Up @@ -245,7 +254,7 @@ def values(self, flow=False):
self._values = values

if flow:
return values
return _remove_nan_dims(values)
else:
return values[1:-1]

Expand All @@ -268,7 +277,7 @@ def _values_variances(self, flow):
self._variances = variances

if flow:
return values, variances
return _remove_nan_dims(values), _remove_nan_dims(variances)
else:
return values[1:-1], variances[1:-1]

Expand Down
6 changes: 3 additions & 3 deletions src/uproot/behaviors/TH2.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import numpy

import uproot
from uproot.behaviors.TH1 import boost_axis_metadata, boost_metadata
from uproot.behaviors.TH1 import _remove_nan_dims, boost_axis_metadata, boost_metadata


class TH2(uproot.behaviors.TH1.Histogram):
Expand Down Expand Up @@ -53,7 +53,7 @@ def values(self, flow=False):
self._values = values

if flow:
return values
return _remove_nan_dims(values)
else:
return values[1:-1, 1:-1]

Expand All @@ -76,7 +76,7 @@ def _values_variances(self, flow):
self._variances = variances

if flow:
return values, variances
return _remove_nan_dims(values), _remove_nan_dims(variances)
else:
return values[1:-1, 1:-1], variances[1:-1, 1:-1]

Expand Down
7 changes: 4 additions & 3 deletions src/uproot/behaviors/TH3.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import numpy

import uproot
from uproot.behaviors.TH1 import boost_axis_metadata, boost_metadata
from uproot.behaviors.TH1 import _remove_nan_dims, boost_axis_metadata, boost_metadata


class TH3(uproot.behaviors.TH1.Histogram):
Expand Down Expand Up @@ -61,7 +61,8 @@ def values(self, flow=False):
self._values = values

if flow:
return values
# return values
return _remove_nan_dims(values)
else:
return values[1:-1, 1:-1, 1:-1]

Expand All @@ -84,7 +85,7 @@ def _values_variances(self, flow):
self._variances = variances

if flow:
return values, variances
return _remove_nan_dims(values), _remove_nan_dims(variances)
else:
return values[1:-1, 1:-1, 1:-1], variances[1:-1, 1:-1, 1:-1]

Expand Down
20 changes: 20 additions & 0 deletions src/uproot/writing/identify.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,20 @@ def to_writable(obj):
# using flow=True if supported
data = obj.values(flow=True)
fSumw2 = obj.variances(flow=True)
# pad flow bins
pad_dim = numpy.array(
[
[1 - int(axis.traits.underflow) for axis in obj.axes],
[1 - int(axis.traits.overflow) for axis in obj.axes],
]
).T
if numpy.sum(pad_dim) != 0:
data = numpy.pad(
data, pad_dim, mode="constant", constant_values=numpy.nan
)
fSumw2 = numpy.pad(
fSumw2, pad_dim, mode="constant", constant_values=numpy.nan
)

except TypeError:
# flow=True is not supported, fallback to allocate-and-fill
Expand Down Expand Up @@ -303,6 +317,12 @@ def to_writable(obj):
# we're assuming the PlottableHistogram ensures data.shape == weights.shape
assert data.shape == fSumw2.shape

# check values/bins dimensions
if ndim == 1:
assert len(data) == len(obj.axes[0].edges) + 1
else:
assert data.shape == tuple(len(axis.edges) + 1 for axis in obj.axes)

# data are stored in transposed order for 2D and 3D
data = data.T.reshape(-1)
fSumw2 = fSumw2.T.reshape(-1)
Expand Down
50 changes: 50 additions & 0 deletions tests/test_0422-hist-integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,3 +157,53 @@ def test_regular_3d(tmp_path):
assert h2.GetBinContent(8, 9, 2) == 0
assert h2.GetBinContent(9, 8, 1) == 0
f.Close()


@pytest.mark.parametrize("overflow", [True, False])
@pytest.mark.parametrize("underflow", [True, False])
def test_flow_bin_writing(tmp_path, underflow, overflow):
newfile = os.path.join(tmp_path, "newfile.root")
tmp = (
hist.new.Reg(3, 1, 4, name="x", underflow=underflow, overflow=overflow)
.Weight()
.fill([0, 1, 2, 3, 4])
)

with uproot.recreate(newfile) as fout:
fout["h1"] = tmp

with uproot.open(newfile) as fin:
h1 = fin["h1"]

assert np.allclose(tmp.values(), h1.values())
assert np.allclose(tmp.values(flow=True), h1.values(flow=True))


@pytest.mark.parametrize("under1", [True, False])
@pytest.mark.parametrize("under2", [True, False])
@pytest.mark.parametrize("under3", [True, False])
@pytest.mark.parametrize("over1", [True, False])
@pytest.mark.parametrize("over2", [True, False])
@pytest.mark.parametrize("over3", [True, False])
def test_flow_bin_writing_3d(tmp_path, under1, under2, under3, over1, over2, over3):
newfile = os.path.join(tmp_path, "newfile.root")
tmp = (
hist.Hist.new.Reg(3, 1, 4, name="x", underflow=under1, overflow=over1)
.Reg(3, 1, 4, name="y", underflow=under2, overflow=over2)
.Reg(3, 1, 4, name="z", underflow=under3, overflow=over3)
.Weight()
.fill(
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
)
)

with uproot.recreate(newfile) as fout:
fout["h1"] = tmp

with uproot.open(newfile) as fin:
h1 = fin["h1"]

assert np.allclose(tmp.values(), h1.values())
assert np.allclose(tmp.values(flow=True), h1.values(flow=True))