Proposal to add a value_range #482

HDembinski · 2020-12-10T09:47:18Z

It may be useful to have a way to do an inclusive selection on a value range which follows the rule "keep all bins that overlap with the value range".

I think the clean way of introducing that is via a special slice-like object. I suggest value_range as a placeholder for a proper name. You would use it instead of slice inside __getitem__.

h[value_range(None, upper)] # selects all bins up to and including the bin containing upper
h[value_range(lower, None)] # selects all bins starting with the bin containing lower
h[value_range(lower, upper)] # selects all bins starting with the bin containing lower up to and including the bin container upper

bh.sum and bh.rebin and friends are supported as an optional third argument. value_range can be transformed internally into an ordinary slice.

The text was updated successfully, but these errors were encountered:

HDembinski · 2020-12-10T09:51:41Z

I am not super committed to this proposal, I mainly added this here as an anchor for discussion. I am personally fine with not adding too many ways of doing the same thing, bh.loc probably does the job just fine.

henryiii · 2020-12-19T23:02:29Z

I think we should make sure UHI is powerful enough that users could create a smart locator - see #485. This would greatly complicate UHI by having to track multiple types of slices and/or having a slice transform step, and would not be composable like the current UHI objects are (mixing bins and values is very useful in one special case - start to value or value to end; these are not easy to write any other way - None includes flow). It also does not have Python syntax support like slices do; that's why we use getitem in the first place over a random Python method; and for UHI takes advantage of this unless you want to use dict-based access (if the keyword getitem PEP gets accepted, you won't need verbose syntax for that anymore either!).

HDembinski · 2020-12-22T13:53:13Z

The problem with a "smart locator" is that it breaks some fundamental design principles. In well designed interfaces, everything is plain and obvious and consistent. We hate it when stuff happens behind the scenes, like global state is modified, objects that depend on each other without being explicitly coupled, etc.

A "smart locator" knows whether it is passed to __getitem__ or to a slice and changes the returned index accordingly. This is not at all what you expect. The locator purpose is to distinguish an index from a value, to signal to the histogram that it has to convert the value to an index before it can do anything with it. We made lots of compromises with respect to how Boost.Histogram works to get the behavior of indices and slices exactly how a Python user expects them to be.

Now you want to break all that with a "smart locator" that is context-aware and behaves differently in a slice. It is an abomination and it goes against all the simpleness and consistency that we worked towards.

Replacing the slice object like I propose here is the right design. If we do that, then we are not bound by any preconceptions on how slices and indexing works in Python. The value_range or you could also call it value_slice is not context-aware, because it does not have to be.

HDembinski · 2020-12-22T14:00:31Z

@jpivarski Henry wants smart locators, I think they are a bad idea. @henryiii seems to believe that you were also in favor of "context-aware" locators, but I think it goes against consistency and what is idiomatic Python, which you have also favored in the past. If Henry insists on his context-aware locators, we need another call to arbitrate, because I am strongly against them.

henryiii · 2020-12-22T15:36:01Z

Tiny clarification: I want the ability for users to write smart locators; I'm proposing the default locators like bh.loc stay "dumb". With one exception: I have been long bothered by this:

h[bh.overflow] # -> overflow bin
h[bh.underflow:bh.overflow] # returns the whole histogram (normal slicing always bring along flow bins anyway)
h[bh.underflow:bh.overflow:sum] # Does not include the overflow! Slices do not include the endpoint, so no overflow bin

h[::sum] # correct syntax

If a locator can detect where it is in the slice, we can fix bh.overflow to always mean the overflow bin, regardless of where it is. If we add smart locators, then this finally can be fixed.

Users were not supposed to use these this way, but I fear that some users, wanting to be explicit, will find bh.underflow/bh.overflow and think it's a great way to be clear to readers that you are including flow bins. Which actually works everywhere except for a reduction over a slice (where it's really hard to debug, since you can't count the number of bins returned - they are summed together, the value you get is just smaller than expected).

At the end of our meeting, we said we wanted bh.loc to not include the flow bins inside a slice, and to include them outside a slice. That requires smart locators. I've relented on bh.loc, I think it's slightly better, but since you have to know about handling the upper bin anyway, I think it's best to keep them simple.

jpivarski · 2020-12-22T17:06:19Z

If we're talking about "the ability for users to write" smart locators, then absolutely yes. We can apply maintainability and design constraints to ourselves and our own libraries, but restricting a user's freedom is known as an "opinionated library," usually in a negative light. Most applications are short-lived: maintainability should not be the primary concern for code that I write in my /tmp directory to test something.

The ability to write locators is a public interface that should allow for gradual complexity: simple things should be easy and complex things should be possible. So, for instance, we shouldn't require locator functions to take this "placement context" as an argument because most locators aren't going to use it. Attaching it to the object that gets passed in sounds like a great solution because users who write locators only have to think about it if they need it.

On a different topic, the standard set of locators provided by boost-histogram, such as bh.loc: I think they should be the easiest thing to describe and understand by a user. (Note that this is a different "user," one who is only slicing histograms, not one who is writing locators.) But "easiest to describe and understand" doesn't necessarily mean context free. Humans don't think in the simplest possible terms. To illustrate, consider the fact that LISP has simpler syntax than just about any other programming language, but humans prefer the complexity of infix notation, statements, and such: the whole field of programming language syntax is one of making computers complex enough for humans to understand. Mathematical simplicity is often easier for humans to think about—programs are not written in natural human languages—to a degree, in most situations.

Coming back to the locators, this means they should be highly constrained by mathematical simplicity, but mathematical simplicity should not win over user expectations. As a user, if

h[bh.underflow:bh.overflow]

returns the whole histogram, I would expect that

h[bh.underflow:bh.overflow:sum]

sums over the whole histogram. If bh.overflow has to be an exception to the "exclude endpoint" rule, then it should be a well-documented exception, even if it requires some minimal context-awareness to make that happen. As a human, I can accept bh.overflow having a special rule. That's why it has a special spelling: it was to make this bin different because in the course of doing data analysis, I think about this bin (and the underflow) in a special way.

If this sounds like the start of a slippery slope, we can just insist that it won't be and apply some self-constraint. Having the ability to use context doesn't mean that we will be tempted to use context just because it's there. Each rule can be approached from the point of view of "what would a user expect?" and the answer to that question will usually veer toward mathematical simplicity. This, I would argue, was the guiding principle in designing the Python language, not the insistence on consistency. I can think of much more mathematical languages than Python (Haskell leaps to mind). The choices Python made to optimize user expectation usually also optimize mathematical simplicity (such as "only one way to do it... mostly"), but as an instrumental good that serves the intrinsic good of making it easier for programmers to write programs. Python doesn't look like Perl because mathematical simplicity is highly correlated with ease of programming, but it doesn't look like Haskell either (or LISP, or Forth, ...) because these two goals aren't perfectly correlated.

HDembinski mentioned this issue Dec 10, 2020

fix: slicing projection with one-sided slices #479

Merged

gohil-jay added good first issue Good for newcomers and removed good first issue Good for newcomers labels Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to add a value_range #482

Proposal to add a value_range #482

HDembinski commented Dec 10, 2020

HDembinski commented Dec 10, 2020

henryiii commented Dec 19, 2020

HDembinski commented Dec 22, 2020 •

edited

Loading

HDembinski commented Dec 22, 2020 •

edited

Loading

henryiii commented Dec 22, 2020

jpivarski commented Dec 22, 2020

Proposal to add a value_range #482

Proposal to add a value_range #482

Comments

HDembinski commented Dec 10, 2020

HDembinski commented Dec 10, 2020

henryiii commented Dec 19, 2020

HDembinski commented Dec 22, 2020 • edited Loading

HDembinski commented Dec 22, 2020 • edited Loading

henryiii commented Dec 22, 2020

jpivarski commented Dec 22, 2020

HDembinski commented Dec 22, 2020 •

edited

Loading

HDembinski commented Dec 22, 2020 •

edited

Loading