Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that reduce-like functions use init one or more times #53945

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 3, 2024

OK, here's take two, building on the work that was done in #53871. We preserve the neutral element requirement but guarantee that init is used to start all evaluation chains.

The news summarizes as succinctly as I can:

init is now guaranteed to be used for non-empty collections one or more times. This ensures that all calls
to the reducing function op have one argument that is either init or the result from a
previous op evaluation. Previously init was explicitly allowed to be omitted from the reduction entirely.

And the docs for mapreduce say a bit more:

If provided, init is the return value for empty collections and is used one or more times as
an argument to op for non-empty collections. The init value is not transformed by the function f.
Using init ensures that all calls to op have one argument that is either init
or the result from a previous op evaluation, and the ordering of these arguments is
unspecified. As it may appear in the reduction one or more times, it must be a neutral element for op that
does not change the result by being used more than once. It is generally an error to call mapreduce
with empty collections without specifying an init value, but in unambiguous cases an
identity value for op may be returned; see Base.reduce_empty for more details.

Closes #49042.

base/reduce.jl Outdated Show resolved Hide resolved
@mbauman mbauman added docs This change adds or pertains to documentation fold sum, maximum, reduce, foldl, etc. labels Apr 3, 2024
base/reduce.jl Outdated Show resolved Hide resolved
@jariji
Copy link
Contributor

jariji commented Apr 3, 2024

Do we want to specify whether it's op(init, _) or op(_, init)?

base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Member Author

mbauman commented Apr 3, 2024

Do we want to specify whether it's op(init, _) or op(_, init)?

I don't think there's a meaningful reason to do so, because subsequent steps might be op(op(...), _) or op(_, op(...)). That's the interesting thing in my head — that you always have one argument from the collection and one result_or_init. Since the location of the result isn't defined, I don't think the location of the init should be, either.

Edit: Oh duh, that's not true. You sometimes have two results. Duh. But my point is that you never have two elements, and that which one may be a value from the collection is unspecified.

But now I need to clean up that language some more. Shucks. 🤦

NEWS.md Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
@jonas-schulze
Copy link
Contributor

jonas-schulze commented Apr 4, 2024

Do we want to specify whether it's op(init, _) or op(_, init)?

I think this is a fair point, especially since fold[lr] immediately refer to reduce. Either option is fine by me. Using the same order for all three (six including map*) makes switching seamless.

It most cases, both arguments will likely have the same type, i.e. op(::T, ::T) where T. But surely, there will be some cases where this is not the case, such that it is necessary to specify whether it is op(acc, val) or the other way around.

I think I had a non-associative example in mind, for which I should have used foldl/foldr instead. We do want to allow op(acc1, acc2).

@Seelengrab
Copy link
Contributor

This docstring feels awfully long, especially when viewed in help> or as a function hint in an editor - maybe it's better to have this behavior recorded in the manual, and refer to that section in the docstring?

Copy link
Contributor

@MasonProtter MasonProtter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for taking this on @mbauman. I just have one remaining point of disagreement about assuming purity of f and then I'm fully on board.

base/reduce.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 1, 2024

I'm a bit late to the party, but after consideration and discussion with @mbauman, I'm on board with this change. I think the guarantee can be expressed more clearly this way:

When init is not provided, one or more op calls will take collection values as both arguments. When init is provided, collection values will never occur as left arguments to op, only as right arguments; the left argument to op will be init for one or more op calls, or the result of a previous call to op otherwise.

I agree that the wording of the new docs is a bit verbose, but we can always worry about that after the functionality is nailed down.

@StefanKarpinski
Copy link
Member

The argument ordering is very hard to express without explicitly invoking an op call tree. Perhaps that's the best way to write it anyway. Something like

The collection values are guaranteed to appear as leaves in the tree of op calls in order, but that does not guarantee that those op calls are evaluated in that order. For example, the left and right halves of an array may be reduced concurrently. When the results of those two reductions are combined the left and right reduced values will respectively be the left and right arguments to the final op call.

@mbauman
Copy link
Member Author

mbauman commented Jul 24, 2024

I don't like requiring that init needs to be a left-argument to op because that means that foldr would not be a valid implementation for reduce. But otherwise I agree. I'm seeing if I can get the description a little more clear with fresh eyes here.

@jariji
Copy link
Contributor

jariji commented Jul 24, 2024

Is "init is the right argument of op in foldr" documented?

@mbauman
Copy link
Member Author

mbauman commented Jul 24, 2024

As an example, yes.

Try to make each paragraph here "punchier" and more pointed:
* First paragraph more specifically describes the general behavior
* Second paragraph is about `init`, clarify what it means for `op` args
* Third paragraph is about `op`, but point to extended help
* Fourth paragraph is about empty collections

The extended help now reads clearer with a bulleted list and explicitly mentions argument ordering... and is also a bit more careful about floating point associativity (or the lack thereof) in mapreduce
@mbauman mbauman changed the title RFC: require reduce-like functions to use init one or more times Document that reduce-like functions use init one or more times Jul 24, 2024
@mbauman
Copy link
Member Author

mbauman commented Jul 24, 2024

OK, I've pushed a new re-wording. I find this more clear, personally.

We seem have a very solid consensus that "reduce-like functions use init one or more times" so I've removed the RFC from the title. I think I've faithfully represented that idea in the docs with this latest push.

Of course comments on the wordings are still very welcome, but I am much happier with the state of this now.

@mikmoore
Copy link
Contributor

It makes me a tad anxious that we don't guarantee whether the init is a first or second argument to op. Could there be a situation where this is an issue, where we might need distinct left and right neutral elements? Does a neutral element always commute for an associative operation? This might be a trivial result from group theory, but I don't know the answer.

@nsajko
Copy link
Contributor

nsajko commented Jul 25, 2024

Does a neutral element always commute for an associative operation?

No: https://en.wikipedia.org/wiki/Semigroup#Identity_and_zero

@nsajko
Copy link
Contributor

nsajko commented Jul 25, 2024

A two-sided identity (or just identity) is an element that is both a left and right identity. Semigroups with a two-sided identity are called monoids. A semigroup may have at most one two-sided identity. If a semigroup has a two-sided identity, then the two-sided identity is the only one-sided identity in the semigroup. If a semigroup has both a left identity and a right identity, then it has a two-sided identity (which is therefore the unique one-sided identity).

So a monoid is required, not just associativity.

@mbauman
Copy link
Member Author

mbauman commented Jul 25, 2024

I have two reasons why that's not a major concern in my view. One is that there are three "types" of arguments that get passed to op:

  • init itself
  • values from itr — which necessarily could be either left- or right-args (but not both when init is provided)
  • return values from previous op calls — which are also necessarily left- or right-args (or both)

Because the init value is also the return value for empty collections, it should be of a similar "type" (yes, I mean that two ways: there's the literal Type, but it also applies more generally as a softer "interpretation" or "category of thing") as op's return values. And op necessarily needs to be able to accommodate previous return values in either slot.

The second is more simplistic, but it really seems like foldr should be a valid implementation for reduce, and foldr(=>, [1,2,3], init=0) is (and should be) 1 => (2 => (3 => 0)).

@mbauman
Copy link
Member Author

mbauman commented Jul 25, 2024

Perhaps more succinctly, just look at foldl. With init, it will pass all its itr values as right-args. All the left args (except one) are results from previous operations. And init is that final left-arg. Flip the associativity with -r and the argument orderings flip. So init's placement flips. It makes sense that an arbitrary associativity would grant this leeway.

If you care about argument orderings between init and an itr value, you probably care about the argument orderings between an op result and an itr value — and so then you care about the associativity. And you should use foldl or foldr.

@stevengj
Copy link
Member

stevengj commented Jul 25, 2024

When the operation is +, should we check iszero(init) for backward compatibility and print a deprecation warning if it is false (calling mapfoldl as a fallback)?

@mbauman
Copy link
Member Author

mbauman commented Jul 25, 2024

Well the previous documentation explicitly says that it may or may not be used... but it'd probably be a good idea to get an idea of how many folks were relying on the old implementation. I say we first try the "full pairwise" in #52397 so we can run PkgEval there to gather the data. A deprecation path would be kind but would probably have to be fairly finely scoped (the generic iszero can throw method errors).

@mikmoore
Copy link
Contributor

mikmoore commented Aug 30, 2024

One behavior I'd like to preserve is

julia> sum(fill(-0.0, 0))
0.0

julia> sum(fill(-0.0, 3))
-0.0

It seems that the way to get this is to make sum on IEEEFloat use init=-0.0 (the IEEEFloat additive identity) but allow (the option of) a distinct return value of +0.0 for the isempty case. I hate that this makes the implementation more complicated but I don't see a great way around it. Thanks to mbauman for clarifying the missing-init case below. It was probably made clear a while ago but I came back to this after seeing the discussion on 55318 today. If I'd re-read more carefully I might have noticed that probably nobody was proposing we provide default values for init.

I would be somewhat unhappy if a sum of -0.0s yielded +0.0 (sum(x) would differ from +(x...) in a not-just-reassociation way), and I expect the change would break things (both existing and yet-unwritten) in at least a few places, but I know most people would be unhappy if an empty sum yielded -0.0 (and this would almost certainly break things in many places).

(+0.0 vs -0.0 might seem pretty inconsequential until you write inv(sum(x)))

@mbauman
Copy link
Member Author

mbauman commented Aug 30, 2024

I've become completely radicalized against initial value guesswork when it's not specified, and the proposed behavior change here is only about what to do when init is specified.

Guessing at init is simply so bug-prone. I think the only sensible answer here is ensuring we always do what the change in #55318 does for dimensional reductions. That is:

  • 0 values:
    • Have init? Return it. Otherwise:
    • Return mapreduce_empty (usually an error, sometimes an unambiguous value like 0.0)
  • 1 value (a):
    • Have init? Return op(init, f(a)). Otherwise:
    • Return mapreduce_first (typically f(a))
  • 2+ values (a1, a2):
    • Have init? Start every chain with op(op(init, f(a1)), f(a2)) and continue. Otherwise:
    • Start every chain with op(f(a1), f(a2)) and continue

In other words, neither 0.0 nor -0.0 in your above cases come from a divined-out init element that wasn't passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation fold sum, maximum, reduce, foldl, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unspecified behavior of mapreduce is bad
9 participants