Change `init`'s role in reduce-like functions: remove "neutral element" restriction and guarantee its use #53871

mbauman · 2024-03-26T22:34:03Z

Here's a shot at revamped documentation in light of #49042. The goal here is ~~threefold~~ fivefold:

Document that init provides the return value if the collections are empty (status quo)
Document how reductions over empty collections may be errors (status quo)
Remove the restriction that init must be a "neutral element" (change)
Guarantee that init is used exactly once (change from being completely unspecified)
Guarantee that init is used as the left-most argument in the reduction (change from being unspecified)

And in the process, this cleans up the language around associativity a bit.

This is a draft as this language needs to be propagated through to all the "special" reduction functions like sum et al. But it gives us something concrete to consider.

base/reduce.jl

vtjnash · 2024-03-27T00:04:17Z

Also may be related to #44819 and #53461 (for the meaning and role of init in cumsum and how type promotion is defined for it)

Seelengrab · 2024-03-27T08:40:16Z

Document that init is used exactly once, and it means that every value in the collection is used as a argument to the reducer op, even for single-argument collections

How does this play together with vectorization & associativity? I vaguely remember that this was a contentious point, since the result can be very different depending on how often that "initial value" (seems to be a misnomer) is used in reductions. I believe this is the reason the actual number of uses of init is currently unspecified.

mbauman · 2024-03-27T13:35:59Z

How does this play together with vectorization & associativity?

That's exactly my motivation here — I want to push forward @stevengj's #52397. If init is specified to only be used exactly once, then only one parallel "chain" can start with init. And that's fine, as long as you don't expect to depend upon init causing a cascading promotion.

Looks like #44819 was trying to add similar guarantees to maximum. My intention here is to nail down the language and then copy it to the op-specific reducers like maximum and friends.

accumulate is slightly more troublesome since (AFAIK) we're missing the analogous associativity-defined verb.

base/reduce.jl

mikmoore · 2024-03-27T15:24:11Z

I'm torn on leaving the reduction position of init unspecified. One one hand, specifying it ties our hands on the implementation. On the other, the name init is a bit misleading if it's not at the beginning (or at least a beginning, assuming we usually opt for pairwise reductions). Although in some implementations it might be easier or more performant to include it at the very end instead (maybe makes initialization easier or cache alignment more likely). In total, I suppose it's easiest to leave explicitly unspecified for now as it can always be made stricter later, so I'm happy with the current language.

While it can be deduced from the documentation, we could consider making an explicit note warning against using init for promotion. Technically, any previous behavior that relied on it was based on an invalid use because a promoting argument is not a neutral element. EDIT: honestly, this might be more confusing than helpful. Although we could consider converting all elements to the promote_type with init before applying the reduction function if we wanted promotion behavior to be possible. Effectively, mapreduce(f,op,x;init) would replace f with g(z) = first(promote(f(z), init)). But maybe that's a future dicussion.

Co-authored-by: mikmoore <[email protected]>

base/reduce.jl

stevengj · 2024-03-27T16:12:20Z

We should definitely specify the ordering of init, i.e. that init comes at the "left" (i.e. before the first element) for ordered collections and iterators, and hence is effectively used only once. (mapreduce does not assume op is commutative for ordered iteration.) We also definitely shouldn't assume that init is an identity element of op. (Changing either of these would be hugely breaking.)

In principle, we still don't need to specify the associativity, though in most cases a mapreduce with init will be implemented as mapfoldl internally. For specific operations like op === + where we know the identity element, we might change the associativity (and be free to add zero(init) elsewhere, e.g. at the leaves of pairwise recursion, to ensure the precision of init is used for the summation). But this illustrates the difficulty with not specifying the associativity, since init can be used not just to specify the value of empty reductions but also to change the precision. e.g. what does:

a = rand(Float16, 1000)
mapreduce(identity, +, a, init=0.0)

do? Is it allowed to compute 0.0 + sum(a), i.e. compute the sum in Float16 precision and then add 0.0 (converting to Float64) at the very end? Or should it be equivalent to mapfoldl(identity, +, a, init=0.0), which performs the sum left-to-right in Float64 precision?

mbauman · 2024-03-27T17:16:51Z

So this is trying to capture the current thought-processes in #49042 — in particular Jeff's comment. The case brought up in #44819 (comment) is helpful in thinking about an associative-but-not-commutative operator: reduce(*, ["aa","bb","cc"], init="zz").

But this illustrates the difficulty with not specifying the associativity, since init can be used not just to specify the value of empty reductions but also to change the precision.

I'm explicitly saying that's a-ok with this PR — and in fact it's exactly what you sign up for with reduce. If you want a cascading promotion like that, well, then you use a function with guaranteed cascading associativity. Concretely, I don't think your example of a = rand(Float16, 1000); reduce(+, a, init=0.0) needs to be computed any differently than reduce(+, Real[0.0; a]) would be.

stevengj · 2024-03-27T18:36:30Z

Okay, but I think we should still document that init is used at the left, even for non-empty collections, e.g. reduce(*, ["aa","bb","cc"], init="zz") must return "zzaabbcc". As far as I know, mapreduce has for a long time assumed op is associative, but has never assumed op is commutative — it's documented that it can change the associativity but not the ordering. So an init argument doesn't make sense unless you say where it goes.

(But I'm still worried that re-associating init may break code that depended on the current promotion behavior.)

mikmoore · 2024-03-27T19:07:42Z

(But I'm still worried that re-associating init may break code that depended on the current promotion behavior.)

Yes, breaking this would be annoying for those that used it successfully. However, it was also explicitly not a valid use:

If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

By the old documentation, it was explicitly undefined whether init had any effect on non-empty collections. And in any case, it was supposed to be a neutral element. I read "neutral" to mean that including it should cause "no change" in the result (the only way one could justify not using it sometimes). Using it to force promotion changes wraparound or precision, so would seem to violate that neutrality.

mbauman · 2024-03-27T21:50:55Z

OK, I've updated the documentation changes here with the feedback and incorporated the requirement that init be incorporated as the left-most argument to op. I like how this reads BUT: This is now becoming a bigger change in the specification of the reduction and whereas my initial intention here was to more match the documentation with the status quo implementations, this now definitely imposes additional requirements that may not be satisfied by the status quo implementations.

At a minimum CUDA does not follow this specification: #49042 (comment) (it didn't match my initial shot, either, though). It's unclear to me where to go from here.

base/reduce.jl

stevengj · 2024-03-27T22:02:57Z

(I don't see why parallel reduction ala CUDA requires init to be the "neutral" identity element for op — if you are subdividing into parallel chunks, just divide into non-empty chunks where init isn't needed.)

mbauman · 2024-03-27T22:07:22Z

(I don't see why parallel reduction ala CUDA requires init to be the "neutral" identity element for op — if you are subdividing into parallel chunks, just divide into non-empty chunks where init isn't needed.)

It surely doesn't require it, but it is a current implementation that deviates from this PR's proposed documentation. I don't know how many other implementations are at variance.

stevengj · 2024-03-27T22:27:45Z

Fair enough — it seems like the documentation that init (formerly v0) must be a "neutral" element dates back to Julia 0.4 (#9027), so it's probably not reasonable to change it now, just to clarify what that means.

mbauman · 2024-03-28T18:09:46Z

I still think this is a useful exercise here because this is describing what I personally would want init to mean and makes concrete what the changes would look like. There are three points:

Remove the restriction that init must be a "neutral element"

Guarantee that init is used exactly once (change from being completely unspecified)

Guarantee that init is used as the left-most argument in the reduction (change from being unspecified)

Point (1) is tricky, because some common use-cases (included some documented ones) already disregard it. Points (2) and (3) would be perfectly compatible with the prior documentation if we remove the "guarantee".

So where to go? We could either decide that this is a "minor change" worth doing given existing usages and implementations (not many packages seem to use a non-neutral init, but I could imagine that this kind of "abuse" would be more common in scripts and non-package code), or we could find a new kwarg name with these tighter semantics, or we could just give up on #49042 for now and clarify the status quo as best we can.

Co-authored-by: Steven G. Johnson <[email protected]>

stevengj · 2024-03-29T13:24:57Z

I would also like to remove the counter-intuitive restriction that init be a "neutral" element (I've been using Julia for more than a decade and never even realized that this restriction existed). But I'm not certain that this is a Julia 1.x change.

mikmoore · 2024-03-29T14:49:59Z

I think I missed something along the way here. Can someone explain in what sense mbauman's proposed changes are breaking. It seems to me that any use of reduce that was valid under the previous restrictions should be valid with these new rules.

(Disregard the rest of my comments if there is a sense is which this is breaking to the use of reduce.)

Any existing implementations of reduce might not initially support the new rules. Is that the sense? I suppose it could at least be considered a minor change under that consideration. Public packages are searchable, so CUDA.jl et al may need PRs but that seems doable. Private code that implements custom reduce functions is likely to continue to implement the old restrictions (if not a different set entirely, if supporting init at all) until updated (probably never). But such private code is unlikely to be used with the new init rules, so that seems like a victimless crime.

I don't see this as profoundly different than any other feature expansion (e.g., most anything with a compat notice). Pre-existing implementations may not support the expanded interface.

init is nigh-useless under its current specification. It seems that the only thing currently guaranteed is that it is returned for isempty collections. A ternary can trivially accomplish this without the keyword. It would be unfortunate to indefinitely block making init useful because a few packages would need PRs.

mbauman · 2024-03-29T15:08:48Z

No, you're exactly right — the "breaking-ness" here is simply that we're changing the rules of the game out from underneath existing implementations. From the calling side, all of these changes "fit" within the previous documentation. But it means that previously technically-correct implementations are now immediately buggy as soon as the new docs lands.

That said, I do think it's warranted to call this a minor change and still do it, but that doesn't mean it's not without impact. It'd be good to get some voices from the JuliaFolds folks here (@MasonProtter, @jw3126, @DilumAluthge) — what do y'all think?

NEWS.md

jw3126 · 2024-03-30T08:27:00Z

I like it. I think some paralell implementations in the ecosystem will still want the neutral element assumption. But IMO it is then reasonable to make it their job to document that.

NEWS.md

MasonProtter · 2024-03-31T09:49:13Z

I'm against this direction.

This change would make it so that the init keyword argument does not actually "initialize" a reduction, but instead is just the leftmost element of the collection. IMO that's not a particularly useful property compared to an initializer.

I'd much rather that we make the opposite guarantee: that the init kwarg must be used for every recursive sub-reduction. e.g. if we had

reduce(op, [1,2, 3, 4, 5])

which was split to

op(foldl(op, [1,2,3]), foldl(op, [4, 5]))

then the init option should be propagated to both foldl calls. This can be really useful for things like recovering type stability, or avoiding issues from doing parallel reductions over collections like Channel that don't have a predictable size.

mikmoore · 2024-03-31T14:18:02Z

I suppose the less breaking thing would be to retain init's current role and neutrality (and extend its guarantee to the beginning of all reduction chains, as suggested). I still think using init for promotion is an abuse (and is unreliable under the existing semantics). If one needs promotion, (e.g.) sum(Float64, x; init=truly_neutral) is easy to write and works predictably under any conceivable semantics.

If we really wanted, we could introduce a new keyword element to apply at the left (and right?) with the proposed semantics. However, the proposed changes don't do anything that couldn't be implemented simply and more clearly as isempty(x) ? leftmost : op(leftmost, reduce(op, x)), so a keyword is probably overkill.

stevengj · 2024-03-31T14:18:13Z

I did a search for all uses of reduce or mapreduce with init in case it is helpful. Most uses of init seem to be an identity element, but there are a few exceptions that effectively assume foldl. Regardless of the outcome of this discussion, it might be good to submit a few PRs to convert them to use foldl.

adienes · 2024-03-31T17:21:44Z

I'm with @MasonProtter . there are easy enough ways to simply add something to the left of a collection. reducing with an init at each subproblem is more difficult (especially if trying to implement a parallel reduction)

maleadt · 2024-04-01T07:31:27Z

I haven't really read into the whole problem here, so just responding to the GPU bits.

(I don't see why parallel reduction ala CUDA requires init to be the "neutral" identity element for op — if you are subdividing into parallel chunks, just divide into non-empty chunks where init isn't needed.)

Just dividing into non-empty chunks is not trivial, because many operations (e.g. tree-style reduction in shared memory, or subgroup intrinsics) are executed cooperatively, i.e., by a number of threads that may not cleanly divide the number of elements. Adding branches to prevent that is generally not a good idea on GPUs.

So right now in GPUArrays.jl we use init as a neutral element, even for non-empty inputs. For compatibility with existing code that doesn't specify init because the input isn't empty (e.g. sum(f, ::CuArray)), we have a GPUArrays.neutral_element(f, ET) interface. And as a final kludge, we have our own version of mapreducedim! that also takes an init argument, because the API-dictated behavior of initializing the output (and then each thread reading back from it to get the neutral element) is too costly.

mbauman · 2024-04-01T19:04:47Z

Thank you all! I'm still not totally thrilled about the "neutral element" meaning, but I can appreciate the arguments given here, especially since it is the status quo.

jonas-schulze · 2024-04-02T16:54:27Z

I support the position of @MasonProtter. If a user requires a certain associativity, or that init be used at a specific end of the reduction chain, or any other assumption to the order of operations,¹ that user should rather use the more imperative mapfoldl or mapfoldr. IMO, mapreduce must retain its current and declarative semantics, in order to support e.g. tree-like reductions² as used by Dagger.³

Maybe the non-associative example could be extended to explicitly allow all of the following to be returned by reduce(-, [0, 1, 2, 3]):

julia> ((0 - 1) - 2) - 3
-6

julia> 0 - (1 - (2 - 3))
-2

julia> (0 - 1) - (2 - 3)
0

like a side-effect of the "map" part to be triggered before any reduction is evaluated ↩
as mentioned by https://github.com/JuliaLang/julia/pull/53871#discussion_r1541389149 ↩
see https://github.com/JuliaParallel/Dagger.jl/issues/475 and the implementation ↩

jariji · 2024-04-03T20:51:50Z

Fwiw, Folds.reduce documents its assumption that

binary function op is (at least approximately) associative and init behaves as the identity of op.

mbauman · 2024-04-03T22:12:35Z

OK, I would still like to improve our semantics here. I've opened my shot at take-2 in #53945.

Add more clarity to init's role in reduce and mapreduce

f36ffbb

mbauman commented Mar 26, 2024

View reviewed changes

base/reduce.jl Show resolved Hide resolved

mbauman added the docs This change adds or pertains to documentation label Mar 26, 2024

mikmoore reviewed Mar 27, 2024

View reviewed changes