-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
WIP: The great pairwise reduction refactor #58418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
The head ref may contain hidden characters: "mb/ReduceReuse\u267B"
Conversation
…]Int`. (cherry picked from commit c923b1d)
(cherry picked from commit 4bee191)
for accumulate inference (cherry picked from commit 4fadd8f)
Previously, Julia tried to guess at initial values for empty dimensional reductions in a slightly different way than whole-array reductions. This change unifies those behaviors, such that `mapreduce_empty` is called for empty dimensional reductions, just like it is called for empty whole-array reductions. Beyond the unification (which is valuable in its own right), this change enables some downstream simplifications to dimensional reductions and is likely to be the "most breaking" public behavior in a refactoring targeting more correctness improvements. (cherry picked from commit 94f24b8)
(cherry picked from commit 0e924cf)
they already supported axes through the default fallback that uses size to construct OneTos
and this was causing ambiguities
(cherry picked from commit 32fc16b)
Fantastic stuff, this is really exciting to see! 🥳 |
length(t::ImmutableDict) = count(Returns(true), t) | ||
# length is defined in terms of iteration; using a higher-order function to do the iteration | ||
# is likely to call `length` again, so this is a manual for loop: | ||
function length(t::ImmutableDict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this be horribly slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, wasn't it already doing this? This shouldn't be any worse. It's definitely faster than infinite recursion :)
The trouble is that ImmutableDict
has an IteratorSize
of HasLength
. And so the new reduction machinery happily calls length
to determine the pairwise splits. So if length
itself uses reductions (like count) to calculate length, we're in trouble.
On nightly:
julia> using BenchmarkTools
julia> d = Base.ImmutableDict((i => i+1 for i in 1:200)...);
julia> @benchmark length($d)
BenchmarkTools.Trial: 10000 samples with 641 evaluations per sample.
Range (min … max): 193.512 ns … 392.225 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 193.772 ns ┊ GC (median): 0.00%
Time (mean ± σ): 196.869 ns ± 11.655 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▁▃▂▄▂ ▁
███████████▇██▇▇▆▇▆▆▆▅▅▇▆▆▅▆▅▅▆▅▅▅▅▅▅▅▄▄▅▅▅▅▂▄▃▃▄▄▄▄▃▂▂▂▃▄▂▅▄ █
194 ns Histogram: log(frequency) by time 241 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @eval Base function length(t::ImmutableDict)
r = 0
for _ in t
r+=1
end
return r
end
length (generic function with 95 methods)
julia> @benchmark length($d)
BenchmarkTools.Trial: 10000 samples with 631 evaluations per sample.
Range (min … max): 193.542 ns … 366.482 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 193.740 ns ┊ GC (median): 0.00%
Time (mean ± σ): 196.829 ns ± 11.533 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▄ ▂ ▁
███████▇███▇██▇▇▇▇▆▇▆▆▅▆▆▆▇▆▆▆▅▅▅▅▅▅▄▅▆▅▄▅▅▅▅▄▄▅▄▄▄▅▅▄▄▃▄▄▃▃▄ █
194 ns Histogram: log(frequency) by time 238 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's unfortunate...
Worth seeing if it also closes #43501 |
The package evaluation job you requested has completed - possible new issues were detected. Report summary❗ Packages that crashed1 packages crashed only on the current version.
6 packages crashed on the previous version too. ✖ Packages that failed1030 packages failed only on the current version.
1402 packages failed on the previous version too. ✔ Packages that passed tests23 packages passed tests only on the current version.
4619 packages passed tests on the previous version too. ~ Packages that at least loaded3 packages successfully loaded only on the current version.
2574 packages successfully loaded on the previous version too. ➖ Packages that were skipped altogether2 packages were skipped only on the current version.
921 packages were skipped on the previous version too. |
😭 OK, it's not as bad as it might seem. Most of this is from the use of (@v1.13) pkg> add DataFrames
Resolving package versions...
Updating `~/.julia/environments/v1.13/Project.toml`
[a93c6f00] + DataFrames v1.7.0
Updating `~/.julia/environments/v1.13/Manifest.toml`
[a93c6f00] + DataFrames v1.7.0
[41ab1584] + InvertedIndices v1.3.1
[08abe8d2] + PrettyTables v2.4.0
[892a3eda] + StringManipulation v0.4.1
(@v1.13) pkg> test DataFrames
# ... only showing tests with failures
Test Summary: | Pass Fail Total Time
DataFrame constructors | 6 1 7 2.1s
Test Summary: | Pass Fail Total Time
fast reductions: AsTable(cols)=>sum, mean, minimum, maximum variants | 85 1 86 12.8s
Test Summary: | Pass Fail Total Time
OnCol correctness tests | 1201549 3000 1204549 10.5s All but that reductions testset are due to #57509 (looks like they're testing exact hash values?). The reductions one is interesting and relevant:
What's happening here is:
Concretely: julia> using Test, DataFrames, Random, LinearAlgebra
julia> Random.seed!(1234);
julia> m = rand(10, 10000);
julia> df = DataFrame(m, :auto);
julia> (combine(df, AsTable(:) => (ByRow(sum) => :sum))).sum == map(x->foldl(+, x), eachslice(m, dims=1))
true
julia> (combine(df, AsTable(:) => (ByRow(sum ∘ skipmissing) => :sum))).sum == map(x->foldl(+, x), eachslice(m, dims=1))
true
julia> sum.(eachrow(df)) == map(x->foldl(+, x), eachslice(m, dims=1))
false
julia> sum.(eachrow(df)) == vec(sum(m, dims=2))
true
julia> norm(vec(sum(m, dims=2)) .- Float64.(vec(sum(big.(m), dims=2))))
2.8760747774580336e-12
julia> norm(map(x->foldl(+, x), eachslice(m, dims=1)) .- Float64.(vec(sum(big.(m), dims=2))))
3.69774194955537e-11 So I think this is worth trying nanosoldier again on. @nanosoldier |
Just as a note (and maybe you already know) but the nanosoldier report has a copy pastable command to only run the packages that failed again. If the only thing you are doing is adding a small thing to fix a specific package it might be worth to only rerun failing packages since that is usually much much faster. |
Yeah, I know about it but didn't think to do it here. Would've been a good idea I suppose, even with that thousand-long-list. I did go check and confirmed that DataFrames is indeed coding up their own |
_mapreduce_start(f, op, A, ::_InitialValue) = mapreduce_empty(f, op, _empty_eltype(A)) | ||
_mapreduce_start(f, op, A, ::_InitialValue, a1) = mapreduce_first(f, op, a1) | ||
_mapreduce_start(f, op, A, init) = init | ||
_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1)) | |
_mapreduce_start(f, op, A, init, a1) = op(init, f(a1)) |
In discussion on triage, we came to the conclusion that mapreduce_first
exists for the case of "fixing up" the first element for use as either the return value (if its the only one) or use as a first "left" argument (although we don't guarantee argument orderings). In the case that we have init
, we don't need to do this operation — we trust the provided init
to effectively do it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case of adding Bool
s gives a nice concrete example of why this is the right thing to do (and is a test-case to add). With this suggested change, we get:
reduce(+, [true, false]) === 1 # `reduce_first(+, true)` promotes to Int, which is passed to `1 + false`
reduce(+, [true]) === 1 # that's `reduce_first(+, true)`
reduce(+, Bool[]) === 0 # that's `reduce_empty(+, Bool)`
reduce(+, [true]; init=0x00) === 0x01 # *** `reduce_first` is *not* called and `+(::UInt8, ::Bool)` does *not* promote ***
reduce(+, Bool[]; init=0x00) === 0x00 # that's the `init`
This is exactly the behavior you get with foldl
and foldr
; they use reduce_empty and reduce_first in exactly the same way (but of course won't additionally call reduce_first a second time in a reassociated pairwise or parallel chain). Before this suggested change the line marked ***
was out of line.
This doesn't have an impact on sum
because Base.add_sum(::UInt8, ::Bool)
promotes to Int
(with #58374). But that leads to an interesting question — when is it useful (or even better?) to specify promotion behaviors in a reduction operator's behavior (like add_sum
) vs. with a cascade stemming from a promotion in reduce_first
? Are add_sum
's extra promotion behaviors even necessary if we ensure reduce_first
's use?
Or here's another example — Statistics' iterable mean
implementation isn't currently written as a reduction, but it sure could be (or, really, should be). It's effectively:
add_mean(x::T, y::S) where {T,S} = x + convert(promote_type(T, S), y)
Base.reduce_first(::typeof(add_mean), x) = Base.reduce_first(+, x/1)
Base.reduce_empty(::typeof(add_mean), T) = Base.reduce_empty(+, T)/1
mean(itr) = foldl(add_mean, itr)/length(itr)
That initial division by one ensures a cascading promotion to float, targeting integer overflow (JuliaStats/Statistics.jl#25). But we also want to avoid overflow for Float16
(JuliaStats/Statistics.jl#140). So what's the way to fix that? Is it with:
reduce_first(::typeof(add_mean), x::Float16) = Float32(x)
Or is with methods of add_mean
? What about allowing, e..g, FixedPoint
promotion (JuliaStats/Statistics.jl#165)? Should they implement methods of add_mean
or reduce_first
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off the cuff I'd go with add_mean
reduce_first
feels like it should (mostly) be handling mechanical / generic base cases for recursion, and any app-specific logic (such as promotion to control when overflow & underflow happens) would be encoded in the reduction function
The package evaluation job you requested has completed - possible new issues were detected. Report summary❗ Packages that crashed1 packages crashed only on the current version.
8 packages crashed on the previous version too. ✖ Packages that failed205 packages failed only on the current version.
1392 packages failed on the previous version too. ✔ Packages that passed tests23 packages passed tests only on the current version.
5119 packages passed tests on the previous version too. ~ Packages that at least loaded10 packages successfully loaded only on the current version.
2907 packages successfully loaded on the previous version too. ➖ Packages that were skipped altogether921 packages were skipped on the previous version too. |
OK, getting better. And there's still a good amount of low-hanging that should resolve more. I've fixed a few items, but some outstanding ones in the top ten listed (across the failure modes) that I haven't fully investigated or fixed or flagged upstream yet:
|
That was based on the definition of |
OK, the dot product of StaticArrays is a fascinating one. This is precisely the case I conjectured about above — the julia> x = [0.6666666666666665, -0.33333333333333326];
julia> y = [3.6275987284684357, 7.255197456936871];
julia> x[1]*y[1] + x[2]*y[2]
0.0
julia> fma(x[2], y[2], x[1]*y[1])
-5.5126462140537006e-17
julia> using LinearAlgebra: BLAS
julia> BLAS.dot(x,y)
-5.5126462140537006e-17 This is a classic example — most commonly seen in complex arithmetic. You definitely don't want complex julia> sum(x->x*1.0, [-.1,.1])
0.0
julia> sum(x->x*0.1, [-.1,.1])
-8.326672684688674e-19 That's... kinda weird, right? The above is on this branch, but you can also observe this on master in slightly more contrived examples: julia> sum(x->x*0.1, [-.1,.1])
0.0
julia> sum(x->x*0.1, [-.1; .1; zeros(13)])
0.0
julia> sum(x->x*0.1, [0.0; -.1; .1; zeros(13)])
-8.326672684688674e-19 I suppose the point is that — if you're fusing a |
StaticArrays explicitly requests the fastmath version of fma here: https://github.com/JuliaArrays/StaticArrays.jl/blob/563adebc264a23d38da97bf5d9e5d3c6333bdf26/src/matrix_multiply_add.jl#L428 |
I have benchmarks! They're... ok. A bit of a mixed bag. The codeusing Chairmarks: Chairmarks, @be
using CSV: CSV
using ProgressMeter: Progress, next!
struct LengthUnknown{T}
itr::T
end
@inline Base.iterate(x::LengthUnknown) = iterate(x.itr)
@inline Base.iterate(x::LengthUnknown, s) = iterate(x.itr, s)
Base.IteratorSize(::LengthUnknown) = Base.SizeUnknown()
Base.IteratorEltype(::LengthUnknown) = Base.HasEltype()
Base.eltype(u::LengthUnknown) = Base.eltype(u.itr)
using Random: Random
const rng = Random.MersenneTwister(1155)
rand(sz...) = Random.rand(rng, sz...)
rand(::Type{T}, sz...) where {T} = Random.rand(rng, T, sz...)
haslength(x) = Iterators.take(x,length(x))
function cartesianvec(x)
n = length(x)
n1 = nextpow(2, sqrt(n))
return @view reshape(x, n1, :)[begin:end, begin:end]
end
function bench_wholearray(Ns, Ts)
results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
P = Progress(4*4*2*length(Ts))
for (xf,type) in [(identity, "IndexLinear"), (cartesianvec, "IndexCartesian"), (haslength, "HasLength"), (LengthUnknown, "SizeUnknown")],
(xf,type) in [(xf, type),
(x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(x))), "SkipMissing $type 0%"),
(x->skipmissing(xf(Array{Union{Missing, eltype(x)}}([rand() < .5 ? missing : x for x in x]))), "SkipMissing $type 50%"),
(x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(fill(missing, size(x))))), "SkipMissing $type 100%"),],
T in Ts,
init in (nothing, zero(T))
ts = [begin
A = xf(rand(T, N÷sizeof(T)))
if isempty(A) && isnothing(init) && try mapreduce(identity, +, A); false; catch; true; end
# Empty SkipMissings don't define reduce_empty; skip it.
[(;time = NaN,)]
elseif isnothing(init)
@be(mapreduce(identity, +, $A))
else
@be(mapreduce(identity, +, $A; init=$init))
end
end for N in Ns]
append!(results.time, getproperty.(minimum.(ts), :time))
append!(results.n, Ns)
append!(results.dims, fill(missing, length(ts)))
append!(results.eltype, fill(string(T), length(ts)))
append!(results.style, fill(type, length(ts)))
append!(results.init, fill(!isnothing(init), length(ts)))
next!(P)
end
results
end
function bench_dims(Ns, Ts)
results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
P = Progress(2*8*2*length(Ts))
for (xf,type) in [(identity, "IndexLinear"), (x->@view(x[1:end,1:end,1:end,1:end]), "IndexCartesian")],
(szf, dims) in [((n,_,_)->(n÷4,4), 1), ((n,_,_)->(4,n÷4), 2),
((n,n2,_)->(n2,n÷n2÷4,4), (1,2)),
((n,n2,_)->(n2,4,n÷n2÷4), (1,3)),
((n,n2,_)->(4,n2,n÷n2÷4), (2,3)),
((n,_,n3)->(n3,n3,n÷n3÷n3÷4,4), (1,2,3)),
((n,_,n3)->(n3,n3,4,n÷n3÷n3÷4), (1,2,4)),
((n,_,n3)->(n3,4,n3,n÷n3÷n3÷4), (1,3,4)),
((n,_,n3)->(4,n3,n3,n÷n3÷n3÷4), (2,3,4))],
T in Ts,
init in (nothing, zero(T))
ts = [begin
n = N÷sizeof(T)
A = xf(rand(T, szf(n, nextpow(2, sqrt(n)), nextpow(2, cbrt(n)))))
if isnothing(init)
@be(mapreduce(identity, +, $A; dims=$dims))
else
@be(mapreduce(identity, +, $A; dims=$dims, init=$init))
end
end for N in Ns]
append!(results.time, getproperty.(minimum.(ts), :time))
append!(results.n, Ns)
append!(results.eltype, fill(string(T), length(ts)))
append!(results.style, fill(type, length(ts)))
append!(results.init, fill(!isnothing(init), length(ts)))
append!(results.dims, fill(string(dims), length(ts)))
next!(P)
end
return results
end
whole = bench_wholearray(2 .^ (3:22), (Int32,Int64,Float32,Float64))
dims = bench_dims(2 .^ (5:22), (Int32,Int64,Float32,Float64))
foreach(Base.splat(append!), zip(whole, dims))
CSV.write(Base.GIT_VERSION_INFO.commit_short * ".csv", whole) This gives us 744e3fd976.csv vs c5df01807e3.csv (nightly from a few days ago). So we can break this down over a wide variety of cases. ![]() ![]() Most of the slow cases here are the "small n": ![]() ![]() The variance is pretty high on SkipMissing — Julia's inference really doesn't like iteration's tuple of |
OK, I've implemented specialized kernels for Here are the latest benchmarks: 144c253358.csv. The remaining big regressions are in mid-size dimensional reductions, and in particular |
This is the grand culmination of (and core motivation for) all the reductions work I've slowly been chipping away at over the past year. It implements the following major changes:
reduce
-based reductions use a pairwise re-association by default. This includes seven different implementations:foldl
s without SIMD)foldl
s without SIMD)foldl
s without SIMD)mapreduce_kernel
function (with specific signatures for arrays and iterators) after performing the pairwise splits and/or dimensional selection. This could eventually become a public API that should help alleviate some of the disruption to those who had been extending the internals before. See, for example, how LinearAlgebra replaced its internal and tricky_sum
overloads with this method: JuliaLang/LinearAlgebra.jl@e70953e.init
to start every reduction chain as we decided the documentation should direct implementations to do in Document that reduce-like functions useinit
one or more times #53945. They also consistently usemapreduce_first
.This currently stands atop the yet-to-be-merged #58374 and #55628.
As the successor to #55318, this carries all its tests and addresses all those issues in the dimensional reductions. Closes: #45566, closes #47231, closes #55213, closes #31427, closes #41054, closes #54920, closes #42259, closes #38660, closes #21097, closes #32366, closes #54875, closes #43731, closes #45562.
In addition, this further brings pairwise reassociations (and SIMD!) to the other class of reduction. Closes #29883, closes #52365, closes #30421, closes #52457, closes #39385.
There are other pull requests that this supersedes that were inspirational for my work here. In particular, this supersedes and would close #45651 (pairwise whole-array IndexCartesian), closes #52397 (pairwise whole-array Iterators.SizeUnknown). It also supersedes a few other PRs; would close #45822 (findmin/max for offset arrays; I've added those tests here), close #58241 (my PR for whole-array reductions
init
refactor that snowballed into this one).Yet to do:
sum
function for custom arrays withIndexCartesian
#33928, Reduction on array view is slow #28848,sum(f, itr; init)
slower thansum(f, itr)
#47216, Reducing over bool array fails to optimize somehow #43501. Unfortunately it looks like this makes performance regression in 1.12.0-beta2 when summing over iterator #58370 wayyyy worse.--depwarn=error
— all it needed was amapreduce_similar
definition to satisfy some type checks!~~ But because the internals changed, the sparse optimizations are no longer taking effect — they're just adding methods to vestigial deprecated methods that aren't called.mapreduce_kernel
on the entire dimension/array. This would be helpful for integers and other fully associative operations