WIP: The great pairwise reduction refactor #58418

mbauman · 2025-05-14T23:03:15Z

This is the grand culmination of (and core motivation for) all the reductions work I've slowly been chipping away at over the past year. It implements the following major changes:

All reduce-based reductions use a pairwise re-association by default. This includes seven different implementations:
- Whole array reductions
  - IndexLinear (these were often pairwise already)
  - IndexCartesian (these were previously foldls without SIMD)
  - Iterators with Shape or Length (previously foldls without SIMD)
  - Iterators with unknown length (previously foldls without SIMD)
- Dimensional array reductions
  - Contiguous column major dimensions (only IndexLinear were pairwise before)
  - Discontiguous column major dimensions
  - Row major dimensions
All of these implementations do their inner work through a singular mapreduce_kernel function (with specific signatures for arrays and iterators) after performing the pairwise splits and/or dimensional selection. This could eventually become a public API that should help alleviate some of the disruption to those who had been extending the internals before. See, for example, how LinearAlgebra replaced its internal and tricky _sum overloads with this method: JuliaLang/LinearAlgebra.jl@e70953e.
The dimensional array reductions no longer guess at initialization. This fully replaces my previous attempt in WIP: a more principled take on dimensional reduction inits #55318, and is at the core of what I want to address here — the guesswork here is what was responsible for many bugs.
All of those implementations consistently use init to start every reduction chain as we decided the documentation should direct implementations to do in Document that reduce-like functions use init one or more times #53945. They also consistently use mapreduce_first.

This currently stands atop the yet-to-be-merged #58374 and #55628.

As the successor to #55318, this carries all its tests and addresses all those issues in the dimensional reductions. Closes: #45566, closes #47231, closes #55213, closes #31427, closes #41054, closes #54920, closes #42259, closes #38660, closes #21097, closes #32366, closes #54875, closes #43731, closes #45562.

In addition, this further brings pairwise reassociations (and SIMD!) to the other class of reduction. Closes #29883, closes #52365, closes #30421, closes #52457, closes #39385.

There are other pull requests that this supersedes that were inspirational for my work here. In particular, this supersedes and would close #45651 (pairwise whole-array IndexCartesian), closes #52397 (pairwise whole-array Iterators.SizeUnknown). It also supersedes a few other PRs; would close #45822 (findmin/max for offset arrays; I've added those tests here), close #58241 (my PR for whole-array reductions init refactor that snowballed into this one).

Yet to do:

…]Int`. (cherry picked from commit c923b1d)

(cherry picked from commit 4bee191)

for accumulate inference (cherry picked from commit 4fadd8f)

Previously, Julia tried to guess at initial values for empty dimensional reductions in a slightly different way than whole-array reductions. This change unifies those behaviors, such that `mapreduce_empty` is called for empty dimensional reductions, just like it is called for empty whole-array reductions. Beyond the unification (which is valuable in its own right), this change enables some downstream simplifications to dimensional reductions and is likely to be the "most breaking" public behavior in a refactoring targeting more correctness improvements. (cherry picked from commit 94f24b8)

(cherry picked from commit 0e924cf)

they already supported axes through the default fallback that uses size to construct OneTos

and this was causing ambiguities

(cherry picked from commit 32fc16b)

tecosaur · 2025-05-15T03:58:34Z

Fantastic stuff, this is really exciting to see! 🥳

oscardssmith · 2025-05-15T04:43:05Z

base/dict.jl

-length(t::ImmutableDict) = count(Returns(true), t)
+# length is defined in terms of iteration; using a higher-order function to do the iteration
+# is likely to call `length` again, so this is a manual for loop:
+function length(t::ImmutableDict)


won't this be horribly slow?

I mean, wasn't it already doing this? This shouldn't be any worse. It's definitely faster than infinite recursion :)

The trouble is that ImmutableDict has an IteratorSize of HasLength. And so the new reduction machinery happily calls length to determine the pairwise splits. So if length itself uses reductions (like count) to calculate length, we're in trouble.

On nightly:

julia> using BenchmarkTools julia> d = Base.ImmutableDict((i => i+1 for i in 1:200)...); julia> @benchmark length($d) BenchmarkTools.Trial: 10000 samples with 641 evaluations per sample. Range (min … max): 193.512 ns … 392.225 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 193.772 ns ┊ GC (median): 0.00% Time (mean ± σ): 196.869 ns ± 11.655 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▁ ▁▃▂▄▂ ▁ ███████████▇██▇▇▆▇▆▆▆▅▅▇▆▆▅▆▅▅▆▅▅▅▅▅▅▅▄▄▅▅▅▅▂▄▃▃▄▄▄▄▃▂▂▂▃▄▂▅▄ █ 194 ns Histogram: log(frequency) by time 241 ns < Memory estimate: 0 bytes, allocs estimate: 0. julia> @eval Base function length(t::ImmutableDict) r = 0 for _ in t r+=1 end return r end length (generic function with 95 methods) julia> @benchmark length($d) BenchmarkTools.Trial: 10000 samples with 631 evaluations per sample. Range (min … max): 193.542 ns … 366.482 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 193.740 ns ┊ GC (median): 0.00% Time (mean ± σ): 196.829 ns ± 11.533 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▁ ▄ ▂ ▁ ███████▇███▇██▇▇▇▇▆▇▆▆▅▆▆▆▇▆▆▆▅▅▅▅▅▅▄▅▆▅▄▅▅▅▅▄▄▅▄▄▄▅▅▄▄▃▄▄▃▃▄ █ 194 ns Histogram: log(frequency) by time 238 ns < Memory estimate: 0 bytes, allocs estimate: 0.

ah that's unfortunate...

jakobnissen · 2025-05-15T05:47:48Z

Worth seeing if it also closes #43501

nanosoldier · 2025-05-22T04:31:06Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

1 packages crashed only on the current version.

The process was aborted: 1 packages

6 packages crashed on the previous version too.

✖ Packages that failed

1030 packages failed only on the current version.

Package fails to precompile: 755 packages
Illegal method overwrites during precompilation: 10 packages
Package has test failures: 29 packages
Package tests unexpectedly errored: 92 packages
Networking-related issues were detected: 92 packages
There were unidentified errors: 1 packages
Tests became inactive: 2 packages
Test duration exceeded the time limit: 46 packages
Test log exceeded the size limit: 3 packages

1402 packages failed on the previous version too.

✔ Packages that passed tests

23 packages passed tests only on the current version.

Other: 23 packages

4619 packages passed tests on the previous version too.

~ Packages that at least loaded

3 packages successfully loaded only on the current version.

Other: 3 packages

2574 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

2 packages were skipped only on the current version.

Package could not be installed: 2 packages

921 packages were skipped on the previous version too.

mbauman · 2025-05-22T14:25:20Z

1030 packages failed only on the current version

😭 OK, it's not as bad as it might seem. Most of this is from the use of promote_union in DataFrames (and perhaps elsewhere, but that's the big one). That's easy to fix. Before unleashing nanosoldier again, I thought I'd look at DataFrames locally.

(@v1.13) pkg> add DataFrames
   Resolving package versions...
    Updating `~/.julia/environments/v1.13/Project.toml`
  [a93c6f00] + DataFrames v1.7.0
    Updating `~/.julia/environments/v1.13/Manifest.toml`
  [a93c6f00] + DataFrames v1.7.0
  [41ab1584] + InvertedIndices v1.3.1
  [08abe8d2] + PrettyTables v2.4.0
  [892a3eda] + StringManipulation v0.4.1

(@v1.13) pkg> test DataFrames
# ... only showing tests with failures
Test Summary:          | Pass  Fail  Total  Time
DataFrame constructors |    6     1      7  2.1s
Test Summary:                                                        | Pass  Fail  Total   Time
fast reductions: AsTable(cols)=>sum, mean, minimum, maximum variants |   85     1     86  12.8s
Test Summary:           |    Pass  Fail    Total   Time
OnCol correctness tests | 1201549  3000  1204549  10.5s

All but that reductions testset are due to #57509 (looks like they're testing exact hash values?). The reductions one is interesting and relevant:

fast reductions: AsTable(cols)=>sum, mean, minimum, maximum variants: Test Failed at /Users/mbauman/.julia/packages/DataFrames/kcA9R/test/select.jl:1825
  Expression: (combine(df, AsTable(:) => (ByRow(sum) => :sum))).sum ≃ (combine(df, AsTable(:) => (ByRow(sum ∘ skipmissing) => :sum))).sum ≃ sum.(eachrow(df))

What's happening here is:

That special ≃ operator checks both isequal and that (column) types match. So we're looking at exact equality between these reductions.
(combine(df, AsTable(:) => (ByRow(sum) => :sum))).sum is a foldl
(combine(df, AsTable(:) => (ByRow(sum ∘ skipmissing) => :sum))).sum is a foldl
sum.(eachrow(df)) hits this pairwise implementation and gets a more accurate result!

Concretely:

julia> using Test, DataFrames, Random, LinearAlgebra

julia> Random.seed!(1234);

julia> m = rand(10, 10000);

julia> df = DataFrame(m, :auto);

julia> (combine(df, AsTable(:) => (ByRow(sum) => :sum))).sum == map(x->foldl(+, x), eachslice(m, dims=1))
true

julia> (combine(df, AsTable(:) => (ByRow(sum ∘ skipmissing) => :sum))).sum == map(x->foldl(+, x), eachslice(m, dims=1))
true

julia> sum.(eachrow(df)) == map(x->foldl(+, x), eachslice(m, dims=1))
false

julia> sum.(eachrow(df)) == vec(sum(m, dims=2))
true

julia> norm(vec(sum(m, dims=2)) .- Float64.(vec(sum(big.(m), dims=2))))
2.8760747774580336e-12

julia> norm(map(x->foldl(+, x), eachslice(m, dims=1)) .- Float64.(vec(sum(big.(m), dims=2))))
3.69774194955537e-11

So I think this is worth trying nanosoldier again on.

@nanosoldier runtests()

KristofferC · 2025-05-22T14:34:52Z

Just as a note (and maybe you already know) but the nanosoldier report has a copy pastable command to only run the packages that failed again. If the only thing you are doing is adding a small thing to fix a specific package it might be worth to only rerun failing packages since that is usually much much faster.

mbauman · 2025-05-22T16:19:03Z

Yeah, I know about it but didn't think to do it here. Would've been a good idea I suppose, even with that thousand-long-list.

I did go check and confirmed that DataFrames is indeed coding up their own ByRow(sum) reduction implementation that is wholly independent of Base... and it is indeed a foldl. So that failure is expected; Base is now doing better.

mbauman · 2025-05-22T19:18:29Z

base/reduce.jl

+_mapreduce_start(f, op, A, ::_InitialValue) = mapreduce_empty(f, op, _empty_eltype(A))
+_mapreduce_start(f, op, A, ::_InitialValue, a1) = mapreduce_first(f, op, a1)
+_mapreduce_start(f, op, A, init) = init
+_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1))


Suggested change

_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1))

_mapreduce_start(f, op, A, init, a1) = op(init, f(a1))

In discussion on triage, we came to the conclusion that mapreduce_first exists for the case of "fixing up" the first element for use as either the return value (if its the only one) or use as a first "left" argument (although we don't guarantee argument orderings). In the case that we have init, we don't need to do this operation — we trust the provided init to effectively do it for us.

The case of adding Bools gives a nice concrete example of why this is the right thing to do (and is a test-case to add). With this suggested change, we get:

reduce(+, [true, false]) === 1 # `reduce_first(+, true)` promotes to Int, which is passed to `1 + false` reduce(+, [true]) === 1 # that's `reduce_first(+, true)` reduce(+, Bool[]) === 0 # that's `reduce_empty(+, Bool)` reduce(+, [true]; init=0x00) === 0x01 # *** `reduce_first` is *not* called and `+(::UInt8, ::Bool)` does *not* promote *** reduce(+, Bool[]; init=0x00) === 0x00 # that's the `init`

This is exactly the behavior you get with foldl and foldr; they use reduce_empty and reduce_first in exactly the same way (but of course won't additionally call reduce_first a second time in a reassociated pairwise or parallel chain). Before this suggested change the line marked *** was out of line.

This doesn't have an impact on sum because Base.add_sum(::UInt8, ::Bool) promotes to Int (with #58374). But that leads to an interesting question — when is it useful (or even better?) to specify promotion behaviors in a reduction operator's behavior (like add_sum) vs. with a cascade stemming from a promotion in reduce_first? Are add_sum's extra promotion behaviors even necessary if we ensure reduce_first's use?

Or here's another example — Statistics' iterable mean implementation isn't currently written as a reduction, but it sure could be (or, really, should be). It's effectively:

add_mean(x::T, y::S) where {T,S} = x + convert(promote_type(T, S), y) Base.reduce_first(::typeof(add_mean), x) = Base.reduce_first(+, x/1) Base.reduce_empty(::typeof(add_mean), T) = Base.reduce_empty(+, T)/1 mean(itr) = foldl(add_mean, itr)/length(itr)

That initial division by one ensures a cascading promotion to float, targeting integer overflow (JuliaStats/Statistics.jl#25). But we also want to avoid overflow for Float16 (JuliaStats/Statistics.jl#140). So what's the way to fix that? Is it with:

reduce_first(::typeof(add_mean), x::Float16) = Float32(x)

Or is with methods of add_mean? What about allowing, e..g, FixedPoint promotion (JuliaStats/Statistics.jl#165)? Should they implement methods of add_mean or reduce_first?

off the cuff I'd go with add_mean

reduce_first feels like it should (mostly) be handling mechanical / generic base cases for recursion, and any app-specific logic (such as promotion to control when overflow & underflow happens) would be encoded in the reduction function

nanosoldier · 2025-05-24T13:25:36Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

1 packages crashed only on the current version.

The process was aborted: 1 packages

8 packages crashed on the previous version too.

✖ Packages that failed

205 packages failed only on the current version.

Package fails to precompile: 11 packages
Illegal method overwrites during precompilation: 1 packages
Package has test failures: 31 packages
Package tests unexpectedly errored: 122 packages
Networking-related issues were detected: 3 packages
There were unidentified errors: 2 packages
Tests became inactive: 1 packages
Test duration exceeded the time limit: 32 packages
Test log exceeded the size limit: 2 packages

1392 packages failed on the previous version too.

✔ Packages that passed tests

23 packages passed tests only on the current version.

Other: 23 packages

5119 packages passed tests on the previous version too.

~ Packages that at least loaded

10 packages successfully loaded only on the current version.

Other: 10 packages

2907 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

921 packages were skipped on the previous version too.

mbauman · 2025-05-27T17:53:22Z

OK, getting better. And there's still a good amount of low-hanging that should resolve more. I've fixed a few items, but some outstanding ones in the top ten listed (across the failure modes) that I haven't fully investigated or fixed or flagged upstream yet:

There are the failures from bad (non-neutral) inits from Whole-array reductions always use init to start each reduction chain #58241 (OnlineStats & AcceleratedKernels)
Calling norm(itr::ComponentArrays.ComponentVector{Float64, Vector{Float64}, Tuple{ComponentArrays.Axis{(p1 = 1, p2 = 2, p3 = 3, p4 = 4, p5 = 5)}}}) ends up indexing into the axes of the component vector in a way that seems to be unexpected and is leading to an ambiguity error: getindex(::ComponentArrays.CombinedAxis{ComponentArrays.Axis{(p1 = 1, p2 = 2, p3 = 3, p4 = 4, p5 = 5)}, Base.OneTo{Int64}}, ::UnitRange{Int64}) is ambiguous in the mapreduce_kernel where we do the inds[begin+1:end] loop.
ColorVectorSpace and YaoArrayRegister checks exact floating point equality in a reduction
MatrixNetworks has some perf tests and we're 3x slower than it expects at https://github.com/JuliaGraphs/MatrixNetworks.jl/blob/ff4ea518385509b83de24e34cde5eadb430bf5f4/test/diffusions_test.jl#L227-L230. Not sure why yet
DiskArrays and DiskArrayTools explicitly tests how many times it was indexed into during a dimensional sum (dims=1). It expects to be indexed into 10 times, but it's reporting 60. This is surely because the DiskArrays.jl specialized on Base._mapreduce to implement some sort of chunked optimization — that's now no longer taking effect.

Brillouin.jl gets a -0.0 out of cartesianize but expects a positive zero. The MWE there is in StaticArrays:

julia> vert = [  0.6666666666666665, -0.33333333333333326,  0.49999999999999983];

julia> basis = [ [6.283185307179586, 3.6275987284684357, 0.0], [0.0, 7.255197456936871, 0.0], [0.0, 0.0, 5.026548245743669]];

julia> vert' * basis
3-element Vector{Float64}:
 4.18879020478639
 0.0
 2.5132741228718336

julia> SVector{3}(vert)' * SVector{3}(SVector{3}.(basis))
3-element SVector{3, Float64} with indices SOneTo(3):
  4.18879020478639
 -5.5126462140537006e-17
  2.5132741228718336

julia> sum(uu*vv for (uu, vv) in zip(SVector{3}(vert)', SVector{3}(SVector{3}.(basis))))
3-element SVector{3, Float64} with indices SOneTo(3):
  4.18879020478639
 -5.5126462140537006e-17
  2.5132741228718336

ChainRulesCore has a NoTangent iterator that claims to have length (well, it's the default) but does not. Similarly for IterativeSolvers.IDRSIterable.
StatsBase is picking up the wrong Statistics?
FillArrays and BandedMatrices is an independent breakage on master (that's since been fixed, looks like)
Clustering is a SparseArrays problem, calling widelength on a subarray of a sparse in the new mapreduce kernel. Should be an easy fix there.
JuliaInterpreter expects to hit length(LinearIndices(a)) in the sum implementation, which we no longer call
Measurements.Derivatives uses count to determine its length
Strided.jl uses reducedim_init in ways that aren't supported by mapreduce_empty.

giordano · 2025-05-27T18:08:08Z

Measurements.Derivatives uses count to determine its length

That was based on the definition of length(::ImmutableDict), which you changed here. Could be changed there as well to mirror the new definition.

mbauman · 2025-05-27T19:29:21Z

OK, the dot product of StaticArrays is a fascinating one. This is precisely the case I conjectured about above — the * here is getting marked contract and fusing with the reduction operator in the @simd loop. It's the difference between:

julia> x = [0.6666666666666665, -0.33333333333333326];

julia> y = [3.6275987284684357, 7.255197456936871];

julia> x[1]*y[1] + x[2]*y[2]
0.0

julia> fma(x[2], y[2], x[1]*y[1])
-5.5126462140537006e-17

julia> using LinearAlgebra: BLAS

julia> BLAS.dot(x,y)
-5.5126462140537006e-17

This is a classic example — most commonly seen in complex arithmetic. You definitely don't want complex * to use fma. Is this similar?

julia> sum(x->x*1.0, [-.1,.1])
0.0

julia> sum(x->x*0.1, [-.1,.1])
-8.326672684688674e-19

That's... kinda weird, right? The above is on this branch, but you can also observe this on master in slightly more contrived examples:

julia> sum(x->x*0.1, [-.1,.1])
0.0

julia> sum(x->x*0.1, [-.1; .1; zeros(13)])
0.0

julia> sum(x->x*0.1, [0.0; -.1; .1; zeros(13)])
-8.326672684688674e-19

I suppose the point is that — if you're fusing a * into a sum, then you definitionally have the above asymmetry in precision. But... BLASes have long done this in their dot-products, so perhaps I'm just being numerically naive.

vtjnash · 2025-05-27T19:39:19Z

StaticArrays explicitly requests the fastmath version of fma here: https://github.com/JuliaArrays/StaticArrays.jl/blob/563adebc264a23d38da97bf5d9e5d3c6333bdf26/src/matrix_multiply_add.jl#L428

NEWS.md

This reverts commit 744e3fd.

This reverts commit be75442.

mbauman · 2025-05-27T21:39:19Z

I have benchmarks! They're... ok. A bit of a mixed bag.

The code

using Chairmarks: Chairmarks, @be
using CSV: CSV
using ProgressMeter: Progress, next!

struct LengthUnknown{T}
    itr::T
end
@inline Base.iterate(x::LengthUnknown) = iterate(x.itr)
@inline Base.iterate(x::LengthUnknown, s) = iterate(x.itr, s)
Base.IteratorSize(::LengthUnknown) = Base.SizeUnknown()
Base.IteratorEltype(::LengthUnknown) = Base.HasEltype()
Base.eltype(u::LengthUnknown) = Base.eltype(u.itr)

using Random: Random
const rng = Random.MersenneTwister(1155)
rand(sz...) = Random.rand(rng, sz...)
rand(::Type{T}, sz...) where {T} = Random.rand(rng, T, sz...)

haslength(x) = Iterators.take(x,length(x))
function cartesianvec(x)
    n = length(x)
    n1 = nextpow(2, sqrt(n))
    return @view reshape(x, n1, :)[begin:end, begin:end]
end

function bench_wholearray(Ns, Ts)
    results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
    P = Progress(4*4*2*length(Ts))
    for (xf,type) in [(identity, "IndexLinear"), (cartesianvec, "IndexCartesian"), (haslength, "HasLength"), (LengthUnknown, "SizeUnknown")],
        (xf,type) in [(xf, type),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(x))), "SkipMissing $type 0%"),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}([rand() < .5 ? missing : x for x in x]))), "SkipMissing $type 50%"),
                      (x->skipmissing(xf(Array{Union{Missing, eltype(x)}}(fill(missing, size(x))))), "SkipMissing $type 100%"),],
        T in Ts,
        init in (nothing, zero(T))
            ts = [begin
                A = xf(rand(T, N÷sizeof(T)))
                if isempty(A) && isnothing(init) && try mapreduce(identity, +, A); false; catch; true; end
                    # Empty SkipMissings don't define reduce_empty; skip it.
                    [(;time = NaN,)]
                elseif isnothing(init)
                    @be(mapreduce(identity, +, $A))
                else
                    @be(mapreduce(identity, +, $A; init=$init))
                end
            end for N in Ns]
            append!(results.time, getproperty.(minimum.(ts), :time))
            append!(results.n, Ns)
            append!(results.dims, fill(missing, length(ts)))
            append!(results.eltype, fill(string(T), length(ts)))
            append!(results.style, fill(type, length(ts)))
            append!(results.init, fill(!isnothing(init), length(ts)))
            next!(P)
    end
    results
end

function bench_dims(Ns, Ts)
    results = (; style=String[], eltype=String[], init=Bool[], dims=Union{Missing,String}[], n=Int[], time=Float64[])
    P = Progress(2*8*2*length(Ts))
    for (xf,type) in [(identity, "IndexLinear"), (x->@view(x[1:end,1:end,1:end,1:end]), "IndexCartesian")],
        (szf, dims) in [((n,_,_)->(n÷4,4), 1), ((n,_,_)->(4,n÷4), 2),
                        ((n,n2,_)->(n2,n÷n2÷4,4), (1,2)),
                        ((n,n2,_)->(n2,4,n÷n2÷4), (1,3)),
                        ((n,n2,_)->(4,n2,n÷n2÷4), (2,3)),
                        ((n,_,n3)->(n3,n3,n÷n3÷n3÷4,4), (1,2,3)),
                        ((n,_,n3)->(n3,n3,4,n÷n3÷n3÷4), (1,2,4)),
                        ((n,_,n3)->(n3,4,n3,n÷n3÷n3÷4), (1,3,4)),
                        ((n,_,n3)->(4,n3,n3,n÷n3÷n3÷4), (2,3,4))],
        T in Ts,
        init in (nothing, zero(T))
                ts = [begin
                    n = N÷sizeof(T)
                    A = xf(rand(T, szf(n, nextpow(2, sqrt(n)), nextpow(2, cbrt(n)))))
                    if isnothing(init)
                        @be(mapreduce(identity, +, $A; dims=$dims))
                    else
                        @be(mapreduce(identity, +, $A; dims=$dims, init=$init))
                    end
                end for N in Ns]
                append!(results.time, getproperty.(minimum.(ts), :time))
                append!(results.n, Ns)
                append!(results.eltype, fill(string(T), length(ts)))
                append!(results.style, fill(type, length(ts)))
                append!(results.init, fill(!isnothing(init), length(ts)))
                append!(results.dims, fill(string(dims), length(ts)))
                next!(P)
    end
    return results
end

whole = bench_wholearray(2 .^ (3:22), (Int32,Int64,Float32,Float64))
dims = bench_dims(2 .^ (5:22), (Int32,Int64,Float32,Float64))
foreach(Base.splat(append!), zip(whole, dims))
CSV.write(Base.GIT_VERSION_INFO.commit_short * ".csv", whole)

This gives us 744e3fd976.csv vs c5df01807e3.csv (nightly from a few days ago).

So we can break this down over a wide variety of cases.

Most of the slow cases here are the "small n":

The variance is pretty high on SkipMissing — Julia's inference really doesn't like iteration's tuple of Union{Missing, ...}. The few cases with 10x+ slowdown are all instabilities of that sort.

mbauman · 2025-05-28T16:11:38Z

OK, I've implemented specialized kernels for SkipMissing{<:AbstractArray} that mirror the old implementation. That's resolved the most egregious perf issues now; none of tested implementations are worse than a 5x regression (and quite a few are 5x+ faster). Still haven't tested Sparse perf, though, there's definitely more work to be done there.

Here are the latest benchmarks: 144c253358.csv. The remaining big regressions are in mid-size dimensional reductions, and in particulardims=(1,2,4).

mbauman · 2025-05-28T19:02:44Z

OK, awesome, I see what's happening here. When we're doing a dimensional reduction over an array with a singleton dimension, we have a choice whether that dimension should be included in the reduction — it doesn't matter whether it was included in the dims or not! And it's advantageous to include it on leading dimensions (which I had already optimized) and advantageous to remove it on trailing dimensions. The biggest gaps to master right were all in cases where we have a singleton trailing dimension that was flagged as being part of the reduction. E.g., sum(rand(3,2,1), dims=(1,3)). That's the same as the simpler and more efficient dims=1! Easy enough fix.

That's why it was the "mid-size" reductions like (1,2,4) that were appearing as big regressions — those are the ones that have that trailing singleton dimension! Now all the tested benchmarks are within 3x, getting better...

30c8afbc5e.csv

There are 260 benchmarks with a 2x or worse regression compared to c5df01807e3.csv. Most (249/260) are dimensional reductions. The whole array ones are either small (n<=64) SizeUnknown iterables or very large (n>million) entirely missing (that is, empty) SkipMissing{<:Array}. The dimensional reductions are mostly IndexCartesian (190/249), and mostly discontiguous column major like (1,3) or (1,2,4) (150/249).

mbauman and others added 20 commits May 12, 2025 11:07

counting and summing Bools with a small integer init promote to `[U…

80f031f

…]Int`. (cherry picked from commit c923b1d)

Fix doctests

3c3fdcc

(cherry picked from commit 4bee191)

keep the add_sum method table small

4baecd3

for accumulate inference (cherry picked from commit 4fadd8f)

TEMPORARILY USE SPARSEARRAYS PR BRANCH

0c3d4b2

(cherry picked from commit 0e924cf)

add two-arg axes and size for Broadcasted

0a3f844

fully pairwise mapreduce with consistent use of init

19f0ebb

fixup Broadcast size/axes

0708c2b

more deprecations

e719715

NEWS update for empty dimensional reductions

da3bdb1

avoid calling length in functions that define length

8cef637

Add (offset) axes support for Enumerate

89c3975

they already supported axes through the default fallback that uses size to construct OneTos

make it a bit easier to extend mapreduce_kernel

f88cfeb

fixup PermutedDimsArray changes

bddc126

avoid similar shenanigans; they are not needed

7747f1e

and this was causing ambiguities

cartesian broadcasts are now pairwise!

31de628

TEMPORARILY USE LINEAR ALGEBRA BRANCH

2f5a886

promote to Int

aeb9a89

(cherry picked from commit 32fc16b)

restore reduce_first(+, true) method

2a0b7b7

TEMPORARILY USE SPARSEARRAY AND STATISTICS BRANCHES

6cc2496

mbauman marked this pull request as draft May 14, 2025 23:03

giordano added needs nanosoldier run This PR should have benchmarks run on it fold sum, maximum, reduce, foldl, etc. needs tests Unit tests are required for this change labels May 14, 2025

mbauman added needs news A NEWS entry is required for this change needs docs Documentation for this change is required labels May 14, 2025

oscardssmith reviewed May 15, 2025

View reviewed changes

Moelf mentioned this pull request May 15, 2025

use SIMD.jl for x86 and naive_findmin for :aarch64 JuliaHEP/JetReconstruction.jl#151

Open

add tests for 39385

e013f9f

tiemvanderdeure mentioned this pull request May 21, 2025

Innacurate results from sum(skipmissing(raster)) rafaqz/Rasters.jl#812

Open

mbauman mentioned this pull request May 21, 2025

sum for array of Float16 is wrong in dims=2 #58485

Closed

mbauman added 2 commits May 22, 2025 09:31

restore promote_union and _realtype

6fb8ef1

fixup! restore promote_union and _realtype

167888d

mbauman commented May 22, 2025

View reviewed changes

mbauman added 3 commits May 27, 2025 10:01

restore has_fast_linear_indexing

a093f9d

simplify inner indices computation

be75442

Specialize function args to avoid allocations

a6e4e73

fixup! simplify inner indices computation

744e3fd

adienes reviewed May 27, 2025

View reviewed changes

NEWS.md Show resolved Hide resolved

mbauman added 2 commits May 27, 2025 17:03

Revert "fixup! simplify inner indices computation"

647d5ed

This reverts commit 744e3fd.

Revert "simplify inner indices computation"

6fb57cc

This reverts commit be75442.

mbauman added 3 commits May 28, 2025 10:06

transducers for SkipMissing

07e11a2

fewer transducers

144c253

fixup! Specialize function args to avoid allocations

ddc4502

be more careful about which dimensions are considered "inner" reductions

30c8afb

Krastanov mentioned this pull request Jun 10, 2025

Proposal: Make raph operators work with different graph types than SimpleGraph and SimpleDiGraph JuliaGraphs/Graphs.jl#441

Open

	_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1))
	_mapreduce_start(f, op, A, init, a1) = op(init, f(a1))

Uh oh!

WIP: The great pairwise reduction refactor #58418

Are you sure you want to change the base?

WIP: The great pairwise reduction refactor #58418

Conversation

mbauman commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tecosaur commented May 15, 2025

Uh oh!

oscardssmith May 15, 2025

Choose a reason for hiding this comment

Uh oh!

mbauman May 15, 2025

Choose a reason for hiding this comment

Uh oh!

oscardssmith May 15, 2025

Choose a reason for hiding this comment

Uh oh!

jakobnissen commented May 15, 2025

Uh oh!

nanosoldier commented May 22, 2025

❗ Packages that crashed

✖ Packages that failed

✔ Packages that passed tests

~ Packages that at least loaded

➖ Packages that were skipped altogether

Uh oh!

mbauman commented May 22, 2025

Uh oh!

KristofferC commented May 22, 2025

Uh oh!

mbauman commented May 22, 2025

Uh oh!

mbauman May 22, 2025

Choose a reason for hiding this comment

Uh oh!

mbauman May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adienes May 24, 2025

Choose a reason for hiding this comment

Uh oh!

nanosoldier commented May 24, 2025

❗ Packages that crashed

✖ Packages that failed

✔ Packages that passed tests

~ Packages that at least loaded

➖ Packages that were skipped altogether

Uh oh!

mbauman commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented May 27, 2025

Uh oh!

mbauman commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash commented May 27, 2025

Uh oh!

Uh oh!

mbauman commented May 27, 2025

Uh oh!

mbauman commented May 28, 2025

Uh oh!

mbauman commented May 28, 2025

Uh oh!

Uh oh!

mbauman commented May 14, 2025 •

edited

Loading

mbauman May 22, 2025 •

edited

Loading

mbauman commented May 27, 2025 •

edited

Loading

mbauman commented May 27, 2025 •

edited

Loading