Configured rule for `maximum(f, xs)` #490

mcabbott · 2021-08-01T03:30:43Z

This uses the RuleConfig{>:HasReverseMode} story to call back into AD to write a rule for maximum(f, xs).

It's much simplified from the first attempt:

On julia 1.7+, for a total reduction, it calls i = findmax(f, xs), and then uses rrule_via_ad(f, xs[i]).
Otherwise, it just calls broadcasting.

Fast case, before & after:

julia> @btime gradient(x -> sum(maximum(sqrt, x)), $(rand(30,30)));
  min 2.908 ms, mean 4.031 ms (21816 allocations, 9.29 MiB)  # before
  min 14.875 μs, mean 17.546 μs (52 allocations, 8.92 KiB)  # after

julia> @btime gradient(x -> sum(maximum(sqrt.(x))), $(rand(30,30)));
  min 17.500 μs, mean 25.453 μs (46 allocations, 36.83 KiB)  # just broadcasting, to compare

Before this PR, gradient(x -> sum(maximum(sqrt, x, dims=1)), (rand(30,30))) gives an error with Zygote. After, it is the same speed as broadcasting.

What doesn't seem easy now is testing the broadcast path.

First attempt

However, it only needs one such call, rather than one for every element. That means it ends up calling f say N^2 + 1 times for a matrix (or N^2 + N with dims). This is much more efficient than calling it via AD all N^2 times, saving the pullbacks somewhere, and calling just one. Not always faster than Zygote's current broadcasting (which uses ForwardDiff), but much less memory:

julia> @btime gradient(x -> sum(maximum(sqrt, x)), $(rand(30,30)));
  9.625 μs (73 allocations: 9.11 KiB)   # this PR
  9.333 μs (66 allocations: 8.95 KiB)   # this PR, with rrule instead of rrule_via_ad
  
julia> @btime gradient(x -> sum(maximum(sqrt, x, dims=1)), $(rand(30,30)));
  10.125 μs (34 allocations: 13.92 KiB)  # this PR, take 1
  15.208 μs (33 allocations: 29.31 KiB)  # this PR, with mask allowing multiple maxima
  17.166 μs (33 allocations: 29.31 KiB)  # with rrule instead of rrule_via_ad

julia> @btime gradient(x -> sum(maximum(sqrt.(x))), $(rand(30,30)));
  8.833 μs (48 allocations: 36.98 KiB)  # broadcasting with Duals

julia> @btime maximum(sqrt, $(rand(30,30)));  # forward pass
  1.438 μs (0 allocations: 0 bytes)

If this is OK, then perhaps the sum(f, x) rule from #441 should also consider calling f more times. There's a commit here doing that, which cuts the memory use by quite a bit. Perhaps there are functions f for which calling twice would be slower? Perhaps writing sum(f, x) vs. sum(f.(x)) is how you emphasise that you care more about memory? ~~(It may make sense to remove this & discuss sum in another thread.)~~ [Now removed here.]

julia> @btime gradient(x -> sum(sqrt, x), $(rand(30,30)));
  4.173 μs (16 allocations: 50.02 KiB)  # before
  1.954 μs (2 allocations: 7.20 KiB)    # after

julia> @btime gradient(x -> sum(sum(sqrt, x, dims=1)), $(rand(30,30)));
  10.625 μs (42 allocations: 51.47 KiB)  # before
  2.704 μs (18 allocations: 8.20 KiB)    # after

# Compare broadcasting:

julia> @btime gradient(x -> sum(sqrt.(x)), $(rand(30,30)));
  2.616 μs (10 allocations: 28.70 KiB)

julia> @btime gradient(x -> sum(sum(sqrt.(x), dims=1)), $(rand(30,30)));
  3.542 μs (26 allocations: 36.81 KiB)

# Forward only:

julia> @btime sum(sqrt, x) setup=(x=$(rand(30,30)));
  833.333 ns (0 allocations: 0 bytes)
  
julia> @btime sum(sqrt.(x)) setup=(x=$(rand(30,30)));
  873.544 ns (1 allocation: 7.19 KiB)

All WIP, needs more careful testing, etc.

mcabbott · 2021-08-05T21:12:30Z

First attempt

With a more expensive function:

julia> @btime gradient(x -> sum(maximum(log∘exp, x)), $(rand(30,30)));
  34.791 μs (162 allocations: 11.11 KiB)

julia> @btime gradient(x -> sum(maximum(log∘exp, x, dims=1)), $(rand(30,30)));
  326.292 μs (2615 allocations: 87.55 KiB)

julia> @btime gradient(x -> sum(maximum((log∘exp).(x))), $(rand(30,30)));
  22.333 μs (48 allocations: 36.86 KiB)

julia> @btime gradient(x -> sum(maximum((log∘exp).(x), dims=1)), $(rand(30,30)));
  16.250 μs (13 allocations: 36.72 KiB)

# without AD:

julia> @btime maximum(log∘exp, $(rand(30,30)));
  13.000 μs (0 allocations: 0 bytes)

julia> @btime maximum(log∘exp, $(rand(30,30)), dims=1);
  15.500 μs (4 allocations: 416 bytes)

julia> @btime findmax(log∘exp, $(rand(30,30)));
  15.334 μs (0 allocations: 0 bytes)

The dims=1 case is very slow, because (1) it's taking a second complete (N^2) pass to find the indices at which this attains the maximum, since there is no findmax(sqrt, rand(3,3), dims=1), and (2) it needs N calls to rrule_via_ad, and this doesn't infer for log∘exp, like Zygote's generic broadcasting.

The broadcasted one uses dual numbers, which is much quicker. Note BTW that there is no chunk mode in play here -- it always evaluates f exactly 900 times.

I'm not so sure why the complete reduction is slower than broadcasting here, but it's much closer, and 3x less memory.

Diffractor, BTW, does not see this rule. It does see #480, but broadcast times are variable:

julia> @btime Diffractor.gradient(x -> maximum(sqrt, x), $(rand(30,30)));
ERROR: TypeError: in typeassert, expected Int64, got a value of type Nothing
...
  [8] (::Diffractor.∂⃖recurse{1})(::typeof(Base._mapreduce), ::typeof(sqrt), ::typeof(max), ::IndexLinear, ::Matrix{Float64})

julia> @btime gradient(x -> maximum(sqrt.(x)), $(rand(30,30)));
  11.417 μs (12 allocations: 64.33 KiB)  # Zygote 8.833 μs (48 allocations: 36.98 KiB)

julia> @btime gradient(x -> maximum((log∘exp).(x)), $(rand(30,30)));
  2.155 ms (17143 allocations: 586.41 KiB)  # Zygote 22.333 μs (48 allocations: 36.86 KiB)

mcabbott · 2021-11-24T19:56:34Z

This has been much simplified. For the case of a complete reduction only, maximum(f, x), this saves the position of the maximum, and calls rrule_via_ad(f, x[i]) once. This saves memory compared to broadcasting, but in the end not much time -- might still not be worth the complication:

julia> @btime gradient(x -> sum(maximum(sqrt, x)), $(rand(30,30)));  # this PR + Zygote + Julia 1.8
  min 8.625 μs, mean 10.906 μs (52 allocations, 8.92 KiB. GC mean 13.94%)

julia> @btime gradient(x -> sum(maximum(sqrt.(x))), $(rand(30,30)));
  min 10.041 μs, mean 16.087 μs (49 allocations, 36.88 KiB. GC mean 20.75%)

julia> @btime gradient(x -> sum(maximum(log∘exp, x)), $(rand(30,30)));  # with a more expensive function:
  min 20.208 μs, mean 22.335 μs (116 allocations, 10.88 KiB. GC mean 5.22%)

julia> @btime gradient(x -> sum(maximum((log∘exp).(x))), $(rand(30,30)));
  min 19.291 μs, mean 25.757 μs (49 allocations, 36.88 KiB. GC mean 13.03%)

julia> @btime maximum(log∘exp, $(rand(30,30)));
  min 8.958 μs, mean 9.128 μs (0 allocations)

That means it calls f in total N+1 times. If f is stateful, then as far as I know the result of maximum(f, x) is already ill-defined, no order is guaranteed. If f closes over something, that will get a gradient contribution only from one entry, should be fine.

Instead of using rrule_via_ad, this would be a good use case for derivatives_given_output when that's defined.

For cases with dims, it just calls broadcasting. Earlier commits tried to handle this, but it gets complicated, and the saving is less clear. This case is not so easy to test.

On Julia 1.6 and below, the method findmax(f, x) which the fast path needs doesn't exist, so it always calls broadcasting.

mcabbott · 2022-02-20T16:10:35Z

Status here is as in (edited) first message above.

Perhaps the broadcast path can be easily tested using JuliaDiff/ChainRulesTestUtils.jl#243 once that's available.

mzgubic

A few questions, generally looks good. Do you plan to extend the tests?

src/rulesets/Base/mapreduce.jl

mzgubic · 2022-02-21T12:00:15Z

test/rulesets/Base/mapreduce.jl

+        @test_skip test_rrule(maximum, sqrt, Float64[1 2; 3 4], fkwargs=(; dims = 1), check_inferred=false)
+        @test_skip test_rrule(minimum, abs, randn(3,3), fkwargs=(; dims = 2), check_inferred=false)


these will need JuliaDiff/FiniteDifferences.jl#203

I thought these needed JuliaDiff/ChainRulesTestUtils.jl#243 : with dims it always calls broadcast.

Yep, they do need JuliaDiff/ChainRulesTestUtils.jl#243 (now merged), but also JuliaDiff/FiniteDifferences.jl#203 to get around to_vecing InplaceableThunks correctly (tested locally)

But where do InplaceableThunks come from? This path of this rule doesn't make them.

I do still get an error with only CRTU update:

julia> test_rrule(maximum, sqrt, Float64[1 2; 3 4], fkwargs=(; dims = 1), check_inferred=false) test_rrule: maximum on typeof(sqrt),Matrix{Float64}: Error During Test at /Users/me/.julia/packages/ChainRulesTestUtils/fCvaU/src/testers.jl:193 Got exception outside of a @test DimensionMismatch("second dimension of A, 4, does not match length of x, 7") Stacktrace: [1] gemv!(y::Vector{Float64}, tA::Char, A::Matrix{Float64}, x::Vector{Float64}, α::Bool, β::Bool) @ LinearAlgebra ~/.julia/dev/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:493 [2] mul! @ ~/.julia/dev/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:93 [inlined] [3] mul! @ ~/.julia/dev/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined] [4] *(tA::Transpose{Float64, Matrix{Float64}}, x::Vector{Float64}) @ LinearAlgebra ~/.julia/dev/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:86 [5] _j′vp(fdm::FiniteDifferences.AdaptedFiniteDifferenceMethod{5, 1, FiniteDifferences.UnadaptedFiniteDifferenceMethod{7, 5}}, f::Function, ȳ::Vector{Float64}, x::Vector{Float64}) @ FiniteDifferences ~/.julia/packages/FiniteDifferences/R6uao/src/grad.jl:80 [6] j′vp(fdm::FiniteDifferences.AdaptedFiniteDifferenceMethod{5, 1, FiniteDifferences.UnadaptedFiniteDifferenceMethod{7, 5}}, f::ChainRulesTestUtils.var"#fnew#45"{ChainRulesTestUtils.var"#call#41"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}}, Tuple{typeof(broadcast), typeof(sqrt), Matrix{Float64}}, Tuple{Bool, Bool, Bool}}, ȳ::InplaceableThunk{Thunk{ChainRules.var"#1316#1319"{Matrix{Float64}, Int64, Matrix{Float64}, ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ProjectTo{Float64, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}, Matrix{CartesianIndex{2}}}}, ChainRules.var"#1317#1320"{Matrix{Float64}, Int64, Matrix{CartesianIndex{2}}}}, x::Matrix{Float64}) @ FiniteDifferences ~/.julia/packages/FiniteDifferences/R6uao/src/grad.jl:73 [7] _make_j′vp_call(fdm::Any, f::Any, ȳ::Any, xs::Any, ignores::Any) @ ChainRulesTestUtils ~/.julia/packages/ChainRulesTestUtils/fCvaU/src/finite_difference_calls.jl:51 [8] f_pb @ ~/.julia/packages/ChainRulesTestUtils/fCvaU/src/rule_config.jl:40 [inlined] [9] (::ChainRules.var"#minormax_f_back2#2098"{ChainRules.var"#maximum_pullback#1326"{ChainRules.var"#findmax_pullback#1318"{Int64 ```

Solved by the to_vec PR, as you said.

Can this thing give less cryptic errors than these "DimensionMismatch" when it goes wrong?

Yeah, I agree with you in general: JuliaDiff/ChainRulesTestUtils.jl#244

Here though this is coming from rrule_via_ad using the make_v'jp_call rather than the usual place 😂

Solving JuliaDiff/ChainRulesTestUtils.jl#213 would be a big QoL improvement indeed. It's on my list

JuliaDiff/FiniteDifferences.jl#203 is now merged, so I think we can update the tests

Great!

This one is weird locally, but on 1.6 it seems to work (or will once changed to ≈ [10 0 0; 0 -20 0]):

julia> y2, bk2 = rrule(CFG, minimum, abs, [1 2 3; -5 -4 -4], dims = 2); julia> @test y2 == hcat([1, 4]) Test Passed Expression: y2 == hcat([1, 4]) Evaluated: [1; 4;;] == [1; 4;;] julia> bk2(hcat([10, 20])) (NoTangent(), NoTangent(), NoTangent())

save less stuff in sum(f, xs) rule probably destroyed in the rebase re-organise change to use BitArray add a few tests Revert "save less stuff in sum(f, xs) rule" This reverts commit c8034da. tidy, add cumsum trick tests for multiple maxima tweaks

fixup update, tidy Apply 3 suggestions Co-authored-by: Miha Zgubic <[email protected]> add an error remove error, as closing over `y` breaks inference simplify, update solve Core.Box tests approx

mcabbott force-pushed the maximum2 branch from 5afb26f to 765dbda Compare August 5, 2021 20:29

oxinabox changed the title ~~Callback rule for maximum(f, xs)~~ Configured rule for maximum(f, xs) Aug 5, 2021

mcabbott force-pushed the maximum2 branch from d67c8b1 to c4a3267 Compare November 24, 2021 16:51

mcabbott force-pushed the maximum2 branch from 5662ce7 to 1e0ee69 Compare February 19, 2022 01:14

mcabbott marked this pull request as ready for review February 19, 2022 01:17

mcabbott force-pushed the maximum2 branch from 1e0ee69 to 187ed27 Compare February 19, 2022 02:17

mzgubic reviewed Feb 21, 2022

View reviewed changes

mzgubic mentioned this pull request Feb 21, 2022

to_vec for InplaceableThunk JuliaDiff/FiniteDifferences.jl#203

Merged

mcabbott force-pushed the maximum2 branch from bc4d45f to ac98722 Compare February 24, 2022 15:31

mcabbott force-pushed the maximum2 branch from ac98722 to 5bce8e4 Compare May 11, 2022 00:02

mcabbott added 3 commits May 13, 2022 07:48

callback gradient for maximum(f, xs)

eb920bb

save less stuff in sum(f, xs) rule probably destroyed in the rebase re-organise change to use BitArray add a few tests Revert "save less stuff in sum(f, xs) rule" This reverts commit c8034da. tidy, add cumsum trick tests for multiple maxima tweaks

delete most of that, go via broadcasting except in best case

9ab580f

update

30532f6

fixup update, tidy Apply 3 suggestions Co-authored-by: Miha Zgubic <[email protected]> add an error remove error, as closing over `y` breaks inference simplify, update solve Core.Box tests approx

mcabbott force-pushed the maximum2 branch from 5bce8e4 to 30532f6 Compare May 13, 2022 12:50

mcabbott closed this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configured rule for `maximum(f, xs)` #490

Configured rule for `maximum(f, xs)` #490

mcabbott commented Aug 1, 2021 •

edited

Loading

mcabbott commented Aug 5, 2021 •

edited

Loading

mcabbott commented Nov 24, 2021 •

edited

Loading

mcabbott commented Feb 20, 2022

mzgubic left a comment

mzgubic Feb 21, 2022

mcabbott Feb 21, 2022

mzgubic Feb 21, 2022

mcabbott Feb 21, 2022

mcabbott Feb 21, 2022

mzgubic Feb 21, 2022

mzgubic Feb 24, 2022

mcabbott Feb 24, 2022

		@test_skip test_rrule(maximum, sqrt, Float64[1 2; 3 4], fkwargs=(; dims = 1), check_inferred=false)
		@test_skip test_rrule(minimum, abs, randn(3,3), fkwargs=(; dims = 2), check_inferred=false)

Configured rule for maximum(f, xs) #490

Configured rule for maximum(f, xs) #490

Conversation

mcabbott commented Aug 1, 2021 • edited Loading

First attempt

mcabbott commented Aug 5, 2021 • edited Loading

First attempt

mcabbott commented Nov 24, 2021 • edited Loading

mcabbott commented Feb 20, 2022

mzgubic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Configured rule for `maximum(f, xs)` #490

Configured rule for `maximum(f, xs)` #490

mcabbott commented Aug 1, 2021 •

edited

Loading

mcabbott commented Aug 5, 2021 •

edited

Loading

mcabbott commented Nov 24, 2021 •

edited

Loading