Skip to content

Get the performance of ArrayPartition up to that of Vector #122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ChrisRackauckas opened this issue Aug 7, 2017 · 2 comments
Closed

Get the performance of ArrayPartition up to that of Vector #122

ChrisRackauckas opened this issue Aug 7, 2017 · 2 comments

Comments

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Aug 7, 2017

Some quick tests show that when using broadcast ArrayPartitions do quite well:

using RecursiveArrayTools
u0 = zeros(100)
v0 = ones(100)
t1 = ArrayPartition((u0,v0))
t2 = [u0;v0]
t12 = ArrayPartition((similar(u0),similar(v0)))
t22 = similar(t2)

using BenchmarkTools
function f(t12,t1)
    for i in eachindex(t12.x)
        t12.x[i] .= t1.x[i] .+ t1.x[i]
    end
end
@benchmark f(t12,t1)

#=
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     70.867 ns (0.00% GC)
  median time:      74.457 ns (0.00% GC)
  mean time:        81.489 ns (0.00% GC)
  maximum time:     249.982 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     979
 =#

function g(t22,t2)
    t22 .= t2 .+ t2
end
@benchmark g(t22,t2)

#=
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     247.830 ns (0.00% GC)
  median time:      264.466 ns (0.00% GC)
  mean time:        294.141 ns (1.59% GC)
  maximum time:     6.066 μs (91.20% GC)
  --------------
  samples:          10000
  evals/sample:     352
=#

add_idxs(x,expr) = expr
add_idxs{T<:ArrayPartition}(::Type{T},expr) = :($(expr).x[i])
@generated function Base.broadcast!(f,A::ArrayPartition,B...)
  exs = ((add_idxs(B[i],:(B[$i])) for i in eachindex(B))...)
  :(for i in eachindex(A.x)
    broadcast!(f,A.x[i],$(exs...))
  end)
end

function h(t12,t1)
    t12 .= t1 .+ t1
end
@benchmark h(t12,t1)

#=
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     60.261 ns (0.00% GC)
  median time:      62.329 ns (0.00% GC)
  mean time:        68.939 ns (0.00% GC)
  maximum time:     225.685 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     991
=#

u0 = zeros(2)
v0 = ones(2)
t1 = ArrayPartition((u0,v0))
t2 = [u0;v0]
t12 = ArrayPartition((similar(u0),similar(v0)))
t22 = similar(t2)

@benchmark f(t12,t1)

#=
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.886 ns (0.00% GC)
  median time:      39.520 ns (0.00% GC)
  mean time:        43.894 ns (0.00% GC)
  maximum time:     173.009 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
=#

@benchmark g(t22,t2)

#=
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     211.612 ns (0.00% GC)
  median time:      226.932 ns (0.00% GC)
  mean time:        253.315 ns (1.97% GC)
  maximum time:     4.982 μs (92.19% GC)
  --------------
  samples:          10000
  evals/sample:     516
=#

@benchmark h(t12,t1)

#=
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.224 ns (0.00% GC)
  median time:      28.689 ns (0.00% GC)
  mean time:        32.268 ns (0.00% GC)
  maximum time:     168.033 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
=#

This is great because this allows for heterogeneously typed arrays and forms the basis of the partitioned methods. But how does it do inside the full ODE solver?

using DiffEqBase, OrdinaryDiffEq, Base.Test, RecursiveArrayTools, DiffEqDevTools
using BenchmarkTools

u0 = zeros(100)
v0 = ones(100)
f1 = function (t,u,v,du)
  du .= v
end
f2 = function (t,u,v,dv)
  dv .= -u
end
function (::typeof(f2))(::Type{Val{:analytic}}, x, y0)
  u0, v0 = y0
  ArrayPartition(u0*cos(x) + v0*sin(x), -u0*sin(x) + v0*cos(x))
end

prob = DynamicalODEProblem(f1,f2,u0,v0,(0.0,5.0))

@benchmark solve(prob,RK4(),dt=1/2,save_everystep=false)

#=
BenchmarkTools.Trial: 
  memory estimate:  82.59 KiB
  allocs estimate:  565
  --------------
  minimum time:     126.171 μs (0.00% GC)
  median time:      137.295 μs (0.00% GC)
  mean time:        159.308 μs (4.95% GC)
  maximum time:     3.158 ms (88.39% GC)
  --------------
  samples:          10000
  evals/sample:     1
=#

function f(t,u,du)
    uu = @view u[1:100]
    v = @view u[101:end]
    du[1:100] .= uu
    du[101:200] .= v
end
u02 = [zeros(100);ones(100)]
tspan=(0.0,5.0)
prob = ODEProblem(f,u02,tspan)
@benchmark sol = solve(prob,RK4(),dt=1/2,save_everystep=false)

#=
BenchmarkTools.Trial: 
  memory estimate:  43.98 KiB
  allocs estimate:  408
  --------------
  minimum time:     65.574 μs (0.00% GC)
  median time:      71.722 μs (0.00% GC)
  mean time:        83.817 μs (6.20% GC)
  maximum time:     3.776 ms (93.39% GC)
  --------------
  samples:          10000
  evals/sample:     1
=#

This matters now because the dynamical ODEs generate a problem like this which can now be used with the general ODE solvers as a fallback, so it would be nice to reduce the overhead of that fallback, especially given that there doesn't seem to be some fundamental reason that the type's elementwise operations are slower.

That timing is with a modification to RK4 to make it use broadcast.

@ChrisRackauckas
Copy link
Member Author

These timings make use of SciML/DiffEqBase.jl@9276249 which helped a lot.

@ChrisRackauckas
Copy link
Member Author

New timings:

using DiffEqBase, OrdinaryDiffEq, Base.Test, RecursiveArrayTools, DiffEqDevTools
using BenchmarkTools

u0 = zeros(100)
v0 = ones(100)
f1 = function (t,u,v,du)
  du .= v
end
f2 = function (t,u,v,dv)
  dv .= -u
end
function (::typeof(f2))(::Type{Val{:analytic}}, x, y0)
  u0, v0 = y0
  ArrayPartition(u0*cos(x) + v0*sin(x), -u0*sin(x) + v0*cos(x))
end

prob = DynamicalODEProblem(f1,f2,u0,v0,(0.0,5.0))

@benchmark solve(prob,RK4(),dt=1/2,adaptive=false,save_everystep=false)

gives

BenchmarkTools.Trial: 
  memory estimate:  90.83 KiB
  allocs estimate:  637
  --------------
  minimum time:     137.588 μs (0.00% GC)
  median time:      153.982 μs (0.00% GC)
  mean time:        181.241 μs (5.89% GC)
  maximum time:     3.434 ms (92.12% GC)
  --------------
  samples:          10000
  evals/sample:     1

while

function f(t,u,du)
    uu = @view u[1:100]
    v = @view u[101:end]
    du[1:100] .= uu
    du[101:200] .= v
end
u02 = [zeros(100);ones(100)]
tspan=(0.0,5.0)
prob = ODEProblem(f,u02,tspan)
@benchmark sol = solve(prob,RK4(),dt=1/2,save_everystep=false)

gives

BenchmarkTools.Trial: 
  memory estimate:  161.34 KiB
  allocs estimate:  611
  --------------
  minimum time:     152.811 μs (0.00% GC)
  median time:      167.156 μs (0.00% GC)
  mean time:        197.936 μs (8.24% GC)
  maximum time:     3.146 ms (92.47% GC)
  --------------
  samples:          10000
  evals/sample:     1

So now it is faster to use ArrayPartitions than it is to use views. This is probably in large part due to @devmotion 's #122 .

Additionally, this checks out when just testing the indexing.

using RecursiveArrayTools
u0 = zeros(100)
v0 = ones(100)
t1 = ArrayPartition((u0,v0))
t2 = [u0;v0]
t12 = ArrayPartition((similar(u0),similar(v0)))
t22 = similar(t2)

using BenchmarkTools
function f(t12,t1)
    for i in eachindex(t12.x)
        t12.x[i] .= t1.x[i] .+ t1.x[i]
    end
end
@benchmark f(t12,t1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     68.444 ns (0.00% GC)
  median time:      72.592 ns (0.00% GC)
  mean time:        78.955 ns (0.00% GC)
  maximum time:     296.889 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988
function g(t22,t2)
    t22 .= t2 .+ t2
end
@benchmark g(t22,t2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     68.444 ns (0.00% GC)
  median time:      72.592 ns (0.00% GC)
  mean time:        78.955 ns (0.00% GC)
  maximum time:     296.889 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

and

function h(t12,t1)
    t12 .= t1 .+ t1
end
@benchmark h(t12,t1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     58.370 ns (0.00% GC)
  median time:      64.297 ns (0.00% GC)
  mean time:        70.905 ns (0.00% GC)
  maximum time:     266.074 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

So once again the ArrayPartition broadcast is doing better than the straight indexing of the larger array.

How about direct indexing?

using BenchmarkTools
function j(t12,t1)
    for i in eachindex(t12)
        t12[i] .= t1[i] .+ t1[i]
    end
end
@benchmark f(t12,t1)
BenchmarkTools.Trial: 
  memory estimate:  31.27 KiB
  allocs estimate:  1601
  --------------
  minimum time:     67.623 μs (0.00% GC)
  median time:      71.136 μs (0.00% GC)
  mean time:        80.140 μs (2.67% GC)
  maximum time:     1.950 ms (92.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

That's much worse, so currently you should avoid ArrayPartition with #106 . Of course that cannot be fixed. But other than that, development-wise this is case closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant