Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110

Sleort · 2019-11-20T10:06:02Z

I'm looking into the possibility of using DynamicHMC.jl for lattice model simulations. For such problems, the number of degrees of freedom ("parameters") should, typically, be as large as possible - O(10^6) is not uncommon.

Now, because of this large number of degrees of freedom, I was thinking of taking advantage of GPU processing to accelerate the log-density and gradient calculations to be fed to the HMC sampler. However, before I do any deep dive into the code base of DynamicHMC, I'd like to know if you think this is feasible...

As a starting point for any discussion: Here's some exploratory testing, from which I got stuck:

using LogDensityProblems, DynamicHMC, DynamicHMC.Diagnostics
using Parameters, Statistics, Random
using Zygote, CuArrays

#Both value and gradient in same calculation:
function value_and_gradient(f, x...)
    value, back = Zygote.pullback(f, x...)
    grad = back(1)[1]
    return value, grad
end

#### Define the problem ####
struct LogNormalTest
    n::Int
end

(ℓ::LogNormalTest)(x) = -sum(x.^2)

LogDensityProblems.capabilities(::LogNormalTest) = LogDensityProblems.LogDensityOrder{1}()
LogDensityProblems.dimension(ℓ::LogNormalTest) = ℓ.n
LogDensityProblems.logdensity(ℓ::LogNormalTest, x) = ℓ(x)
LogDensityProblems.logdensity_and_gradient(ℓ::LogNormalTest, x) = value_and_gradient(ℓ, x)

#### Testing the problem ####
n = 100 #Should, ideally, be much larger...
ℓ = LogNormalTest(n)

x0 = randn(n) |> cu #Make a CuArray
LogDensityProblems.dimension(ℓ) #Works
LogDensityProblems.logdensity(ℓ, x0) #Works
LogDensityProblems.logdensity_and_gradient(ℓ, x0) #Works

#### HMC ####
# rng = Random.GLOBAL_RNG #Works with this, but only on the CPU..?
rng = CURAND.RNG() #Fails with this... But more is probably needed?
results = mcmc_with_warmup(rng, ℓ, 10) #Errors with "KernelError: passing and using non-bitstype argument"

(I'm on DynamicHMC v2.1.0, CuArrays v1.4.7, Zygote v0.4.1 )

tpapp · 2019-11-20T11:05:01Z

I am very excited to see this package applied on the GPU. People work with large-dimensional HMC all the time, so theoretically it should be the ideal method if your problem is somewhat "regular".

I appreciate the very nice MWE which should help a lot in figuring this out. Unfortunately, my current machine doesn't have a GPU and I am not well-versed in how to use it. I would recommend that you solve the problem with CURAND.RNG() first (perhaps ask on the forum), then if there is still a problem, provide a full stacktrace for the errors.

If that fails, then if someone can provide access to a machine (eg over ssh) that has the GPU-based part set up, I am happy to log in and help figure it out.

Sleort · 2019-11-21T02:16:05Z

Great. Thanks! My problem definitely goes into the "regular" category.

I'll have a closer look then...

And for the record (or if you or someone else can spot anything), here's the stacktrace for the errors when I run the code above: (I should have included that in my first post. Sorry about that.)

julia> results = mcmc_with_warmup(rng, ℓ, 10) 
[ Info: finding initial optimum
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/0lvhc/src/indexing.jl:16
ERROR: GPU compilation of #25(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}.
That type is not isbits, and such arguments are only allowed when they are unused by the kernel.  .args is of type Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}} which is not isbits.
      .x is of type Array{Float64,1} which is not isbits.


Stacktrace:
 [1] check_invocation(::CUDAnative.CompilerJob, ::LLVM.Function) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/validation.jl:70
 [2] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:187 [inlined]
 [3] macro expansion at /home/troels/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214 [inlined]
 [4] #codegen#152(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:186
 [5] #codegen at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:0 [inlined]
 [6] #compile#151(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:47
 [7] #compile#150 at ./none:0 [inlined]
 [8] #compile at ./none:0 [inlined] (repeats 2 times)
 [9] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:403 [inlined]
 [10] #cufunction#194(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUArrays, Symbol("##25#26")), ::Type{Tuple{CuArrays.CuKernelState,CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [11] cufunction(::Function, ::Type) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [12] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:176 [inlined]
 [13] macro expansion at ./gcutils.jl:87 [inlined]
 [14] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:173 [inlined]
 [15] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Float64,1,Nothing}, ::Tuple{CuArray{Float64,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /home/troels/.julia/packages/CuArrays/7z7MV/src/gpuarray_interface.jl:62
 [16] gpu_call(::Function, ::CuArray{Float64,1,Nothing}, ::Tuple{CuArray{Float64,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Int64) at /home/troels/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:151
 [17] gpu_call at /home/troels/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:128 [inlined]
 [18] copyto! at /home/troels/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:48 [inlined]
 [19] copyto! at ./broadcast.jl:842 [inlined]
 [20] copy at ./broadcast.jl:818 [inlined]
 [21] materialize at ./broadcast.jl:798 [inlined]
 [22] *(::LinearAlgebra.Diagonal{Float64,Array{Float64,1}}, ::CuArray{Float64,1,Nothing}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/diagonal.jl:163
 [23] rand_p(::CuArrays.CURAND.RNG, ::GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}}, ::Nothing) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/hamiltonian.jl:122 (repeats 2 times)
 [24] warmup(::DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}, ::InitialStepsizeSearch, ::DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:159
 [25] #27 at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:443 [inlined]
 [26] mapafoldl(::typeof(identity), ::getfield(DynamicHMC, Symbol("##27#28")){DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}}, ::Tuple{Tuple{NamedTuple{(:stage, :results, :warmup_state),Tuple{FindLocalOptimum{Float64},Nothing,DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}, ::InitialStepsizeSearch, ::TuningNUTS{Nothing,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{Nothing,DualAveraging{Float64}}) at ./tuple.jl:192 (repeats 2 times)
 [27] mapfoldl_impl(::Function, ::Function, ::NamedTuple{(:init,),Tuple{Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}) at ./tuple.jl:198
 [28] #mapfoldl#188(::Base.Iterators.Pairs{Symbol,Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}},Tuple{Symbol},NamedTuple{(:init,),Tuple{Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}}}, ::typeof(mapfoldl), ::Function, ::Function, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}) at ./reduce.jl:72
 [29] #mapfoldl at ./none:0 [inlined]
 [30] #foldl#189 at ./reduce.jl:90 [inlined]
 [31] #foldl at ./none:0 [inlined]
 [32] _warmup(::DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:441
 [33] #mcmc_keep_warmup#29(::Tuple{}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.NUTS{Val{:generalized}}, ::LogProgressReport{Nothing}, ::typeof(DynamicHMC.mcmc_keep_warmup), ::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:517
 [34] #mcmc_keep_warmup at ./none:0 [inlined]
 [35] macro expansion at /home/troels/.julia/packages/Parameters/l76EM/src/Parameters.jl:750 [inlined]
 [36] #mcmc_with_warmup#30(::Tuple{}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.NUTS{Val{:generalized}}, ::LogProgressReport{Nothing}, ::typeof(mcmc_with_warmup), ::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:568
 [37] mcmc_with_warmup(::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:568
 [38] top-level scope at REPL[19]:1

tpapp · 2019-11-21T07:06:56Z

It fails with making up a random momentum p. Here is an MWE you can tweak to see how much it takes to make it work with CuArrays:

using DynamicHMC, Random
rng = Random.GLOBAL_RNG                 # replace with the Cu equivalent
# probably need the explicit constructor below with a diagonal matrix
κ = DynamicHMC.GaussianKineticEnergy(5) 
DynamicHMC.rand_p(rng, κ, nothing)

The leapfrog step performs simple linear operations on momentum p and position q. I suspect that one of the major difficulties in getting NUTS to work with GPUs is to make sure that types remain in the GPU realm for the whole calculation. Currently the code is not really set up for that.

But at the same time, I don't think you would gain a lot from doing the logic of NUTS/HMC in the GPU because it branches a lot. Instead, perhaps you could focus on just making the posterior log density work with GPUs, in a way that takes plain vanilla Vectors, sends them to the GPU, performs your calculation, and returns the log density and its gradient a vectors again. Eg schematically,

function LogDensityProblems.logdensity_and_gradient(ℓ::YourModel, x::SomeVector)
    z = CuArrays.convert_vector_somehow(x)
    logdens, grad = .... # calculate with z with the GPU
    ... # get them back as a Float64 and a Vector{Float64}
end

Sleort · 2019-11-21T09:54:13Z

Now I've looked quite a bit into the problem (before realizing that you had replied - you're "too quick"! 😄 ), and here's my overall take:

The CuArrays random number generator doesn't work for scalars (and randexp for that matter), meaning that passing around rng = CURAND.RNG() only works for a subset of the random number generation needed. Now, as you say, most of the logic should probably be kept on the CPU, so in these cases GLOBAL_RNG will do. However, I could imagine that it could be useful to keep the rand_p sampling on the GPU, as p is just as high dimentional as q (and multiplication with M⁻¹ should also - ideally - happen on the GPU). Thus, is it possible to extend the API such that one can pass two random number generators (one for the CPU part and one for the GPU part) to the sampler?
At the moment, the DynamicHMC code is (implicitly) assuming/enforcing Float64, although I see that you have tried to make it type agnostic. CuArrays/GPUs prefer Float32s, so it would be nice if these type issues could be fixed. (See examples below)

Continuing the test case of my first post, here's today's story:

rng = CURAND.RNG() #Unless otherwise specified, rng refers to the CuArrays random number generator

#### Random number hacks needed to get things to work####
Random.rand(rng::CURAND.RNG, dims::Integer...; kwargs...) = Random.rand(rng, Float32, dims; kwargs...) #Defaults to Float64, not Float32. Bug reported over at CuArrays.jl.
Random.randn(rng::CURAND.RNG, dims::Integer...; kwargs...) = Random.randn(rng, Float32, dims; kwargs...)
Random.rand(rng::CURAND.RNG, ::Type{DynamicHMC.Directions}) = DynamicHMC.Directions(rand(Random.GLOBAL_RNG, UInt32)) #CURAND.RNG fails when we're dealing with scalars.
Random.randexp(rng::CURAND.RNG, T::Type{Float64}) = randexp(Random.GLOBAL_RNG, T) #... same here. (Why explicitly Float64 in the code?)

#### The kinetic energy: ####
using LinearAlgebra: Diagonal
# κ = GaussianKineticEnergy(n) #Automatically makes M⁻¹::Array{Float64}, while I want it to be a CuArray
κ = GaussianKineticEnergy(Diagonal(CuArrays.ones(n))) #Works

#### Initial adaption state ####
# Hack needed, since initial_adaptation_state in stepsize.jl introduces Float64
# (due to log(10)), even when we pass ϵ::Float32 to it...
using ArgCheck: @argcheck
function DynamicHMC.initial_adaptation_state(::DualAveraging, ϵ::Float32)
    @argcheck ϵ > 0
    logϵ = log(ϵ)
    DynamicHMC.DualAveragingState(log(10f0) + logϵ, 0, zero(logϵ), logϵ, zero(logϵ))
end

#### TrajectoryNUTS ####
# Similar to above. min_Δ somehow ends up as Float64 when the rest are Float32
DynamicHMC.TrajectoryNUTS(H, π₀::Float32, ϵ::Float32, min_Δ::Float64, turn_statistic_configuration) =
    DynamicHMC.TrajectoryNUTS(H, π₀, ϵ, Float32(min_Δ), turn_statistic_configuration)

#Later, I get a case with π₀::Float64...
DynamicHMC.TrajectoryNUTS(H, π₀::Float64, ϵ::Float32, min_Δ::Float64, turn_statistic_configuration) =
    DynamicHMC.TrajectoryNUTS(H, Float32(π₀), ϵ, Float32(min_Δ), turn_statistic_configuration)

#### DualAveragingState ####
# More type issues. Originating from default values in DualAveraging, it seems...
DynamicHMC.DualAveragingState(μ::Float32, m::Int, H̄::Float64, logϵ::Float64, logϵ̄::Float64) =
    DynamicHMC.DualAveragingState(μ, m, Float32(H̄), Float32(logϵ), Float32(logϵ̄))

#### HMC ####
# Here's a "minimal" example I get to work. stepsize_search ≠ nothing still leads to some type issues
mcmc_with_warmup(rng, ℓ, 10;
    initialization = (κ = κ, ϵ = 1.0f0), #Even more type issues with ϵ = 1.0
    warmup_stages = default_warmup_stages(;
        stepsize_search = nothing
        )
    )

Sleort · 2019-11-21T10:01:04Z

It fails with making up a random momentum p.

And, by the way, I realized that this is due to the attempted multiplication of a CuArray (randn(rng, size(κ.W, 1)) and an Array (κ.W) in rand_p. CuArrays don't like that.

tpapp · 2019-11-21T10:16:55Z

I am open to extending this package in any reasonable direction needed by its users, but at the moment I am not convinced that rand_p or any part of the outer logic is going to be your bottleneck.

Specifically, suppose that your problem is super-well-behaved so a tree depth of 5-6 gives your convergence. This means 32 to 64 evaluations of logdensity_and_gradient per tree, the same number of leapfrog calls, and a single rand_p call. Usually it is the logdensity_and_gradient that is expensive: the NUTS logic is itself quite cheap (I had an exploratory PR #45 "optimizing" the NUTS logic using preallocated arrays, it didn't buy me anything except a lot of debugging headaches 😉).

So I would propose that you first write your actual model (not a toy one, like above, because sum(abs2, x) is very cheap) using GPU only in logdensity_and_gradient; then we profile some runs and see if the other parts are a bottleneck, and then I am happy to make the code more generic so that a GPU infrastructure can be used. FWIW, I would suspect that the leapfrog step would be more significant before rand_p because it gets called for every move along the trajectory.

Sleort · 2019-11-21T10:42:41Z

Hmm... Yeah, that's a good point. I'll look into it.

It would still be nice to not having everything converted to Float64 when passing in Float32 gradients, though...

tpapp · 2019-11-21T12:25:07Z

To reiterate: I am happy to accept that some changes to DynamicHMC may be necessary to accommodate GPU calculations better. I am very open to making those changes, just asking that we wait until we have some real-life code to profile to determine where the bottlenecks are.

Sleort · 2019-11-22T00:47:14Z

Sure, sure. I'm on it. 😉

To clarify: My last comment was not just stated with GPU acceleration in mind, but more generally with my observation that, at the moment, DynamicHMC seems to only work with/assume Float64's. Ideally, it should be type agnostic, right? (Given that the necessary operations are defined for the type, of course.) If my understanding of your code is correct (disclaimer: I haven't reviewed all of it in detail, so I may be wrong about this), this could be achieved by, as an example, writing something like oftype(logϵ, log(10)) + logϵ instead of log(10) + logϵ in

DynamicHMC.jl/src/stepsize.jl

Line 239 in 9c8d2f6

DualAveragingState(log(10) + logϵ, 0, zero(logϵ), logϵ, zero(logϵ))

I pointed to some other, similar cases above.

tpapp · 2019-11-22T08:35:24Z

It should be possible to make the code more generic, but I don't really see the gain from that. Float64 is kind of a sweet spot between cost, precision, and me becoming a numerical analyst.

With Float32, I would have to think very seriously about floating point inaccuracy in the leapfrog steps and especially the turn and acceptance statistic accumulation. I am not saying it is not doable, but it is something I would prefer to focus on if it turns out to be a bottleneck.

Sleort · 2019-11-25T03:07:41Z

It should be possible to make the code more generic, but I don't really see the gain from that. Float64 is kind of a sweet spot between cost, precision, and me becoming a numerical analyst.

That is a fair point (and of course completely up to you). I just noticed that your code is "almost generically" written, with many (most?) places declaring types on the form T <: AbstractFloat, while my testing/investigations revealed that only T = Float64 worked / actually was expected. For example here:

DynamicHMC.jl/src/stepsize.jl

Lines 223 to 229 in 9c8d2f6

    
           struct DualAveragingState{T <: AbstractFloat} 
        
               μ::T 
        
               m::Int 
        
               H̄::T 
        
               logϵ::T 
        
               logϵ̄::T 
        
           end

With Float32, I would have to think very seriously about floating point inaccuracy in the leapfrog steps and especially the turn and acceptance statistic accumulation. I am not saying it is not doable, but it is something I would prefer to focus on if it turns out to be a bottleneck.

First, one could imagine that some critical parts of the code is kept in Float64 precision, while other parts are more free to be whatever the user desires (of course at his/her own risk)... right? Second, as we "all" know and love, one of the beauties of Julia is that multiple dispatch opens up for the usage of code beyond what the original creator imagined... Even if GPU acceleration turns out to be not such a good idea after all, wouldn't it be great if this could be easily done/tested regardless? Or what if I had made my own number type for some reason? Or wanted to do something else none of us have thought of before?

Yes, I know this is a bit hypothetical at the moment. (And much of the work burden to get this up and running probably and unfairly would be yours.) So I'll get back when I've got something more concrete to share 😅

tpapp · 2019-11-25T08:10:56Z

Even if GPU acceleration turns out to be not such a good idea after all, wouldn't it be great if this could be easily done/tested regardless? Or what if I had made my own number type for some reason? Or wanted to do something else none of us have thought of before?

You are perfectly free to fork and experiment with these things. If you find that it pans out, make a PR and I am happy to work on merging it. I am not trying to be difficult here, just please keep in mind that everything that ends up in this package, especially if there is a commitment at the API level, is one more thing to support. So I am conditioning these things on some preliminary evidence or a least plausibility.

Regarding GPU acceleration of the NUTS logic: as I said above, I am skeptical that it would work because the logic of NUTS involves a lot of branching. At the same time, it is not the costly part. So it is unclear to me what the GPU would buy here.

Sleort · 2019-12-02T09:55:09Z

So I've finally gotten some time to look at this project again. This time, I'm benchmarking with respect to a basic, but fairly representative model for the kind of models I had in mind. (Note that although my ultimate goal is to simulate somewhat more complicated models than what I show here, all have a rather simple structure that suits parallelism quite well.)

Here's the first part of the code, loading packages and defining the model:

using LogDensityProblems, DynamicHMC
using DynamicHMC: Hamiltonian, evaluate_ℓ, PhasePoint, leapfrog
import LogDensityProblems: logdensity_and_gradient, capabilities, dimension, logdensity
using LinearAlgebra: Diagonal
using Parameters, Statistics, Random
using Flux, Zygote, CuArrays
using BenchmarkTools

#### φ⁴ 3D lattice model ####
struct PhiFourModel{T, F}
    L::Int
    λ::T
    nn::F #Nearest neighbor field using Conv
end

Flux.@functor PhiFourModel #Just to get easy gpu mapping...

function PhiFourModel(L, λ = 1)
    weight = zeros(3, 3, 3, 1, 1)
    weight[1, 2, 2, 1, 1] = weight[3, 2, 2, 1, 1] =
    weight[2, 1, 2, 1, 1] = weight[2, 3, 2, 1, 1] =
    weight[2, 2, 1, 1, 1] = weight[2, 2, 3, 1, 1] = 0.5
    nn = Conv(weight, zeros(1), pad = 1)
    PhiFourModel(L, λ, nn)
end

function (ℓ::PhiFourModel)(q::AbstractArray{T,5}) where T
    @unpack λ, nn = ℓ
    kin = - sum( q .* nn(q) ) #kinetic energy: interaction between nearest neighbor lattice sites
    pot = λ * sum(x -> (x^2 - 1)^2, q) #potential energy: only local contributions
    return - kin - pot
end

(ℓ::PhiFourModel)(q::AbstractVector) = ℓ(reshape(q, ℓ.L, ℓ.L, ℓ.L, 1, 1))

capabilities(::PhiFourModel) = LogDensityProblems.LogDensityOrder{1}()
dimension(ℓ::PhiFourModel) = ℓ.L^3
logdensity(ℓ::PhiFourModel, x) = ℓ(x)

It turned out to be a bit unwieldy (as also reported earlier) to benchmark the entire DynamicHMC.jl stack in one go, so I opted to focus on the leapfrog step. I assume this would be the computationally expensive part of the simulation - as you also reminded me of.

I explored the following scenarios:

Do everything on the CPU
Let the GPU do the gradient calculation, whereafter the result is copied back to the CPU and the rest of the leapfrog step is done there.
Do the entire leapfrog calculation on the GPU.

(And just to clarify, as it seems that I previously wasn't sufficiently clear: I didn't envision performing the NUTS logic on the GPU. I don't think that is worth it either. What I had in mind was to accelerate the parts that deal directly with the high dimensional phase space points, i.e. gradient calculation, leapfrog integration etc.)

Due to technicalities/minor issues, I had to make separate methods for the logdensity_and_gradient function depending on whether operating on the CPU or the GPU:

function logdensity_and_gradient(ℓ::PhiFourModel, q::CuArray{T}) where T
    y, g = value_and_gradient(ℓ, q)
    y, one(T).*g #Need the multiplication by one to avoid some promotion issues with CuArrays...
end

function logdensity_and_gradient(ℓ::PhiFourModel, q::Array)
    if ℓ.nn.weight isa CuArray #The model is evaluated on the GPU
        return map(cpu, logdensity_and_gradient(ℓ, gpu(q)))
    else #everything happens on the CPU
        return value_and_gradient(ℓ, q)
    end
end

Then I tested scenario 1 as

L = 100
n = L^3
ℓ = PhiFourModel(L)
κ = GaussianKineticEnergy(n)
H = Hamiltonian(κ, ℓ)
q = randn(n)
p = randn(n)
Q = evaluate_ℓ(ℓ, q)
z = PhasePoint(Q, p)
ϵ = 1.0
@btime leapfrog(H, z, ϵ)

while scenario 2 was done by simply writing ℓ = PhiFourModel(L) |> gpu instead of ℓ = PhiFourModel(L). For scenario 3, a few extra |> gpus were added:

ℓ = PhiFourModel(L) |> gpu
κ = GaussianKineticEnergy(Diagonal(CuArrays.ones(n))) #CuArrays version
H = Hamiltonian(κ, ℓ)
q = randn(n) |> gpu
p = randn(n) |> gpu
Q = evaluate_ℓ(ℓ, q)
z = PhasePoint(Q, p)
ϵ = 1.0f0

@btime leapfrog(H, z, ϵ)

I got the following results on my laptop (Intel Core i7-8750H CPU, Nvidia GeForce GTX 1050 Ti Max-Q graphics card)

423.734 ms (3000439 allocations: 869.78 MiB)
354.661 ms (1635 allocations: 72.56 MiB)
7.150 ms (2045 allocations: 95.84 KiB)
Sure, the model could be written more efficiently (by avoiding unnecessary allocations, tailor-making CUDA kernels etc.), but this only accounts for some of the difference between 1 and 2. And 3 nevertheless shows that this is somewhat beside the point... (Barring that I've missed something crucial, of course.)

Thoughts?

jtrakk · 2021-05-11T21:37:58Z

TuringLang/AdvancedHMC.jl#261 discusses a new GPU paper by Hoffman et al with very impressive benchmarks.

xukai92 mentioned this issue Feb 15, 2020

GPU accelerated HMC TuringLang/AdvancedHMC.jl#119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110

Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110

Sleort commented Nov 20, 2019

tpapp commented Nov 20, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 21, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 22, 2019

tpapp commented Nov 22, 2019

Sleort commented Nov 25, 2019

tpapp commented Nov 25, 2019

Sleort commented Dec 2, 2019

jtrakk commented May 11, 2021

Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110

Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110

Comments

Sleort commented Nov 20, 2019

tpapp commented Nov 20, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 21, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 21, 2019

tpapp commented Nov 21, 2019

Sleort commented Nov 22, 2019

tpapp commented Nov 22, 2019

Sleort commented Nov 25, 2019

tpapp commented Nov 25, 2019

Sleort commented Dec 2, 2019

jtrakk commented May 11, 2021