-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation of large systems utilizing GPUs (i.e. CuArrays)? #110
Comments
I am very excited to see this package applied on the GPU. People work with large-dimensional HMC all the time, so theoretically it should be the ideal method if your problem is somewhat "regular". I appreciate the very nice MWE which should help a lot in figuring this out. Unfortunately, my current machine doesn't have a GPU and I am not well-versed in how to use it. I would recommend that you solve the problem with If that fails, then if someone can provide access to a machine (eg over ssh) that has the GPU-based part set up, I am happy to log in and help figure it out. |
Great. Thanks! My problem definitely goes into the "regular" category. I'll have a closer look then... And for the record (or if you or someone else can spot anything), here's the stacktrace for the errors when I run the code above: (I should have included that in my first post. Sorry about that.) julia> results = mcmc_with_warmup(rng, ℓ, 10)
[ Info: finding initial optimum
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/0lvhc/src/indexing.jl:16
ERROR: GPU compilation of #25(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) failed
KernelError: passing and using non-bitstype argument
Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}.
That type is not isbits, and such arguments are only allowed when they are unused by the kernel. .args is of type Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}} which is not isbits.
.1 is of type Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}} which is not isbits.
.x is of type Array{Float64,1} which is not isbits.
Stacktrace:
[1] check_invocation(::CUDAnative.CompilerJob, ::LLVM.Function) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/validation.jl:70
[2] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:187 [inlined]
[3] macro expansion at /home/troels/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214 [inlined]
[4] #codegen#152(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:186
[5] #codegen at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:0 [inlined]
[6] #compile#151(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:47
[7] #compile#150 at ./none:0 [inlined]
[8] #compile at ./none:0 [inlined] (repeats 2 times)
[9] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:403 [inlined]
[10] #cufunction#194(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUArrays, Symbol("##25#26")), ::Type{Tuple{CuArrays.CuKernelState,CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
[11] cufunction(::Function, ::Type) at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
[12] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:176 [inlined]
[13] macro expansion at ./gcutils.jl:87 [inlined]
[14] macro expansion at /home/troels/.julia/packages/CUDAnative/2WQzk/src/execution.jl:173 [inlined]
[15] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Float64,1,Nothing}, ::Tuple{CuArray{Float64,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /home/troels/.julia/packages/CuArrays/7z7MV/src/gpuarray_interface.jl:62
[16] gpu_call(::Function, ::CuArray{Float64,1,Nothing}, ::Tuple{CuArray{Float64,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{Array{Float64,1},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Int64) at /home/troels/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:151
[17] gpu_call at /home/troels/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:128 [inlined]
[18] copyto! at /home/troels/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:48 [inlined]
[19] copyto! at ./broadcast.jl:842 [inlined]
[20] copy at ./broadcast.jl:818 [inlined]
[21] materialize at ./broadcast.jl:798 [inlined]
[22] *(::LinearAlgebra.Diagonal{Float64,Array{Float64,1}}, ::CuArray{Float64,1,Nothing}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/diagonal.jl:163
[23] rand_p(::CuArrays.CURAND.RNG, ::GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}}, ::Nothing) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/hamiltonian.jl:122 (repeats 2 times)
[24] warmup(::DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}, ::InitialStepsizeSearch, ::DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:159
[25] #27 at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:443 [inlined]
[26] mapafoldl(::typeof(identity), ::getfield(DynamicHMC, Symbol("##27#28")){DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}}, ::Tuple{Tuple{NamedTuple{(:stage, :results, :warmup_state),Tuple{FindLocalOptimum{Float64},Nothing,DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}, ::InitialStepsizeSearch, ::TuningNUTS{Nothing,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}}, ::TuningNUTS{Nothing,DualAveraging{Float64}}) at ./tuple.jl:192 (repeats 2 times)
[27] mapfoldl_impl(::Function, ::Function, ::NamedTuple{(:init,),Tuple{Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}) at ./tuple.jl:198
[28] #mapfoldl#188(::Base.Iterators.Pairs{Symbol,Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}},Tuple{Symbol},NamedTuple{(:init,),Tuple{Tuple{Tuple{},DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}}}}}, ::typeof(mapfoldl), ::Function, ::Function, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}) at ./reduce.jl:72
[29] #mapfoldl at ./none:0 [inlined]
[30] #foldl#189 at ./reduce.jl:90 [inlined]
[31] #foldl at ./none:0 [inlined]
[32] _warmup(::DynamicHMC.SamplingLogDensity{CuArrays.CURAND.RNG,LogNormalTest,DynamicHMC.NUTS{Val{:generalized}},LogProgressReport{Nothing}}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.WarmupState{DynamicHMC.EvaluatedLogDensity{CuArray{Float64,1,Nothing},Float64},GaussianKineticEnergy{LinearAlgebra.Diagonal{Float64,Array{Float64,1}},LinearAlgebra.Diagonal{Float64,Array{Float64,1}}},Nothing}) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:441
[33] #mcmc_keep_warmup#29(::Tuple{}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.NUTS{Val{:generalized}}, ::LogProgressReport{Nothing}, ::typeof(DynamicHMC.mcmc_keep_warmup), ::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:517
[34] #mcmc_keep_warmup at ./none:0 [inlined]
[35] macro expansion at /home/troels/.julia/packages/Parameters/l76EM/src/Parameters.jl:750 [inlined]
[36] #mcmc_with_warmup#30(::Tuple{}, ::Tuple{FindLocalOptimum{Float64},InitialStepsizeSearch,TuningNUTS{Nothing,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{LinearAlgebra.Diagonal,DualAveraging{Float64}},TuningNUTS{Nothing,DualAveraging{Float64}}}, ::DynamicHMC.NUTS{Val{:generalized}}, ::LogProgressReport{Nothing}, ::typeof(mcmc_with_warmup), ::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:568
[37] mcmc_with_warmup(::CuArrays.CURAND.RNG, ::LogNormalTest, ::Int64) at /home/troels/.julia/packages/DynamicHMC/pVyrS/src/mcmc.jl:568
[38] top-level scope at REPL[19]:1 |
It fails with making up a random momentum using DynamicHMC, Random
rng = Random.GLOBAL_RNG # replace with the Cu equivalent
# probably need the explicit constructor below with a diagonal matrix
κ = DynamicHMC.GaussianKineticEnergy(5)
DynamicHMC.rand_p(rng, κ, nothing) The leapfrog step performs simple linear operations on momentum But at the same time, I don't think you would gain a lot from doing the logic of NUTS/HMC in the GPU because it branches a lot. Instead, perhaps you could focus on just making the posterior log density work with GPUs, in a way that takes plain vanilla function LogDensityProblems.logdensity_and_gradient(ℓ::YourModel, x::SomeVector)
z = CuArrays.convert_vector_somehow(x)
logdens, grad = .... # calculate with z with the GPU
... # get them back as a Float64 and a Vector{Float64}
end |
Now I've looked quite a bit into the problem (before realizing that you had replied - you're "too quick"! 😄 ), and here's my overall take:
Continuing the test case of my first post, here's today's story: rng = CURAND.RNG() #Unless otherwise specified, rng refers to the CuArrays random number generator
#### Random number hacks needed to get things to work####
Random.rand(rng::CURAND.RNG, dims::Integer...; kwargs...) = Random.rand(rng, Float32, dims; kwargs...) #Defaults to Float64, not Float32. Bug reported over at CuArrays.jl.
Random.randn(rng::CURAND.RNG, dims::Integer...; kwargs...) = Random.randn(rng, Float32, dims; kwargs...)
Random.rand(rng::CURAND.RNG, ::Type{DynamicHMC.Directions}) = DynamicHMC.Directions(rand(Random.GLOBAL_RNG, UInt32)) #CURAND.RNG fails when we're dealing with scalars.
Random.randexp(rng::CURAND.RNG, T::Type{Float64}) = randexp(Random.GLOBAL_RNG, T) #... same here. (Why explicitly Float64 in the code?)
#### The kinetic energy: ####
using LinearAlgebra: Diagonal
# κ = GaussianKineticEnergy(n) #Automatically makes M⁻¹::Array{Float64}, while I want it to be a CuArray
κ = GaussianKineticEnergy(Diagonal(CuArrays.ones(n))) #Works
#### Initial adaption state ####
# Hack needed, since initial_adaptation_state in stepsize.jl introduces Float64
# (due to log(10)), even when we pass ϵ::Float32 to it...
using ArgCheck: @argcheck
function DynamicHMC.initial_adaptation_state(::DualAveraging, ϵ::Float32)
@argcheck ϵ > 0
logϵ = log(ϵ)
DynamicHMC.DualAveragingState(log(10f0) + logϵ, 0, zero(logϵ), logϵ, zero(logϵ))
end
#### TrajectoryNUTS ####
# Similar to above. min_Δ somehow ends up as Float64 when the rest are Float32
DynamicHMC.TrajectoryNUTS(H, π₀::Float32, ϵ::Float32, min_Δ::Float64, turn_statistic_configuration) =
DynamicHMC.TrajectoryNUTS(H, π₀, ϵ, Float32(min_Δ), turn_statistic_configuration)
#Later, I get a case with π₀::Float64...
DynamicHMC.TrajectoryNUTS(H, π₀::Float64, ϵ::Float32, min_Δ::Float64, turn_statistic_configuration) =
DynamicHMC.TrajectoryNUTS(H, Float32(π₀), ϵ, Float32(min_Δ), turn_statistic_configuration)
#### DualAveragingState ####
# More type issues. Originating from default values in DualAveraging, it seems...
DynamicHMC.DualAveragingState(μ::Float32, m::Int, H̄::Float64, logϵ::Float64, logϵ̄::Float64) =
DynamicHMC.DualAveragingState(μ, m, Float32(H̄), Float32(logϵ), Float32(logϵ̄))
#### HMC ####
# Here's a "minimal" example I get to work. stepsize_search ≠ nothing still leads to some type issues
mcmc_with_warmup(rng, ℓ, 10;
initialization = (κ = κ, ϵ = 1.0f0), #Even more type issues with ϵ = 1.0
warmup_stages = default_warmup_stages(;
stepsize_search = nothing
)
) |
And, by the way, I realized that this is due to the attempted multiplication of a |
I am open to extending this package in any reasonable direction needed by its users, but at the moment I am not convinced that Specifically, suppose that your problem is super-well-behaved so a tree depth of 5-6 gives your convergence. This means 32 to 64 evaluations of So I would propose that you first write your actual model (not a toy one, like above, because |
Hmm... Yeah, that's a good point. I'll look into it. It would still be nice to not having everything converted to |
To reiterate: I am happy to accept that some changes to DynamicHMC may be necessary to accommodate GPU calculations better. I am very open to making those changes, just asking that we wait until we have some real-life code to profile to determine where the bottlenecks are. |
Sure, sure. I'm on it. 😉 To clarify: My last comment was not just stated with GPU acceleration in mind, but more generally with my observation that, at the moment, DynamicHMC seems to only work with/assume Line 239 in 9c8d2f6
I pointed to some other, similar cases above. |
It should be possible to make the code more generic, but I don't really see the gain from that. With |
That is a fair point (and of course completely up to you). I just noticed that your code is "almost generically" written, with many (most?) places declaring types on the form Lines 223 to 229 in 9c8d2f6
First, one could imagine that some critical parts of the code is kept in Yes, I know this is a bit hypothetical at the moment. (And much of the work burden to get this up and running probably and unfairly would be yours.) So I'll get back when I've got something more concrete to share 😅 |
You are perfectly free to fork and experiment with these things. If you find that it pans out, make a PR and I am happy to work on merging it. I am not trying to be difficult here, just please keep in mind that everything that ends up in this package, especially if there is a commitment at the API level, is one more thing to support. So I am conditioning these things on some preliminary evidence or a least plausibility. Regarding GPU acceleration of the NUTS logic: as I said above, I am skeptical that it would work because the logic of NUTS involves a lot of branching. At the same time, it is not the costly part. So it is unclear to me what the GPU would buy here. |
So I've finally gotten some time to look at this project again. This time, I'm benchmarking with respect to a basic, but fairly representative model for the kind of models I had in mind. (Note that although my ultimate goal is to simulate somewhat more complicated models than what I show here, all have a rather simple structure that suits parallelism quite well.) Here's the first part of the code, loading packages and defining the model: using LogDensityProblems, DynamicHMC
using DynamicHMC: Hamiltonian, evaluate_ℓ, PhasePoint, leapfrog
import LogDensityProblems: logdensity_and_gradient, capabilities, dimension, logdensity
using LinearAlgebra: Diagonal
using Parameters, Statistics, Random
using Flux, Zygote, CuArrays
using BenchmarkTools
#### φ⁴ 3D lattice model ####
struct PhiFourModel{T, F}
L::Int
λ::T
nn::F #Nearest neighbor field using Conv
end
Flux.@functor PhiFourModel #Just to get easy gpu mapping...
function PhiFourModel(L, λ = 1)
weight = zeros(3, 3, 3, 1, 1)
weight[1, 2, 2, 1, 1] = weight[3, 2, 2, 1, 1] =
weight[2, 1, 2, 1, 1] = weight[2, 3, 2, 1, 1] =
weight[2, 2, 1, 1, 1] = weight[2, 2, 3, 1, 1] = 0.5
nn = Conv(weight, zeros(1), pad = 1)
PhiFourModel(L, λ, nn)
end
function (ℓ::PhiFourModel)(q::AbstractArray{T,5}) where T
@unpack λ, nn = ℓ
kin = - sum( q .* nn(q) ) #kinetic energy: interaction between nearest neighbor lattice sites
pot = λ * sum(x -> (x^2 - 1)^2, q) #potential energy: only local contributions
return - kin - pot
end
(ℓ::PhiFourModel)(q::AbstractVector) = ℓ(reshape(q, ℓ.L, ℓ.L, ℓ.L, 1, 1))
capabilities(::PhiFourModel) = LogDensityProblems.LogDensityOrder{1}()
dimension(ℓ::PhiFourModel) = ℓ.L^3
logdensity(ℓ::PhiFourModel, x) = ℓ(x) It turned out to be a bit unwieldy (as also reported earlier) to benchmark the entire DynamicHMC.jl stack in one go, so I opted to focus on the leapfrog step. I assume this would be the computationally expensive part of the simulation - as you also reminded me of. I explored the following scenarios:
(And just to clarify, as it seems that I previously wasn't sufficiently clear: I didn't envision performing the NUTS logic on the GPU. I don't think that is worth it either. What I had in mind was to accelerate the parts that deal directly with the high dimensional phase space points, i.e. gradient calculation, leapfrog integration etc.) Due to technicalities/minor issues, I had to make separate methods for the function logdensity_and_gradient(ℓ::PhiFourModel, q::CuArray{T}) where T
y, g = value_and_gradient(ℓ, q)
y, one(T).*g #Need the multiplication by one to avoid some promotion issues with CuArrays...
end
function logdensity_and_gradient(ℓ::PhiFourModel, q::Array)
if ℓ.nn.weight isa CuArray #The model is evaluated on the GPU
return map(cpu, logdensity_and_gradient(ℓ, gpu(q)))
else #everything happens on the CPU
return value_and_gradient(ℓ, q)
end
end Then I tested scenario 1 as L = 100
n = L^3
ℓ = PhiFourModel(L)
κ = GaussianKineticEnergy(n)
H = Hamiltonian(κ, ℓ)
q = randn(n)
p = randn(n)
Q = evaluate_ℓ(ℓ, q)
z = PhasePoint(Q, p)
ϵ = 1.0
@btime leapfrog(H, z, ϵ) while scenario 2 was done by simply writing ℓ = PhiFourModel(L) |> gpu
κ = GaussianKineticEnergy(Diagonal(CuArrays.ones(n))) #CuArrays version
H = Hamiltonian(κ, ℓ)
q = randn(n) |> gpu
p = randn(n) |> gpu
Q = evaluate_ℓ(ℓ, q)
z = PhasePoint(Q, p)
ϵ = 1.0f0
@btime leapfrog(H, z, ϵ) I got the following results on my laptop (Intel Core i7-8750H CPU, Nvidia GeForce GTX 1050 Ti Max-Q graphics card)
Thoughts? |
TuringLang/AdvancedHMC.jl#261 discusses a new GPU paper by Hoffman et al with very impressive benchmarks. |
I'm looking into the possibility of using DynamicHMC.jl for lattice model simulations. For such problems, the number of degrees of freedom ("parameters") should, typically, be as large as possible - O(10^6) is not uncommon.
Now, because of this large number of degrees of freedom, I was thinking of taking advantage of GPU processing to accelerate the log-density and gradient calculations to be fed to the HMC sampler. However, before I do any deep dive into the code base of DynamicHMC, I'd like to know if you think this is feasible...
As a starting point for any discussion: Here's some exploratory testing, from which I got stuck:
(I'm on DynamicHMC v2.1.0, CuArrays v1.4.7, Zygote v0.4.1 )
The text was updated successfully, but these errors were encountered: