Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor optimization of the LOBPCG solver #1037

Merged
merged 9 commits into from
Feb 5, 2025
Merged

Conversation

abussy
Copy link
Collaborator

@abussy abussy commented Dec 18, 2024

This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by ortho! X vs Y in particular.

This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.

Additionally, it was noted that quite some time is spent in rayleigh_ritz, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standard LazyHcat format, no automatic optimization can take place. This PR adds a new function mul_hermi for LazyHcat matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.

@mfherbst
Copy link
Member

mfherbst commented Dec 19, 2024

This results in savings of the order of 30%.

Of time to do the Rayleigh-Ritz step ?

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally very nice, thanks.

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

@abussy
Copy link
Collaborator Author

abussy commented Dec 19, 2024

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant.

@mfherbst
Copy link
Member

@abussy I think for mul_hermi an example showing a clear difference in timings would be good if you happen to have one handy. Just for @antoine-levitt to get convinced whether the extra code is worth it 😄.

@abussy
Copy link
Collaborator Author

abussy commented Jan 10, 2025

I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the mul_hermi() function for the product of LazyHcat matrices known to be Hermitian, and a redefinition of mul!(A, B, C, alpha, beta). In the latter case, there was an explicit MM with the identity (clearly a waste).

For the mul_hermi() function, I observe up to 4% efficiency gains on the SCF. For a single additional function of 20 lines, I think it's a bargain. Making small steps like this can eventually lead to significant gains. That's my opinion at least.

As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).

using DFTK                                                                                           
using PseudoPotentialData                                                                            
setup_threading()                                                                                    
                                                                                                     
Ecut = 32.0                                                                                          
kgrid = [1, 1, 1]                                                                                    
maxiter = 10                                                                                         
tol = 1.0e-8                                                                                         
                                                                                                     
factor = 4                                                                                           
a = 3.8267                                                                                           
lattice = factor*a * [[0.0 1.0 1.0];                                                                 
                      [1.0 0.0 1.0];                                                                 
                      [1.0 1.0 0.0]]                                                                 
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))                            
atoms = [Al for i in 1:factor^3]                                                                     
positions = Vector{Vector{Float64}}([])                                                              
for i = 1:factor, j = 1:factor, k=1:factor                                                           
   push!(positions, [i/factor, j/factor, k/factor])                                                  
end                                                                                                  
                                                                                                     
model = model_DFT(lattice, atoms, positions; temperature=1e-4,                                       
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())                              
                                                                                                     
#actual calculations                                                                                 
DFTK.reset_timer!(DFTK.timer)                                                                        
basis  = PlaneWaveBasis(model; Ecut, kgrid)                                                          
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);                                     
@show DFTK.timer

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to 4% efficiency gains on the SCF.

If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.

@abussy
Copy link
Collaborator Author

abussy commented Jan 28, 2025

Implemented the suggested changes. Requires the removal of the @assert !any(isnan, XAX) in rayleigh_ritz() before the diagonalization (because of disallowed scalar access on GPU, when XAX is already wrapped as Hermitian). I believe this is safe, as the eigen() function fails loudly with LoadError: ArgumentError: matrix contains Infs or NaNs when NaNs are present.

@mfherbst
Copy link
Member

This version is fine with me.

Requires the removal of the @assert !any(isnan, XAX)

This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

@antoine-levitt Ok with this PR as it stands ?

Copy link
Member

@antoine-levitt antoine-levitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.

@@ -100,6 +100,28 @@ end

Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

# Special case of Hermitian result: can only actively compute the block upper diagonal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"

@@ -100,6 +100,28 @@ end

Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

# Special case of Hermitian result: can only actively compute the block upper diagonal
@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aadj -> A

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mul does AB, not AadjB)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. To be fair, I am copying the convention from the original general function

@views function Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}
I guess I could change it there too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that'd be good!

Hermitian(ret)
end

mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the generic before the specialization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)

@antoine-levitt
Copy link
Member

Huh, I didn't see that this was a copy-paste. You can have a _mul(A,B, hermitian=Val(false) with if hermitian ... inside and then just dispatch the * and mul_hermi methods to that.

@mfherbst
Copy link
Member

mfherbst commented Feb 3, 2025

Great idea @antoine-levitt. I would make it a kwarg, though i.e. _mul(A,B; hermitian=Val(false) ) and then the check is if hermitian isa Val{true}

@abussy
Copy link
Collaborator Author

abussy commented Feb 3, 2025

I added a generic mul_() function that dispatches the general and Hermitian case. From my understanding, it is more efficient to branch out with if hermitian isa Val{true} a single time rather than in the loop. Correct me if I'm wrong, and I can reduce code duplication.

Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

Unfortunately, cuBLAS let's this slide and goes ahead with NaNs. It turns out that using @assert !any(isnan, UpperTriangular(parent(XAX))) does not trigger GPU scalar access error, while fulfilling the safety requirements.

for (ib, blB) in enumerate(B.blocks)
orow = 0 # row offset
for (ia, blA) in enumerate(Ap.blocks)
ib < ia && continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this (and the Hermitian wrapper) the only line that is not duplicate between the two versions of the _mul function ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the only difference

end

Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)
@views function Base.:*(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @views here does nothing and can be removed


mul_hermi(A, B) = Hermitian(A * B)

@views function mul_hermi(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, @views can be removed

@antoine-levitt
Copy link
Member

From my understanding, it is more efficient to branch out with if hermitian isa Val{true} a single time rather than in the loop. Correct me if I'm wrong, and I can reduce code duplication.

nope, it doesn't matter. Julia compiles a function for the types of its arguments, so it will know at compile time that the argument isa Val{true} and just delete all checks. You can check this for yourself with the @code_ functions

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mfherbst mfherbst enabled auto-merge (squash) February 5, 2025 14:07
@mfherbst mfherbst merged commit 24eb717 into JuliaMolSim:master Feb 5, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants