Minor optimization of the LOBPCG solver #1037

abussy · 2024-12-18T16:46:12Z

This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by ortho! X vs Y in particular.

This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.

Additionally, it was noted that quite some time is spent in rayleigh_ritz, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standard LazyHcat format, no automatic optimization can take place. This PR adds a new function mul_hermi for LazyHcat matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.

mfherbst · 2024-12-19T07:40:44Z

This results in savings of the order of 30%.

Of time to do the Rayleigh-Ritz step ?

mfherbst

Generally very nice, thanks.

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

src/eigen/lobpcg_hyper_impl.jl

abussy · 2024-12-19T10:17:21Z

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant.

mfherbst · 2024-12-21T19:01:56Z

@abussy I think for mul_hermi an example showing a clear difference in timings would be good if you happen to have one handy. Just for @antoine-levitt to get convinced whether the extra code is worth it 😄.

abussy · 2025-01-10T16:44:17Z

I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the mul_hermi() function for the product of LazyHcat matrices known to be Hermitian, and a redefinition of mul!(A, B, C, alpha, beta). In the latter case, there was an explicit MM with the identity (clearly a waste).

For the mul_hermi() function, I observe up to 4% efficiency gains on the SCF. For a single additional function of 20 lines, I think it's a bargain. Making small steps like this can eventually lead to significant gains. That's my opinion at least.

As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).

using DFTK                                                                                           
using PseudoPotentialData                                                                            
setup_threading()                                                                                    
                                                                                                     
Ecut = 32.0                                                                                          
kgrid = [1, 1, 1]                                                                                    
maxiter = 10                                                                                         
tol = 1.0e-8                                                                                         
                                                                                                     
factor = 4                                                                                           
a = 3.8267                                                                                           
lattice = factor*a * [[0.0 1.0 1.0];                                                                 
                      [1.0 0.0 1.0];                                                                 
                      [1.0 1.0 0.0]]                                                                 
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))                            
atoms = [Al for i in 1:factor^3]                                                                     
positions = Vector{Vector{Float64}}([])                                                              
for i = 1:factor, j = 1:factor, k=1:factor                                                           
   push!(positions, [i/factor, j/factor, k/factor])                                                  
end                                                                                                  
                                                                                                     
model = model_DFT(lattice, atoms, positions; temperature=1e-4,                                       
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())                              
                                                                                                     
#actual calculations                                                                                 
DFTK.reset_timer!(DFTK.timer)                                                                        
basis  = PlaneWaveBasis(model; Ecut, kgrid)                                                          
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);                                     
@show DFTK.timer

mfherbst

up to 4% efficiency gains on the SCF.

If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.

src/eigen/lobpcg_hyper_impl.jl

abussy · 2025-01-28T16:54:54Z

Implemented the suggested changes. Requires the removal of the @assert !any(isnan, XAX) in rayleigh_ritz() before the diagonalization (because of disallowed scalar access on GPU, when XAX is already wrapped as Hermitian). I believe this is safe, as the eigen() function fails loudly with LoadError: ArgumentError: matrix contains Infs or NaNs when NaNs are present.

mfherbst · 2025-01-31T08:04:31Z

This version is fine with me.

Requires the removal of the @assert !any(isnan, XAX)

This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

@antoine-levitt Ok with this PR as it stands ?

antoine-levitt

Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.

antoine-levitt · 2025-01-31T08:31:24Z

src/eigen/lobpcg_hyper_impl.jl

@@ -100,6 +100,28 @@ end

 Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

+# Special case of Hermitian result: can only actively compute the block upper diagonal


Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"

antoine-levitt · 2025-01-31T08:31:34Z

src/eigen/lobpcg_hyper_impl.jl

@@ -100,6 +100,28 @@ end

 Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

+# Special case of Hermitian result: can only actively compute the block upper diagonal
+@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}


(mul does AB, not AadjB)

True. To be fair, I am copying the convention from the original general function

DFTK.jl/src/eigen/lobpcg_hyper_impl.jl

Line 83 in dc62a84

@views function Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}

I guess I could change it there too

Yes that'd be good!

antoine-levitt · 2025-01-31T08:33:23Z

src/eigen/lobpcg_hyper_impl.jl

+    Hermitian(ret)
+end
+
+mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B)


put the generic before the specialization

(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)

src/eigen/lobpcg_hyper_impl.jl

antoine-levitt · 2025-02-03T09:28:46Z

Huh, I didn't see that this was a copy-paste. You can have a _mul(A,B, hermitian=Val(false) with if hermitian ... inside and then just dispatch the * and mul_hermi methods to that.

mfherbst · 2025-02-03T11:57:32Z

Great idea @antoine-levitt. I would make it a kwarg, though i.e. _mul(A,B; hermitian=Val(false) ) and then the check is if hermitian isa Val{true}

abussy · 2025-02-03T17:02:21Z

I added a generic mul_() function that dispatches the general and Hermitian case. From my understanding, it is more efficient to branch out with if hermitian isa Val{true} a single time rather than in the loop. Correct me if I'm wrong, and I can reduce code duplication.

Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

Unfortunately, cuBLAS let's this slide and goes ahead with NaNs. It turns out that using @assert !any(isnan, UpperTriangular(parent(XAX))) does not trigger GPU scalar access error, while fulfilling the safety requirements.

mfherbst · 2025-02-03T17:17:33Z

src/eigen/lobpcg_hyper_impl.jl

+        for (ib, blB) in enumerate(B.blocks)
+            orow = 0  # row offset
+            for (ia, blA) in enumerate(Ap.blocks)
+                ib < ia && continue


Isn't this (and the Hermitian wrapper) the only line that is not duplicate between the two versions of the _mul function ?

Yes, that's the only difference

mfherbst · 2025-02-03T17:18:00Z

src/eigen/lobpcg_hyper_impl.jl

 end

-Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)
+@views function Base.:*(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}


The @views here does nothing and can be removed

mfherbst · 2025-02-03T17:18:19Z

src/eigen/lobpcg_hyper_impl.jl

+
+mul_hermi(A, B) = Hermitian(A * B)
+
+@views function mul_hermi(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}


Same, @views can be removed

src/eigen/lobpcg_hyper_impl.jl

antoine-levitt · 2025-02-03T19:08:45Z

From my understanding, it is more efficient to branch out with if hermitian isa Val{true} a single time rather than in the loop. Correct me if I'm wrong, and I can reduce code duplication.

nope, it doesn't matter. Julia compiles a function for the types of its arguments, so it will know at compile time that the argument isa Val{true} and just delete all checks. You can check this for yourself with the @code_ functions

mfherbst

LGTM

mfherbst reviewed Dec 19, 2024

View reviewed changes

antoine-levitt reviewed Dec 19, 2024

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

abussy force-pushed the lobpcg branch from 3088a79 to f2b4899 Compare January 10, 2025 16:41

mfherbst reviewed Jan 11, 2025

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

mfherbst reviewed Jan 11, 2025

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

abussy force-pushed the lobpcg branch from f2b4899 to ed53658 Compare January 28, 2025 16:49

antoine-levitt reviewed Jan 31, 2025

View reviewed changes

abussy added 6 commits February 3, 2025 17:57

some minor buffering in LOBPCG

3e13f22

Specifc Hermitian mul() for LazyHcat

b15e05c

Applied suggested changes

38a05b6

clean-up

c2b074b

Move Hermitian() within mul_hermi

e7dd283

Added suggested changes

33b4b60

abussy force-pushed the lobpcg branch from da93d25 to 33b4b60 Compare February 3, 2025 16:57

mfherbst reviewed Feb 3, 2025

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Show resolved Hide resolved

Implemented suggested changes

999144f

mfherbst approved these changes Feb 5, 2025

View reviewed changes

mfherbst enabled auto-merge (squash) February 5, 2025 14:07

mfherbst added 2 commits February 5, 2025 15:07

Merge branch 'master' into lobpcg

1d2d027

Update lobpcg_hyper_impl.jl

f3f866a

mfherbst merged commit 24eb717 into JuliaMolSim:master Feb 5, 2025
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor optimization of the LOBPCG solver #1037

Minor optimization of the LOBPCG solver #1037

abussy commented Dec 18, 2024

mfherbst commented Dec 19, 2024 •

edited

Loading

mfherbst left a comment

abussy commented Dec 19, 2024

mfherbst commented Dec 21, 2024

abussy commented Jan 10, 2025

mfherbst left a comment

abussy commented Jan 28, 2025

mfherbst commented Jan 31, 2025

antoine-levitt left a comment

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

abussy Feb 3, 2025

antoine-levitt Feb 3, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt commented Feb 3, 2025

mfherbst commented Feb 3, 2025

abussy commented Feb 3, 2025

mfherbst Feb 3, 2025

abussy Feb 4, 2025

mfherbst Feb 3, 2025

mfherbst Feb 3, 2025

antoine-levitt commented Feb 3, 2025

mfherbst left a comment

		@@ -100,6 +100,28 @@ end

		Base.:(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj LazyHcat(B)

		# Special case of Hermitian result: can only actively compute the block upper diagonal


		mul_hermi(A, B) = Hermitian(A * B)

		@views function mul_hermi(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}

Minor optimization of the LOBPCG solver #1037

Minor optimization of the LOBPCG solver #1037

Conversation

abussy commented Dec 18, 2024

mfherbst commented Dec 19, 2024 • edited Loading

mfherbst left a comment

Choose a reason for hiding this comment

abussy commented Dec 19, 2024

mfherbst commented Dec 21, 2024

abussy commented Jan 10, 2025

mfherbst left a comment

Choose a reason for hiding this comment

abussy commented Jan 28, 2025

mfherbst commented Jan 31, 2025

antoine-levitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoine-levitt commented Feb 3, 2025

mfherbst commented Feb 3, 2025

abussy commented Feb 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoine-levitt commented Feb 3, 2025

mfherbst left a comment

Choose a reason for hiding this comment

mfherbst commented Dec 19, 2024 •

edited

Loading