-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor optimization of the LOBPCG solver #1037
Conversation
Of time to do the Rayleigh-Ritz step ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally very nice, thanks.
In particular mul_hermi
is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?
Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant. |
@abussy I think for |
I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the For the As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to 4% efficiency gains on the SCF.
If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.
Implemented the suggested changes. Requires the removal of the |
This version is fine with me.
This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns. @antoine-levitt Ok with this PR as it stands ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.
src/eigen/lobpcg_hyper_impl.jl
Outdated
@@ -100,6 +100,28 @@ end | |||
|
|||
Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B) | |||
|
|||
# Special case of Hermitian result: can only actively compute the block upper diagonal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"
src/eigen/lobpcg_hyper_impl.jl
Outdated
@@ -100,6 +100,28 @@ end | |||
|
|||
Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B) | |||
|
|||
# Special case of Hermitian result: can only actively compute the block upper diagonal | |||
@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aadj -> A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mul does AB, not AadjB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. To be fair, I am copying the convention from the original general function
DFTK.jl/src/eigen/lobpcg_hyper_impl.jl
Line 83 in dc62a84
@views function Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that'd be good!
src/eigen/lobpcg_hyper_impl.jl
Outdated
Hermitian(ret) | ||
end | ||
|
||
mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the generic before the specialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)
Huh, I didn't see that this was a copy-paste. You can have a |
Great idea @antoine-levitt. I would make it a kwarg, though i.e. |
I added a generic
Unfortunately, cuBLAS let's this slide and goes ahead with NaNs. It turns out that using |
src/eigen/lobpcg_hyper_impl.jl
Outdated
for (ib, blB) in enumerate(B.blocks) | ||
orow = 0 # row offset | ||
for (ia, blA) in enumerate(Ap.blocks) | ||
ib < ia && continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this (and the Hermitian
wrapper) the only line that is not duplicate between the two versions of the _mul
function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the only difference
src/eigen/lobpcg_hyper_impl.jl
Outdated
end | ||
|
||
Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B) | ||
@views function Base.:*(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @views
here does nothing and can be removed
src/eigen/lobpcg_hyper_impl.jl
Outdated
|
||
mul_hermi(A, B) = Hermitian(A * B) | ||
|
||
@views function mul_hermi(A::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, @views
can be removed
nope, it doesn't matter. Julia compiles a function for the types of its arguments, so it will know at compile time that the argument isa Val{true} and just delete all checks. You can check this for yourself with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by
ortho! X vs Y
in particular.This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.
Additionally, it was noted that quite some time is spent in
rayleigh_ritz
, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standardLazyHcat
format, no automatic optimization can take place. This PR adds a new functionmul_hermi
forLazyHcat
matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.