Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richardson CG #29

Open
wants to merge 16 commits into
base: devel
Choose a base branch
from
Open

Conversation

martin-ueding
Copy link
Contributor

Exposing the QPhiX Richardson Solver with CG in Chroma

Rationale

Chroma already has the QPhiX Richardson solver exposed, but only with BiCGStab
as a solver. For our Wilson clover simulaton, the BiCGStab seems to be really
unstable. Bartek has extended QPhiX such that one can use the CG in the
Richardson solver. We want to have this exposed in Chroma in order to use it
for our HMC simulation.

Current Implementation

HMC and Propagator Solver

With a git grep QPHIX_CLOVER_ITER_REFINE_BICGSTAB_INVERTER one can find where
the Richardson solver is defined within Chroma:

  • lib/actions/ferm/invert/qphix/syssolver_linop_clover_qphix_iter_refine_w.cc
  • lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc

In the linop file, there is a comment stating:

Solve a MdagM*psi=chi linear system by CG2

Whereas in the mdagm file, there is a comment stating:

Solve a MdagM*psi=chi linear system by CG2

The corresponding header files enlighten with the following comment:

Solve a M*psi=chi linear system by BiCGStab

At this point, I am completely lost. I guess that one inverts M†M whereas
the other only inverts M. My bet is that mdagm solves M†M and linop
solves, well, just the linear operator M.

Indeed, there is one call to mixed_solver in the linop version and two
invocations in the mdagm variant:

(*mixed_solver)(psi_s[1], chi_s[1], toDouble(invParam.RsdTarget),
                res.n_count, rsd_final, site_flops, mv_apps, my_isign,
                invParam.VerboseP);

Versus:

(*mixed_solver)(tmp_qphix, chi_qphix, toDouble(invParam.RsdTarget),
                n_count1, rsd_final, site_flops1, mv_apps1, -1,
                invParam.VerboseP);
// ...
(*mixed_solver)(psi_qphix, tmp_qphix, toDouble(invParam.RsdTarget),
                n_count2, rsd_final, site_flops2, mv_apps2, +1,
                invParam.VerboseP);

So these two are indeed the solvers for the propagators (M) and the HMC (M†M).

Despite the comments in the code, both cases use the BiCGStab solver as the
inner solver. The outer solver always has been the BiCGStab solver, now it is
this only by default after Bartek's changes.

“Regular” QPhiX Solver

For the regular QPhiX solver, there is an additional XML tag that can be used
to select the CG or BiCGStab solver. It would be nice if something similar
could be implemented for the Richardson solver such that the user can specify
whether CG or BiCGStab is to be used.

Analogously to the Richardson solver, the regular solver is defined in a HMC
and a propagator variant:

  • lib/actions/ferm/invert/qphix/syssolver_linop_clover_qphix_w.cc
  • lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_w.cc

In the constructor of the parameter struct, it reads the solver type into an
enum variable:

read(paramtop, "SolverType", SolverType);

And then later on there is a switch statement using dynamic polymorphism to
create the correct operator.

switch (invParam.SolverType) {
    case CG: {
        QDPIO::cout << "Creating the CG Solver" << std::endl;
        cg_solver = new QPhiX::InvCG<REALT, VecTraits<REALT>::Vec,
                                     VecTraits<REALT>::Soa,
                                     VecTraits<REALT>::compress12>(
            (*M), invParam.MaxIter);
    } break;

    case BICGSTAB: {
        QDPIO::cout << "Creating the BiCGStab Solver" << std::endl;
        bicgstab_solver =
            new QPhiX::InvBiCGStab<REALT, VecTraits<REALT>::Vec,
                                   VecTraits<REALT>::Soa,
                                   VecTraits<REALT>::compress12>(
                (*M), invParam.MaxIter);
    } break;
    default:
        QDPIO::cerr << "UNKNOWN Solver Type" << std::endl;
        QDP_abort(1);
}

For the inversion of M, the actual inversions happen these two functions:

SystemSolverResults_t cgnrSolve(T &psi, const T &chi) const;
SystemSolverResults_t biCGStabSolve(T &psi, const T &chi) const;

This uses the “conjugate gradient normal relation” to invert M using CG whereas
for BiCGStab this is just a simple application. For the M†M case in the other
file, there are these methods:

SystemSolverResults_t biCGStabSolve(
    T &psi, const T &chi, AbsChronologicalPredictor4D<T> &predictor) const;
SystemSolverResults_t cgSolve(
    T &psi, const T &chi, AbsChronologicalPredictor4D<T> &predictor) const;

There is a switch statement on the operator() which chooses the right
function to call based on the runtime parameter.

switch (invParam.SolverType) {
    case CG: {
        res = cgSolve(psi, chi, predictor);
    } break;
    case BICGSTAB: {
        res = biCGStabSolve(psi, chi, predictor);
    } break;
    default:
        QDPIO::cout << "Unknown Solver " << std::endl;
        break;
}

Planning

The regular QPhiX solvers (solvers meaning inverter for M and M†M) have a
switch to allow for both CG and BiCGStab. There is no dynamic polymorphism
involved because CG might need the normal relation in the M case and BiCGStab
needs two applications in the M†M case. Therefore this should be applicable to
the Richardson solver as well.

For now we only need the Richardson solver in the HMC as we do our propagators
with the sLapH method using the peram_gen code based on tmLQCD and QUDA.
Therefore I will start with the M†M implementation and do the M implementation
later on.

Implementation

The new feature needs to be developed against Bartek's QPhiX branch. This has
to be compiled and installed locally, then Chroma has to build against this.
This worked directly, which is a nice surprise.

The regular and Richardson solvers share the SysSolverQPhiXCloverParams
struct which means that also has the field SolverType. Therefore the XML
parsing does not need to be changed, the option just needs to be honored. In
the syssolver_mdagm_clover_qphix_iter_refine_w.h, I add the same switch
statement as there is already in syssolver_mdagm_clover_qphix_w.h in the
following places:

  • The constructor now constructs either a CG or a BiCGStab solver to be used
    as the outer solver.
  • The operator() which takes the chronological predictor has the switch.

The regular QPhiX solver has member functions cgSolve and biCGStabSolve.
Similar functions are going to be implemented in the Richardson case as well.
There the operator() is renamed into biCGStabSolve, the cgSolve is added
analogously to the regular CG case. This will be easier to implement because I
just need to call CG once and I have solved M†M.

Testing

Short HMC on my Laptop

I have compiled Chroma using Bartek's QPhiX branch on my AVX laptop and tested
the HMC with Nf = 2 on a 8³×8 lattice from a random starting configuration. I
have let it two trajectories with just a single time step on the outermost time
scale. I have used two MPI ranks with four threads each (effectively
oversubscribing my laptop)

Comparing CG (-) with Richardson-CG (+), one can see that there are
changes, but not very large. I would conclude that the Richardson-CG did not
break anything.

-              <PE_old>-16985.682690378</PE_old>
+              <PE_old>-16985.6827849023</PE_old>

-              <KE_new>33974.0571754877</KE_new>
+              <KE_new>33974.0638706824</KE_new>

-              <PE_new>-50450.2118875695</PE_new>
+              <PE_new>-50450.2159562068</PE_new>

-            <deltaKE>34020.4653290306</deltaKE>
+            <deltaKE>34020.4720242253</deltaKE>

-            <deltaPE>-33464.5291971914</deltaPE>
+            <deltaPE>-33464.5331713045</deltaPE>

-            <deltaH>555.93613183916</deltaH>
+            <deltaH>555.938852920808</deltaH>

-            <AccProb>3.63082779652542e-242</AccProb>
+            <AccProb>3.62096144730027e-242</AccProb>

diff

You can have a look at the input file hmc-in.xml.txt, the output from plain CG
dat-cg.xml.txt, and the output from Richardson-CG dat-richardson-cg.xml.txt.

The time to solution overall seems to be the same, this can probably not be
seen on my laptop with a non-thermalized random configuration.

@kostrzewa
Copy link

Excellent work Martin. I will pull-request on QPhiX soon such that someone else can test this. As for the time to solution: did you actually do mixed precision or did you have a double-double solver? Also, you oversubscribed, so any benefit might be masked by that...

@martin-ueding
Copy link
Contributor Author

Thank you for the kind words!

It seems I did a single-single solve. My QDP++ is configured with default precision (like to default to single) and Chroma is configured with default inner-precision (defaults to single). Therefore I have a inner and outer veclen of 8 on my 256-Bit machine. So no conversion was done.

I will recompile QDP++ and Chroma using double-precision on my machine, then I can test it again and also I will use 1 MPI rank and 2×2 threads.

@martin-ueding
Copy link
Contributor Author

In the case of differing inner and outer precisions, it does not compile any more:

In file included from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:8:0:
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h: In instantiation of 'Chroma::MdagMSysSolverQPhiXCloverIterRefine<T, U>::MdagMSysSolverQPhiXCloverIterRefine(Chroma::Handle<Chroma::LinearOperator<T> >, Chroma::Handle<Chroma::FermState<T, QDP::multi1d<U>, QDP::multi1d<U> > >, const Chroma::SysSolverQPhiXCloverParams&) [with T = QDP::OLattice<QDP::PSpinVector<QDP::PColorVector<QDP::RComplex<double>, 3>, 4> >; U = QDP::OLattice<QDP::PScalar<QDP::PColorMatrix<QDP::RComplex<double>, 3> > >]':
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:39:139:   required from here
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h:294:13: error: no matching function for call to 'QPhiX::InvCG<double, 4, 4, true, QPhiX::EvenOddLinearOperator<double, 4, 4, true> >::InvCG(QPhiX::EvenOddCloverOperator<float, 8, 4, true>&, const int&)'
             new QPhiX::InvCG<REALT,
             ^~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::Vec,
                              ~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::Soa,
                              ~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::compress12>(*M_inner, invParam.MaxIter);
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h:34:0,
                 from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:8:

So this needs a bit more work to be useful, I work on it.

@martin-ueding
Copy link
Contributor Author

Seemed to be an easy fix. Now it works with double-single on my laptop, though the speed-up is not significant. I'll set it up on Hazel Hen after getting ice cream.

@kostrzewa
Copy link

@bjoo you mentioned that you had observed some bicgstab convergence issues. If this was "physics", these changes may help. Using Richardson-CG for the most chiral determinant ratio monomial may solve your stability problems.

@kostrzewa
Copy link

@martin-ueding How come there is a conflict? Does that just mean that a merge must take place?

@martin-ueding
Copy link
Contributor Author

I have no idea, but it seemed strange to me as well. I probably did not do a git pull on devel before I started working on this feature. I'll look into it, right after I compile Chroma again on Hazel Hen, this time with the right branches. Their no-outgoing-connections policy let me compile it with Chroma devel and QPhiX devel` yesterday, so I'll have to recompile it again. What a 💢 waste of time.

@martin-ueding
Copy link
Contributor Author

martin-ueding commented Aug 24, 2017

Conflict is gone now, it was a simple issue, Bálint has added a std:: in a line which I touched as well.

I just see that I have added a using namespace std; because I did not want to deal with all those issues. This has to be removed before the merge.

@martin-ueding
Copy link
Contributor Author

Now it is working on Hazel Hen, this is the performance on the thermalized L = 24 lattice. One can see that Richardson-CG has an advantage over regular CG, though BiCGStab is still faster. But in our situation where BiCGStab is not stable, this already is a nice win:

tts

The details follow, it is the first solve that is done. The other monomials probably have a different speed-up, but I did not further look into it because it seems to work and it seems faster.

Regular CG (double):

QPHIX_CLOVER_CG_SOLVER: 436 iters,  rsd_sq_final=9.97351928069349e-10
QPHIX_CLOVER_CG_SOLVER: || r || / || b || = 9.97351928069349e-10
QPHIX_CLOVER_CG_SOLVER: Solver Time=3.25231695175171 (sec)  Performance=1616.54930373261 GFLOPS
QPHIX_MDAGM_SOLVER: total time: 3.306395 (sec)

Richardson-BiCGStab (double-float):

Starting Y solve
RICHARDSON: Initial || r ||^2 =   3.12362151e+08   Target || r ||^2 =   3.12362151e-10
RICHARDSON: Solver converged at iter 3 || r || =   7.65889119e-10
RICHARDSON: Inner MV Apps=247 Outer MV Apps=5 Inner Site Flops=117336 Outer Site Flops=477
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: 3 iters,  rsd_sq_final=5.86586142643085e-19 ||r||/||b|| (acc) = 7.65889119026
432e-10
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: Solver Time=0.526260137557983 (sec)  Performance=3135.819446211 GFLOPS
Starting X solve
RICHARDSON: Initial || r ||^2 =   1.83896867e+07   Target || r ||^2 =   1.83896867e-11
RICHARDSON: Solver converged at iter 3 || r || =   9.78313482e-10
RICHARDSON: Inner MV Apps=317 Outer MV Apps=5 Inner Site Flops=150936 Outer Site Flops=477
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: 3 iters,  rsd_sq_final=9.57097268641972e-19 ||r||/||b|| (acc) = 9.78313481784
838e-10
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: Solver Time=0.657563924789429 (sec)  Performance=3208.86882332496 GFLOPS
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: total_iters=6 || r || / || b || = 8.20486128341607e-10
QPHIX_MDAGM_SOLVER: total time: 1.241386 (sec)

Richardson-CG (double-float)

RICHARDSON: Solver converged at iter 3 || r || =   9.53805296e-10
RICHARDSON: Inner MV Apps=882 Outer MV Apps=10 Inner Site Flops=105336 Outer Site Flops=477
QPHIX_CLOVER_CG_SOLVER: 3 iters,  rsd_sq_final=9.53805295227943e-10
QPHIX_CLOVER_CG_SOLVER: || r || / || b || = 9.53805295227943e-10
QPHIX_CLOVER_CG_SOLVER: Solver Time=1.95444297790527 (sec)  Performance=2738.32385277162 GFLOPS
QPHIX_MDAGM_ITER_REFINE_SOLVER: total time: 2.017046 (sec)

@kostrzewa
Copy link

kostrzewa commented Aug 24, 2017

Excellent, looks good. We'll try with Richardson-BiCGstab for all monomials to begin with and if this proves unstable for the lightest determinant ratio, we'll go with Richardson-CG there. The more iterations, the closer the advantage of mixed-precision gets to the theoretical factor of 2.

@martin-ueding
Copy link
Contributor Author

I have removed the using namespace std; now again, so this can be merged, I'd think.

@kostrzewa
Copy link

Could you adjust the output to read:

QPHIX_CLOVER_ITER_REFINE_CG_SOLVER

for completeness?

@martin-ueding
Copy link
Contributor Author

Yes, that can be done. While we are at it, the XML name should also be changed in order to remove the BICGSTAB. However, that will break existing restart files. People running them will get an error message with a list of available XML entities, so they could in principle update it.

An alternative way would be to register the solver in the factory twice with the new and the old name. Existing configuration files would only get the correct behavior if there is no SolverType field in the XML file. Speaking of which, I am not sure whether the selected Richardson solver (CG or BiCGStab) will be properly written out into the restart file. So this has to be looked into.

Regarding the breaking of backwards compatibility against having two entries, what would be the preferred option? Or just leave the name as-is and hope that users find out from the C++ code that CG is supported now?

@martin-ueding
Copy link
Contributor Author

I have changed the name in standard output from QPHIX_CLOVER_CG_SOLVER to QPHIX_CLOVER_ITER_REFINE_CG_SOLVER. That was what you wanted, right?

@kostrzewa
Copy link

Yes, that looks good. Currently testing this (without the new name) on Hazel Hen. CG double-single iterative refinement in all light determinant and determinant ratio terms gives a speed-up of roughly 1.8 over CG in double (speed-up measured on total trajectory time). Not bad, I'd say!

I will prepare a pull-request for my QPhiX branch (need to check some things for completeness before proceeding) and then this can be tested by someone else.

A point which might be interesting for backwards-compatiblity reasons is to use BICGSTAB by default, such that input files which don't specify the solver type will continue to function unmodified.

@martin-ueding
Copy link
Contributor Author

I just had a quick look at the code and I don't see where the solverType XML tag would be optional. So one has to supply a type with all QPhiX solvers it seems. The option was just ignored previously, so people specifying CG got BiCGStab.

This is probably is not hard to fix, I just did not see it right away. I'll be out of the loop for some weeks now, so please just go ahead and change it.

@kostrzewa
Copy link

Okay, thanks. I'll take over from here then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants