Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MonitorForces not working with QUDA #517

Closed
sbacchio opened this issue Jan 11, 2022 · 19 comments
Closed

MonitorForces not working with QUDA #517

sbacchio opened this issue Jan 11, 2022 · 19 comments

Comments

@sbacchio
Copy link
Contributor

When MonitorForces = yes an error occurs in QUDA. More investigation required.

The error is the following:

MG level 0 (GPU): ERROR: Spinor volume 351232 doesn't match gauge volume 0 (rank 0, host jwb0134.juwels, dirac.cpp:125 in checkParitySpinor())
MG level 0 (GPU):        last kernel called was (name=N4quda4blas7axpbyz_IfEE,volume=28x56x56x4,aux=GPU-offline,vol=351232,precision=4,order=4,Ns=4,Nc=3,TwistFlavour=1)

For more details see $SCRATCH_fssh/bacchio1/C56/logs/log_trial_4811972.out on Juwels Booster

@kostrzewa
Copy link
Member

kostrzewa commented Jan 26, 2022

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

@pittlerf
Copy link
Contributor

Hi, yes, in case I do
in the beginning of monitor_force-s
update_tm_gauge_id(&g_gauge_state, 0.1);
and
update_tm_gauge_id(&g_gauge_state, -0.1);
the problem actually disappers

@pittlerf
Copy link
Contributor

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

I saw similar kind of issue using hot start:
like
MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

Hi
I'm seeing a similar error, I attach the relevant part of the valgrind inspection. My understanding is that the check which control whether the sloppy gauge must be allocated,

https://github.com/sunpho84/quda/blob/5431b168b09343503d0d676425069dc895879c92/lib/interface_quda.cpp#L670-L674

is not working, I have to say I don't understand the logic.

Below is also the relevant part of the input file, maybe somebody can spot a parameter not properly set?

==93804==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93804==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93804==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93804==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93804==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803== Invalid read of size 8
==93803==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93803==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93803==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93803==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93803==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x1490916F: invertQuda (interface_quda.cpp:3011)
==93804==    by 0x1004C08F: invert_eo_degenerate_quda (quda_interface.c:2099)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==  Address 0x15b7deaa8 is 8 bytes inside a block of size 3,112 free'd
==93804==    at 0x4086234: free (vg_replace_malloc.c:540)
==93804==    by 0x149CF2BB: quda::host_free_(char const*, char const*, int, void*) (malloc.cpp:475)
==93804==    by 0x14949E8F: operator delete (object.h:24)
==93804==    by 0x14949E8F: quda::cudaGaugeField::~cudaGaugeField() (cuda_gauge_field.cpp:111)
==93804==    by 0x148E88BB: freeSloppyGaugeQuda() (interface_quda.cpp:1046)
==93804==    by 0x148E8C57: freeGaugeQuda (interface_quda.cpp:1104)
==93804==    by 0x100483BB: _loadGaugeQuda (quda_interface.c:587)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==    by 0x1007F59B: monitor_forces (monitor_forces.c:58)
==93804==    by 0x1003701B: update_tm (update_tm.c:134)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804==  Block was alloc'd at
==93804==    at 0x408484C: malloc (vg_replace_malloc.c:309)
==93803==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x149CFA6F: quda::safe_malloc_(char const*, char const*, int, unsigned long) (malloc.cpp:282)
==93804==    by 0x1490D45B: operator new (object.h:22)
==93804==    by 0x1490D45B: loadGaugeQuda (interface_quda.cpp:673)
==93804==    by 0x100482EB: _loadGaugeQuda (quda_interface.c:595)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93803==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==93804==    by 0x100366A3: update_tm (update_tm.c:130)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804== 
BeginExternalInverter QUDA
  Pipeline = 24
  gcrNkrylov = 24
  MGCoarseMuFactor = 1.0, 1.0, 50.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.000336154560
  MGVerbosity = silent, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGBlockSizesX = 2,2
  MGBlockSizesY = 2,2
  MGBlockSizesZ = 2,2
  MGBlockSizesT = 2,2
  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90
  MGResetSetupMDUThreshold = 1.0
  # tau = 1.0 / 17 = 0.05882353 -> Threshold = 0.058
  MGRefreshSetupMDUThreshold = 0.058
  MGRefreshSetupMaxSolverIterations = 20, 20
EndExternalInverter

BeginOperator CLOVER
  CSW = 1.76
  kappa = 0.15
  2kappamu = 0.0015846837
  SolverPrecision = 1e-14
  MaxSolverIterations = 1000
#  solver = cg
  solver = mg
  UseEvenOdd = yes
  useexternalinverter = quda
  usesloppyprecision = single  
EndOperator

BeginMonomial CLOVERDET
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  CSW = 1.76
  rho = 0.09353509
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-15
  Name = cloverdetlight
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial


BeginMonomial CLOVERDETRATIO
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  rho = 0.01039279
  rho2 = 0.09353509
  CSW = 1.76
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-16
  Name = cloverdetratio1light
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial


@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

I attach @Marcogarofalo since he is observing the same issue

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

I notice

  UseEvenOdd = yes

is not add to the cloverdet, while it is used in the clover which is working smoothly (I believe). Is this related to your opinion? I'll do a test...

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

Ok I understand that the

UseEvenOdd = yes

is not needed and is not understood at all

@kostrzewa
Copy link
Member

The clover monomials should always be EO (it's an unholy mess for historical reasons)...

  • There's a global UseEvenOdd which, in princple, sets all monomials to be EO-preconditioned (unless a monomial is encountered which does not support this)
  • some monomials have a UseEvenOdd parameter to control certain historically relevant cases
  • the operators are a different story: these can always be yes or no and with MG for the online measurement, no makes sense (and in general for measurements).

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

ok, the global flag is set to yes, this is ruled out

For some reason the heatbath part of the momomial is working, but the force calculation is not...

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 8, 2022

in other words when the monomial is created, the sloppy gauge field is initialized, then when the force is computed, the sloppy field is freed, is not recreated, but later is addressed by the solver

@sunpho84
Copy link
Contributor

sunpho84 commented Feb 9, 2022

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

I see only now this comment by Ferenz. It looks to me like this might be related to PR #522, where we observed another problem related to gauge_state. Possibly the PR #523 might fix the issue?

@pittlerf
Copy link
Contributor

pittlerf commented Feb 9, 2022

Hi @sunpho84, I tried the PR #523, however I still get the issue when the monitor forces is turned on:
MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

@simone-romiti
Copy link
Contributor

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

Just for reference, I report here another workaround that makes the problem disappear. One should add:

updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);

after this line:

quda_mg_param.invert_param->gamma_basis = QUDA_DEGRAND_ROSSI_GAMMA_BASIS;

kostrzewa added a commit that referenced this issue Feb 17, 2022
…is 'resolves' #517 but makes using the MG a bit more expensive in the HMC because MG_Preconditioner_Setup_Update is basically called at every call of the MG
@kostrzewa
Copy link
Member

@sbacchio can you give #525 a try for the problem that you've encountered?

@kostrzewa
Copy link
Member

@sbacchio did the changes solve the issue with MonitorForces ?

@Marcogarofalo
Copy link
Contributor

I still see this issue in
/m100_work/INF22_lqcd123_0/hmc/cA211.12.48/start_from_0186/new_3nodes/logs/log_cA211.12.48_5918403.out

@kostrzewa
Copy link
Member

@Marcogarofalo in your input file, can you specify

BeginOperator CLOVER
  CSW = 1.74
  kappa = 0.140065
  2KappaMu = 0.0003361560
  solver = mg
  SolverPrecision = 1e-18
  MaxSolverIterations = 70000
  useevenodd = yes                                                                                                                         
  useexternalinverter = quda
  usesloppyprecision = single ## <-- add this
EndOperator

to see if this resolves the problem? I think there might be an issue with trying to do full double-precision MG. Doing so is not recommended anyway, but I suspect that this is the reason for what you're seeing in the online measurement.

Strictly speaking we should of course support full double-precision MG, but it's not a high priority as it will be slow.

@kostrzewa
Copy link
Member

note that you can also reduce the maximum number of iterations there to at most 500 or so.

@Marcogarofalo
Copy link
Contributor

Yes sorry, besically the error I am seeing is #530. I thought that I had fixed the input. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants