MonitorForces not working with QUDA #517

sbacchio · 2022-01-11T15:36:10Z

When MonitorForces = yes an error occurs in QUDA. More investigation required.

The error is the following:

MG level 0 (GPU): ERROR: Spinor volume 351232 doesn't match gauge volume 0 (rank 0, host jwb0134.juwels, dirac.cpp:125 in checkParitySpinor())
MG level 0 (GPU):        last kernel called was (name=N4quda4blas7axpbyz_IfEE,volume=28x56x56x4,aux=GPU-offline,vol=351232,precision=4,order=4,Ns=4,Nc=3,TwistFlavour=1)

For more details see $SCRATCH_fssh/bacchio1/C56/logs/log_trial_4811972.out on Juwels Booster

The text was updated successfully, but these errors were encountered:

kostrzewa · 2022-01-26T14:25:37Z

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

pittlerf · 2022-01-27T11:12:27Z

Hi, yes, in case I do
in the beginning of monitor_force-s
update_tm_gauge_id(&g_gauge_state, 0.1);
and
update_tm_gauge_id(&g_gauge_state, -0.1);
the problem actually disappers

pittlerf · 2022-01-27T11:14:11Z

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

I saw similar kind of issue using hot start:
like
MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

sunpho84 · 2022-02-08T17:07:05Z

Hi
I'm seeing a similar error, I attach the relevant part of the valgrind inspection. My understanding is that the check which control whether the sloppy gauge must be allocated,

https://github.com/sunpho84/quda/blob/5431b168b09343503d0d676425069dc895879c92/lib/interface_quda.cpp#L670-L674

is not working, I have to say I don't understand the logic.

Below is also the relevant part of the input file, maybe somebody can spot a parameter not properly set?

==93804==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93804==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93804==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93804==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93804==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803== Invalid read of size 8
==93803==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93803==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93803==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93803==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93803==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x1490916F: invertQuda (interface_quda.cpp:3011)
==93804==    by 0x1004C08F: invert_eo_degenerate_quda (quda_interface.c:2099)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==  Address 0x15b7deaa8 is 8 bytes inside a block of size 3,112 free'd
==93804==    at 0x4086234: free (vg_replace_malloc.c:540)
==93804==    by 0x149CF2BB: quda::host_free_(char const*, char const*, int, void*) (malloc.cpp:475)
==93804==    by 0x14949E8F: operator delete (object.h:24)
==93804==    by 0x14949E8F: quda::cudaGaugeField::~cudaGaugeField() (cuda_gauge_field.cpp:111)
==93804==    by 0x148E88BB: freeSloppyGaugeQuda() (interface_quda.cpp:1046)
==93804==    by 0x148E8C57: freeGaugeQuda (interface_quda.cpp:1104)
==93804==    by 0x100483BB: _loadGaugeQuda (quda_interface.c:587)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==    by 0x1007F59B: monitor_forces (monitor_forces.c:58)
==93804==    by 0x1003701B: update_tm (update_tm.c:134)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804==  Block was alloc'd at
==93804==    at 0x408484C: malloc (vg_replace_malloc.c:309)
==93803==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x149CFA6F: quda::safe_malloc_(char const*, char const*, int, unsigned long) (malloc.cpp:282)
==93804==    by 0x1490D45B: operator new (object.h:22)
==93804==    by 0x1490D45B: loadGaugeQuda (interface_quda.cpp:673)
==93804==    by 0x100482EB: _loadGaugeQuda (quda_interface.c:595)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93803==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==93804==    by 0x100366A3: update_tm (update_tm.c:130)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804==

BeginExternalInverter QUDA
  Pipeline = 24
  gcrNkrylov = 24
  MGCoarseMuFactor = 1.0, 1.0, 50.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.000336154560
  MGVerbosity = silent, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGBlockSizesX = 2,2
  MGBlockSizesY = 2,2
  MGBlockSizesZ = 2,2
  MGBlockSizesT = 2,2
  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90
  MGResetSetupMDUThreshold = 1.0
  # tau = 1.0 / 17 = 0.05882353 -> Threshold = 0.058
  MGRefreshSetupMDUThreshold = 0.058
  MGRefreshSetupMaxSolverIterations = 20, 20
EndExternalInverter

BeginOperator CLOVER
  CSW = 1.76
  kappa = 0.15
  2kappamu = 0.0015846837
  SolverPrecision = 1e-14
  MaxSolverIterations = 1000
#  solver = cg
  solver = mg
  UseEvenOdd = yes
  useexternalinverter = quda
  usesloppyprecision = single  
EndOperator

BeginMonomial CLOVERDET
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  CSW = 1.76
  rho = 0.09353509
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-15
  Name = cloverdetlight
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial


BeginMonomial CLOVERDETRATIO
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  rho = 0.01039279
  rho2 = 0.09353509
  CSW = 1.76
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-16
  Name = cloverdetratio1light
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial

sunpho84 · 2022-02-08T17:09:01Z

I attach @Marcogarofalo since he is observing the same issue

sunpho84 · 2022-02-08T17:11:45Z

I notice

  UseEvenOdd = yes

is not add to the cloverdet, while it is used in the clover which is working smoothly (I believe). Is this related to your opinion? I'll do a test...

sunpho84 · 2022-02-08T17:26:13Z

Ok I understand that the

UseEvenOdd = yes

is not needed and is not understood at all

kostrzewa · 2022-02-08T17:28:49Z

The clover monomials should always be EO (it's an unholy mess for historical reasons)...

There's a global UseEvenOdd which, in princple, sets all monomials to be EO-preconditioned (unless a monomial is encountered which does not support this)
some monomials have a UseEvenOdd parameter to control certain historically relevant cases
the operators are a different story: these can always be yes or no and with MG for the online measurement, no makes sense (and in general for measurements).

sunpho84 · 2022-02-08T17:31:02Z

ok, the global flag is set to yes, this is ruled out

For some reason the heatbath part of the momomial is working, but the force calculation is not...

sunpho84 · 2022-02-08T17:32:03Z

in other words when the monomial is created, the sloppy gauge field is initialized, then when the force is computed, the sloppy field is freed, is not recreated, but later is addressed by the solver

sunpho84 · 2022-02-09T14:18:29Z

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

I see only now this comment by Ferenz. It looks to me like this might be related to PR #522, where we observed another problem related to gauge_state. Possibly the PR #523 might fix the issue?

pittlerf · 2022-02-09T15:40:59Z

Hi @sunpho84, I tried the PR #523, however I still get the issue when the monitor forces is turned on:
MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

simone-romiti · 2022-02-09T16:08:38Z

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

Just for reference, I report here another workaround that makes the problem disappear. One should add:

updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);

after this line:

tmLQCD/quda_interface.c

Line 2101 in 23003f1

quda_mg_param.invert_param->gamma_basis = QUDA_DEGRAND_ROSSI_GAMMA_BASIS;

…is 'resolves' #517 but makes using the MG a bit more expensive in the HMC because MG_Preconditioner_Setup_Update is basically called at every call of the MG

kostrzewa · 2022-02-17T17:49:10Z

@sbacchio can you give #525 a try for the problem that you've encountered?

kostrzewa · 2022-03-02T10:09:28Z

@sbacchio did the changes solve the issue with MonitorForces ?

Marcogarofalo · 2022-03-15T10:26:18Z

I still see this issue in
/m100_work/INF22_lqcd123_0/hmc/cA211.12.48/start_from_0186/new_3nodes/logs/log_cA211.12.48_5918403.out

kostrzewa · 2022-03-15T19:17:01Z

@Marcogarofalo in your input file, can you specify

BeginOperator CLOVER
  CSW = 1.74
  kappa = 0.140065
  2KappaMu = 0.0003361560
  solver = mg
  SolverPrecision = 1e-18
  MaxSolverIterations = 70000
  useevenodd = yes                                                                                                                         
  useexternalinverter = quda
  usesloppyprecision = single ## <-- add this
EndOperator

to see if this resolves the problem? I think there might be an issue with trying to do full double-precision MG. Doing so is not recommended anyway, but I suspect that this is the reason for what you're seeing in the online measurement.

Strictly speaking we should of course support full double-precision MG, but it's not a high priority as it will be slow.

kostrzewa · 2022-03-15T19:17:44Z

note that you can also reduce the maximum number of iterations there to at most 500 or so.

Marcogarofalo · 2022-03-16T08:23:09Z

Yes sorry, besically the error I am seeing is #530. I thought that I had fixed the input. Thank you.

kostrzewa mentioned this issue Feb 17, 2022

partial workaround for some of the precision mismatch problems in QUDA interface #525

Merged

kostrzewa closed this as completed Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MonitorForces not working with QUDA #517

MonitorForces not working with QUDA #517

sbacchio commented Jan 11, 2022

kostrzewa commented Jan 26, 2022 •

edited

Loading

pittlerf commented Jan 27, 2022

pittlerf commented Jan 27, 2022

sunpho84 commented Feb 8, 2022 •

edited

Loading

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 8, 2022 •

edited

Loading

sunpho84 commented Feb 8, 2022

kostrzewa commented Feb 8, 2022

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 9, 2022

pittlerf commented Feb 9, 2022

simone-romiti commented Feb 9, 2022

kostrzewa commented Feb 17, 2022

kostrzewa commented Mar 2, 2022

Marcogarofalo commented Mar 15, 2022

kostrzewa commented Mar 15, 2022

kostrzewa commented Mar 15, 2022

Marcogarofalo commented Mar 16, 2022

MonitorForces not working with QUDA #517

MonitorForces not working with QUDA #517

Comments

sbacchio commented Jan 11, 2022

kostrzewa commented Jan 26, 2022 • edited Loading

pittlerf commented Jan 27, 2022

pittlerf commented Jan 27, 2022

sunpho84 commented Feb 8, 2022 • edited Loading

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 8, 2022 • edited Loading

sunpho84 commented Feb 8, 2022

kostrzewa commented Feb 8, 2022

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 8, 2022

sunpho84 commented Feb 9, 2022

pittlerf commented Feb 9, 2022

simone-romiti commented Feb 9, 2022

kostrzewa commented Feb 17, 2022

kostrzewa commented Mar 2, 2022

Marcogarofalo commented Mar 15, 2022

kostrzewa commented Mar 15, 2022

kostrzewa commented Mar 15, 2022

Marcogarofalo commented Mar 16, 2022

kostrzewa commented Jan 26, 2022 •

edited

Loading

sunpho84 commented Feb 8, 2022 •

edited

Loading

sunpho84 commented Feb 8, 2022 •

edited

Loading