Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES doesn't work in branch runs #1271

Open
samsrabin opened this issue Oct 28, 2024 · 22 comments · May be fixed by ESCOMP/CTSM#2955
Open

FATES doesn't work in branch runs #1271

samsrabin opened this issue Oct 28, 2024 · 22 comments · May be fixed by ESCOMP/CTSM#2955

Comments

@samsrabin
Copy link
Contributor

samsrabin commented Oct 28, 2024

When trying to use FATES as part of a branch run, I get the following error:

forrtl: severe (151): allocatable array is already allocated
Image              PC                Routine            Line        Source
cesm.exe           0000000000983A08  edinitmod_mp_init         144  EDInitMod.F90
cesm.exe           00000000006403F4  clmfatesinterface        2089  clmfates_interfaceMod.F90
cesm.exe           00000000006063A6  clm_initializemod         757  clm_initializeMod.F90
cesm.exe           00000000005A8EA3  lnd_comp_nuopc_mp         658  lnd_comp_nuopc.F90

This is with CTSM tag ctsm5.3.009, FATES tag sci.1.78.3_api.36.1.0. It can be reproduced using the test ERI_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates, which is the ERI version of an ERS test we already run as part of the fates suite.

Note parallel issue at ESCOMP/CTSM#2903.

@jennykowalcz
Copy link

Have also encountered this. The less strict hybrid mode does work though.

@XiulinGao
Copy link
Contributor

I did encounter a similar issue when run with FATES API 36 (CTSM 5.2.013). If I want to do a hybrid run using restart file model build will fail warning that a restart file should not be defined when cold start is turned on. But I did turn it off.

@samsrabin
Copy link
Contributor Author

@XiulinGao That's actually a separate issue that is now fixed in CTSM as of ctsm5.3.011.

@rosiealice
Copy link
Contributor

Hi all. Do we have a plan for addressing this issue? It has suddenly cropped up as an problem in the NorESM workflow

NorESMhub/CTSM#115 (comment)

@samsrabin
Copy link
Contributor Author

Hi all. Do we have a plan for addressing this issue? It has suddenly cropped up as an problem in the NorESM workflow

NorESMhub/CTSM#115 (comment)

Not as far as I know.

@ckoven
Copy link
Contributor

ckoven commented Feb 5, 2025

Do we know why this is the case? could it just be that is_restart() returns .false. for hybrid cases here?

@jennykowalcz
Copy link

Hybrid cases do work. I'll try to dig up the errors I got trying a branch case.

@ckoven
Copy link
Contributor

ckoven commented Feb 5, 2025

right, sorry, meant to say branch runs above, not hybrid.

@samsrabin
Copy link
Contributor Author

I'm not sure of the ultimate issue(s), but I included an error message in the CTSM issue; reproducing here for convenience:

forrtl: severe (151): allocatable array is already allocated
Image              PC                Routine            Line        Source
cesm.exe           0000000000983A08  edinitmod_mp_init         144  EDInitMod.F90
cesm.exe           00000000006403F4  clmfatesinterface        2089  clmfates_interfaceMod.F90
cesm.exe           00000000006063A6  clm_initializemod         757  clm_initializeMod.F90
cesm.exe           00000000005A8EA3  lnd_comp_nuopc_mp         658  lnd_comp_nuopc.F90

If you have Derecho access, that test is at /glade/derecho/scratch/samrabin/tests_1028-110314de/ERI_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates.1028-110314de/. That test was with CTSM tag ctsm5.3.009, FATES tag sci.1.78.3_api.36.1.0.

@jennykowalcz
Copy link

Wait.. I feel like I'm losing it here. 🤪 I couldn't find the logs from long ago when I tried to run a branch case. So I took one of my more recent hybrid cases, set RUN_TYPE to branch, refreshed the rpointer files in the run directory, and it seems to be working! As in history files are being produced.
I am using E3SM branch: https://github.com/rgknox/E3SM/tree/rknox/lnd/fates-api34-schwartz-cbal
and this FATES branch: https://github.com/jennykowalcz/fates/tree/jkowalcz-merge-test

@samsrabin
Copy link
Contributor Author

Huh! I'll start a CTSM test now.

@jennykowalcz
Copy link

I am still confused, but the land log file has

 define run:
    source                = E3SM Land Model
    model_version         = 2d74f509c8
    run type              = branch 
    case title            = UNSET
    username              = jkowalcz
    hostname              = pm-cpu

so it really is happening as branch run

@samsrabin
Copy link
Contributor Author

Confirming that it still happens for me as of CTSM ctsm5.3.021 and FATES sci.1.80.4_api.37.0.0. Slight difference in line numbers here relative to before, but same error:

dec1816.hsn.de.hpc.ucar.edu 611: forrtl: severe (151): allocatable array is already allocated
dec1816.hsn.de.hpc.ucar.edu 611: Image              PC                Routine            Line        Source
dec1816.hsn.de.hpc.ucar.edu 611: cesm.exe           0000000000981CF5  edinitmod_mp_init         144  EDInitMod.F90
dec1816.hsn.de.hpc.ucar.edu 611: cesm.exe           000000000063CB84  clmfatesinterface        2089  clmfates_interfaceMod.F90
dec1816.hsn.de.hpc.ucar.edu 611: cesm.exe           00000000006025AD  clm_initializemod         761  clm_initializeMod.F90
dec1816.hsn.de.hpc.ucar.edu 611: cesm.exe           00000000005A50AA  lnd_comp_nuopc_mp         661  lnd_comp_nuopc.F90
dec1816.hsn.de.hpc.ucar.edu 611: libesmf.so         00001520624E9884  _ZN5ESMCI6FTable1     Unknown  Unknown

Test is on Derecho at /glade/derecho/scratch/samrabin/tests_0205-111402de/ERI_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates.G.0205-111402de/.

@jennykowalcz
Copy link

jennykowalcz commented Feb 5, 2025

There are a lot of differences between that FATES tag and the api-34-based branch I was using, and I have no idea why any of them would affect branch vs hybrid...

Or is it a problem on the HLM side?

@samsrabin
Copy link
Contributor Author

samsrabin commented Feb 5, 2025

I suspect it is indeed an HLM-side issue, with CTSM calling an initialization subroutine that it shouldn't during branch runs.

@ckoven
Copy link
Contributor

ckoven commented Feb 5, 2025

Is it the same issue as #653? And has anyone tried adding an .or. nsrest .eq. nsrBranch to https://github.com/ESCOMP/CTSM/blob/master/src/utils/clmfates_interfaceMod.F90#L472 to see if that resolves things?

@samsrabin
Copy link
Contributor Author

Looks like yes, that is the same issue. I haven't tried anything myself.

@ckoven
Copy link
Contributor

ckoven commented Feb 5, 2025

ok, well it looks like E3SM does have that logic for deciding when to do restarts, as of this PR, so @jennykowalcz if you are still running into the problem then it must not be a complete solution.

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 5, 2025

Hi everyone. So do we have any ERI tests for FATES? That test type, tests: restart, hybrid, branch and startup. So it should show this problem. I think we might have a few, but maybe not enough.

Might be good to run the whole FATES test suite where we change everything to ERI tests?

@samsrabin can you check if there are any ERI tests for FATES?

@samsrabin
Copy link
Contributor Author

@ekluzek No, there are no ERI tests for FATES in the CLM test list. We definitely should add some of those, but after the problem is solved—we already know it happens reliably with ERI_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates.

@jennykowalcz
Copy link

ok, well it looks like E3SM does have that logic for deciding when to do restarts, as of this PR, so @jennykowalcz if you are still running into the problem then it must not be a complete solution.

@ckoven I think my problem must have been user error.. I have tested ELM-FATES branch runs with api 34 and api 36 now and it works :)

@samsrabin
Copy link
Contributor Author

Is it the same issue as #653? And has anyone tried adding an .or. nsrest .eq. nsrBranch to https://github.com/ESCOMP/CTSM/blob/master/src/utils/clmfates_interfaceMod.F90#L472 to see if that resolves things?

This (plus doing it in one other place) solves the crash and the simulations complete successfully, although it does fail in the COMPARE_base_hybrid step. Thanks, Charlie!

@samsrabin samsrabin linked a pull request Feb 6, 2025 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ❕Todo
Development

Successfully merging a pull request may close this issue.

6 participants