Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure the bufr sounding job #2853

Merged
merged 11 commits into from
Sep 7, 2024

Conversation

BoCui-NOAA
Copy link
Contributor

@BoCui-NOAA BoCui-NOAA commented Aug 21, 2024

Description

The current operational BUFR job begins concurrently with the GFS model run. This PR updates the script and ush to process all forecast hour data simultaneously, then combines the temporary outputs to create BUFR sounding products for each station. The updated job will now start processing data only after the GFS model completes its 180-hour run, handling all forecast files from 000hr to 180hr at a time. The new version job running will need 7 nodes instead of the current operational 4 nodes.

This PR depends on the GFS bufr code update NOAA-EMC/gfs-utils#75

With the updates of bufr codes and scripts, there is no need to add restart capability to GFS post-process job JGFS_ATMOS_POSTSND.

This PR includes the other changes:

Rename the following table files:

parm/product/bufr_ij13km.txt to parm/product/bufr_ij_gfs_C768.txt
parm/product/bufr_ij9km.txt to parm/product/bufr_ij_gfs_C1152.txt

Add a new table file: parm/product/bufr_ij_gfs_C96.txt for GFSv17 C96 testing.

Added a new capability to the BUFR package. The job priority is to read bufr_ij_gfs_${CASE}.txt. If the table file is not available, the code will automatically find the nearest neighbor grid point (i, j).

Refs #1257
Refs NOAA-EMC/gfs-utils#75

Type of change

  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO

  • Does this change require a documentation update? NO

  • Does this change require an update to any of the following submodules YES (If YES, please add a link to any PRs that are pending.)

    • GFS-utils

How has this been tested?

  • Cycled test on WCOSS2

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
ush/gfs_bufr.sh Fixed Show fixed Hide fixed
ush/gfs_bufr.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
scripts/exgfs_atmos_postsnd.sh Fixed Show fixed Hide fixed
WalterKolczynski-NOAA pushed a commit to NOAA-EMC/gfs-utils that referenced this pull request Aug 24, 2024
This PR includes the follow changes to bufr sounding codes:  

1. added the function to judge and process 3D soil variables, which are
new outputs from GFSv17
   2.  modified the code to process forecast hour individually
3. code clean-up and removal of nemsio input files that are not used
anymore
4. added a new module modpr_module.f90, which is a simplified version of
sigio_module.f
5. removed linking with 'nemsio' and 'sigio' library in CMakeLists.txt

With the updates of bufr codes and scripts, there is no need to add
restart capability to GFS post-process job JGFS_ATMOS_POSTSND.
  
The related bufr job script update is another PR
NOAA-EMC/global-workflow#2853

Refs NOAA-EMC/global-workflow#1257
Refs NOAA-EMC/global-workflow#2853
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an updated gfs_utils hash in this PR. I can help if you don't know how to commit that.

scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
Comment on lines 86 to 93
# allocate 21 processes per node
# don't allocate more processes, or it might have memory issue
num_ppn=21
export APRUN="mpiexec -np ${num_hours} -ppn ${num_ppn} --cpu-bind core cfp "

if [ -s "${DATA}/poescript_bufr" ]; then
rm ${DATA}/poescript_bufr
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a utility script, ush/run_mpmh.sh, that handles setting up an MPMD job now. That is the preferred method, as it correctly handles both slurm and pbs/torque. You just need to give it the file with your list of commands as an argument. See the atmos products ex-script for an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ush/run_mpmd.sh, the mpiexec command misses the setting of the process number per node in bufr job exgfs_atmos.postsnd.sh.. Will there be any update for the run_mpmd.sh in the future?

Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try without the ppn setting first to confirm it is actually an issue (ideally the MPMD tasks should be equally distributed across all nodes anyway). If it is still required, an entry should be added to the env script on any machine where it is necessary to update the mpmd_opt setting to include -ppn for the sounding job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the bufr job using run_mpmd.sh without setting the ppn parameter, and the job failed. After adding the ppn setting, the bufr job completed successfully. The PBS setting in my jobcard is:
#PBS -l place=vscatter,select=7:ncpus=128:mpiprocs=128

Please let me know if I’m wrong.

Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new review. It should maintain the ppn setting while switching to run_mpmd.sh.

scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
ush/gfs_bufr.sh Outdated Show resolved Hide resolved
ush/gfs_bufr.sh Outdated Show resolved Hide resolved
@BoCui-NOAA
Copy link
Contributor Author

I don't see an updated gfs_utils hash in this PR. I can help if you don't know how to commit that.

Yes, please tell me where to update gfs_utils hash and how to commit this. Thanks!

@WalterKolczynski-NOAA
Copy link
Contributor

I don't see an updated gfs_utils hash in this PR. I can help if you don't know how to commit that.

Yes, please tell me where to update gfs_utils hash and how to commit this. Thanks!

In your global-workflow clone, go to the sorc/gfs_utils.fd directory and checkout the appropriate hash. Then go back up and do a git add sorc/gfs_utils.fd and commit.

@BoCui-NOAA
Copy link
Contributor Author

I don't see an updated gfs_utils hash in this PR. I can help if you don't know how to commit that.

Yes, please tell me where to update gfs_utils hash and how to commit this. Thanks!

In your global-workflow clone, go to the sorc/gfs_utils.fd directory and checkout the appropriate hash. Then go back up and do a git add sorc/gfs_utils.fd and commit.

Thanks, just committed using the updated gfs-utils hash.

env/WCOSS2.env Outdated Show resolved Hide resolved
scripts/exgfs_atmos_postsnd.sh Outdated Show resolved Hide resolved
Comment on lines 86 to 93
# allocate 21 processes per node
# don't allocate more processes, or it might have memory issue
num_ppn=21
export APRUN="mpiexec -np ${num_hours} -ppn ${num_ppn} --cpu-bind core cfp "

if [ -s "${DATA}/poescript_bufr" ]; then
rm ${DATA}/poescript_bufr
fi
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new review. It should maintain the ppn setting while switching to run_mpmd.sh.

BoCui-NOAA and others added 2 commits September 4, 2024 13:06
Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
@WalterKolczynski-NOAA WalterKolczynski-NOAA added the CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS label Sep 6, 2024
@emcbot emcbot added CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS and removed CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Sep 6, 2024
@emcbot
Copy link

emcbot commented Sep 6, 2024

CI Update on Wcoss2 at 09/06/24 04:28:09 PM
============================================
Cloning and Building global-workflow PR: 2853
with PID: 182099 on host: dlogin03

@BoCui-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA I committed some changes 5 minutes ago. I didn't realize you have approved the PR. Should I submit a new PR? My changes include:

Rename the following table files:

parm/product/bufr_ij13km.txt to bufr_ij_gfs_C768.txt
parm/product/bufr_ij9km.txt to bufr_ij_gfs_C1152.txt

Add a new table file: parm/product/bufr_ij_gfs_C96.txt for GFSv17 C96 testing.

Added a new capability to the BUFR package. The job priority is to read bufr_ij_gfs_${CASE}.txt. If the table file is not available, the code will automatically find the nearest neighbor grid point (i, j).

@WalterKolczynski-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA I committed some changes 5 minutes ago. I didn't realize you have approved the PR. Should I submit a new PR? My changes include:

Rename the following table files:

parm/product/bufr_ij13km.txt to bufr_ij_gfs_C768.txt
parm/product/bufr_ij9km.txt to bufr_ij_gfs_C1152.txt

Add a new table file: parm/product/bufr_ij_gfs_C96.txt for GFSv17 C96 testing.

Added a new capability to the BUFR package. The job priority is to read bufr_ij_gfs_${CASE}.txt. If the table file is not available, the code will automatically find the nearest neighbor grid point (i, j).

No, this PR is fine. I'm going to restart the CI, since it will likely fail anyway.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS label Sep 6, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA added the CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS label Sep 6, 2024
@emcbot
Copy link

emcbot commented Sep 6, 2024

CI Update on Wcoss2 at 09/06/24 05:04:08 PM
=================================================
PR:2853 Reset to Wcoss2-Ready by user and is now restarting CI tests
Driver PID: Requested termination of 182099 and children on dlogin03
Driver PID: has restarted as 109597 on dlogin03
No current experiments to cancel in PR: 2853 on Wcoss2

@emcbot emcbot added CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS and removed CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Sep 6, 2024
@emcbot
Copy link

emcbot commented Sep 6, 2024

CI Update on Wcoss2 at 09/06/24 05:04:48 PM
============================================
Cloning and Building global-workflow PR: 2853
with PID: 109597 on host: dlogin03

@BoCui-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA I have a question for the ppn setting, which is set to 21 now. This setting is based on running the C768 resolution. However, it may need adjustment for the C1152. Can we make this number flexible according to the resolution? The total number of nodes will also need to be adjusted.

@WalterKolczynski-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA I have a question for the ppn setting, which is set to 21 now. This setting is based on running the C768 resolution. However, it may need adjustment for the C1152. Can we make this number flexible according to the resolution? The total number of nodes will also need to be adjusted.

Yes, but let's explore that further and make a follow-up PR for that.

@BoCui-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA I have a question for the ppn setting, which is set to 21 now. This setting is based on running the C768 resolution. However, it may need adjustment for the C1152. Can we make this number flexible according to the resolution? The total number of nodes will also need to be adjusted.

Yes, but let's explore that further and make a follow-up PR for that.

Sure. Thanks for the comments.

@emcbot emcbot added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Sep 6, 2024
@emcbot
Copy link

emcbot commented Sep 6, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Fri Sep  6 17:08:47 UTC 2024 on dlogin03
---------------------------------------------------
Build: Completed at 09/06/24 05:49:35 PM
Case setup: Completed for experiment C48_ATM_5f5e542f
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_5f5e542f
Case setup: Skipped for experiment C48_S2SWA_gefs_5f5e542f
Case setup: Completed for experiment C48_S2SW_5f5e542f
Case setup: Completed for experiment C96_atm3DVar_extended_5f5e542f
Case setup: Skipped for experiment C96_atm3DVar_5f5e542f
Case setup: Completed for experiment C96_atmaerosnowDA_5f5e542f
Case setup: Completed for experiment C96C48_hybatmDA_5f5e542f
Case setup: Completed for experiment C96C48_ufs_hybatmDA_5f5e542f

@emcbot emcbot added CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Sep 7, 2024
@emcbot
Copy link

emcbot commented Sep 7, 2024

All CI Test Cases Passed on Wcoss2:

Experiment C48_ATM_5f5e542f *** SUCCESS *** at 09/06/24 07:21:24 PM
Experiment C48_S2SW_5f5e542f *** SUCCESS *** at 09/06/24 07:33:18 PM
Experiment C96C48_hybatmDA_5f5e542f *** SUCCESS *** at 09/06/24 08:36:30 PM
Experiment C96_atmaerosnowDA_5f5e542f *** SUCCESS *** at 09/06/24 09:33:29 PM
Experiment C96C48_ufs_hybatmDA_5f5e542f *** SUCCESS *** at 09/06/24 09:57:19 PM
Experiment C96_atm3DVar_extended_5f5e542f *** SUCCESS *** at 09/07/24 07:42:35 AM

@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit b8080cd into NOAA-EMC:develop Sep 7, 2024
5 checks passed
@BoCui-NOAA
Copy link
Contributor Author

Great, thanks @WalterKolczynski-NOAA

@@ -223,7 +223,7 @@ elif [[ "${step}" = "postsnd" ]]; then
export OMP_NUM_THREADS=1

export NTHREADS_POSTSND=${NTHREADS1}
export APRUN_POSTSND="${APRUN} --depth=${NTHREADS_POSTSND} --cpu-bind depth"
export mpmd_opt="-ppn 21 ${mpmd_opt}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BoCui-NOAA @WalterKolczynski-NOAA Was it intentional to get rid of APRUN_POSTSND? This is referenced by ush/gfs_bufr_netcdf.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 'APRUN_POSTSND' setting is no longer needed, and the ush/gfs_bufr_netcdf.sh script is no longer in use.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/global-workflow that referenced this pull request Sep 9, 2024
* origin/develop:
  Create JEDI class (NOAA-EMC#2805)
  Restructure the bufr sounding job    (NOAA-EMC#2853)
  Add an archive task to GEFS system to archive files locally (NOAA-EMC#2816)
  Reenable Orion Cycling Support (NOAA-EMC#2877)
  Eliminate race conditions and remove DATAROOT last in cleanup (NOAA-EMC#2893)
  Update aerosol climatology to 2013-2024 mean (NOAA-EMC#2888)
  Add ability to run CI test C96_atm3DVar.yaml to Gaea-C5 (NOAA-EMC#2885)
  Support global-workflow GEFS C48 on Google Cloud (NOAA-EMC#2861)
  Add 3 and 9 hr increment files to IC staging (NOAA-EMC#2876)
  Add diffusion/diag B for aerosol DA and some other needed changes (NOAA-EMC#2738)
  Correct ocean `MOM.res_#` stage copy (NOAA-EMC#2868)
  Support coupling on AWS (NOAA-EMC#2859)
  Add JEDI ATM lgetkf observer and solver jobs (NOAA-EMC#2833)
  Fix gdas build on Gaea and add Gaea to available CI list (NOAA-EMC#2857)
  Support ATM forecast only on Google (NOAA-EMC#2832)
  Add GEFS C48 support on AWS (NOAA-EMC#2818)
  Update omega calculation (NOAA-EMC#2751)
  Add snow DA update and recentering for the EnKF forecasts (NOAA-EMC#2690)
  support ATM forecast only on Azure (NOAA-EMC#2827)
  Convert staging job to python and yaml (NOAA-EMC#2651)
  Fixed test on UNAVAILBLE in python Rocoto check (NOAA-EMC#2842)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants