Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS functionality to workflow setup and forecast job #1667

Merged
merged 39 commits into from
Jun 18, 2023
Merged

Add AWS functionality to workflow setup and forecast job #1667

merged 39 commits into from
Jun 18, 2023

Conversation

HenryRWinterbottom
Copy link
Contributor

Description

This PR provides support for running UFS weather model forecasts, using the global-workflow, on the AWS Parallel Works (PW) platform. This PR accomplishes the following:

  • Building the UFS weather model within the global-workflow build system;
  • Running a UFS weather model atmosphere-only deterministic forecast using the global-workflow infrastructure.

Follow-up PRs will include the following:

  • Adding documentation for building and using the AWS PW UFSWM spack-stack environment; this is heavily borrowed from Sadegh Sadeghi Tabas's presentation 03 May 2023;
  • Addressing remaining TODO issues within this PR; several notes are made throughout as to how to better streamline the connections between the UFSWM and the global-workflow for the respective cloud platforms;
  • Adding data-transfer capabilities within the global-workflow for cloud deployments;
  • Testing additional UFSWM capabilities (e.g., coupling, etc.,) and resolutions (this PR has been tested at C48 and has not been configured for other cubed-sphere resolutions).

No dependencies other than the loading of the appropriate AWS PW modules are required. These are included in this PR.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

This PR has been tested as follows:

  • The global-workflow fork/branch has been deployed to the AWS PW platform;
  • The UFSWM is built using the global-workflow build system;
  • A C48 atmosphere-only forecast has been run to completion.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shellcheck found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

jobs/rocoto/fcst.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for getting late on this PR.
I am noticing redefinitions between the global-workflow and the ufs-weather-model e.g. PW_CSP and some of the module loading.
Please take a look at the comments.

@@ -38,6 +38,16 @@ case $(hostname -f) in
*) MACHINE_ID=UNKNOWN ;; # Unknown platform
esac

if [[ ${MACHINE_ID} == "UNKNOWN" ]]; then
case $(dnsdomainname -f) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case $(dnsdomainname -f) in
case ${PW_CSP:-} in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit/update.

case $(dnsdomainname -f) in

# AWS Parallel Works
*pw-noaa*pw.local) MACHINE_ID=aws ;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*pw-noaa*pw.local) MACHINE_ID=aws ;;
"aws" | "google" | "azure" ) MACHINE_ID=noaacloud ;;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit/update.

ush/module-setup.sh Outdated Show resolved Hide resolved
ush/module-setup.sh Show resolved Hide resolved
@@ -15,14 +15,15 @@ class Host:
"""

SUPPORTED_HOSTS = ['HERA', 'ORION', 'JET',
'WCOSS2', 'S4', 'CONTAINER']
'WCOSS2', 'S4', 'CONTAINER', 'AWSPW']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'WCOSS2', 'S4', 'CONTAINER', 'AWSPW']
'WCOSS2', 'S4', 'CONTAINER', 'NOAACLOUD']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be there is a need to distinguish between Azure, Google and AWS later.
But for now, lets keep with the ufs-weather-model

Comment on lines 43 to 48
if [[ "${MACHINE_ID}" == "noaacloud" ]]; then
case $(dnsdomainname -f) in
# TODO: Add Google and Azure platforms.
*pw-noaa*pw.local) CLOUD_MACHINE_ID="aws.${RT_COMPILER}" ;;
*) ;;
esac
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be doing detection again. MACHINE ID has been detected once. UFSWM is using a PW_CSP variable. Lets use that and if necessary use PW_CSP to distinguish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aerorahul Can you point me to a use case within the UFSWM of the PW_CSP instance?

For clarification, the above detection was needed to determine which CSP the user is on. The DNS strings should be unique. This allows the GW (and build system) to determine which host they are on and thus make the downstream decisions accordingly.

For example, AWS PW use mpiexec as it's parallel executable launcher. However, Azure or Google could use another. This may ultimately wind up being redundant but since I haven't see the PW_CSP use case I felt it unsafe to make such an assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/detect_machine.sh
I would recommend a PW_CSP as a separate configurable if you wish to identify a CSP available on NOAACloud.

Comment on lines 50 to 59
if [[ "${CLOUD_MACHINE_ID}" == "aws.intel" ]]; then
module use /contrib/spack-stack/envs/ufswm/install/modulefiles/Core
module load stack-intel
module load stack-intel-oneapi-mpi
module load ufs-weather-model-env/1.0.0
# TODO: It is still uncertain why this is the only module that is
# missing; check the spack build as this needed to be added manually.
module load w3emc/2.9.2 # TODO: This has similar issues for the EPIC stack.
module list
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary as ufs_noaacloud.intel.lua contains the information. Can we not use that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path that the module file points to, an EPIC stack, does not exist. As a result, the build environment cannot be defined using the module that you noted above.

[Henry.Winterbottom@hwufscpldcld-2 modulefiles]$ ls /contrib/EPIC/spack-stack/spack-stack-1.3.0/
ls: cannot access /contrib/EPIC/spack-stack/spack-stack-1.3.0/: No such file or directory

Note that this comes from the develop branch.

I suggest that we include this bit for now. Once EPIC provides a working stack we can create a new PR and make the update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I did not know that the EPIC stack is not available for use by us. This certainly makes porting an issue.
I will discuss with @arunchawla-NOAA on how to enable that.

ush/detect_machine.sh Fixed Show fixed Hide fixed
sorc/build_ufs.sh Fixed Show fixed Hide fixed
@aerorahul
Copy link
Contributor

@SadeghTabas-NOAA Would you please review the changes in the build and run script requirements of the ufs-weather-model?

@SadeghTabas-NOAA
Copy link

@aerorahul Sure, I'll do it today. Thanks

@WalterKolczynski-NOAA WalterKolczynski-NOAA changed the title Feature/cloud aws fcstonly Add AWS functionality to build workflow and run forecast job Jun 13, 2023
@WalterKolczynski-NOAA WalterKolczynski-NOAA changed the title Add AWS functionality to build workflow and run forecast job Add AWS functionality to workflow setup and forecast job Jun 13, 2023
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only went through the shell scripts here.
Take a look if you agree with these suggestions.

jobs/rocoto/fcst.sh Outdated Show resolved Hide resolved
jobs/rocoto/fcst.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
sorc/build_ufs.sh Outdated Show resolved Hide resolved
ush/detect_machine.sh Outdated Show resolved Hide resolved
HenryRWinterbottom and others added 2 commits June 16, 2023 09:01
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional requests beyond what Rahul has pointed out. I would have rather kept $MACHINE as the specific cloud platform, but understand the decision to follow what UFS has done.

HenryRWinterbottom and others added 6 commits June 16, 2023 12:12
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
Co-authored-by: Rahul Mahajan <[email protected]>
jobs/rocoto/fcst.sh Outdated Show resolved Hide resolved
workflow/hosts.py Outdated Show resolved Hide resolved
Co-authored-by: Rahul Mahajan <[email protected]>
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.
You will have to show us how to run this beast on the cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants