Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set rrtmgp default backend back to kokkos #3030

Merged
merged 13 commits into from
Nov 15, 2024

Conversation

jgfouca
Copy link
Member

@jgfouca jgfouca commented Oct 4, 2024

Non-determinism and poor performance have been resolved.

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Oct 4, 2024

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6114
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 88d8ecc
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5880
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 88d8ecc
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: jgfouca/rrtmgp_interface_back_to_kokkos
  • SHA: 88d8ecc
  • Mode: TEST_REPO

Pull Request Author: jgfouca

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Oct 4, 2024

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6114
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 88d8ecc
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5880
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 88d8ecc
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6114 FAILED (click to see last 100 lines of console output)

/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/field_at_level.cpp:119:471: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  119 |         Kokkos::parallel_for(m_diagnostic_output.name(),policy,KOKKOS_LAMBDA(const int idx) {
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/virtual_temperature.cpp: In member function 'virtual void scream::VirtualTemperatureDiagnostic::set_grids(std::shared_ptr)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/virtual_temperature.cpp:23:88: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   23 |   FieldLayout scalar3d_layout_mid { {COL,LEV}, {m_num_cols,m_num_levs} };
      |                                                                                        ^         
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp: In constructor 'Kokkos::Impl::ParallelFor, Kokkos::Cuda>::ParallelFor(const FunctorType&, const Policy&) [with FunctorType = __nv_hdl_wrapper_t, void(const Kokkos::Impl::CudaTeamMember&), const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const int, const double, const Kokkos::View, Kokkos::MemoryTraits<0> > >; Properties = {Kokkos::Cuda}]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:533:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  533 |   ParallelFor(const FunctorType& arg_functor, const Policy& arg_policy)
      | ^ ~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/sea_level_pressure.cpp: In static member function 'static void* __nv_hdl_wrapper_t::manager::do_copy(void*) [with Lambda = scream::SeaLevelPressureDiagnostic::compute_diagnostic_impl()::; Tag = __nv_dl_tag; OpFuncR = void; OpFuncArgs = {const Kokkos::Impl::CudaTeamMember&}; F1 = const Kokkos::View, Kokkos::MemoryTraits<0> >; F2 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F3 = const int; F4 = const int; F5 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F6 = const Kokkos::View, Kokkos::MemoryTraits<0> >]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/sea_level_pressure.cpp:57:569: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   57 |   Kokkos::parallel_for("SeaLevelPressureDiagnostic",
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/vertical_layer.cpp: In static member function 'static void* __nv_hdl_wrapper_t::manager::do_copy(void*) [with Lambda = scream::VerticalLayerDiagnostic::do_compute_diagnostic_impl<1>()::; Tag = __nv_dl_tag; OpFuncR = void; OpFuncArgs = {const Kokkos::Impl::CudaTeamMember&}; F1 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F2 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F3 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F4 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F5 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F6 = const bool; F7 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F8 = const bool; F9 = const double; F10 = const int; F11 = const Kokkos::View, Kokkos::MemoryTraits<0> >; F12 = const bool; F13 = const bool]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/vertical_layer.cpp:190:898: note: the layout of aggregates containing vectors with 2-byte alignment has changed in GCC 5
  190 |   auto lambda = KOKKOS_LAMBDA(const MemberType& team) {
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/precip_surf_mass_flux.cpp: In member function 'virtual void scream::PrecipSurfMassFlux::compute_diagnostic_impl()':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/precip_surf_mass_flux.cpp:99:52: warning: 'dt' may be used uninitialized in this function [-Wmaybe-uninitialized]
   99 |   auto rhodt = PC::RHO_H2O*dt;
      |              ~~~~~~~~~~~~~~~~~                     ^   
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/exner.cpp: In member function 'virtual void scream::ExnerDiagnostic::set_grids(std::shared_ptr)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/exner.cpp:28:88: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   28 |   FieldLayout scalar3d_layout_mid { {COL,LEV}, {m_num_cols,m_num_levs} };
      |                                                                                        ^         
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp: In constructor 'Kokkos::Impl::ParallelFor, Kokkos::Cuda>::ParallelFor(const FunctorType&, const Policy&) [with FunctorType = __nv_hdl_wrapper_t, void(const Kokkos::Impl::CudaTeamMember&), const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const int, const double, const Kokkos::View, Kokkos::MemoryTraits<0> > >; Properties = {Kokkos::Cuda}]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:533:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  533 |   ParallelFor(const FunctorType& arg_functor, const Policy& arg_policy)
      | ^ ~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/atm_density.cpp: In member function 'virtual void scream::AtmDensityDiagnostic::set_grids(std::shared_ptr)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/atm_density.cpp:26:88: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   26 |   FieldLayout scalar3d_layout_mid { {COL,LEV}, {m_num_cols,m_num_levs} };
      |                                                                                        ^         
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/relative_humidity.cpp: In static member function 'static void* __nv_hdl_wrapper_t::manager::do_copy(void*) [with Lambda = scream::RelativeHumidityDiagnostic::compute_diagnostic_impl()::; Tag = __nv_dl_tag; OpFuncR = void; OpFuncArgs = {const int&}; F1 = const int; F2 = int; F3 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F4 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F5 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F6 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F7 = const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >; F8 = Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/relative_humidity.cpp:60:703: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   60 |   Kokkos::parallel_for("RelativeHumidityDiagnostic",
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp: In constructor 'Kokkos::Impl::ParallelFor, Kokkos::Cuda>::ParallelFor(const FunctorType&, const Policy&) [with FunctorType = __nv_hdl_wrapper_t, void(const Kokkos::Impl::CudaTeamMember&), const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, Kokkos::View, Kokkos::MemoryTraits<0> >, double, double, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, Kokkos::View, Kokkos::MemoryTraits<1> >, bool, const int, Kokkos::View, Kokkos::MemoryTraits<1> > >; Properties = {Kokkos::Cuda}]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:533:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  533 |   ParallelFor(const FunctorType& arg_functor, const Policy& arg_policy)
      | ^ ~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp: In constructor 'Kokkos::Impl::ParallelFor, Kokkos::Cuda>::ParallelFor(const FunctorType&, const Policy&) [with FunctorType = __nv_hdl_wrapper_t, void(const Kokkos::Impl::CudaTeamMember&), const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> > >; Properties = {Kokkos::Cuda}]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:533:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  533 |   ParallelFor(const FunctorType& arg_functor, const Policy& arg_policy)
      | ^ ~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp: In constructor 'Kokkos::Impl::ParallelFor, Kokkos::Cuda>::ParallelFor(const FunctorType&, const Policy&) [with FunctorType = __nv_hdl_wrapper_t, void(const Kokkos::Impl::CudaTeamMember&), Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const int, const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const int, const double, const Kokkos::View**, Kokkos::LayoutRight, Kokkos::Device, Kokkos::MemoryTraits<0> >, const Kokkos::View, Kokkos::MemoryTraits<0> > >; Properties = {Kokkos::Cuda}]':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:533:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  533 |   ParallelFor(const FunctorType& arg_functor, const Policy& arg_policy)
      | ^ ~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/potential_temperature.cpp: In member function 'virtual void scream::PotentialTemperatureDiagnostic::set_grids(std::shared_ptr)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/diagnostics/potential_temperature.cpp:43:88: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
   43 |   FieldLayout scalar3d_layout_mid { {COL,LEV}, {m_num_cols,m_num_levs} };
      |                                                                                        ^         
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/field/field_alloc_prop.hpp: In copy constructor 'scream::FieldAllocProp::FieldAllocProp(const scream::FieldAllocProp&)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/field/field_alloc_prop.hpp:106:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  106 |   FieldAllocProp (const FieldAllocProp&) = default;
      | ^ ~~~~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/field/field_alloc_prop.hpp: In copy constructor 'scream::FieldAllocProp::FieldAllocProp(const scream::FieldAllocProp&)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/field/field_alloc_prop.hpp:106:1: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  106 |   FieldAllocProp (const FieldAllocProp&) = default;
      | ^ ~~~~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/io/scorpio_output.cpp: In member function 'std::vector scream::AtmosphereOutput::get_var_dof_offsets(const scream::FieldLayout&)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/io/scorpio_output.cpp:1131:20: warning: 'min_gid' may be used uninitialized in this function [-Wmaybe-uninitialized]
 1131 |       auto offset = (gid-min_gid)*col_size;
      |               ~~~~~^~~~~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx/src/share/io/scorpio_output.cpp:1088:24: note: 'min_gid' was declared here
 1088 |   AbstractGrid::gid_type min_gid;
      |                        ^~~~~~~
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/KokkosExp_MDRangePolicy.hpp: In copy constructor 'Kokkos::MDRangePolicy, Kokkos::IndexType >::MDRangePolicy(const Kokkos::MDRangePolicy, Kokkos::IndexType >&)':
/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/externals/ekat/extern/kokkos/core/src/KokkosExp_MDRangePolicy.hpp:153:8: note: the layout of aggregates containing vectors with 8-byte alignment has changed in GCC 5
  153 | struct MDRangePolicy : public Kokkos::Impl::PolicyTraits {
      |        ^~~~~~~~~~~~~
[ 56%] Linking CXX static library libdiagnostics.a
[ 56%] Built target diagnostics
[ 56%] Linking CXX static library libscream_io.a
[ 56%] Built target scream_io
gmake: *** [Makefile:166: all] Error 2

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6114/scream/components/eamxx
weaver failed
######################################################
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins8961515551201238493.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

SCREAM_PullRequest_Autotester_Mappy # 5880 FAILED (click to see last 100 lines of console output)

      |                                                1
Warning: Unused variable 'j' declared at (1) [-Wunused-variable]
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/homme/src/share/compose_test_mod.F90:184:8:

184 | use dimensions_mod, only: ne, np, nlev, qsize, qsize_d, nelemd
| 1
Warning: Unused module variable 'ne' which has been explicitly imported at (1) [-Wunused-variable]
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/homme/src/share/compose_test_mod.F90:182:8:

182 | use thread_mod, only: hthreads, vthreads, omp_set_num_threads, omp_get_thread_num
| 1
Warning: Unused module variable 'vthreads' which has been explicitly imported at (1) [-Wunused-variable]
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/homme/src/share/compose_test_mod.F90:86:8:

86 | use thread_mod, only: hthreads, vthreads, omp_set_num_threads, omp_get_thread_num
| 1
Warning: Unused module variable 'vthreads' which has been explicitly imported at (1) [-Wunused-variable]
[ 53%] Linking Fortran static library libtheta-l_kokkos_4_72_41.a
[ 53%] Built target theta-l_kokkos_4_72_41
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx/src/diagnostics/precip_surf_mass_flux.cpp: In member function 'virtual void scream::PrecipSurfMassFlux::compute_diagnostic_impl()':
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx/src/diagnostics/precip_surf_mass_flux.cpp:99:27: warning: 'dt' may be used uninitialized in this function [-Wmaybe-uninitialized]
99 | auto rhodt = PC::RHO_H2O*dt;
| ~~~~~~~~~~~^~~
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx/src/share/io/scorpio_output.cpp: In member function 'std::vector scream::AtmosphereOutput::get_var_dof_offsets(const scream::FieldLayout&)':
/home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx/src/share/io/scorpio_output.cpp:1131:25: warning: 'min_gid' may be used uninitialized in this function [-Wmaybe-uninitialized]
1131 | auto offset = (gid-min_gid)*col_size;
| ~~~~^~~~~~~~~
[ 53%] Linking CXX static library libscream_io.a
[ 53%] Built target scream_io
[ 53%] Linking CXX static library libdiagnostics.a
[ 53%] Built target diagnostics
gmake: *** [Makefile:166: all] Error 2

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on mappy with cmd: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
RUN: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
FROM: /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5880/scream/components/eamxx
mappy failed
SCREAM V1 TESTING FAILED!
Waiting for tests to finish
FAIL ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241004_142817_o8ngnx
FAIL ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4.C.20241004_142817_o8ngnx
FAIL ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5.C.20241004_142817_o8ngnx
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5.C.20241004_142817_o8ngnx
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5.C.20241004_142817_o8ngnx
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5.C.20241004_142817_o8ngnx
FAIL ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2.C.20241004_142817_o8ngnx
FAIL ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97.C.20241004_142817_o8ngnx
FAIL ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble.C.20241004_142817_o8ngnx
FAIL ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01.C.20241004_142817_o8ngnx
FAIL ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu.C.20241004_142817_o8ngnx
FAIL PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241004_142817_o8ngnx
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci.C.20241004_142817_o8ngnx
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep.C.20241004_142817_o8ngnx
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics.C.20241004_142817_o8ngnx
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav.C.20241004_142817_o8ngnx
FAIL SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3 (phase MODEL_BUILD)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3.C.20241004_142817_o8ngnx
test-scheduler took 666.6103744506836 seconds
######################################################
Build step 'Execute shell' marked build as failure
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 2761283 killed;
[ssh-agent] Stopped.
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

[SCREAM_PullRequest_Autotester_Mappy] $ /bin/bash -le /tmp/jenkins16234286672086420951.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

@ndkeen
Copy link
Contributor

ndkeen commented Oct 4, 2024

I'm unable to build with this branch.

/mscratch/sd/n/ndk/repos/jgfouca_rrtmgp_interface_back_to_kokkos/components/eamxx/../eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas\
_optics_rrtmgp.h(1264): error: namespace "Kokkos::Experimental" has no member "OffsetView"
    using oview_t = Kokkos::Experimental::OffsetView<T, LayoutT, DeviceT>;
                                          ^

/mscratch/sd/n/ndk/repos/jgfouca_rrtmgp_interface_back_to_kokkos/components/eamxx/../eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas\
_optics_rrtmgp.h(1264): error: expected a ";"
    using oview_t = Kokkos::Experimental::OffsetView<T, LayoutT, DeviceT>;

bartgol
bartgol previously approved these changes Oct 7, 2024
@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Oct 7, 2024

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6119
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA d3d549a
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5885
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA d3d549a
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: jgfouca/rrtmgp_interface_back_to_kokkos
  • SHA: d3d549a
  • Mode: TEST_REPO

Pull Request Author: jgfouca

@E3SM-Bot
Copy link
Collaborator

E3SM-Bot commented Oct 7, 2024

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6119
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA d3d549a
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5885
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA d3d549a
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 9b1b4c7
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6119 FAILED (click to see last 100 lines of console output)

The following tests FAILED:
40 - rrtmgp_unit_tests (Failed)
149 - homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp (Failed)
CMake Error at /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/ctest_script.cmake:76 (message):
Test had fails

===============================================================================
Testing '''d3d549a14cea98addd0fb9079f356b7a38aaff32''' for test '''full_sp_debug'''

RUN: taskset -c 52-103 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_sp_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_sp_debug -DBUILD_NAME_MOD=full_sp_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DSCREAM_DOUBLE_PRECISION=False -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_sp_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_sp_debug

Testing '''d3d549a14cea98addd0fb9079f356b7a38aaff32''' for test '''release'''

RUN: taskset -c 104-155 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/release/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/release -DBUILD_NAME_MOD=release -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Release -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/release" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/release

Testing '''d3d549a14cea98addd0fb9079f356b7a38aaff32''' for test '''full_debug'''

RUN: taskset -c 0-51 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_debug -DBUILD_NAME_MOD=full_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=True -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx/ctest-build/full_debug
Build type full_debug failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx
weaver failed'

  • errors='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx
weaver failed'

  • SA_FAILURES_DETAILS+='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx
weaver failed'

  • [[ 1 == 0 ]]
  • [[ weaver == \m\a\p\p\y ]]
  • set +x
    ######################################################
    FAILS DETECTED:
    SCREAM STANDALONE TESTING FAILED!
    Build type full_debug failed at testing time. Here's a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here's a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6119/scream/components/eamxx
weaver failed
######################################################
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins15007129404760870269.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

SCREAM_PullRequest_Autotester_Mappy # 5885 FAILED (click to see last 100 lines of console output)

518:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type debug_nopack_fpe failed at testing time. Here's a list of failed tests:
173:rrtmgp_unit_tests
376:shoc_cld_p3_rrtmgp_np2_vs_np1
377:shoc_cld_p3_rrtmgp_np3_vs_np1
378:shoc_cld_p3_rrtmgp_np4_vs_np1
418:shoc_cldfrac_mam4_aci_p3_rrtmgp_np2_vs_np1
419:shoc_cldfrac_mam4_aci_p3_rrtmgp_np3_vs_np1
420:shoc_cldfrac_mam4_aci_p3_rrtmgp_np4_vs_np1
425:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np2_vs_np1
426:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np3_vs_np1
427:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np4_vs_np1
487:homme_shoc_cld_spa_p3_rrtmgp_128levels_np2_vs_np1
488:homme_shoc_cld_spa_p3_rrtmgp_128levels_np3_vs_np1
489:homme_shoc_cld_spa_p3_rrtmgp_128levels_np4_vs_np1

Build type release failed at testing time. Here's a list of failed tests:
178:rrtmgp_unit_tests
396:shoc_cld_p3_rrtmgp_np2_vs_np1
397:shoc_cld_p3_rrtmgp_np3_vs_np1
398:shoc_cld_p3_rrtmgp_np4_vs_np1
438:shoc_cldfrac_mam4_aci_p3_rrtmgp_np2_vs_np1
439:shoc_cldfrac_mam4_aci_p3_rrtmgp_np3_vs_np1
440:shoc_cldfrac_mam4_aci_p3_rrtmgp_np4_vs_np1
445:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np2_vs_np1
446:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np3_vs_np1
447:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np4_vs_np1
510:homme_shoc_cld_spa_p3_rrtmgp_128levels_np2_vs_np1
511:homme_shoc_cld_spa_p3_rrtmgp_128levels_np3_vs_np1
512:homme_shoc_cld_spa_p3_rrtmgp_128levels_np4_vs_np1
517:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on mappy with cmd: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5885/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
RUN: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5885/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
FROM: /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5885/scream/components/eamxx
mappy failed
SCREAM V1 TESTING FAILED!
Waiting for tests to finish
FAIL ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241007_130508_zd0t86
FAIL ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4.C.20241007_130508_zd0t86
FAIL ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5.C.20241007_130508_zd0t86
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5.C.20241007_130508_zd0t86
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5.C.20241007_130508_zd0t86
FAIL ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5.C.20241007_130508_zd0t86
FAIL ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2.C.20241007_130508_zd0t86
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97.C.20241007_130508_zd0t86
FAIL ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble.C.20241007_130508_zd0t86
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01.C.20241007_130508_zd0t86
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu.C.20241007_130508_zd0t86
FAIL PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241007_130508_zd0t86
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci.C.20241007_130508_zd0t86
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep.C.20241007_130508_zd0t86
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics.C.20241007_130508_zd0t86
FAIL SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav.C.20241007_130508_zd0t86
FAIL SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3 (phase RUN)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3.C.20241007_130508_zd0t86
test-scheduler took 3681.3000757694244 seconds
######################################################
Build step 'Execute shell' marked build as failure
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 3364020 killed;
[ssh-agent] Stopped.
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

[SCREAM_PullRequest_Autotester_Mappy] $ /bin/bash -le /tmp/jenkins8083232712308687000.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

@AaronDonahue
Copy link
Contributor

@jgfouca , it looks like all the AT fails are unit tests w/ rrtmgp in them. Are these expected fails? Can we merge?

@jgfouca
Copy link
Member Author

jgfouca commented Oct 7, 2024

@AaronDonahue , not yet. I goofed pretty badly and did a lot of testing with Kokkos off (so yakl rrtmgp). I'm seeings loads of problems now that I corrected that.

@AaronDonahue
Copy link
Contributor

okay cool, just checking as I do my audit of the open PRs

@bartgol bartgol added the AT: WIP label Oct 7, 2024
@jgfouca jgfouca force-pushed the jgfouca/rrtmgp_interface_back_to_kokkos branch from d3d549a to 1a701dd Compare October 23, 2024 17:39
…rface_back_to_kokkos

* origin/master: (75 commits)
  add EAMxx initial condition for ne4 with L128
  change from scream-docs to eamxx-scripts
  EAMxx: flush output file whenever we write a rhist file
  EAMxx: add support for daily storage type in IO
  EAMxx: always update num snaps in file, even for Yearly/Monthly storage
  clarify P3Runtime struct defaults
  This commit removes a docker folder in eamxx no longer in use.
  restore BFB in order to merge PR
  Remind devs to load scream env
  Add some Kokkos dev doc
  EAMxx: work on atm process developer documentation
  EAMxx: removed undefined method in AbstractGrid
  Update EAMxx docs and improve help formatters
  Rename set_params -> setup_internals
  Improve OutputManager function descriptions
  Separate restart output manager from output manager list
  Only call set_logger() once
  Fix run_t0 and case_t0 checks
  fix incorrect inputs
  Move set_provenence_data call to in front of create_output_managers
  ...
@jgfouca
Copy link
Member Author

jgfouca commented Oct 23, 2024

OK, I think this is finally ready.

bartgol
bartgol previously approved these changes Oct 23, 2024
@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6202
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA afbefb4
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5953
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA afbefb4
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: jgfouca/rrtmgp_interface_back_to_kokkos
  • SHA: afbefb4
  • Mode: TEST_REPO

Pull Request Author: jgfouca

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6202
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA afbefb4
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM

Build Information

Test Name: SCREAM_PullRequest_Autotester_Mappy

  • Build Num: 5953
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS AT: RETEST;AT: AUTOMERGE
PULLREQUESTNUM 3030
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA afbefb4
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 69c16bb
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6202 FAILED (click to see last 100 lines of console output)

The following tests FAILED:
40 - rrtmgp_unit_tests (Failed)
149 - homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp (Failed)
CMake Error at /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/ctest_script.cmake:76 (message):
Test had fails

===============================================================================
Testing '''afbefb4625b72ef102505cf812c8bf4c110547ad''' for test '''full_sp_debug'''

RUN: taskset -c 52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_sp_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_sp_debug -DBUILD_NAME_MOD=full_sp_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DSCREAM_DOUBLE_PRECISION=False -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_sp_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_sp_debug

Testing '''afbefb4625b72ef102505cf812c8bf4c110547ad''' for test '''release'''

RUN: taskset -c 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/release/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/release -DBUILD_NAME_MOD=release -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Release -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/release" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/release

Testing '''afbefb4625b72ef102505cf812c8bf4c110547ad''' for test '''full_debug'''

RUN: taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51 sh -c '''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_debug -DBUILD_NAME_MOD=full_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=True -DEKAT_DISABLE_TPL_WARNINGS='''''''''ON''''''''' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_debug" '''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx/ctest-build/full_debug
Build type full_debug failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx
weaver failed'

  • errors='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx
weaver failed'

  • SA_FAILURES_DETAILS+='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx
weaver failed'

  • [[ 1 == 0 ]]
  • [[ weaver == \m\a\p\p\y ]]
  • set +x
    ######################################################
    FAILS DETECTED:
    SCREAM STANDALONE TESTING FAILED!
    Build type full_debug failed at testing time. Here's a list of failed tests:
    40:rrtmgp_unit_tests
    149:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type release failed at testing time. Here's a list of failed tests:
40:rrtmgp_unit_tests
148:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6202/scream/components/eamxx
weaver failed
######################################################
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins7372766892143081798.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

SCREAM_PullRequest_Autotester_Mappy # 5953 FAILED (click to see last 100 lines of console output)

518:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Build type debug_nopack_fpe failed at testing time. Here's a list of failed tests:
173:rrtmgp_unit_tests
376:shoc_cld_p3_rrtmgp_np2_vs_np1
377:shoc_cld_p3_rrtmgp_np3_vs_np1
378:shoc_cld_p3_rrtmgp_np4_vs_np1
418:shoc_cldfrac_mam4_aci_p3_rrtmgp_np2_vs_np1
419:shoc_cldfrac_mam4_aci_p3_rrtmgp_np3_vs_np1
420:shoc_cldfrac_mam4_aci_p3_rrtmgp_np4_vs_np1
425:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np2_vs_np1
426:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np3_vs_np1
427:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np4_vs_np1
487:homme_shoc_cld_spa_p3_rrtmgp_128levels_np2_vs_np1
488:homme_shoc_cld_spa_p3_rrtmgp_128levels_np3_vs_np1
489:homme_shoc_cld_spa_p3_rrtmgp_128levels_np4_vs_np1

Build type release failed at testing time. Here's a list of failed tests:
178:rrtmgp_unit_tests
396:shoc_cld_p3_rrtmgp_np2_vs_np1
397:shoc_cld_p3_rrtmgp_np3_vs_np1
398:shoc_cld_p3_rrtmgp_np4_vs_np1
438:shoc_cldfrac_mam4_aci_p3_rrtmgp_np2_vs_np1
439:shoc_cldfrac_mam4_aci_p3_rrtmgp_np3_vs_np1
440:shoc_cldfrac_mam4_aci_p3_rrtmgp_np4_vs_np1
445:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np2_vs_np1
446:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np3_vs_np1
447:shoc_cldfrac_mam4_aci_p3_mam4_optics_rrtmgp_np4_vs_np1
510:homme_shoc_cld_spa_p3_rrtmgp_128levels_np2_vs_np1
511:homme_shoc_cld_spa_p3_rrtmgp_128levels_np3_vs_np1
512:homme_shoc_cld_spa_p3_rrtmgp_128levels_np4_vs_np1
517:homme_shoc_cld_spa_p3_rrtmgp_128levels_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on mappy with cmd: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5953/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
RUN: cd /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5953/scream/components/eamxx && source /projects/sems/modulefiles/utils/sems-modules-init.sh && module purge && module load sems-cmake/3.27.9 sems-git/2.42.0 sems-gcc/11.4.0 sems-openmpi-no-cuda/4.1.6 sems-netcdf-c/4.9.2 sems-netcdf-cxx/4.2 sems-netcdf-fortran/4.6.1 sems-parallel-netcdf/1.12.3 sems-openblas && export GATOR_INITIAL_MB=4000MB && export OMP_PROC_BIND=spread && true && ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m mappy
FROM: /home/e3sm-jenkins/jenkins-ws/workspace/SCREAM_PullRequest_Autotester_Mappy/5953/scream/components/eamxx
mappy failed
SCREAM V1 TESTING FAILED!
Waiting for tests to finish
FAIL ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase COMPARE_base_rest)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_D_Lh4.ne4_ne4.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241023_155628_6cfk5s
FAIL ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4 (phase COMPARE_base_rest)
Case dir: /home/e3sm-jenkins/acme/scratch/ERP_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-4.C.20241023_155628_6cfk5s
DIFF ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_D_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-rad_frequency_2--scream-output-preset-5.C.20241023_155628_6cfk5s
DIFF ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels--scream-output-preset-5.C.20241023_155628_6cfk5s
DIFF ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_p3--scream-output-preset-5.C.20241023_155628_6cfk5s
DIFF ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-small_kernels_shoc--scream-output-preset-5.C.20241023_155628_6cfk5s
DIFF ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.mappy_gnu.scream-output-preset-2.C.20241023_155628_6cfk5s
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-arm97.C.20241023_155628_6cfk5s
PASS ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble RUN
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-comble.C.20241023_155628_6cfk5s
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.mappy_gnu.scream-dpxx-dycomsrf01.C.20241023_155628_6cfk5s
DIFF ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/ERS_P16_Ln22.ne30pg2_ne30pg2.FRCE-SCREAMv1-DP.mappy_gnu.C.20241023_155628_6cfk5s
DIFF PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/PET_Ln9_P32x2.ne4pg2_ne4pg2.F2010-SCREAMv1.mappy_gnu.scream-output-preset-1.C.20241023_155628_6cfk5s
DIFF SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-aci.C.20241023_155628_6cfk5s
DIFF SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-drydep.C.20241023_155628_6cfk5s
DIFF SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-optics.C.20241023_155628_6cfk5s
DIFF SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu.scream-mam4xx-wetscav.C.20241023_155628_6cfk5s
DIFF SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3 (phase BASELINE)
Case dir: /home/e3sm-jenkins/acme/scratch/SMS_D_Ln9.ne4_ne4.F2010-SCREAMv1-noAero.mappy_gnu.scream-output-preset-3.C.20241023_155628_6cfk5s
test-scheduler took 1922.4293880462646 seconds
######################################################
Build step 'Execute shell' marked build as failure
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 1444972 killed;
[ssh-agent] Stopped.
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh

We're having issues with some test-launcher job hanging forever. So let's make sure we clean all penting test-launcher jobs

squeue -o"%.7i %u %40j" | grep e3sm-jenkins | grep test-launcher | awk '{ print $1 }' | xargs -r scancel

[SCREAM_PullRequest_Autotester_Mappy] $ /bin/bash -le /tmp/jenkins4684708144307962649.sh
POST BUILD TASK : FAILURE
END OF POST BUILD TASK : 0
Sending e-mails to: [email protected]
Finished: FAILURE

@bartgol
Copy link
Contributor

bartgol commented Oct 25, 2024

@jgfouca I set WIP so the AT stops testing (it retests if master changes) until you decide on merging.

Regarding the memory: how much more memory are we talking about? If it's not too much more, we may just merge.

@AaronDonahue
Copy link
Contributor

@jgfouca , i'm happy to leave the merge decision up to you. I'm happy to wait while you investigate w/ Noel, or merge it.

@jgfouca
Copy link
Member Author

jgfouca commented Oct 25, 2024

@bartgol , I'm not sure exactly how much more memory, but it's enough the @ndkeen says ne120 no longer fits on 8 nodes on pm-gpu. He also reported some huge (3x) slowdowns in rrtmgp.

@@ -240,7 +240,7 @@ static void rrtmgp_initialize(
load_cld_lutcoeff(cloud_optics_lw_k, cloud_optics_file_lw);

// initialize kokkos rrtmgp pool allocator
const size_t base_ref = 18000;
const size_t base_ref = 4000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's base_ref used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky. When I was doing the standalone rrtmgp work, I tried to come up with a simple function for the amount of pool memory needed. I found that comparing (ncol * nlay * nlev) to base_ref got me a reasonable estimate, but for some reason, way less is needed for full cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the "initial" guess for the pool, and then it can be expanded if needed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. We can change this number in the source code (which reqs a rebuild to take effect). The pool cannot expand at runtime; so if we overflow it, the case will crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I thought it was like YAKL, which whenever the pool fills, opens another pool.

Is there any back of the envelope calculation to ensure this size is enough then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartgol , I'm not sure YAKL works that way. There's been many times we've had to increase GATOR_INITIAL_MB.

I arrived at this number through experimentation on rrtmgp standalone.

bartgol
bartgol previously approved these changes Oct 25, 2024
@ndkeen
Copy link
Contributor

ndkeen commented Nov 7, 2024

A quick profile with ne30 on pm-gpu using new kokkos branch, I'm seeing a few CudaMalloc/CudaFrees that are most likely a great deal of the slowdown.
Screenshot 2024-11-07 at 1 24 41 PM

@jgfouca
Copy link
Member Author

jgfouca commented Nov 7, 2024

@ndkeen , thanks! Is there a way to get a line number or kernel name to see where these mallocs are happening?

@ndkeen
Copy link
Contributor

ndkeen commented Nov 8, 2024

No, the profiler uses sampling. Previously, these CudaMallocs are from Kokkos views.

@ndkeen
Copy link
Contributor

ndkeen commented Nov 8, 2024

I added more start/stops and kokkos labels to get a little more detail. But the cuda malloc/frees are spread throughout.
Screenshot 2024-11-08 at 8 57 36 AM

@jgfouca
Copy link
Member Author

jgfouca commented Nov 8, 2024

@ndkeen , that is so weird. Can you confirm that you are on the correct rrtmgp sha? If you look at the rrtmgp file mo_gas_optics_rrtmgp.h and the kokkos impl of function gas_optics (around line 2000), near the top of the function you should see the pool allocator being used. I also hope you are on the most recent sha of this branch.

Within the main rrtmgp impl function, RRTMGPRadiation::run_impl (in eamxx_rrtmgp_process_interface.cpp), there should not be any cuda mallocs or frees happening.

@ndkeen
Copy link
Contributor

ndkeen commented Nov 8, 2024

I did a fresh checkout of jgfouca/rrtmgp_interface_back_to_kokkos branch.

@bartgol bartgol force-pushed the master branch 4 times, most recently from 6318f41 to db5b35d Compare November 13, 2024 21:48
…rface_back_to_kokkos

* origin/master: (194 commits)
  Fix threads arg
  Fixes for p3 tests CMake settings
  Mergify: fix merge proteciton rule
  Fixup small kernel situation
  Quick fix for baseline generation
  Mergify: remove redundant merge protection condition in automerge rule
  Mergify: set commit message for automerge
  Mergify: enable pull request automerge
  Update .mergify.yml
  This commit fixes a bug in defining an int layout for hyai and hybi
  EAMxx - Replace 'verti' with 'elevated' in tag names; add missing lines.
  EAMxx - Rename tag names: 'elevated' replaces 'verti'
  EAMxx - Rename variables: 'elevated_' prefix replaces 'vert_'.
  EAMxx - Rename enum VERT_EMISSION to ELEVATED_EMISSIONS.
  Update based on better ekat flag handling
  EMAxx - Update comment in enum TracerFileType to avoid confusion.
  Workflows: fix logic to execute/skip eamxx testing workflows
  EAMxx: Fixes a comment in the newly added test
  EAMxx: Fixes a comment in the microphysics testmod file
  EAMxx: Fix a comment in the shell script
  ...
Copy link
Contributor

mergify bot commented Nov 14, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟠 Enforce checks passing

Waiting checks: gcc-cuda / ${{ matrix.build_type }}, gcc-cuda / dbg, gcc-cuda / opt, gcc-cuda / sp.

Make sure that checks are not failing on the PR, and reviewers approved

  • any of:
    • check-skipped={% raw %}gcc-cuda / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-cuda / dbg"
      • check-success="gcc-cuda / opt"
      • check-success="gcc-cuda / sp"
  • #approved-reviews-by >= 1
  • #changes-requested-reviews-by == 0
  • any of:
    • all of:
      • check-success="gcc-openmp / dbg"
      • check-success="gcc-openmp / fpe"
      • check-success="gcc-openmp / opt"
      • check-success="gcc-openmp / sp"
    • check-skipped={% raw %}gcc-openmp / ${{ matrix.build_type }}{% endraw %}
  • any of:
    • all of:
      • check-success="cpu-gcc / ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.scream-small_kernels--scream-output-preset-5"
      • check-success="cpu-gcc / ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.scream-output-preset-2"
      • check-success="cpu-gcc / ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.scream-dpxx-arm97"
      • check-success="cpu-gcc / SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.scream-mam4xx-all_mam4xx_procs"
    • check-skipped={% raw %}cpu-gcc / ${{ matrix.test.short_name }}{% endraw %}
  • any of:
    • check-skipped=cpu-gcc
    • check-success=cpu-gcc

@jgfouca jgfouca force-pushed the jgfouca/rrtmgp_interface_back_to_kokkos branch from 7ceb74c to 6ce5b69 Compare November 14, 2024 21:07
@jgfouca jgfouca merged commit bc5e39b into master Nov 15, 2024
15 of 19 checks passed
@jgfouca jgfouca deleted the jgfouca/rrtmgp_interface_back_to_kokkos branch November 15, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Non-B4B Not bit for bit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants