Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build with rocm/5.6 on Frontier (~ginkgo) #89

Closed
wants to merge 6 commits into from

Conversation

nkoukpaizan
Copy link
Collaborator

Merge request type

  • New feature
  • Resolves bug
  • Documentation
  • Other

Relates to

  • OPFLOW
  • SOPFLOW
  • SCOPFLOW
  • TCOPFLOW
  • CMake build system
  • Spack configuration
  • Manual
  • Web docs
  • Other

This MR updates

  • Header files
  • Source code
  • CMake build system
  • Spack configuration
  • Web docs
  • Manual
  • Other

Summary

This MR updates the Spack configuration and the corresponding modules on Frontier to build with rocm/5.6.
A few notes:

@nkoukpaizan
Copy link
Collaborator Author

nkoukpaizan commented Nov 27, 2023

Currently seeing some test failures I hadn't seen before. I wonder if it is due to the rocm/5.6 upgrade or something else. I will investigate.

	  2 - UNIT_TESTS_OPFLOW_case118.m (Failed)
	  3 - UNIT_TESTS_OPFLOW_case_ACTIVSg200.m (Failed)
	 18 - FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_TOML_TESTSUITE (Failed)
	 20 - FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE (Failed)
	 21 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)
	 34 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_MPI_TESTSUITE (Failed)
	 35 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE (Failed)
	 36 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_RAJA_TESTSUITE (Failed)
	 48 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_RAJA_GPU_TOML (Failed)
	 49 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_MPI_RAJA_GPU_TOML (Failed)

@nkoukpaizan
Copy link
Collaborator Author

On the failing tests:

  1. I am observing different behaviors in Debug vs RelWithDebInfo. The list of failed tests above was for Debug. In RelWithDebInfo, the failures are on tests 20 and 21 (others pass)

  2. The failure on test 20 seems related to Incline Test Failures #92 . Note that this is still [email protected].

Test 20 log on Frontier
20/57 Testing: FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE
20/57 Test: FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE
Command: "/usr/bin/srun" "--gpus-per-task=1" "/lustre/orion/scratch/nkouk/csc359/exago-frontier-amd-gfortran-github/build/tests/functionality/opflow/test_opflow_functionality" "/lustre/orion/scratch/nkouk/csc359/exago-frontier-amd-gfortran-github/tests/functionality/opflow/hiop_pbpolrajahiopsparse_gpu.toml"
Directory: /lustre/orion/scratch/nkouk/csc359/exago-frontier-amd-gfortran-github/build/tests/functionality/opflow
"FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE" start time: Dec 04 15:37 EST
Output:
----------------------------------------------------------
[ExaGO] Creating OPFlow Functionality Test
Test Description: datafiles/case9/case9mod.m base case
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
[Warning] Hiop does not understand option 'dualsInitialization' and will ignore its value 'zero'.
[Warning] Detected 1 fixed variables out of a total of 24.
===============
Hiop SOLVER
===============
Using 1 MPI ranks.
---------------
Problem Summary
---------------
Total number of variables: 24
   lower/upper/lower_and_upper bounds: 16 / 16 / 16
Total number of equality constraints: 18
Total number of inequality constraints: 18
   lower/upper/lower_and_upper bounds: 18 / 18 / 18
iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
 0  1.0318125e+04 1.800e+00  4.460e+03  -1.00  0.000e+00  0.000e+00  -(-)
MPICH ERROR [Rank 0] [job id 1520382.78] [Mon Dec  4 15:37:30 2023] [frontier01527] - Abort(59) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0

srun: error: frontier01527: task 0: Exited with exit code 15
srun: Terminating StepId=1520382.78
<end of output>
Test time =   0.49 sec
----------------------------------------------------------
Test Failed.
"FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE" end time: Dec 04 15:37 EST
"FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE" time elapsed: 00:00:00
  1. Using rocgdb with Debug builds, the failure occurs at the same place for tests 2, 3, 18 and 48 (exago/src/opflow/model/power_bal_hiop/pbpolrajahiopkernels.cpp:1385). For example on test 2 (UNIT_TESTS_OPFLOW_case118.m) :
Test 2 backtrace
Testing custom power balance model in polar coordinates for HIOP using RAJA(PBPOLHIOPRAJA) ...
[New Thread 0x7fff32ea3700 (LWP 101232)]
[New Thread 0x7ff7179ff700 (LWP 101233)]
[Thread 0x7ff7179ff700 (LWP 101233) exited]
nx = 344
nconeq = 236
nconineq = 0
--- PASS: Test computeVariableBounds
[New Thread 0x7ff6e7dff700 (LWP 101235)]
[Thread 0x7ff6e7dff700 (LWP 101235) exited]
[New Thread 0x7fff2077f700 (LWP 101236)]
--- PASS: Test computeObjective
--- PASS: Test computeGradient
--- PASS: Test computeConstraints
--- PASS: Test computeConstraintBounds
--- PASS: Test computeConstraintJacobian

Thread 6 "test_acopf" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:3:1:1/0 (0,0,0)[0,0,0])]
0x00007ff6e90f60f0 in RAJA::internal::detail::ViewReturnHelper<0l, camp::list<int, int>, double, double*, long, -1l>::make_return<RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l, 1l>, long, -1l> > (layout=..., data=<error reading variable: Cannot access memory at address 0x190800020070>, args=<error reading variable: Cannot access memory at address 0x2000000001924>,
  args=<error reading variable: Cannot access memory at address 0x2000000001924>)
  at /lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-x86_64/clang-16.0.0-rocm5.6.0-mixed/raja-0.14.0-g3pbskqchls2f3lbqunuawgwfinyy44o/include/RAJA/util/TypedViewBase.hpp:116
116	        return data[stripIndexType(layout(args...))];
(gdb) backtrace
#0  0x00007ff6e90f60f0 in RAJA::internal::detail::ViewReturnHelper<0l, camp::list<int, int>, double, double*, long, -1l>::make_return<RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l, 1l>, long, -1l> > (layout=..., data=<error reading variable: Cannot access memory at address 0x190800020070>,
  args=<error reading variable: Cannot access memory at address 0x2000000001924>, args=<error reading variable: Cannot access memory at address 0x2000000001924>)
  at /lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-x86_64/clang-16.0.0-rocm5.6.0-mixed/raja-0.14.0-g3pbskqchls2f3lbqunuawgwfinyy44o/include/RAJA/util/TypedViewBase.hpp:116
#1  RAJA::internal::view_make_return_value<double, long, RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l, 1l>, long, -1l>, double*, int, int> (layout=...,
  data=<error reading variable: Cannot access memory at address 0x190800020070>, args=<error reading variable: Cannot access memory at address 0x2000000001924>,
  args=<error reading variable: Cannot access memory at address 0x2000000001924>)
  at /lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-x86_64/clang-16.0.0-rocm5.6.0-mixed/raja-0.14.0-g3pbskqchls2f3lbqunuawgwfinyy44o/include/RAJA/util/TypedViewBase.hpp:157
#2  RAJA::internal::ViewBase<double, double*, RAJA::detail::LayoutBase_impl<camp::int_seq<long, 0l, 1l>, long, -1l> >::operator()<int, int> (this=0x190800020070, args=1, args=1)
  at /lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-x86_64/clang-16.0.0-rocm5.6.0-mixed/raja-0.14.0-g3pbskqchls2f3lbqunuawgwfinyy44o/include/RAJA/util/TypedViewBase.hpp:341
#3  OPFLOWComputeDenseEqualityConstraintHessian_PBPOLRAJAHIOP(_p_OPFLOW*, double const*, double const*, double*)::{lambda(long)#2}::operator()(long) const (
  this=<error reading variable: Cannot access memory at address private_lane#0x4bf8>, i=<error reading variable: Cannot access memory at address private_lane#0x4c00>)
  at /lustre/orion/scratch/nkouk/csc359/exago-frontier-amd-gfortran-github/src/opflow/model/power_bal_hiop/pbpolrajahiopkernels.cpp:1385
failed to find previous frame when computing inline frame id
  1. I didn't try to get a backtrace on tests 34, 35, 36 and 49, as they are using MPI.

I can upgrade to hiop@develop and see what happens.

@nkoukpaizan
Copy link
Collaborator Author

@nkoukpaizan nkoukpaizan deleted the nicholson/frontier-rocm5.6 branch September 24, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants