Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests hang on ppc64le and aarch64 #1196

Open
junghans opened this issue Dec 6, 2024 · 14 comments
Open

Some tests hang on ppc64le and aarch64 #1196

junghans opened this issue Dec 6, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@junghans
Copy link
Contributor

junghans commented Dec 6, 2024

E.g. see the build here: https://koji.fedoraproject.org/koji/taskinfo?taskID=126510517

@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

@junghans Thank you for the report. I looked through the logs, but can't see it running any tests. Could you please point me out where to look?

@JBludau pointed out that it seems to still show progress in compilation, albeit very slowly.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

The end of the aarch64 build (https://kojipkgs.fedoraproject.org//work/tasks/562/126510562/build.log) looks like:

+ /usr/bin/ctest --test-dir aarch64-redhat-linux-gnu-serial --output-on-failure --force-new-ctest-process -j1
Internal ctest changing into directory: /builddir/build/BUILD/ArborX-1.7-build/ArborX-1.7/aarch64-redhat-linux-gnu-serial
Test project /builddir/build/BUILD/ArborX-1.7-build/ArborX-1.7/aarch64-redhat-linux-gnu-serial
      Start  1: ArborX_Test_DetailsUtils
 1/11 Test  #1: ArborX_Test_DetailsUtils .................   Passed    0.05 sec
      Start  2: ArborX_Test_Geometry
 2/11 Test  #2: ArborX_Test_Geometry .....................   Passed    0.01 sec
      Start  3: ArborX_Test_QueryTree
 3/11 Test  #3: ArborX_Test_QueryTree ....................   Passed    0.99 sec
      Start  4: ArborX_Test_DetailsTreeConstruction
 4/11 Test  #4: ArborX_Test_DetailsTreeConstruction ......   Passed    0.02 sec
      Start  5: ArborX_Test_DetailsContainers
 5/11 Test  #5: ArborX_Test_DetailsContainers ............   Passed    0.06 sec
      Start  6: ArborX_Test_DetailsCrsGraphWrapperImpl
 6/11 Test  #6: ArborX_Test_DetailsCrsGraphWrapperImpl ...   Passed    0.01 sec
      Start  7: ArborX_Test_Clustering
 7/11 Test  #7: ArborX_Test_Clustering ...................   Passed    0.07 sec
      Start  8: ArborX_Test_DetailsClusteringHelpers
 8/11 Test  #8: ArborX_Test_DetailsClusteringHelpers .....   Passed    0.04 sec
      Start  9: ArborX_Test_SpecializedTraversals

and it hangs there.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

For ppc64le, some tests are actually failing (see https://kojipkgs.fedoraproject.org//work/tasks/563/126510563/build.log):

 7/11 Test  #7: ArborX_Test_Clustering ...................***Failed    9.60 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Noise point does not have index -1: 0 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
....

@aprokop aprokop added the bug Something isn't working label Dec 6, 2024
@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

I tested on Power 10 system (Mammatus on Franken). Both current master and 1.7 are fine there. I wonder what's going on. I tested both Debug and RelWithDebugInfo builds.
The failure seems to be similar to #1112.

Surprisingly, I can make a different fail for the clustering helpers test. Some dendrogram tests fail:

$ OMP_NUM_THREADS=30 ./ArborX_Test_Clustering.exe
/home/users/aprokop/code/arborx/test/tstDendrogram.cpp(187): error: in "Dendrogram/dendrogram_boruvka<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check parents_boruvka == parents_union_find has failed
  - mismatch at position 831: [1510 == 1511] is false
  - mismatch at position 1306: [1511 == 1510] is false
  - mismatch at position 1510: [1557 == 2205] is false
  - mismatch at position 1511: [2205 == 1557] is false
  - mismatch at position 5871: [1510 == 1511] is false
  - mismatch at position 5903: [1511 == 1510] is false

*** 1 failure is detected in the test module "Master Test Suite"

Not very surprising because that test is hard to design as dendrograms may shift slightly.

@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

I tested on Power 10 system (Mammatus on Franken). Both current master and 1.7 are fine there. I wonder what's going on. I tested both Debug and RelWithDebugInfo builds.

Nevermind. Only Debug passes. RelWithDebInfo (NOT RelWithDebugInfo :|) reproduces the failure. This is very similar to #1113. So maybe it's not just Intel, there's something about optimization that produces a wrong result. I'm really not sure how to start figuring this out.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

@JBludau pointed out that it seems to still show progress in compilation, albeit very slowly.

Sometimes they run for weeks: https://koji.fedoraproject.org/koji/tasks?owner=junghans&state=active&view=tree&method=all&order=-id

@aprokop
Copy link
Contributor

aprokop commented Dec 10, 2024

@junghans Is there a way to test current master? I looked further into DBSCAN failures, and while I can reproduce them with 1.7, I can't reproduce them with the current master. I think something in refactoring may have fixed it.

@junghans
Copy link
Contributor Author

I tested master here: https://koji.fedoraproject.org/koji/taskinfo?taskID=126684443
(Yes, it says v1.7, but it is 37adf1a)

aarch64 passed, but ppc64le still have one failing test:

 8/14 Test  #8: ArborX_Test_Clustering ...................***Failed    0.37 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 1 [1]
Noise point does not have index -1: 0 [1]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 0 [0]
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstDBSCAN.cpp(202): �[1;31;49merror: in "DBSCAN/dbscan<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check verifyDBSCAN(space, points, 2 * r, 2, dbscan(space, points, 2 * r, 2, params)) has failed�[0;39;49m
�[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"

@aprokop
Copy link
Contributor

aprokop commented Jan 5, 2025

@junghans Can you please try the latest master (with #1198 merged).

@junghans
Copy link
Contributor Author

junghans commented Jan 5, 2025

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

x86_64 fails to build for a different reason due to some rocm issue, but aarch64 is also still failing on testing:

10/12 Test #10: ArborX_Test_SpecializedTraversals ........***Failed    0.01 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstNeighborList.cpp(177): �[1;31;49merror: in "find_neighbor_list_compare_filtered_tree_traversal<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check Test::buildHalfNeighborListAndExpandToFull(exec_space, points, radius) == Test::compute_reference<MemorySpace>(exec_space, points, radius) has failed
  - mismatch at position 31: [( 34 51 54 73 76 ) == ( 51 76 )] is false
  - mismatch at position 33: [( 34 54 73 76 ) == ( 76 )] is false
  - mismatch at position 34: [( 31 33 54 73 ) == ( )] is false
  - mismatch at position 40: [( 25 54 65 76 ) == ( 25 76 )] is false
  - mismatch at position 54: [( 31 33 34 40 73 76 ) == ( 76 )] is false
  - mismatch at position 65: [( 25 40 ) == ( 25 )] is false
  - mismatch at position 73: [( 31 33 34 54 ) == ( )] is false�[0;39;49m
�[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"
�[0;39;49m

ppc64le is fine.

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

Actually this is vice versa from before now that aarch64 is failing and pcc64le is fine,

@aprokop
Copy link
Contributor

aprokop commented Jan 6, 2025

Thanks for testing!

Interesting. So the failing test is actually different from the one before.

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

Ah you are right, it is indeed a different test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants