Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zoltan2: broken unit tests cuda 12.4 w/ h100 gpus #13400

Open
vasylivy opened this issue Aug 27, 2024 · 1 comment
Open

zoltan2: broken unit tests cuda 12.4 w/ h100 gpus #13400

vasylivy opened this issue Aug 27, 2024 · 1 comment
Labels
pkg: Tpetra pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@vasylivy
Copy link

Hi,

Testing w/ cuda 12.4 on h100 gpus, the following set of unit tests are broken

850:Zoltan2_mjTest_geomgen_MPI_3
851:Zoltan2_mjTest_geomgen_noholes_MPI_3
855:Zoltan2_mjTest_geomgen_uvmOff_MPI_3
856:Zoltan2_mjTest_geomgen_noholes_uvmOff_MPI_3
890:Zoltan2_Bug9500_MPI_4
893:Zoltan2_TpetraCrsColorer_simple_MPI_4
894:Zoltan2_TpetraCrsColorer_simple_nodistrib_MPI_4
895:Zoltan2_TpetraCrsColorer_west0067_MPI_4
896:Zoltan2_TpetraCrsColorer_west0067_nodistrib_MPI_4
897:Zoltan2_TpetraCrsColorer_galeri1_MPI_4

where errors are

850:Zoltan2_mjTest_geomgen_MPI_3
851:Zoltan2_mjTest_geomgen_noholes_MPI_3 
855:Zoltan2_mjTest_geomgen_uvmOff_MPI_3
856:Zoltan2_mjTest_geomgen_noholes_uvmOff_MPI_3

not clear what this means but see output like the following depending on the test
1 12 52 52: XMV 0; BMV 6  :  FAIL
1 23 63 63: XMV 6; BMV 4  :  FAIL
2 32 112 112: XMV 0; BMV 12  :  FAILs
FAIL
890:Zoltan2_Bug9500_MPI_4
893:Zoltan2_TpetraCrsColorer_simple_MPI_4
895:Zoltan2_TpetraCrsColorer_west0067_MPI_4

Throw test that evaluated to true: ret != ZOLTAN_OK

Zoltan::Color returned -1
894:Zoltan2_TpetraCrsColorer_simple_nodistrib_MPI_4
Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsMatrix<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 1) When converting column indices from global to local, we encountered 1 index that do not live in the column Map on this process.
896:Zoltan2_TpetraCrsColorer_west0067_nodistrib_MPI_4
Throw test that evaluated to true: minAllGID_ < indexBase_

Tpetra::Map constructor (noncontiguous): Minimum global ID = -1 over all process(es) is less than the given indexBase = 0.
897:Zoltan2_TpetraCrsColorer_galeri1_MPI_4
cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered

See configuration 1 reported here #13397.

Can someone try to reproduce these errors?

Thanks,

Yaro

@vasylivy vasylivy added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 27, 2024
@vasylivy
Copy link
Author

Tested config 1 w/ the following turned off

-DKokkos_ENABLE_CUDA_UVM=OFF
-DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=OFF
-DTpetra_ALLOCATE_IN_SHARED_SPACE=OFF

the unit tests pass, so it would appear to be UVM related.

Yaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants