Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Map<>::getLocalElement cannot be called in a host-side parallel region #13445

Open
MalachiTimothyPhillips opened this issue Sep 12, 2024 · 3 comments
Labels
client: Sierra All issues that primarily impacts SNL Sierra codes pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@MalachiTimothyPhillips
Copy link

Bug Report

@trilinos/tpetra
@japlews
@vbrunini

Tpetra::Map<>::getLocalElement (which according to the doxygen should be thread safe) may call Tpetra::Map<>::lazyPushToHost.

Tpetra::Map<>::lazyPushToHost performs several operations (Kokkos::deep_copy, Kokkos::fence) that are "illegal" from within a parallel region.

With the recent Kokkos 4.4 promotion, this may now cause host-side parallel regions to hang.

See:

/// \note This function should be thread safe and thread scalable,

and

Steps to Reproduce

Call Tpetra::Map<>::lazyPushToHost from within a host-side Kokkos::parallel_for.

@MalachiTimothyPhillips MalachiTimothyPhillips added type: bug The primary issue is a bug in Trilinos code or tests client: Sierra All issues that primarily impacts SNL Sierra codes labels Sep 12, 2024
@MalachiTimothyPhillips MalachiTimothyPhillips changed the title Tpetra: Map<>::getLocalElement cannot be called from a host-side parallel region Tpetra: Map<>::getLocalElement cannot be called in a host-side parallel region Sep 12, 2024
@japlews
Copy link
Contributor

japlews commented Sep 12, 2024

@MalachiTimothyPhillips thanks a lot for reporting this. FYI, we hit this with Kokkos 4.4 integrations in Sierra. I would also add to the comments that there are several functions in Tpetra::Map that call lazyPushToHost; not limited to just getLocalElement. It's easy enough to work around, so this is not super urgent, but would be nice to fix.

@jhux2 jhux2 added this to Tpetra Sep 13, 2024
@jhux2 jhux2 moved this to Needs Triage in Tpetra Sep 13, 2024
@csiefer2
Copy link
Member

csiefer2 commented Sep 17, 2024

@MalachiTimothyPhillips @japlews We generally recommend avoiding calling Tpetra functions (excepting those with KOKKOS_FUNCTION decorators) in parallel regions, because that code will never work on a GPU.

As you note, the documentation about thread safety of that function is no longer correct. The change we neglected to document is an intentional one --- the lazyPushToHost stuff was designed to avoid adding a bunch of D2H calls in the middle of an otherwise 100% GPU calculations. Previously every Map construction involved those D2H calls.

If you're not building a GPU backend, we might be able to clean up the fences. But if you are building a GPU backend with a host-parallel front end, then those fences really do need to be there.

@japlews
Copy link
Contributor

japlews commented Sep 17, 2024

@csiefer2 This sounds good. I only advocate for improving the comments or otherwise enforcing somehow that we don't do this in threaded loops where a hang could occur!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: Sierra All issues that primarily impacts SNL Sierra codes pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Needs Triage
Development

No branches or pull requests

3 participants