Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribits: MPI_EXEC if unaccessible doesn't produce an error related to MPI but fails obscurely #594

Open
jjellio opened this issue Sep 13, 2023 · 9 comments

Comments

@jjellio
Copy link

jjellio commented Sep 13, 2023

I believe this falls under Tribits, since I think Trilinos doesn't use the official Kitware MPI module.

What happened is a user on the El Capitan systems was using my build setup, which set
`-DMPI_EXEC=/some/path/they/cant/read'

The user's configure only gave this as an error:

CMake Error at cmake/tribits/core/package_arch/TribitsAddTestHelpers.cmake:688 (add_test):
  add_test given test NAME "TeuchosCore_GlobalMPISessionUnitTests_MPI_1"
  which already exists in this directory.
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsAddTestHelpers.cmake:851 (tribits_add_test_add_test)
  cmake/tribits/core/package_arch/TribitsAddTest.cmake:1009 (tribits_add_test_add_test_all)
  packages/tconfigure_trilinos_performance.logeuchos/core/test/UnitTest/CMakeLists.txt:175 (TRIBITS_ADD_TEST)

Yet, if I took the same configure+environment, it worked.... so I diffed the CMakeCache... and we see:

//Where can one of the /p/lustre1/jjellio/EMPIRE-work/flux-wrap,    
  // mpiexec or mpirun libraries be found
 // mpiexec or mpirun libraries be found                             
  MPI_EXEC:FILEPATH=MPI_EXEC-NOTFOUND  

Looking in their configure log (saved cmake output), this is reported, but it is not an error:

Probing the environment ...

-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='OFF'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='OFF'
-- MPI_EXEC='/p/lustre1/jjellio/EMPIRE-work/flux-wrap'
-- MPI_EXEC='MPI_EXEC-NOTFOUND'
-- The C compiler identification is Clang 16.0.0

Instead, the user got the Teuchos unit test error which would send most users looking in the wrong direction.

We could probably produce this failure anywhere. Just set MPI_EXEC to a location you can't read.

@bartlettroscoe
Copy link
Member

@jjellio, so we just want an explicit check that if TPL_ENABLE_MPI=ON then MPI_EXEC must evaluate to true? (Note, any value ending in -NOTFOUND evaluates to false in CMake, see here).

@jjellio
Copy link
Author

jjellio commented Sep 13, 2023

I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?

@jjellio
Copy link
Author

jjellio commented Sep 13, 2023

It seems in a perfect world,


-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='OFF'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='OFF'
-- MPI_EXEC='/p/lustre1/jjellio/EMPIRE-work/flux-wrap'
-- MPI_EXEC='MPI_EXEC-NOTFOUND'
Error: /p/lustre1/jjellio/EMPIRE-work/flux-wrap was not found but was set.

@jjellio
Copy link
Author

jjellio commented Sep 13, 2023

Just throw an error there, so that latter stuff can assume MPI_EXEC must be legal

@bartlettroscoe
Copy link
Member

I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?

@jjellio, it is interesting that it has taken 15 years for this issue to come up. (I don't ever remember hearing about a use case like this.)

@bartlettroscoe
Copy link
Member

This is trivial to address, except for perhaps an automated test to ensure that a "false" MPI_EXEC throws an error correctly. (Most of these types of issues take about 10x the amount of test code to library code to test.)

@bartlettroscoe
Copy link
Member

@jjellio, how urgent is this? Sounds like other users might hit this on that system?

@jjellio
Copy link
Author

jjellio commented Sep 14, 2023

Not urgent. I did want to report it though.

I'm guessing this will come up again though. If you remember we wrote a trilinos_jsrun for the ATS-2 testing.

Well on El Cap, we don't need that type of tool - but to get faster testing throughput, I wrote a tool I shared with the FLUX team (the job manager on El Cap), that allows for much higher job throughput. But that script was in folder that my colleague couldn't see. Hopefully we won't need middle-man scripts like that on ATS-4 (El Cap), but in the stand-up process they are nice.

What stinks is how ambiguous the error was. I guess I have enough intuition now to know how to diagnose things better. But anyone else that would have seen this likely would have done alot of wrong things trying to resolve it.

@bartlettroscoe
Copy link
Member

What stinks is how ambiguous the error was.

@jjellio, many errors often are, depending who sees them. That is why it is good to be paranoid and check everything you can (unless there is a significant performance hit, which is often not the case for issues like this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants