You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error MPI_ERR_PROC_FAILED does not details what caused the failure
Implication when debugging an FT app
in an FT application, programmatic errors are absorbed and “recovered from” semi-silently
For debugging of the FT path of the app: we need to tolerate only purposefully injected failures
Possible solution: mpirun parameters (non-standard) to launch in debug mode (nothing to standardize )
Implication In production
If program fails due to hardware failure, we always want to continue.
If the program fails due to software error of some sort (app. Or system software) the situation is not as clear cut.
some users may want to continue, depending on the type/severity of the failure, the current progress, trigger a dataset verification (for possibly propagated soft errors starting from the failed rank), etc.
in other conditions, or other users, they want to stop (code is bad, needs to be debugged).
Possible solution: Standardize (or recommend?) specific error codes for the MPI_ERR_PROC_FAILED*/REVOKE classes
Possible error codes for all types of signal, different type of hw failures, etc
Reasons to oppose:
So far all error codes that are not also a class are implementation specific
some of these new code would be specific to some architecture, not necessarily generic (my machine does not generate SIGSEGV because it doesn't have virtual memory)
The text was updated successfully, but these errors were encountered:
At this point the general feeling is to not standardize this. We will proceed independently in providing this feature to users in the Open MPI ULFM implementation, and see if this becomes a widely used feature or if nobody cares.
why
Feedback received from SC'15 ULFM BoF
Motivation
Error MPI_ERR_PROC_FAILED does not details what caused the failure
Implication when debugging an FT app
Possible solution: mpirun parameters (non-standard) to launch in debug mode (nothing to standardize )
Implication In production
Possible solution: Standardize (or recommend?) specific error codes for the MPI_ERR_PROC_FAILED*/REVOKE classes
The text was updated successfully, but these errors were encountered: