Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revoke for sessions #16

Open
abouteiller opened this issue Jan 9, 2024 · 3 comments
Open

Revoke for sessions #16

abouteiller opened this issue Jan 9, 2024 · 3 comments

Comments

@abouteiller
Copy link

abouteiller commented Jan 9, 2024

Idea:
revoke entire sessions

Benefit:
Gives a handy way for libraries to expose a single handle that can be used to revoke library internal state (in particular MPI communicators used internally) from library users' code. Signaling to the library the need to react to an 'unexpected' condition.

Proposed text:
MPI_SESSION_REVOKE revokes all communicators that are created using objects derived from the session

Branch: https://github.com/mpiwg-ft/mpi-standard/tree/ulfm/session-revoke
Diff: https://github.com/mpiwg-ft/mpi-standard/compare/ulfm/master...mpiwg-ft:mpi-standard:ulfm/session-revoke

Implementation prototype:
https://github.com/abouteiller/ompi-aurelien/tree/ulfm/session_revoke

Possible complications:
There has been talks about relaxing the hard limitation on mixing groups obtained from different sessions, which would render the logic of this operation harder to define/implement. However, such relaxation may also cause problems for the 'fallback' of error handling in Session as is already defined in MPI 4.0/4.1.

@bosilca
Copy link
Member

bosilca commented Jan 9, 2024

How is this reflected back to the user on the other processes ? As far as I understand your code:

  1. there is no difference between session_revoke and communicator_revoke for the other ranks
  2. the implementation is hardly scalable because it generates a revocation storm (the communicators are kept in different orders on each process
  3. assuming the sessions support dynamic processes is this function expected to be transitive across all connected sessions ?

@abouteiller
Copy link
Author

abouteiller commented Mar 18, 2024

R: the advice is promoting an undesired use of the session handle, the handle should be encapsulated in the library and the accessor then will use session-revoke on an internal session object.

  • Q: Is the session revoked? Is there a 'is_revoked'?

    • The session object itself is not 'revoked', and remains fully operational as a session object
    • Therefore the name is maybe too symmetric with the other revoke procedures
    • We discussed having session_revoke actually revoke the session, what it means is blurry because all session procedures are local, so that would have no effect anyway?
      • there is no concept of revoked session, thus it should not be 'revoked', we need the function to have another name ?mpi_session_revoke_objects? mpi_session_revoke_all?
      • have a look at section 2.2 (naming convention)
      • explain why there is no is_revoked (because what we just said)
  • Q: comm_create_from_group after session revoke?

    • will still work, not possible to revoke a group, so can't 'revoke' that procedure either.

Q: Relation with error scope (with new scope session)

  • Yes, this is the explicit version of setting the info error_scope=session (text to be written yet)

Q: There has been talks about relaxing the hard limitation on mixing groups obtained from different sessions, which would render the logic of this operation harder to define/implement. However, such relaxation may also cause problems for the 'fallback' of error handling in Session as is already defined in MPI 4.0/4.1.

  • The breach of separation that is considered is much more limited (like doing translate_ranks only), and doing group_Union between different session-groups would be a major break of the session model that is not atm considered.

Q: G1: should we be able to tell if comm revoked from comm_revoke or session_revoke?
- No, it's the same as if ...
- implementation may still show a different error code (not class) to distinguish, but we don't standardize codes.

Q: G3: dynamic processes
- definition is in terms of communicators, so the connected-set is the processes that appear in these communicators

Q: propagation at relay nodes?
n1: s1, c1, c2
n2: s2, c1, c3
n3: s3, c2, c3
n1 session_revoke(s1), -> c3 is not revoked. c1, c2 are revoked.

@abouteiller
Copy link
Author

Discussion about implicit session revoke started there, there is some proposed text
https://github.com/mpi-forum/mpi-standard/pull/715#discussion_r1529316235
A complication is that the proposed texts are not symmetrical: session_revoke revokes a bunch of comms, range=session revokes a single comm.

Maybe we need to consider the implicit variants that do the implicit form (control what process fault causes the revoke, control what objects get revoked in that case if more than one, in the implicit form).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants