Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify process binding in MPI #4

Open
cniethammer opened this issue Dec 1, 2022 · 0 comments
Open

Clarify process binding in MPI #4

cniethammer opened this issue Dec 1, 2022 · 0 comments

Comments

@cniethammer
Copy link

Problem

Many functionalities in the MPI standard require that the location of the underlying physical processes associated with MPI processes is defined and that processes do not migrate between hardware resources.
The standard however neglects this fact in many places leaving the users and developers with unclear ways to use MPI interfaces or unmet requirements for optimizations leaving MPI implementations with their own solutions.

  • virtual Topologies:
    Optimized mapping of MPI processes to physical processes may only be valid as long as physical processes do not migrate.
    If a communicator is created - and reordering was allowed for optimization - communication paths may be taken into account by the implementations.
    Process migration will prevent or may make optimizations useless, e.g., if an implementation takes care of NUMA properties.

  • Communicators:
    The creation of communicators with MPI_Comm_split_type() relies on locality properties of the processes not changing during the call.
    Further, the new communicator is not guaranteed to have these properties - so users cannot rely on without additional checks/code.

    The standard mentions here:

    Advice to users. Since the location of some of the MPI processes may change
    during the application execution, the communicators created with the value
    MPI_COMM_TYPE_SHARED before this change may not reflect an actual ability
    to share memory between MPI processes after this change. (End of advice to
    users.)
    (MPI 4.0, p.339)

    or

    Advice to users. The set of hardware resources that an MPI process is able to
    utilize may change during the application execution (e.g., because of the reloca-
    tion of an MPI process), in which case the communicators created with the value
    MPI_COMM_TYPE_HW_GUIDED before this change may not reflect the utiliza-
    tion of hardware resources of such process at any time after the communicator
    creation. (End of advice to users.)
    (MPI 4.0, p.340)

    or

    The set of hardware
    resources an MPI process utilizes may change during the application execution
    (e.g., because of process relocation), in which case the communicators created
    with the value MPI_COMM_TYPE_HW_UNGUIDED before this change may not
    reflect the utilization of hardware resources for such process at any time after
    the communicator creation. (End of advice to users.)
    (MPI 4.0, p.341)

  • MPI shared memory:
    MPI shared memory allows local and remote processes access via load/store accesses by providing a baseptr via, e.g., MPI_Win_allocate_shared or MPI_Win_shared_query.
    Especially in the case of load/store access from remote processes this requires the target not to migrate and invalidate the baseptr as this is used outside MPI.

    The standard puts here the burden onto the user - without help of MPI, cf. MPI_Comm_split_type + MPI_COMM_TYPE_SHARED Atu p.339:

    It is the user’s responsibility to ensure that the communicator comm represents a group of processes that
    can create a shared memory segment that can be accessed by all processes in the group.
    (MPI 4.0, p.559)

  • MPI_Get_processor_name:
    Is allowed to return different values with each call - but no negative effect, as this is just informative for the user.

  • MPI_Sessions:
    TBD

Proposal

We should address this in some way. Possible ways:

  • Add new procedure to bind MPI processes to physical resources

    int MPI_Process_bind(resource_specifier, info)
    

    TODO: This seems to have been proposed before, add reference to the old proposal

  • Handle at communicator creation via info argument

    • Provide binding specification in info object to procedures that create communicators
      info { 'resources_bound': 'true' }

      PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property
      PRO: user has control
      PRO: uses existing info interface

      CON: binding is optional by the user and will require checks in relevant code paths even for intended communicators
      CON: requires boilerplate code for users to create info object
      CON: not all communicator creation functions provide an info argument as input

  • Handle at the communicator level as a hidden communicator property

    • communicators provide a hidden binding feature/property
      -> "High quality implementations will ensure that the created communicator ...
      by binding the MPI processes in the group of the communicator to appropriate physical processes/resources."

    PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property

    CON: Not clear if MPI_Win_allocate_shared should fail or show the old behaviour
    CON: Users cannot

Changes to the Text

TBD
processe with same naem in a pset?

Impact on Implementations

  • Will have already some kind of related hardware query functionality
  • Can allow optimizations in several functionalities if they ensure binding
  • If necessary add new interfaces

Impact on Users

  • Clarify and ensure intended usage of MPI functions

References and Pull Requests

TBD

Notes from discussions in the WG so far:

  • Solving the binding question at the global level seem not possible as it will lead to a too wide scope for sessions, etc.
  • Solving at the communicator level may be to narrow and lead to conflicts, e.g., MPI processes in two communicators where ranks can be translated between both - and both would have competing binding requirements.
  • MPI_Win_shared is broken in general - so should we care? Issues with it are also negelected in the Fault tolerance approaches.
  • The binding problem seems to be one part of a larger conceptual Problem of MPI following the CSP model facing requirements that ask for an agent based model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant