Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14.x: Cherry pick commits from master #806

Merged
merged 13 commits into from
Mar 12, 2025
Merged

Conversation

yexiang-aws
Copy link
Contributor

Description of changes:

Cherry-picked below commits into 1.14.x branch

9d90c0e  rdma: Support early completion of recv() requests
4117397 ofi: add data progress model detection
a20854e Remove support for not creating local MRs
167e6dc rdma: Overwrite buffers for 0 byte messages
161cb12 rdma: Fix flush buffer cleanup
4195fa8 rdma: Move flush buffer to domain object
21ac264 rdma: Simplify MR paths
f898cb9 rdma: Simplify MR argument passing
d22b961 rdma: Make MRs be domain oriented (instead of EP)
8ee095e rdma: Remove ofi domain pointer from endpoint
4bfb9e0 rdma: Remove control rail MRs from MR
1c3b82f .ci/aws: Configure PortaFiducia to run in lean setup
ba5850a rdma: Disable eager by default

skipped following from master

71a1e68 neuron: add inf2e.32xlarge
26ed86b rdma: Fix request lookup on error

only conflict that needed manual resolution was in rdma: Support early completion of recv() requests, when calling fi_writedata, there was a context2 change in master that needed to remove.

Testing:
nccl-test pass against both efa-rdm and efa-direct branch.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bwbarrett and others added 13 commits March 12, 2025 01:38
We don't think we need eager with the performance improvements on P5 and
P5en, so set the default to -1 (disabled).

Signed-off-by: Brian Barrett <[email protected]>
This patch eliminates PortaFiducia's setups that
are unncessary when running tests on containers.

Signed-off-by: Vishwas Dsouza <[email protected]>
The rdma transport does not support FI_MR_ENDPOINT, so there is no
reason to have every memory region pinned twice per domain (one
on the control endpoint and one on the data endpoint).  This also
happily simplifies the code.

Signed-off-by: Brian Barrett <[email protected]>
The fid_domain pointer in the endpoint is a remnant of the world before
we had a domain pointer.  We can easily get the domain object anywhere
we need the fid_domain, so remove the endpoint fid_domain.

Signed-off-by: Brian Barrett <[email protected]>
The rdma transport doesn't support FI_MR_ENDPOINT, so swap all the MR
oriented interfaces to be domain-oriented instead of endpoint-oriented.
This cleans up the code a little bit, but also makes it considerably
easier to add future features, like reducing the number of flush
buffers.

Signed-off-by: Brian Barrett <[email protected]>
Rather than passing fields in the domain into various MR helper
functions, just pass the domain itself.  Simplifies the interface,
especially as we clean up needed error paths.
Static analysis is really complaining about our error cleanup paths for
MRs, which were pretty well spread out and convoluted anyway.  Flatten
the call paths for registration and deregistration, move all the
cleanup to just calling dereg_mr(), and hopefully keep the static
analysis tools from getting angry again.

Signed-off-by: Brian Barrett <[email protected]>
Since MRs are entirely domain-oriented and we don't care about the
data in the flush buffer, move the flush buffer to the domain.  It
reduces the number of pinned pages and simplifies some later work to
fix 0-byte send and receives.

Signed-off-by: Brian Barrett <[email protected]>
Static analysis found a path where we could try to unmap a buffer
that didn't exist because map had failed.  I think we'd have aborted
anyway, but handle that case.

Signed-off-by: Brian Barrett <[email protected]>
NCCL versions prior to 2.24 take a shortcut when using user buffer
registration and pass the user's buffer but the mhandle for the
channel buffer.  This saves a lookup in the MR cache and is legal
enough on IB because the SGE data isn't looked at for a 0 byte transfer
in IB.  But EFA does enforce all the SGE fields on a 0 byte transfer,
so a recent change to not always use eager messages for 0 byte
transfers broke 0 byte user registered buffer cases.

This patch fixes that case by overwriting both the base buffer and the
mhandle to use the domain's flush buffer.  Since no data will actually
be transfered, this won't cause data corruption issues, but does provide
a buffer and mhandle that match.

Signed-off-by: Brian Barrett <[email protected]>
Commits 27548d4 and a1ac0f5 fixed providers that required memory
registrations when FI_MR_LOCAL was set, but also broke providers that
clear FI_MR_LOCAL (such as HPE's provider), because I did not account
for the mr_local handling in the send/recv transport.

We don't have a great way to test that case from AWS, the vast majority
of transfers will either be HMEM (which creates an MR anyway) or control
messages (which have a freelist which creates an MR), and the Libfabric
specification allows passing an MR descriptor to a local operation even
if the provider clears the FI_MR_LOCAL bit.  Therefore, the best path
forward seems to be removing the code to skip registration if
FI_MR_LOCAL is cleared, and always creating an MR.

Signed-off-by: Brian Barrett <[email protected]>
Plugin Optimization Details:
This change set early completion by default to be enabled when
data progress model is FI_PROGRESS_AUTO.

Receiver Side:
- Marks request completion immediately after CTRL message send completion
- Does not wait for RDMA write operation completion
Sender Side:
- Uses fi_write instead of fi_writedata, to eliminate unnecessary CQ
entries on RX side

Requirements:
- Eager msg mode is disabled: OFI_NCCL_EAGER_MAX_SIZE == -1. (With the
plugin version at the time of this PR, by default, eager mode is disabled)
- Provider must use FI_PROGRESS_AUTO data progress model
@yexiang-aws yexiang-aws requested review from a-szegel, bwbarrett and a team as code owners March 12, 2025 02:10
@rajachan rajachan merged commit 800ff60 into aws:v1.14.x Mar 12, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants