Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 0 byte sends/receives with user buffer registration #804

Merged
merged 8 commits into from
Mar 10, 2025

Conversation

bwbarrett
Copy link
Contributor

The primary goal of this patch series is to fix another bug introduced with #614 that #794 didn't fix, specifically an issue with how NCCL 2.24 and earlier behaved with 0 byte send/recv from user registered buffers. In those configurations, NCCL would pass an mhandle that was valid but not one that covered the base pointer of the buffer. This patch fixes that by using the flush buffer for the source / dest of 0 byte send / recv, so that we always have a valid MR for the buffer.

That was simplest by moving the flush buffer into the domain, so that we didn't have to add a ton of flush buffers, one for each new communicator. However, that simplified the code just enough that static analysis suddenly found lots of existing leaks, potential leaks, and use after frees, which required some significant code cleanup in the reg_mr and dereg_mr code paths. So this patch series grew significantly longer than originally anticipated.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The rdma transport does not support FI_MR_ENDPOINT, so there is no
reason to have every memory region pinned twice per domain (one
on the control endpoint and one on the data endpoint).  This also
happily simplifies the code.

Signed-off-by: Brian Barrett <[email protected]>
The fid_domain pointer in the endpoint is a remnant of the world before
we had a domain pointer.  We can easily get the domain object anywhere
we need the fid_domain, so remove the endpoint fid_domain.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett requested a review from a team as a code owner March 8, 2025 20:31
The rdma transport doesn't support FI_MR_ENDPOINT, so swap all the MR
oriented interfaces to be domain-oriented instead of endpoint-oriented.
This cleans up the code a little bit, but also makes it considerably
easier to add future features, like reducing the number of flush
buffers.

Signed-off-by: Brian Barrett <[email protected]>
Rather than passing fields in the domain into various MR helper
functions, just pass the domain itself.  Simplifies the interface,
especially as we clean up needed error paths.
Static analysis is really complaining about our error cleanup paths for
MRs, which were pretty well spread out and convoluted anyway.  Flatten
the call paths for registration and deregistration, move all the
cleanup to just calling dereg_mr(), and hopefully keep the static
analysis tools from getting angry again.

Signed-off-by: Brian Barrett <[email protected]>
Since MRs are entirely domain-oriented and we don't care about the
data in the flush buffer, move the flush buffer to the domain.  It
reduces the number of pinned pages and simplifies some later work to
fix 0-byte send and receives.

Signed-off-by: Brian Barrett <[email protected]>
Static analysis found a path where we could try to unmap a buffer
that didn't exist because map had failed.  I think we'd have aborted
anyway, but handle that case.

Signed-off-by: Brian Barrett <[email protected]>
NCCL versions prior to 2.24 take a shortcut when using user buffer
registration and pass the user's buffer but the mhandle for the
channel buffer.  This saves a lookup in the MR cache and is legal
enough on IB because the SGE data isn't looked at for a 0 byte transfer
in IB.  But EFA does enforce all the SGE fields on a 0 byte transfer,
so a recent change to not always use eager messages for 0 byte
transfers broke 0 byte user registered buffer cases.

This patch fixes that case by overwriting both the base buffer and the
mhandle to use the domain's flush buffer.  Since no data will actually
be transfered, this won't cause data corruption issues, but does provide
a buffer and mhandle that match.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett force-pushed the bugfix/eager-throwing-invalid-mr branch from 54ef352 to 61b16e7 Compare March 8, 2025 23:51
Copy link
Member

@rajachan rajachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we can just drop the FI_MR_ENDPOINT piece of https://github.com/aws/aws-ofi-nccl/blob/26b3d5929f15d1f6b4c2a0b8feb98d47e21d4b5e/src/nccl_ofi_net.c#L467-#L482 then since we fail init for providers that require it anyway.

@bwbarrett
Copy link
Contributor Author

With this change, we can just drop the FI_MR_ENDPOINT piece of https://github.com/aws/aws-ofi-nccl/blob/26b3d5929f15d1f6b4c2a0b8feb98d47e21d4b5e/src/nccl_ofi_net.c#L467-#L482 then since we fail init for providers that require it anyway.

I only dropped FI_MR_ENDPOINT in the RDMA transport. The sendrecv transport still supports FI_MR_ENDPOINT. Given that CXI only supports FI_MR_ENDPOINT, we probably shouldn't drop that for send/recv.

@bwbarrett bwbarrett merged commit 274d84e into aws:master Mar 10, 2025
23 checks passed
@bwbarrett bwbarrett deleted the bugfix/eager-throwing-invalid-mr branch March 10, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants