Fix 0 byte sends/receives with user buffer registration #804

bwbarrett · 2025-03-08T20:31:08Z

The primary goal of this patch series is to fix another bug introduced with #614 that #794 didn't fix, specifically an issue with how NCCL 2.24 and earlier behaved with 0 byte send/recv from user registered buffers. In those configurations, NCCL would pass an mhandle that was valid but not one that covered the base pointer of the buffer. This patch fixes that by using the flush buffer for the source / dest of 0 byte send / recv, so that we always have a valid MR for the buffer.

That was simplest by moving the flush buffer into the domain, so that we didn't have to add a ton of flush buffers, one for each new communicator. However, that simplified the code just enough that static analysis suddenly found lots of existing leaks, potential leaks, and use after frees, which required some significant code cleanup in the reg_mr and dereg_mr code paths. So this patch series grew significantly longer than originally anticipated.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The rdma transport does not support FI_MR_ENDPOINT, so there is no reason to have every memory region pinned twice per domain (one on the control endpoint and one on the data endpoint). This also happily simplifies the code. Signed-off-by: Brian Barrett <[email protected]>

The fid_domain pointer in the endpoint is a remnant of the world before we had a domain pointer. We can easily get the domain object anywhere we need the fid_domain, so remove the endpoint fid_domain. Signed-off-by: Brian Barrett <[email protected]>

The rdma transport doesn't support FI_MR_ENDPOINT, so swap all the MR oriented interfaces to be domain-oriented instead of endpoint-oriented. This cleans up the code a little bit, but also makes it considerably easier to add future features, like reducing the number of flush buffers. Signed-off-by: Brian Barrett <[email protected]>

Rather than passing fields in the domain into various MR helper functions, just pass the domain itself. Simplifies the interface, especially as we clean up needed error paths.

Static analysis is really complaining about our error cleanup paths for MRs, which were pretty well spread out and convoluted anyway. Flatten the call paths for registration and deregistration, move all the cleanup to just calling dereg_mr(), and hopefully keep the static analysis tools from getting angry again. Signed-off-by: Brian Barrett <[email protected]>

Since MRs are entirely domain-oriented and we don't care about the data in the flush buffer, move the flush buffer to the domain. It reduces the number of pinned pages and simplifies some later work to fix 0-byte send and receives. Signed-off-by: Brian Barrett <[email protected]>

Static analysis found a path where we could try to unmap a buffer that didn't exist because map had failed. I think we'd have aborted anyway, but handle that case. Signed-off-by: Brian Barrett <[email protected]>

NCCL versions prior to 2.24 take a shortcut when using user buffer registration and pass the user's buffer but the mhandle for the channel buffer. This saves a lookup in the MR cache and is legal enough on IB because the SGE data isn't looked at for a 0 byte transfer in IB. But EFA does enforce all the SGE fields on a 0 byte transfer, so a recent change to not always use eager messages for 0 byte transfers broke 0 byte user registered buffer cases. This patch fixes that case by overwriting both the base buffer and the mhandle to use the domain's flush buffer. Since no data will actually be transfered, this won't cause data corruption issues, but does provide a buffer and mhandle that match. Signed-off-by: Brian Barrett <[email protected]>

rajachan

With this change, we can just drop the FI_MR_ENDPOINT piece of https://github.com/aws/aws-ofi-nccl/blob/26b3d5929f15d1f6b4c2a0b8feb98d47e21d4b5e/src/nccl_ofi_net.c#L467-#L482 then since we fail init for providers that require it anyway.

bwbarrett · 2025-03-10T18:17:17Z

With this change, we can just drop the FI_MR_ENDPOINT piece of https://github.com/aws/aws-ofi-nccl/blob/26b3d5929f15d1f6b4c2a0b8feb98d47e21d4b5e/src/nccl_ofi_net.c#L467-#L482 then since we fail init for providers that require it anyway.

I only dropped FI_MR_ENDPOINT in the RDMA transport. The sendrecv transport still supports FI_MR_ENDPOINT. Given that CXI only supports FI_MR_ENDPOINT, we probably shouldn't drop that for send/recv.

bwbarrett added 2 commits March 7, 2025 18:11

bwbarrett requested a review from a team as a code owner March 8, 2025 20:31

bwbarrett added 6 commits March 8, 2025 23:42

rdma: Simplify MR argument passing

083638f

Rather than passing fields in the domain into various MR helper functions, just pass the domain itself. Simplifies the interface, especially as we clean up needed error paths.

rdma: Fix flush buffer cleanup

801bf33

Static analysis found a path where we could try to unmap a buffer that didn't exist because map had failed. I think we'd have aborted anyway, but handle that case. Signed-off-by: Brian Barrett <[email protected]>

bwbarrett force-pushed the bugfix/eager-throwing-invalid-mr branch from 54ef352 to 61b16e7 Compare March 8, 2025 23:51

rauteric approved these changes Mar 10, 2025

View reviewed changes

rajachan approved these changes Mar 10, 2025

View reviewed changes

bwbarrett merged commit 274d84e into aws:master Mar 10, 2025
23 checks passed

bwbarrett deleted the bugfix/eager-throwing-invalid-mr branch March 10, 2025 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 0 byte sends/receives with user buffer registration #804

Fix 0 byte sends/receives with user buffer registration #804

bwbarrett commented Mar 8, 2025

rajachan left a comment

bwbarrett commented Mar 10, 2025

Fix 0 byte sends/receives with user buffer registration #804

Fix 0 byte sends/receives with user buffer registration #804

Conversation

bwbarrett commented Mar 8, 2025

rajachan left a comment

Choose a reason for hiding this comment

bwbarrett commented Mar 10, 2025