-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: Support early completion of recv() requests #797
rdma: Support early completion of recv() requests #797
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something missing in this patch. The receiver needs to specify whether to use early completion or not for each request. There's no change to request header data in this patch, so we're clearly not doing that.
46c478d
to
a635123
Compare
a635123
to
dc27609
Compare
ec05e81
to
68d1981
Compare
68d1981
to
7d8d380
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a rebase is needed.
7d8d380
to
31497a5
Compare
31497a5
to
0d24e6e
Compare
7fe98f3
to
c11cccf
Compare
f13057f
to
ad38cfa
Compare
57a8239
to
7cecc37
Compare
Plugin Optimization Details: This change set early completion by default to be enabled when data progress model is FI_PROGRESS_AUTO. Receiver Side: - Marks request completion immediately after CTRL message send completion - Does not wait for RDMA write operation completion Sender Side: - Uses fi_write instead of fi_writedata, to eliminate unnecessary CQ entries on RX side Requirements: - Eager msg mode is disabled: OFI_NCCL_EAGER_MAX_SIZE == -1. (With the plugin version at the time of this PR, by default, eager mode is disabled) - Provider must use FI_PROGRESS_AUTO data progress model
7cecc37
to
d869312
Compare
bot:aws:retest |
AWS CI failed due to infra issue on 4_g4dn_ubuntu2204 Jenkins. Other platforms are passing
|
Description of changes:
Background:
When using LL (Low Latency) or LL128 protocols, NCCL sets the request pointer to NCCL_NET_OPTIONAL_RECV_COMPLETION in irecv() calls. This indicates that the plugin can complete a receiver request early without plugin explicitly polling the CQ to validate data arrival. This is achievable because NCCL itself following LL protocol semantics will validate data arrival by checking the flag bytes.
Plugin Optimization Details:
This change set early completion by default to be enabled when data progress model is FI_PROGRESS_AUTO.
Requirements:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.