-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(rpc): io_uring integration & redesign #477
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4f8ccba
to
6b56293
Compare
9f42c0c
to
22c4f6d
Compare
72c0b7f
to
945988d
Compare
922a7bb
to
8fdab03
Compare
6195797
to
ecfe590
Compare
And re-organize some methods based on that change
Do not exit for *any* errors that are specific to the related connection, simply free them and continue to the next CQE. Specifically in the case of `error.SubmissionQueueFull`, instead of immediately failing, we instead first try to flush the submission queue and then try again to submit; if it fails a second time, that means despite flushing the submission queue, it somehow still failed, so we panic, since this indicates something is *very* wrong. This also eliminates the `pending_cqes_buf`, since there is actually no situation in which `consumeOurCqe` returns an error, and we resume work afterwards - either we process all the received CQEs, or we hard exit - this was already essentially the case before, now it's more obvious. For the main submit, we now wait for at least 1 connection, but we also add a timeout SQE to make it terminate if we don't receive a connection or completion of another task for 1 second; this alleviates the busy loop that was running before.
Also slightly refactor error sets. Now instead of checking to see if we need to set a flag to re-queue the multishot accept, we just pass in the server context on init and queue it, which now makes sense since the context and workpool are separate.
Also add fix for rebase
* Move more specific functions to the only files they're used. * Move the `serve*` functions outside of `Context`, making them free functions which just accept the context and work pool. * Remove `acceptAndServeConnection`; originally this was required to be able to nicely structure the unit test, and used to be more integrated, however it no longer makes sense as a concept. * Inline `handleRequest` into the basic backend. * Make the `acceptHandled` function, moved into the basic backend, guarantee the specified `sync` behavior, and inline `have_accept4`. * Appropriately re-export the relevant parts of the server API. * Added top level doc comments.
And disable the rpc server test when it is enabled
c70555e
to
ef620c8
Compare
0xNineteen
previously approved these changes
Feb 10, 2025
ef620c8
to
fb5963e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
0xNineteen
approved these changes
Feb 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work
Rexicon226
pushed a commit
that referenced
this pull request
Feb 11, 2025
* RPC server: io_uring upgrade Separates the server into two parts: the context, and the work pool; the context contains everything generally needed to run the server, the work pool contains a statically polymorphic implementation for a pool to dispatch the actual work to. In doing this, we also separate certain things out into a few different files. The RPC server context API has been modified slightly to reflect this, and the work pool directly exposed, for now. * Don't use file-as-struct * Run style script, respect line length limit * Improve accept failure handling & update TODOs * Handle potentially failing/cancelling of `accept_multishot` by re-queueing it, based on the `IORING_CQE_F_MORE` flag. * Revise/simplify the queueing logic for the `accept_multishot` SQE. * Resolve the EINTR TODO panics, returning a catch-all error value indicating it as a bad but non-critical error. * Update the `a: ?noreturn` `if (a) |*b|` TODO, adding that it's solved in 0.14; it should be resolved after we update to 0.14. * Unify EAGAIN panic message. * Add TODO to remove hacky-ish workaround * Use `self: Type` convention * A few minor fixups and improvements * Simplify test, make server socket nonblocking On MacOS, on basic WorkPool, this means we now need to manually set the accepted socket's flags to the right things, ie, blocking, as opposed to the server socket's nonblocking mode. Means we also have to handle EAGAIN a bit differently in the io_uring backend, but that's a fine tradeoff. * server.zig -> server/lib.zig * Segregate out the basic backend And re-organize some methods based on that change * Re-organize server module * Update LOGGER_SCOPE * Simplify & improve io_uring backend error handling * De-scope `accept_flags` * Simplify `can_use` for linux cross-compilation * (io_uring) Rework error handling, add timeout Do not exit for *any* errors that are specific to the related connection, simply free them and continue to the next CQE. Specifically in the case of `error.SubmissionQueueFull`, instead of immediately failing, we instead first try to flush the submission queue and then try again to submit; if it fails a second time, that means despite flushing the submission queue, it somehow still failed, so we panic, since this indicates something is *very* wrong. This also eliminates the `pending_cqes_buf`, since there is actually no situation in which `consumeOurCqe` returns an error, and we resume work afterwards - either we process all the received CQEs, or we hard exit - this was already essentially the case before, now it's more obvious. For the main submit, we now wait for at least 1 connection, but we also add a timeout SQE to make it terminate if we don't receive a connection or completion of another task for 1 second; this alleviates the busy loop that was running before. * (io_uring) Remove multishot_accept_submitted Also slightly refactor error sets. Now instead of checking to see if we need to set a flag to re-queue the multishot accept, we just pass in the server context on init and queue it, which now makes sense since the context and workpool are separate. * (io_uring) Simplify new entry creation Also add fix for rebase * Misc fixups * Re-organize alias/import * General restructure * Move more specific functions to the only files they're used. * Move the `serve*` functions outside of `Context`, making them free functions which just accept the context and work pool. * Remove `acceptAndServeConnection`; originally this was required to be able to nicely structure the unit test, and used to be more integrated, however it no longer makes sense as a concept. * Inline `handleRequest` into the basic backend. * Make the `acceptHandled` function, moved into the basic backend, guarantee the specified `sync` behavior, and inline `have_accept4`. * Appropriately re-export the relevant parts of the server API. * Added top level doc comments. * Re-oorganize loggers & scopes * Refactor `build_options` imports * Add `no_network_tests` build option And disable the rpc server test when it is enabled * Update circleci with `-Dno-network-tests`
dadepo
pushed a commit
that referenced
this pull request
Feb 13, 2025
* RPC server: io_uring upgrade Separates the server into two parts: the context, and the work pool; the context contains everything generally needed to run the server, the work pool contains a statically polymorphic implementation for a pool to dispatch the actual work to. In doing this, we also separate certain things out into a few different files. The RPC server context API has been modified slightly to reflect this, and the work pool directly exposed, for now. * Don't use file-as-struct * Run style script, respect line length limit * Improve accept failure handling & update TODOs * Handle potentially failing/cancelling of `accept_multishot` by re-queueing it, based on the `IORING_CQE_F_MORE` flag. * Revise/simplify the queueing logic for the `accept_multishot` SQE. * Resolve the EINTR TODO panics, returning a catch-all error value indicating it as a bad but non-critical error. * Update the `a: ?noreturn` `if (a) |*b|` TODO, adding that it's solved in 0.14; it should be resolved after we update to 0.14. * Unify EAGAIN panic message. * Add TODO to remove hacky-ish workaround * Use `self: Type` convention * A few minor fixups and improvements * Simplify test, make server socket nonblocking On MacOS, on basic WorkPool, this means we now need to manually set the accepted socket's flags to the right things, ie, blocking, as opposed to the server socket's nonblocking mode. Means we also have to handle EAGAIN a bit differently in the io_uring backend, but that's a fine tradeoff. * server.zig -> server/lib.zig * Segregate out the basic backend And re-organize some methods based on that change * Re-organize server module * Update LOGGER_SCOPE * Simplify & improve io_uring backend error handling * De-scope `accept_flags` * Simplify `can_use` for linux cross-compilation * (io_uring) Rework error handling, add timeout Do not exit for *any* errors that are specific to the related connection, simply free them and continue to the next CQE. Specifically in the case of `error.SubmissionQueueFull`, instead of immediately failing, we instead first try to flush the submission queue and then try again to submit; if it fails a second time, that means despite flushing the submission queue, it somehow still failed, so we panic, since this indicates something is *very* wrong. This also eliminates the `pending_cqes_buf`, since there is actually no situation in which `consumeOurCqe` returns an error, and we resume work afterwards - either we process all the received CQEs, or we hard exit - this was already essentially the case before, now it's more obvious. For the main submit, we now wait for at least 1 connection, but we also add a timeout SQE to make it terminate if we don't receive a connection or completion of another task for 1 second; this alleviates the busy loop that was running before. * (io_uring) Remove multishot_accept_submitted Also slightly refactor error sets. Now instead of checking to see if we need to set a flag to re-queue the multishot accept, we just pass in the server context on init and queue it, which now makes sense since the context and workpool are separate. * (io_uring) Simplify new entry creation Also add fix for rebase * Misc fixups * Re-organize alias/import * General restructure * Move more specific functions to the only files they're used. * Move the `serve*` functions outside of `Context`, making them free functions which just accept the context and work pool. * Remove `acceptAndServeConnection`; originally this was required to be able to nicely structure the unit test, and used to be more integrated, however it no longer makes sense as a concept. * Inline `handleRequest` into the basic backend. * Make the `acceptHandled` function, moved into the basic backend, guarantee the specified `sync` behavior, and inline `have_accept4`. * Appropriately re-export the relevant parts of the server API. * Added top level doc comments. * Re-oorganize loggers & scopes * Refactor `build_options` imports * Add `no_network_tests` build option And disable the rpc server test when it is enabled * Update circleci with `-Dno-network-tests`
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
first_error
error handling if practicalfollow up work (not part of this pr): #517