Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(rpc): io_uring integration & redesign #477

Merged
merged 25 commits into from
Feb 10, 2025
Merged

Conversation

InKryption
Copy link
Contributor

@InKryption InKryption commented Jan 3, 2025

  • get receive working
  • get send working
  • error handling:
    • eliminate explicit TODO panics
    • improve first_error error handling if practical

follow up work (not part of this pr): #517

@InKryption InKryption self-assigned this Jan 3, 2025
@0xNineteen 0xNineteen changed the title RPC server: io_uring upgrade WIP perf(rpc): io_uring WIP Jan 3, 2025
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch 6 times, most recently from 4f8ccba to 6b56293 Compare January 14, 2025 20:36
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch 6 times, most recently from 9f42c0c to 22c4f6d Compare January 20, 2025 19:09
@InKryption InKryption changed the title perf(rpc): io_uring WIP perf(rpc): io_uring integration & redesign Jan 20, 2025
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch from 72c0b7f to 945988d Compare January 20, 2025 19:25
@InKryption InKryption marked this pull request as ready for review January 21, 2025 10:33
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch 2 times, most recently from 922a7bb to 8fdab03 Compare January 23, 2025 05:24
@0xNineteen 0xNineteen requested a review from kprotty January 23, 2025 16:38
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch 7 times, most recently from 6195797 to ecfe590 Compare January 24, 2025 21:37
And re-organize some methods based on that change
Do not exit for *any* errors that are specific to the
related connection, simply free them and continue to the next CQE.

Specifically in the case of `error.SubmissionQueueFull`, instead of
immediately failing, we instead first try to flush the submission queue
and then try again to submit; if it fails a second time, that means
despite flushing the submission queue, it somehow still failed, so
we panic, since this indicates something is *very* wrong.

This also eliminates the `pending_cqes_buf`, since there is actually
no situation in which `consumeOurCqe` returns an error, and we resume
work afterwards - either we process all the received CQEs, or we hard
exit - this was already essentially the case before, now it's more
obvious.

For the main submit, we now wait for at least 1 connection, but we
also add a timeout SQE to make it terminate if we don't receive a
connection or completion of another task for 1 second; this alleviates
the busy loop that was running before.
Also slightly refactor error sets.
Now instead of checking to see if we need to set a flag to re-queue
the multishot accept, we just pass in the server context on init
and queue it, which now makes sense since the context and workpool
are separate.
* Move more specific functions to the only files they're used.
* Move the `serve*` functions outside of `Context`, making them
  free functions which just accept the context and work pool.
* Remove `acceptAndServeConnection`; originally this was required to
  be able to nicely structure the unit test, and used to be more
  integrated, however it no longer makes sense as a concept.
* Inline `handleRequest` into the basic backend.
* Make the `acceptHandled` function, moved into the basic backend,
  guarantee the specified `sync` behavior, and inline `have_accept4`.
* Appropriately re-export the relevant parts of the server API.
* Added top level doc comments.
And disable the rpc server test when it is enabled
@InKryption InKryption force-pushed the ink/rpc-server-optimize branch from c70555e to ef620c8 Compare February 10, 2025 18:47
0xNineteen
0xNineteen previously approved these changes Feb 10, 2025
Copy link

codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
src/accountsdb/db.zig 88.98% <ø> (ø)
src/accountsdb/snapshots.zig 88.84% <100.00%> (ø)
src/ledger/blockstore.zig 100.00% <ø> (ø)
src/ledger/cleanup_service.zig 84.04% <ø> (ø)
src/ledger/database/hashmap.zig 91.66% <ø> (ø)
src/ledger/database/rocksdb.zig 95.58% <ø> (ø)
src/ledger/fuzz.zig 0.00% <ø> (ø)
src/rpc/server/server.zig 100.00% <100.00%> (ø)
src/utils/fmt.zig 86.66% <ø> (ø)
src/utils/io.zig 96.66% <100.00%> (+4.66%) ⬆️

Copy link
Contributor

@0xNineteen 0xNineteen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work :shipit:

@0xNineteen 0xNineteen added this pull request to the merge queue Feb 10, 2025
Merged via the queue into main with commit 694aa99 Feb 10, 2025
16 checks passed
@0xNineteen 0xNineteen deleted the ink/rpc-server-optimize branch February 10, 2025 19:54
Rexicon226 pushed a commit that referenced this pull request Feb 11, 2025
* RPC server: io_uring upgrade
Separates the server into two parts: the context, and the work pool;
the context contains everything generally needed to run the server,
the work pool contains a statically polymorphic implementation for
a pool to dispatch the actual work to.
In doing this, we also separate certain things out into a few different
files.
The RPC server context API has been modified slightly to reflect this,
and the work pool directly exposed, for now.

* Don't use file-as-struct

* Run style script, respect line length limit

* Improve accept failure handling & update TODOs
* Handle potentially failing/cancelling of `accept_multishot` by
  re-queueing it, based on the `IORING_CQE_F_MORE` flag.
* Revise/simplify the queueing logic for the `accept_multishot` SQE.
* Resolve the EINTR TODO panics, returning a catch-all error value
  indicating it as a bad but non-critical error.
* Update the `a: ?noreturn` `if (a) |*b|` TODO, adding that it's solved
  in 0.14; it should be resolved after we update to 0.14.
* Unify EAGAIN panic message.

* Add TODO to remove hacky-ish workaround

* Use `self: Type` convention

* A few minor fixups and improvements

* Simplify test, make server socket nonblocking
On MacOS, on basic WorkPool, this means we now need to manually set the
accepted socket's flags to the right things, ie, blocking, as opposed
to the server socket's nonblocking mode.
Means we also have to handle EAGAIN a bit differently in the io_uring
backend, but that's a fine tradeoff.

* server.zig -> server/lib.zig

* Segregate out the basic backend
And re-organize some methods based on that change

* Re-organize server module

* Update LOGGER_SCOPE

* Simplify & improve io_uring backend error handling

* De-scope `accept_flags`

* Simplify `can_use` for linux cross-compilation

* (io_uring) Rework error handling, add timeout
Do not exit for *any* errors that are specific to the
related connection, simply free them and continue to the next CQE.

Specifically in the case of `error.SubmissionQueueFull`, instead of
immediately failing, we instead first try to flush the submission queue
and then try again to submit; if it fails a second time, that means
despite flushing the submission queue, it somehow still failed, so
we panic, since this indicates something is *very* wrong.

This also eliminates the `pending_cqes_buf`, since there is actually
no situation in which `consumeOurCqe` returns an error, and we resume
work afterwards - either we process all the received CQEs, or we hard
exit - this was already essentially the case before, now it's more
obvious.

For the main submit, we now wait for at least 1 connection, but we
also add a timeout SQE to make it terminate if we don't receive a
connection or completion of another task for 1 second; this alleviates
the busy loop that was running before.

* (io_uring) Remove multishot_accept_submitted
Also slightly refactor error sets.
Now instead of checking to see if we need to set a flag to re-queue
the multishot accept, we just pass in the server context on init
and queue it, which now makes sense since the context and workpool
are separate.

* (io_uring) Simplify new entry creation
Also add fix for rebase

* Misc fixups

* Re-organize alias/import

* General restructure
* Move more specific functions to the only files they're used.
* Move the `serve*` functions outside of `Context`, making them
  free functions which just accept the context and work pool.
* Remove `acceptAndServeConnection`; originally this was required to
  be able to nicely structure the unit test, and used to be more
  integrated, however it no longer makes sense as a concept.
* Inline `handleRequest` into the basic backend.
* Make the `acceptHandled` function, moved into the basic backend,
  guarantee the specified `sync` behavior, and inline `have_accept4`.
* Appropriately re-export the relevant parts of the server API.
* Added top level doc comments.

* Re-oorganize loggers & scopes

* Refactor `build_options` imports

* Add `no_network_tests` build option
And disable the rpc server test when it is enabled

* Update circleci with `-Dno-network-tests`
dadepo pushed a commit that referenced this pull request Feb 13, 2025
* RPC server: io_uring upgrade
Separates the server into two parts: the context, and the work pool;
the context contains everything generally needed to run the server,
the work pool contains a statically polymorphic implementation for
a pool to dispatch the actual work to.
In doing this, we also separate certain things out into a few different
files.
The RPC server context API has been modified slightly to reflect this,
and the work pool directly exposed, for now.

* Don't use file-as-struct

* Run style script, respect line length limit

* Improve accept failure handling & update TODOs
* Handle potentially failing/cancelling of `accept_multishot` by
  re-queueing it, based on the `IORING_CQE_F_MORE` flag.
* Revise/simplify the queueing logic for the `accept_multishot` SQE.
* Resolve the EINTR TODO panics, returning a catch-all error value
  indicating it as a bad but non-critical error.
* Update the `a: ?noreturn` `if (a) |*b|` TODO, adding that it's solved
  in 0.14; it should be resolved after we update to 0.14.
* Unify EAGAIN panic message.

* Add TODO to remove hacky-ish workaround

* Use `self: Type` convention

* A few minor fixups and improvements

* Simplify test, make server socket nonblocking
On MacOS, on basic WorkPool, this means we now need to manually set the
accepted socket's flags to the right things, ie, blocking, as opposed
to the server socket's nonblocking mode.
Means we also have to handle EAGAIN a bit differently in the io_uring
backend, but that's a fine tradeoff.

* server.zig -> server/lib.zig

* Segregate out the basic backend
And re-organize some methods based on that change

* Re-organize server module

* Update LOGGER_SCOPE

* Simplify & improve io_uring backend error handling

* De-scope `accept_flags`

* Simplify `can_use` for linux cross-compilation

* (io_uring) Rework error handling, add timeout
Do not exit for *any* errors that are specific to the
related connection, simply free them and continue to the next CQE.

Specifically in the case of `error.SubmissionQueueFull`, instead of
immediately failing, we instead first try to flush the submission queue
and then try again to submit; if it fails a second time, that means
despite flushing the submission queue, it somehow still failed, so
we panic, since this indicates something is *very* wrong.

This also eliminates the `pending_cqes_buf`, since there is actually
no situation in which `consumeOurCqe` returns an error, and we resume
work afterwards - either we process all the received CQEs, or we hard
exit - this was already essentially the case before, now it's more
obvious.

For the main submit, we now wait for at least 1 connection, but we
also add a timeout SQE to make it terminate if we don't receive a
connection or completion of another task for 1 second; this alleviates
the busy loop that was running before.

* (io_uring) Remove multishot_accept_submitted
Also slightly refactor error sets.
Now instead of checking to see if we need to set a flag to re-queue
the multishot accept, we just pass in the server context on init
and queue it, which now makes sense since the context and workpool
are separate.

* (io_uring) Simplify new entry creation
Also add fix for rebase

* Misc fixups

* Re-organize alias/import

* General restructure
* Move more specific functions to the only files they're used.
* Move the `serve*` functions outside of `Context`, making them
  free functions which just accept the context and work pool.
* Remove `acceptAndServeConnection`; originally this was required to
  be able to nicely structure the unit test, and used to be more
  integrated, however it no longer makes sense as a concept.
* Inline `handleRequest` into the basic backend.
* Make the `acceptHandled` function, moved into the basic backend,
  guarantee the specified `sync` behavior, and inline `have_accept4`.
* Appropriately re-export the relevant parts of the server API.
* Added top level doc comments.

* Re-oorganize loggers & scopes

* Refactor `build_options` imports

* Add `no_network_tests` build option
And disable the rpc server test when it is enabled

* Update circleci with `-Dno-network-tests`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

5 participants