-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example for using a separate threadpool for CPU bound work (try 2) #14286
base: main
Are you sure you want to change the base?
Conversation
I think this is great Andrew. For what it's worth if this were packaged up in some installable way (even if it had to be from git, etc.) I'm sure we'd be super happy to can our custom stuff and use this :) |
df43a02
to
b5d4ae1
Compare
Thank you 🙏 If this is the case I would be happy to make a PR to put @adriangb Is there any way you can test / verify that the structures in this crate would work for you (like can you temporarily copy/paste dedicated_executor.rs into your tree?) |
/// Demonstrates running queries so that | ||
/// 1. IO operations happen on the current thread pool | ||
/// 2. CPU bound tasks happen on a different thread pool | ||
async fn different_runtime_advanced() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the example that shows the best practice for separating CPU and IO
// | ||
// ctx.register_object_store(&base_url, http_store); | ||
// A Tokio 1.x context was found, but timers are disabled. Call `enable_time` on the runtime builder to enable timers. | ||
let http_store = dedicated_executor.wrap_object_store_for_io(http_store); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is @tustvold 's core concern as I understand -- that the overhead of transferring data back/forth between runtimes by wrapping streams could be avoided with lower level hooks for IO
Yes I'll do just that and report back. I don't think it should block merging this as an example though. I understand the hesitation to put it in core, but unless there's something better that's plug and play I think it's better to have this than nothing. Could an alternative be to put it in a contrib crate or a new workspace package (so that CI also runs on it and it can have tests), add whatever APIs core needs to make it easier to plug in and document the setup? In my mind unless it's set up by default there isn't much more value in having it in core vs any other installable package. |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! [DedicatedExecutor] for running CPU-bound tasks on a separate tokio runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a monster amount of code to put in an example -- it is a large amount of tests and documentation and I wrote it with an eye towards eventually putting it into datafusion itself
/// | ||
/// [`DedicatedExecutor::spawn_io`] for more details | ||
#[derive(Debug)] | ||
pub struct IoObjectStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the IoObjectStore that implements ObjectStore
and wraps another ObjectStore that does IO on a different thread.
} | ||
|
||
// ----------------------------------- | ||
// ----- Tests for IoObjectStore ------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are tests that prove the IO is actually happening on the correct thread (both the original requests as well as the streams that are returned)
Is there anyone willing to test out this approach in their application so we can get some confirmation that it avoids some of the timeouts reported while doing heavy processing of data from object store? Some people I think who have reported seeing this issue:
I am willing to help make a draft PR to whatever repo to hook Also FYI @ozankabak as we discussed this topic recently. |
In the interests of avoiding confusion, as my objections appear to have gotten a little misinterpreted, I'd like to clarify the fact this approach comes with non-trivial overheads is not what concerns me with this approach. Rather that we know from experience at InfluxData that this pattern is fragile, easy to mess up, and leads to emergent behaviour that is highly non-trivial to reproduce and debug. That being said as Andrew says, nobody has emerged who is able/willing to resolve this with a more holistic approach, e.g. something closer to what polars/DuckDb/Hyper are doing to separate IO/compute, and so proceeding with something is better than nothing. I just hoped someone might step up and run with something along the lines of #13692 (comment) |
Based on the previous discussions, and draft PRs, I ended up with this Object store wrapper to spawn the io tasks in a different handle: https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Fstorage%2Fmod.rs#L116-L124 I do like the approach proposed in this PR, and worthwhile to have it be part of core. Regarding testing it, since this wrapper above was added, I haven't heard anyone mentioning these issues in delta-rs anymore outside of @Tom-Newton, did you end up resolving it on your end? I forgot to follow up on it Slightly offtopic, but I wonder if there was anything done with surfacing the true error better? Because |
We are still having quite significant problems, but we can't reproduce it reliably and I haven't been personally working on it recently. Anecdotally we think it's more frequent when:
|
This is neat -- I actually like that it is in terms of just a tokio runtime rather than somehthing like One thing I noticed is that it doesn't actually shim the get result stream (so while the intial GET request is on the other runtime, when reading data from the stream that will still happen on the same runtime) I handled this by wrapping the stream like this: /// Wrap the result stream if necessary
fn wrap_if_necessary(&self, payload: GetResultPayload) -> GetResultPayload {
match payload {
GetResultPayload::File(_, _) => payload,
GetResultPayload::Stream(stream) => {
let new_stream = self.dedicated_executor.run_io_stream(stream).boxed();
GetResultPayload::Stream(new_stream)
}
}
} It might be an interesting thing to try in delta.rs 🤔 |
@alamb ah good point, I missed that! Definitely good to add, will have a better look at where these payload streams are collected |
.on_thread_start(move || { | ||
DedicatedExecutor::register_io_runtime(io_handle.clone()) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note that this clobbers any on_thread_start
set on runtime_builder
. Unfortunately it is not stored publicly on RuntimeBuilder
so it is not possible to pull off the existing setting and wrap it. I suggest we add an on_thread_start
method to DedicatedExecutorBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice if there was also a way to warn users that we are clobbering their configuration...
I've successfully made a PR to integrate this internally. It was pretty straightforward. We'll have to scrutinize a bit to see if we can tell if anything is missing (this is very error prone code that can go bad in subtle ways) and if all goes well will report back after this is in production for a couple of days. |
I think that makes sense to me too. When I see the current
Just to put the brakes on rushing this into core too quickly, I have to support @tustvold in raising concern; we have had plenty of issues at Pydantic with this pattern where we have missed cases where we should have spawned IO (or CPU work) onto the other runtime. It seems to me like most folks agree research into schedulers for the CPU work that can sit within the single tokio runtime would be much easier to integrate for downstream use cases, probably at the cost of complexity (overheads?) within datafusion itself. We have already vendored this pattern and have it working for us, you don't need to rush this through for us. |
This looks great. I'll try to reproduce my issue and try it out. And I have to dig a bit deeper into the topic to understand the concerns about this approach. |
|
||
#[async_trait] | ||
impl MultipartUpload for IoMultipartUpload { | ||
fn put_part(&mut self, data: PutPayload) -> UploadPart { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the S3 implementation of ObjectStore, put_part is 'tokio spawned' on the runtime of its caller (it uses a JoinSet and currently uses spawn and not spawn_on):
https://github.com/apache/arrow-rs/blob/6aaff7e38a573f797b31f89f869c3706cbe26e37/object_store/src/upload.rs#L211
If I understand the implementation and integration pattern correctly, this means that the put_part future for S3 will be spawned on the dedicated (CPU only) executor and not the IO runtime. We may have to patch the object store crate to get true IO/CPU isolation in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this just needs to be wrapped in spawn_io, the overhead of the additional tokio task will be irrelevant when compared to the cross RT and IO overheads.
The fact this is so hard to spot is a good example of why this pattern is really fragile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to wrap put_part with spawn_io, but I'm encountering some issues. For multipart uploads, I believe the part_idx needs to be incremented in the same order that PutPayload instances are generated from the stream being written. The indirection introduced by sending the put_part future to the I/O runtime via spawn_io seems to cause the payloads to be written out of order.
As a result, I'm seeing failures in complete() that indicate parts with a size smaller than 5MB are being sent before the last part.
I don't see actually anything in the standard S3MultipartUpload implementation that prevents this behavior, but I haven't encountered any failures related to this in the past, so I may be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't matter that they're written out of order as they're assembled based on the index captured when the future is created. Could you perhaps share your code?
Edit: I created an explicit test in object_store to show this - apache/arrow-rs#7047
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, the test was very helpful and clearly illustrates that even though the async part uploads run concurrently, the synchronous part idx assignment happens serially on the same thread. I accidentally changed this behavior by wrapping inner in an arc mutex (to fix some borrow checker issues) and having all of inner put_parts() contend for the mutex, which broke the ordering guarantees. Fixed it.
For your information, I was able to reproduce the IO stall error with a very simple example. In case anybody is interested and wants to try it out. I will test the dedicated executor tomorrow. |
@davidhewitt I agree in principle. I think any time one has to cross from sync code to async code the API is going to get challenging. The upside of the current DataFusion approach with a single tokio runtime is it just "works" (for some defintion of work) The downside is that squeezing out the maximum performance is tricky Though I would argue postponing the pain until you actually need really good network performance might be better than forcing users to sort it out upfront before they can even run simple things Of course, ideally it would "just work" for users and they wouldn't have to worry about it at all |
I tried the dedicated executor in my test example. I'm not entirely sure if I'm using it wrong or maybe my test is too contrived. But I'm still getting a single Timeout Error. This is already much better compared to when using a single runtime, which produces many TimeOut Errors. I invite everybody to have a look, maybe you can spot a bug. I will try the dedicated executor with a real example. The problem is that it is difficult to reproduce because the issue only occurred at a certain size. |
I took a look at your example @JanKaul, thank you for this as I think it very nicely demonstrates the challenge of shimming IO at the object store interface. Unfortunately I think this may be a different issue from the runtime stall issue. At a high level the code is doing this
The problem is this introduces 2 second pauses between polling the streaming request from object storage. Regardless of runtime setup, holding a request across a long-running task will result in timeouts, as backpressure will eventually cause the sender to timeout. You would have the same issue if it were an "async" sleep, e.g. There are two broad strategies to avoid this:
Option 1. will potentially buffer the entire input stream if the consumer is running behind, but should ensure the request runs to completion. Option 2. is I think the better approach, and is what code should probably be doing (FWIW this is what the parquet reader does). As an aside I also took the time to port over the custom executor approach, to show what that looks like - JanKaul/cpu-io-executor#1. I still think this is a cleaner approach although, much like the DedicatedExecutor in this PR, it doesn't help with the issue this example is running into |
Thanks a lot for having a look! I'll try to adopt the strategy 2 to simulate the behavior of the parquet reader. |
@tustvold I've adopted the example to somewhat resemble the parquet reader. It tries to mimic its behavior by reading the stream from the object store and creating a stream of batched data. It would be great if you could have a look. The goal is to mimic the parquet readers behavior from an async perspective. To be able to test different approaches. |
I think you misunderstand, the parquet reader does not do streaming fetches at all, it uses get_range and get_ranges. Edit: FWIW I think focusing on object store is a distraction, shimming it properly is very hard and I've written in the past about why, and we've seen some further evidence of this here. However, the broader challenge is this problem impacts all IO. We're struggling to safely wrap one IO interface, taking the approach in this PR forces this for every catalog, store, async TableProvider, etc... Something is better than nothing, but I at least don't consider this anything more than a hack. |
} | ||
|
||
#[async_trait] | ||
impl ObjectStore for IoObjectStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a wrapper it needs to override every method and call through to the underlying impl, otherwise it will de-optimise stores with specialized impls
Thanks for having a look. I have to dig a bit deeper to really understand where the issue is. I think this PR is a very pragmatic approach that solves the most important use case. As this problem seems to only occur when executing the physical plan. |
Assuming this was a mistake we made at InfluxData, we had issues with plan time concurrency that were caused by this also, they were just much harder to track down as they didn't manifest as obviously (it took ~3 months). |
Note: This PR contains a (substantial) example and supporting code. It has no changes to the core.
Which issue does this PR close?
Closes Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393
Note this is new version version of Add example for using a separate threadpool for CPU bound work #13424
Rationale for this change
I have heard from multiple people multiple times that the specifics of using multiple threadpools for separate CPU and IO work in DataFusion is confusing.
However, this is a key detail for building high performance engines which process data directly from remote storage, which I think is a very important capability for DataFusion
My past attempts to make this example have been bogged down trying to get consensus on details of how to transfer results across streams, the wisdom of wrapping streams, and other details.
However, there have been no other actual alternatives proposed (aka no actual code that I could write an example with). So while the approach of wrapping streams may be a bit ugly, it is working for us in production at InfluxDB and I believe @adriangb said they are using the same strategy at Pydantic.
In my opinion, the community needs something that works, even if it is not optimial, which is this example.
I would personally like to merge it in so it is easier to find (and not locked in a PR) and iterate afterwards
What changes are included in this PR?
thread_pools.rs
examplededicated_executor.rs
module thatNote that the DedicatedExecutor code is orginally from
IoObjectStore
wrapper is based on work from @matthewmturne in AddDedicatedExecutor
to FlightSQL Server datafusion-contrib/datafusion-dft#247Note that I have purposely avoided any changes to the DataFusion crates (such as adding
DedicatedExecutor
tophysical-plan
), but I have written the code with tests with an eye towards doing exactly such a thing one we have some experience / feedback on the approach. This is tracked byDedicatedExecutor
into the DataFusion crates to make using multiple threadpools easier #14285Right now, I think the most important thing is to get this example in for people to try and confirm if it helps them
Are these changes tested?
Yes the example is run as part of CI and there are tests
TODO: Verify that the tests in the examples run in CI
Are there any user-facing changes?
Not really