-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ClickBench Q23 Go Faster #15177
Comments
OOOO -- here is the duckdb plan and it shows what they are doing! The key is this line:
What I think this is referring to is what @adriangb is describing in : Specifically, the Top_N operator passes down a filter into the scan. The filter is "dynamic" in the sense that
|
The topk dynamic filtering is described here: |
BTW apparently DuckDB uses the "late materialization" technique with its own native format. Here is an explain courtesy of Joe Issacs and Robert Kruszewski
|
This looks cool! Very interested in this. |
There's two optimizations here that go together, if you check clickbench results duckdb on their own format is significantly faster than parquet. The two optimizer rules that do this is 1) TopN https://github.com/duckdb/duckdb/blob/main/src/optimizer/topn_optimizer.cpp#L105 2) Late materialization https://github.com/duckdb/duckdb/blob/main/src/optimizer/late_materialization.cpp#L180 (join back the filter result to obtain rest of the columns) |
Note that late materialization (the join / semi join rewrite) needs join operator support that DataFusion doesn't yet have (we could add it but it will take non trivial effort) My suggested order of implementation is:
I actually think that will likely get us quite fast. I am not sure how much more improvement late materialized joins will get without a specialized file format. I don't have time to help plan out late materializing joins at the moment, but I am quite interested in pushing along the predicate pushdown |
There is a similar thought named Even though it aims to filter, the idea is similar, for example: Table
Back to topk,
We can spilt the idea to the query: WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids) |
I agree -- this is what I meant by "late materialization" . Your example / explanation is much better than mine @xudong963 🙏 |
Is your feature request related to a problem or challenge?
Comparing ClickBench on DataFusion 45 and DuckDB (link)
You can see that for 23 DataFusion is almost 2x slower (around 10s where DuckDB is 5s)

You can run this query like this:
Here is the explain plan
Something that immediately jumps out at me in the explain plan is this line
"Projection" I think means that all of those columns are being read/ decoded from parquet, which makes sense as the query has a
SELECT *
on it.However, in this case all but the top 10 rows are returned (out of 100M rows in the file)
So this means that most of the decoded data is decoded and thrown away immediately
Describe the solution you'd like
I would like to close the gap with DuckDB with some general purpose improvement
Describe alternatives you've considered
I think the way to improve performance here is to defer decoding ("Materializing") the other columns until we know what the top 10 rows are.
some wacky ideas:
Late materialization would look something like
row_id
Additional context
No response
The text was updated successfully, but these errors were encountered: