Expression boundary analysis framework #3912

isidentical · 2022-10-20T20:04:44Z

This PR implements the initial revision of the proposed (by #3898) expression boundary analysis framework. The changes can be summarized as:

Removed expr_stats() method from PhysicalExpr.
Removed PhysicalExprStats.
Introduced a new boundaries() method to PhysicalExpr, which allows any expression to provide its own boundaries.
Migrated the existing implementations (comparisons, literals, columns) from ~temporary expression statistics APIs to proposed boundary APIs.
Changed the implementation of comparisons to now return a boundary of [true] (when it is always true), [false] (when it is always false), and [false, true] (when there is a partial overlap). So the framework is consistent with all expressions.
A shared context is now passed instead of a list of column boundaries. We intend to leverage this existing layout later in the process for composite expression chains.

datafusion/physical-expr/src/expressions/binary.rs

isidentical · 2022-10-25T15:31:46Z

Another interesting point that this PR might need is probably a specification / design document, @alamb what would work best for you (I can try to add a couple more example analysis, like a + b; or maybe a specification like the ones in here https://arrow.apache.org/datafusion/contributor-guide/specification/index.html) to push this forward?

alamb · 2022-10-25T16:37:25Z

I would personally recommend starting with a google doc if you can do that (as I have found it is easiest to quickly iterate with them) and then we convert that to a document in https://arrow.apache.org/datafusion/contributor-guide/specification/index.html

metesynnada · 2022-10-26T08:29:49Z

We are also considering a range optimization on hash join, on queries like

SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x

and it paves the way for a streaming execution of the HashJoinExec. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.

isidentical · 2022-10-26T15:29:30Z

@metesynnada I've started a draft of the whole CBOs and statistics (including the expression analysis framework) on a google doc, would love to hear more on what sort of stuff range optimization would need and how we could reshape this API to be able to easily used for that.

isidentical · 2022-11-02T16:03:10Z

A couple of questions:

Is apply() a good name? I thought about it a lot, but I am still not sure. Maybe something like update_context.
Should I still emphasize this is an experimental API (I think we should?)

isidentical · 2022-11-02T16:26:09Z

Also CC: @Dandandan @andygrove

alamb

Thanks! @isidentical -- I think this is looking quite good. I left some comments about the interface -- let us know what you think

datafusion/physical-expr/src/expressions/binary.rs

alamb · 2022-11-02T19:45:18Z

Is apply() a good name? I thought about it a lot, but I am still not sure. Maybe something like update_context.

How about analyze()?

Should I still emphasize this is an experimental API (I think we should?)

I agree

isidentical · 2022-11-03T20:21:49Z

How about analyze()?

analyze() is the endpoint we have that takes a context and then produces the boundaries. apply() in that sense is taking the context and updating the existing boundaries for that expression. E.g. where a + 1 > 20 AND a < 70, we need a way of changing the context to mark a's lower boundaries to be 19 (20 - 1). Normally if a was just a column (like a < 70), we could downcast it to a ColumnExpr during the comparison; learn its position in the current schema and update it directly on the context. But since a + 1 is the expression that has the lower boundary of 20; I was thinking we can provide an additional method where each physical expression can decide how they want to propagate the information to their respective sub-expressions until they reach a column.

Like the example in here.

mingmwang · 2022-11-04T06:58:58Z

We are also considering a range optimization on hash join, on queries like
SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x
and it paves the way for a streaming execution of the HashJoinExec. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.

Why this SQL can run HashJoinExec ? Could you please explain a bit? Since there is no equal-join conditions, we can not derive any equal join conditions from the original expression either.

yahoNanJing · 2022-11-04T07:00:57Z

SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x

Same question. For this non-equal join, how can we apply hash join?

mingmwang · 2022-11-04T08:35:11Z

Can we merge this PR first ? Then we can continue and do some experimentations with those APIs.

datafusion/physical-expr/src/expressions/column.rs

mingmwang · 2022-11-04T09:06:25Z

@isidentical One quick question here, a complex binary expressions mixed of AND/OR expressions might generate multiple discrete intervals, make it looks like histograms actually. How can our API models this ?

Column a boundaries [0, 100]

Expressions:
a = 1 or a = 9 or (a > 30 and a <= 50) or (a > 70 and a <= 80)

mingmwang · 2022-11-04T09:07:27Z

Another example is :
a in (1, 3, 5, 7, .....)

isidentical · 2022-11-04T17:27:50Z

Marking the PR draft until we reach a consensus on the API design. @alamb @Dandandan how do we feel about:

fn boundaries(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries>;
fn apply_boundaries(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries);

I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.

@mingmwang on the point of discrete intervals, yes it is a matter of histograms. Currently we would have a non-uniform distribution but would represent it as a uniform range (a in (1, 2, 24, 25) would be represented as {min=1, max=25, distinct=4} which would imply a distribution of 1, 9, 17, 25 so not as precise). In the following phases, I hope to add a ValueDistribution construct to the ExprBoundaries so we would have a smaller-scoped histogram with all these ranges but we are not there yet.

alamb · 2022-11-07T20:59:14Z

I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.

I feel like we are somewhat stuck because the usecase (at least to me) of intermediate column analysis isn't 100% clear. To truly understand I think I need some example code showing how this would be used. Until then I can't seem to quite follow the proposals (maybe because I am too stuck in previous patterns I have seen)

Here is what I propose:

We get some sort of specific usecase for the intermediate column range expressions
If we can't get that maybe we abandon the multi-column flexibility initially and work on getting something in with focus on specific columns and see where multi column would have helped

isidentical · 2022-11-07T23:02:41Z

@alamb I think that sounds like a good plan! The primary place where this would be essential (or at least, I'd interpret it as essential; but maybe we can find a simpler solution) is ANDs which we can work on in the subsequent PRs.

For this PR, I've dropped support for 'intermediate column updates': so apply() API is gone, context is now shared as a regular reference (&AnalysisContext) and all the phases for intermediate analyses are removed (e.g. new upper bound computation from comparison analyzer).

mingmwang · 2022-11-08T07:07:32Z

Can we get this PR merged this week?
I would like to start working on the Filter Selectivity estimation PR base on those new APIs.

mingmwang · 2022-11-08T07:24:10Z

Based my understanding, I think the selectivity and row count estimation is more important, the column boundaries estimation and derivation is error prone, especially after couple of complex expressions, or some advanced SQL operators, we might have to give up the column boundaries and just propagate the estimation of the row counts.

SELECT * FROM LINEITEM WHERE L_EXTENDEDPRICE*(1-L DISCOUNT) > 90000

mingmwang · 2022-11-08T07:40:02Z

And for a simple expressions like a + 100 < 20, can we safely derive the boundary of a [min, -80) ? Probably not, considering a + 100 might overflow the type's boundary.

alamb

Thanks @isidentical -- this looks like a good foundation to build on to me 👍

Perhaps as @mingmwang works with the code and analysis further improvements will become apparent

ursabot · 2022-11-09T17:02:05Z

Benchmark runs are scheduled for baseline = 0f18e76 and contender = 01941aa. 01941aa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

isidentical · 2022-11-09T17:06:42Z

Thanks everyone for working with me on this while we iterate the on the design for a few times! Hopefully the upcoming PRs would justify the APIs even more and opens up more use cases. If anyone is interested, I plan to add a few tickets for the parts that I am planning to work on (esp on the composite expressions) and if there is a special topic that interests you please let me know on #3898 so we can sync up.

isidentical commented Oct 20, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved

github-actions bot added the physical-expr Physical Expressions label Oct 20, 2022

isidentical force-pushed the gh-3898 branch 2 times, most recently from f4050c9 to d95ce05 Compare October 22, 2022 19:15

alamb mentioned this pull request Oct 28, 2022

Representing statistics in logical or physical plans (or in both of them) #4003

Open

isidentical mentioned this pull request Oct 31, 2022

A framework for expression boundary analysis (and statistics) #3898

Closed

isidentical force-pushed the gh-3898 branch from d95ce05 to 0916b3d Compare November 2, 2022 15:55

isidentical marked this pull request as ready for review November 2, 2022 16:03

isidentical changed the title ~~[wip]: Expression boundary analysis framework~~ Expression boundary analysis framework Nov 2, 2022

isidentical requested a review from alamb November 2, 2022 16:25

alamb reviewed Nov 2, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Nov 2, 2022

minor: Use Operator::swap #4092

Merged

Dandandan reviewed Nov 4, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/column.rs Outdated Show resolved Hide resolved

Dandandan reviewed Nov 4, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/column.rs Outdated Show resolved Hide resolved

isidentical marked this pull request as draft November 4, 2022 17:19

isidentical added 2 commits November 8, 2022 01:58

Generic Expression Analysis Framework

3c127f8

Drop support for intermediate column updates on context

78b1295

isidentical force-pushed the gh-3898 branch from dcaab5a to 78b1295 Compare November 7, 2022 22:58

isidentical marked this pull request as ready for review November 7, 2022 23:08

alamb approved these changes Nov 9, 2022

View reviewed changes

alamb merged commit 01941aa into apache:master Nov 9, 2022

isidentical deleted the gh-3898 branch November 9, 2022 17:08

clflushopt mentioned this pull request Feb 11, 2025

Implement nested join optimization #3843

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expression boundary analysis framework #3912

Expression boundary analysis framework #3912

isidentical commented Oct 20, 2022 •

edited

Loading

isidentical commented Oct 25, 2022

alamb commented Oct 25, 2022

metesynnada commented Oct 26, 2022

isidentical commented Oct 26, 2022

isidentical commented Nov 2, 2022

isidentical commented Nov 2, 2022

alamb left a comment

alamb commented Nov 2, 2022

isidentical commented Nov 3, 2022

mingmwang commented Nov 4, 2022

yahoNanJing commented Nov 4, 2022 •

edited

Loading

mingmwang commented Nov 4, 2022

mingmwang commented Nov 4, 2022

mingmwang commented Nov 4, 2022

isidentical commented Nov 4, 2022

alamb commented Nov 7, 2022

isidentical commented Nov 7, 2022

mingmwang commented Nov 8, 2022

mingmwang commented Nov 8, 2022

mingmwang commented Nov 8, 2022

alamb left a comment

ursabot commented Nov 9, 2022

isidentical commented Nov 9, 2022 •

edited

Loading

Expression boundary analysis framework #3912

Expression boundary analysis framework #3912

Conversation

isidentical commented Oct 20, 2022 • edited Loading

isidentical commented Oct 25, 2022

alamb commented Oct 25, 2022

metesynnada commented Oct 26, 2022

isidentical commented Oct 26, 2022

isidentical commented Nov 2, 2022

isidentical commented Nov 2, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Nov 2, 2022

isidentical commented Nov 3, 2022

mingmwang commented Nov 4, 2022

yahoNanJing commented Nov 4, 2022 • edited Loading

mingmwang commented Nov 4, 2022

mingmwang commented Nov 4, 2022

mingmwang commented Nov 4, 2022

isidentical commented Nov 4, 2022

alamb commented Nov 7, 2022

isidentical commented Nov 7, 2022

mingmwang commented Nov 8, 2022

mingmwang commented Nov 8, 2022

mingmwang commented Nov 8, 2022

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Nov 9, 2022

isidentical commented Nov 9, 2022 • edited Loading

isidentical commented Oct 20, 2022 •

edited

Loading

yahoNanJing commented Nov 4, 2022 •

edited

Loading

isidentical commented Nov 9, 2022 •

edited

Loading