Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression boundary analysis framework #3912

Merged
merged 2 commits into from
Nov 9, 2022
Merged

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Oct 20, 2022

This PR implements the initial revision of the proposed (by #3898) expression boundary analysis framework. The changes can be summarized as:

  • Removed expr_stats() method from PhysicalExpr.

  • Removed PhysicalExprStats.

  • Introduced a new boundaries() method to PhysicalExpr, which allows any expression to provide its own boundaries.

  • Migrated the existing implementations (comparisons, literals, columns) from ~temporary expression statistics APIs to proposed boundary APIs.

  • Changed the implementation of comparisons to now return a boundary of [true] (when it is always true), [false] (when it is always false), and [false, true] (when there is a partial overlap). So the framework is consistent with all expressions.

  • A shared context is now passed instead of a list of column boundaries. We intend to leverage this existing layout later in the process for composite expression chains.

@github-actions github-actions bot added the physical-expr Physical Expressions label Oct 20, 2022
@isidentical isidentical force-pushed the gh-3898 branch 2 times, most recently from f4050c9 to d95ce05 Compare October 22, 2022 19:15
@isidentical
Copy link
Contributor Author

Another interesting point that this PR might need is probably a specification / design document, @alamb what would work best for you (I can try to add a couple more example analysis, like a + b; or maybe a specification like the ones in here https://arrow.apache.org/datafusion/contributor-guide/specification/index.html) to push this forward?

@alamb
Copy link
Contributor

alamb commented Oct 25, 2022

I would personally recommend starting with a google doc if you can do that (as I have found it is easiest to quickly iterate with them) and then we convert that to a document in https://arrow.apache.org/datafusion/contributor-guide/specification/index.html

@metesynnada
Copy link
Contributor

We are also considering a range optimization on hash join, on queries like

SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x

and it paves the way for a streaming execution of the HashJoinExec. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.

@isidentical
Copy link
Contributor Author

@metesynnada I've started a draft of the whole CBOs and statistics (including the expression analysis framework) on a google doc, would love to hear more on what sort of stuff range optimization would need and how we could reshape this API to be able to easily used for that.

@isidentical
Copy link
Contributor Author

A couple of questions:

  • Is apply() a good name? I thought about it a lot, but I am still not sure. Maybe something like update_context.
  • Should I still emphasize this is an experimental API (I think we should?)

@isidentical isidentical marked this pull request as ready for review November 2, 2022 16:03
@isidentical isidentical changed the title [wip]: Expression boundary analysis framework Expression boundary analysis framework Nov 2, 2022
@isidentical isidentical requested a review from alamb November 2, 2022 16:25
@isidentical
Copy link
Contributor Author

Also CC: @Dandandan @andygrove

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @isidentical -- I think this is looking quite good. I left some comments about the interface -- let us know what you think

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Is apply() a good name? I thought about it a lot, but I am still not sure. Maybe something like update_context.

How about analyze()?

Should I still emphasize this is an experimental API (I think we should?)

I agree

@alamb alamb mentioned this pull request Nov 2, 2022
@isidentical
Copy link
Contributor Author

How about analyze()?

analyze() is the endpoint we have that takes a context and then produces the boundaries. apply() in that sense is taking the context and updating the existing boundaries for that expression. E.g. where a + 1 > 20 AND a < 70, we need a way of changing the context to mark a's lower boundaries to be 19 (20 - 1). Normally if a was just a column (like a < 70), we could downcast it to a ColumnExpr during the comparison; learn its position in the current schema and update it directly on the context. But since a + 1 is the expression that has the lower boundary of 20; I was thinking we can provide an additional method where each physical expression can decide how they want to propagate the information to their respective sub-expressions until they reach a column.

Like the example in here.

@mingmwang
Copy link
Contributor

We are also considering a range optimization on hash join, on queries like

SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x

and it paves the way for a streaming execution of the HashJoinExec. We’d like to collaborate on this PR. I guess a google doc would be the perfect medium to share ideas @isidentical.

Why this SQL can run HashJoinExec ? Could you please explain a bit? Since there is no equal-join conditions, we can not derive any equal join conditions from the original expression either.

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Nov 4, 2022

SELECT ...
FROM left, right
WHERE left.x > right.y AND
		left.x + left.y < right.x + 10 AND
		left.z = left.x

Same question. For this non-equal join, how can we apply hash join?

@mingmwang
Copy link
Contributor

Can we merge this PR first ? Then we can continue and do some experimentations with those APIs.

@mingmwang
Copy link
Contributor

@isidentical One quick question here, a complex binary expressions mixed of AND/OR expressions might generate multiple discrete intervals, make it looks like histograms actually. How can our API models this ?

Column a boundaries [0, 100]

Expressions:
a = 1 or a = 9 or (a > 30 and a <= 50) or (a > 70 and a <= 80)

@mingmwang
Copy link
Contributor

Another example is :
a in (1, 3, 5, 7, .....)

@isidentical isidentical marked this pull request as draft November 4, 2022 17:19
@isidentical
Copy link
Contributor Author

Marking the PR draft until we reach a consensus on the API design. @alamb @Dandandan how do we feel about:

  • fn boundaries(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries>;
  • fn apply_boundaries(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries);

I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.

@mingmwang on the point of discrete intervals, yes it is a matter of histograms. Currently we would have a non-uniform distribution but would represent it as a uniform range (a in (1, 2, 24, 25) would be represented as {min=1, max=25, distinct=4} which would imply a distribution of 1, 9, 17, 25 so not as precise). In the following phases, I hope to add a ValueDistribution construct to the ExprBoundaries so we would have a smaller-scoped histogram with all these ranges but we are not there yet.

@alamb
Copy link
Contributor

alamb commented Nov 7, 2022

I know there are still some reservations on the context side, but I am out of ideas on how to make it simpler while preserving the same set of features (allowing dynamic column boundary updates from sub-expressions; being able to fork it at certain points (like ORs) and still able to delegate all this information back to the call side [so not only to sub-expressions but also the expression itself who is calling us]). I'd be happy to try any suggestions though.

I feel like we are somewhat stuck because the usecase (at least to me) of intermediate column analysis isn't 100% clear. To truly understand I think I need some example code showing how this would be used. Until then I can't seem to quite follow the proposals (maybe because I am too stuck in previous patterns I have seen)

Here is what I propose:

  1. We get some sort of specific usecase for the intermediate column range expressions
  2. If we can't get that maybe we abandon the multi-column flexibility initially and work on getting something in with focus on specific columns and see where multi column would have helped

@isidentical
Copy link
Contributor Author

@alamb I think that sounds like a good plan! The primary place where this would be essential (or at least, I'd interpret it as essential; but maybe we can find a simpler solution) is ANDs which we can work on in the subsequent PRs.

For this PR, I've dropped support for 'intermediate column updates': so apply() API is gone, context is now shared as a regular reference (&AnalysisContext) and all the phases for intermediate analyses are removed (e.g. new upper bound computation from comparison analyzer).

@isidentical isidentical marked this pull request as ready for review November 7, 2022 23:08
@mingmwang
Copy link
Contributor

Can we get this PR merged this week?
I would like to start working on the Filter Selectivity estimation PR base on those new APIs.

@mingmwang
Copy link
Contributor

Based my understanding, I think the selectivity and row count estimation is more important, the column boundaries estimation and derivation is error prone, especially after couple of complex expressions, or some advanced SQL operators, we might have to give up the column boundaries and just propagate the estimation of the row counts.

SELECT * FROM LINEITEM WHERE L_EXTENDEDPRICE*(1-L DISCOUNT) > 90000

@mingmwang
Copy link
Contributor

And for a simple expressions like a + 100 < 20, can we safely derive the boundary of a [min, -80) ? Probably not, considering a + 100 might overflow the type's boundary.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isidentical -- this looks like a good foundation to build on to me 👍

Perhaps as @mingmwang works with the code and analysis further improvements will become apparent

@alamb alamb merged commit 01941aa into apache:master Nov 9, 2022
@ursabot
Copy link

ursabot commented Nov 9, 2022

Benchmark runs are scheduled for baseline = 0f18e76 and contender = 01941aa. 01941aa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@isidentical
Copy link
Contributor Author

isidentical commented Nov 9, 2022

Thanks everyone for working with me on this while we iterate the on the design for a few times! Hopefully the upcoming PRs would justify the APIs even more and opens up more use cases. If anyone is interested, I plan to add a few tickets for the parts that I am planning to work on (esp on the composite expressions) and if there is a special topic that interests you please let me know on #3898 so we can sync up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants