-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition skipping filter #624
base: main
Are you sure you want to change the base?
partition skipping filter #624
Conversation
@scovich I still have some more testing to add in unit tests before this is ready for review, but would you mind taking a quick look to see if it matches the design you were thinking? |
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider passing (and updating) the existing selection vector instead of creating a new one, because not all rows of the data we visit are even valid add actions. That way, we only try to extract partition values for rows that have survived this far, and we clear the selection bit if the filter prunes out the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we should have an additive vector to minimize what work we need to do, and to keep things simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: It's not just about minimizing work. Right now, we're attempting to apply partition filters to every action, even non-file actions. Which forces us to treat the partition values column as nullable even tho the Delta spec says it's required. That means extra complexity to compensate for possible nullness, and risk of missing a corrupt table whose add actions illegally lack the add.partitionValues
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be simpler to inject partition pruning into the existing AddRemoveDedupVisitor::is_valid_add
method, instead of defining a whole new visitor? It already accepts a selection vector, and it already skips non-file actions. We would just have to add the partition values to the schema it works with. Because add and remove actions both always provide partition columns, we could partition prune both before calling AddRemoveDedupVisitor::self.check_and_record_seen
, to avoid bloating the hash table with pruned entries.
The one complication would be how to handle non-partition tables cleanly, if we try to avoid fetching the partition values columns for non-partitioned tables. My guess is, that's over-optimizing. We should go ahead and unconditionally fetch the empty column, and just let the visitor conditionally ignore it. Something like this:
fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> {
// Add will have a path at index 0 if it is valid; otherwise, if it is a log batch, we may
// have a remove with a path at index 5. In either case, extract the getters for partition
// values and dv columns at indexes that immediately follow a valid path index.
let (path, pv_getter, dv_getters, is_add) =
if let Some(path) = getters[0].get_str(i, "add.path")? {
(path, &getters[1], &getters[2..5], true)
} else if !self.is_log_batch {
return Ok(false);
} else if let Some(path) = getters[5].get_opt(i, "remove.path")? {
(path, &getters[6], &getters[7..10], false)
} else {
return Ok(false);
};
// Only consider adds and removes that survive partition pruning
if !self.apply_partition_filters(pv_getter)? {
return Ok(false)
}
let dv_unique_id = match dv_getters[0].get_opt(i, "deletionVector.storageType")? {
Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts(
storage_type,
dv_getters[1].get(i, "deletionVector.pathOrInlineDv")?,
dv_getters[2].get_opt(i, "deletionVector.offset")?,
)),
None => None,
};
// Process both adds and removes, but only return not already-seen adds
let file_key = FileActionKey::new(path, dv_unique_id);
Ok(!self.check_and_record_seen(file_key) && is_add)
}
fn apply_partition_filters<'a>(&mut self, i: usize, pv_getter: &'a dyn GetData<'a>) -> DeltaResult<bool> {
let Some(partition_filter) = self.partition_filter else {
return Ok(true); // no filter => keep it
}
// extract and parse partition values, and apply the filter to them
let partition_values: HashMap<String, String> = pv_getter.get(i, "partitionValues")?;
todo!()
}
Also: I think today's approach only looks at adds, and removes would still bloat up the hash table. If so, we should fix that regardless of where the partition pruning code lives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to do this when I merge this with #607. I think I'd be okay to merge this first with this code and then I can move it over when I update my pr. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Thinking more about this, I realized it's actually UNSAFE to partition-prune removes, for the same reason we can't apply data skipping over removes: Those removes need to suppress earlier incompatible adds we might encounter, if the table's schema was replaced after the most recent checkpoint.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #624 +/- ##
==========================================
+ Coverage 84.08% 84.13% +0.05%
==========================================
Files 76 77 +1
Lines 17526 17707 +181
Branches 17526 17707 +181
==========================================
+ Hits 14736 14898 +162
- Misses 2077 2088 +11
- Partials 713 721 +8 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks.
I think we need to figure out if we merge this or #607 first. I suspect the code flow will look rather different once both are in.
In particular, what do you think about AddRemoveDedupVisitor
becoming the main point where we do both skipping and fixup expression calculation? (We'd probably want to rename it 😄)
The code flow would then be something like:
- call
process_scan_batch
on the scanner - build initial selection vec based on stats (this should probably be folded into the new visitor as well btw)
- construct new visitor
- have new visitor visit actions
new visitor does:
- resolving adds/removes to skip removed files
- visiting
add.partitionValues
- Applying the predicate to the extracted values (as in this PR except update the existing selection vector embedded in the visitor)
- Computing the predicate fixup expression if not skipped
In the end we have a single selection vector that has filtered out any Add
files that should be skipped either for stats or partitions, as well as the physical->logical transforms needed, and can return that to the engine.
I'm okay with merging this first and then I can update my PR to pull this logic into the visitor, or we can merge #607 and adjust this PR.
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we should have an additive vector to minimize what work we need to do, and to keep things simple.
We'll need some tests for this too, probably both unit for the skipping and then at least one in |
Absolutely, I wasn’t going to move it out of draft until after adding tests. I wasn’t mostly looking for an early check to make sure the approach matched the suggestion in the other PR linked above. |
One thing to note tho -- the DefaultPredicateEvaluator itself is already tested pretty thoroughly, so any partition pruning unit tests only need to validate that the partition values map it takes as input is correct (correctly parsed, correct names, etc). The read.rs would then be a lightweight end to end test to make sure everything is plumbed through correctly. |
let filter = DefaultPredicateEvaluator::from(resolver); | ||
Some(filter.eval_expr(&self.predicate, false).unwrap_or(true)) | ||
Ok(filter.eval_expr(&self.predicate, false).unwrap_or(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right? If the expression evaluated to None
that means we don't know whether it can be skipped and we must not skip it (see my other comment about the importance of passing Scalar::Null
values for partition values that are known to be NULL, so that IS [NOT] NULL predicates work correctly).
Meanwhile, if we just need to handle the extra Option/Result nesting, we can do:
let val = getter.get_map(i, "output")?.and_then(|m| {
...
Ok(filter.eval_expr(&self.predicate, false)).transpose()
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point now.
Something I was struggling with is I was testing with the multi partitioned acceptance tests examples, some of which do have null values for some partitions. In particular looking at acceptance/tests/dat/out/reader_tests/generated/multi_partitioned
I was trying to use a predicate that column "letter" equals a literal "b". This does let the nulls through because in the equality we're using PartialCmp
even for equality. I get that when doing a > or < comparison it's ill defined for null values, but I would expect that checking equality should be valid. I tested adding in a partial_eq_scalars
along side partial_cmp_scalars
and it resolves the problem I was having.
I might coming at the problem from a different point of view than the delta-kernel design. It the intent is for nulls to make it through this kind of predicate, then I can adjust my datafusion code to add in additional predicates of "is not null" for each of these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparisons with NULL get tricky... in SQL, any comparison against NULL produces NULL -- including e.g. NULL == NULL
. There's a concept of "null-safe equal" in spark (operator <=>
) which always returns either TRUE or FALSE, and there are other ways in other systems to achieve a similar result. But yes -- if you want a null-safe equal (or comparison) in a system that doesn't explicitly claim to support it, you need to turn e.g. x > 10
into x IS NOT NULL AND x > 10
.
Additionally for our predicate filters, a missing value is as NULL some of the time (e.g. comparisons against it return NULL), but it's NOT the same when it comes to IS [NOT] NULL predicates. Because if a value is outright missing, we don't know whether that's because it was null, or because it was unavailable. So a missing input has to cause IS [NOT] NULL to return NULL, when it comes to data skipping. For other comparisons, the correct thing already happens naturally because the result is the same for missing vs. NULL.
To give an example, suppose the predicate was:
WHERE partition_col > 10 AND value_col = 42
Then for partition pruning, we won't have stats for value_col
and we'll effectively end up with:
WHERE partition_col > 10 AND NULL
That's totally fine -- FALSE AND NULL
is FALSE
, so we can still skip. TRUE AND NULL
is NULL
, but that's fine because TRUE
and NULL
both force us to keep the file.
Complicated stuff! For more details, see extensive code comments including:
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/predicates/mod.rs#L25-L44
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/predicates/parquet_stats_skipping.rs#L109-L153
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add, data skipping and SQL WHERE have conflicting requirements:
- Data skipping must keep the file unless the predicate returns an actual FALSE
- WHERE discards the row unless the predicate returns an actual TRUE
For partition pruning, we should in theory use SQL WHERE semantics -- but that's only safe if we're certain that every column the predicate references is a partition column, which is not guaranteed. Hence my compromise suggestion, that we should ensure the partition values map always has a value (Scalar::Null
by default) for every partition column.
Additionally, we will probably want to hoist the eval_sql_where
function from ParquetStatsSkippingFilter
(which can probably go away) up to the PredicateEvaluator
trait so you can use it. That way, the required IS NOT NULL
checks would be injected automatically. That would also allow to simplify the can_statically_skip_all_files
function in scan/mod.rs, because it could also use a DefaultPredicateEvaluator
(with an empty column resolver) instead of weirdly using parquet stats skipping logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to write this all up. From reading the delta spec, specifically this line in requirements for writers :
Values for all partition columns present in the schema MUST be present for all files in the table.
This would lead me to think that the right thing to do is that if any of the partitionValues
are entirely missing that we should return an error because it doesn't match the spec.
Also from my reading of the spec it looks like null values should be an empty string per partition value serialization. The acceptance tests currently have things like "partitionValues":{"letter":"a","date":null,"data":"x"}
which is also how my other examples show up. But to meet the spec it looks like we need to also support the empty string case. I don't immediately see any examples in the repo that have this for a unit test, so I'll be sure to add both variants (null
and ""
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoisting eval_sql_where
up so DefaultPredicateEvaluator
can use it: #627
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to write this all up. From reading the delta spec, specifically this line in requirements for writers :
Values for all partition columns present in the schema MUST be present for all files in the table.
This would lead me to think that the right thing to do is that if any of the
partitionValues
are entirely missing that we should return an error because it doesn't match the spec.
That sounds right to me as well. However:
- We may want to ask delta-rs folks if they've seen other behaviors in the wild that we need to consider tolerating.
- We still have to worry about the non-partition columns the predicate might mention -- they won't have entries in
add.partitionValues
and we want them to resolve asNone
(notScalar::Null
).
Maybe the simplest thing to do -- in this partition skipping code at least -- is to treat missing partition values as if they were value columns. They would simply not participate in skipping, and the query that comes later can error out as appropriate.
Thank you for all the thoughtful feedback. I have to step away from this for a few days to focus on other things, but it is high priority for me because it will greatly impact many of my work flows. |
I've added a unit test. Aside from passing the selection vector around to reduce work, are there other show stoppers anyone sees remaining? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from passing the selection vector around to reduce work, are there other show stoppers anyone sees remaining?
Show stoppers -- no. I added some comments on some potential ways to simplify and harden the code tho.
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: It's not just about minimizing work. Right now, we're attempting to apply partition filters to every action, even non-file actions. Which forces us to treat the partition values column as nullable even tho the Delta spec says it's required. That means extra complexity to compensate for possible nullness, and risk of missing a corrupt table whose add actions illegally lack the add.partitionValues
column.
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be simpler to inject partition pruning into the existing AddRemoveDedupVisitor::is_valid_add
method, instead of defining a whole new visitor? It already accepts a selection vector, and it already skips non-file actions. We would just have to add the partition values to the schema it works with. Because add and remove actions both always provide partition columns, we could partition prune both before calling AddRemoveDedupVisitor::self.check_and_record_seen
, to avoid bloating the hash table with pruned entries.
The one complication would be how to handle non-partition tables cleanly, if we try to avoid fetching the partition values columns for non-partitioned tables. My guess is, that's over-optimizing. We should go ahead and unconditionally fetch the empty column, and just let the visitor conditionally ignore it. Something like this:
fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> {
// Add will have a path at index 0 if it is valid; otherwise, if it is a log batch, we may
// have a remove with a path at index 5. In either case, extract the getters for partition
// values and dv columns at indexes that immediately follow a valid path index.
let (path, pv_getter, dv_getters, is_add) =
if let Some(path) = getters[0].get_str(i, "add.path")? {
(path, &getters[1], &getters[2..5], true)
} else if !self.is_log_batch {
return Ok(false);
} else if let Some(path) = getters[5].get_opt(i, "remove.path")? {
(path, &getters[6], &getters[7..10], false)
} else {
return Ok(false);
};
// Only consider adds and removes that survive partition pruning
if !self.apply_partition_filters(pv_getter)? {
return Ok(false)
}
let dv_unique_id = match dv_getters[0].get_opt(i, "deletionVector.storageType")? {
Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts(
storage_type,
dv_getters[1].get(i, "deletionVector.pathOrInlineDv")?,
dv_getters[2].get_opt(i, "deletionVector.offset")?,
)),
None => None,
};
// Process both adds and removes, but only return not already-seen adds
let file_key = FileActionKey::new(path, dv_unique_id);
Ok(!self.check_and_record_seen(file_key) && is_add)
}
fn apply_partition_filters<'a>(&mut self, i: usize, pv_getter: &'a dyn GetData<'a>) -> DeltaResult<bool> {
let Some(partition_filter) = self.partition_filter else {
return Ok(true); // no filter => keep it
}
// extract and parse partition values, and apply the filter to them
let partition_values: HashMap<String, String> = pv_getter.get(i, "partitionValues")?;
todo!()
}
Also: I think today's approach only looks at adds, and removes would still bloat up the hash table. If so, we should fix that regardless of where the partition pruning code lives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks. Just a few more things I think
kernel/src/scan/log_replay.rs
Outdated
filter: DataSkippingFilter::new(engine, physical_predicate), | ||
partition_filter: PartitionSkippingFilter::new( | ||
engine, | ||
partition_predicate.cloned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's nice to note that this is a cheap arc clone
assert_eq!(selection_vector.len(), actions.len()); | ||
assert_eq!(partition_selection_vector.len(), actions.len()); | ||
|
||
let selection_vector = data_selection_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just:
let selection_vector = data_selection_vector | |
data_selection_vector.append(&mut partition_selection_vector); | |
let selection_vector = data_selection_vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do this - we want to zip
one iter with the other, not append it.
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to do this when I merge this with #607. I think I'd be okay to merge this first with this code and then I can move it over when I update my pr. Thoughts?
Thanks for the feedback. I'm going to try to find some time this weekend to resolve the remaining comments. |
Hi Tim. I'm going to go ahead and merge #607 to unblock some stuff. I'm happy to help merge/rebase this as we planned in #624 (comment), or you can as well. Just let me know how you'd like to proceed. Thanks! |
That’s understandable. I’m sorry, I’ve been meaning to get to this sooner but had some high priority things come up. |
@timsaucer FYI when you do get back to this, you should now be able to use |
No worries at all, we appreciate your time! #607 is merged now, so you could begin the refactor. If you'd like me to help with that lmk |
… is necessary to skip the other partition columns. Also, if the evaluation returns a None it means the evaluation is not possible, such as equality of a scalar value and a null. These should return false in the filter.
2e071d9
to
7122332
Compare
Ok, finally getting back around to looking at this. I did a rebase and resolved the conflicts. Now I'm going to try to go back through all of the above comments. |
… rather than assuming the caller knows what they're doing
Ok, that's all I can do today. I need to come back and address the question of looking at both adds and removes and seeing how to merge this with the Sorry, this is turning out to be a far more complicated PR than I had originally anticipated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to come back and address the question of looking at both adds and removes and seeing how to merge this with the AddRemoveDedupVisitor as suggested.
Sorry, this is turning out to be a far more complicated PR than I had originally anticipated.
This has been a super helpful exploration, thanks for tackling it. It prompted us to take a hard look at this code, and already resulted in helpful changes to kernel -- and this PR didn't even merge yet.
After staring at your code and the data transform changes @nicklan recently merged, I think we can implement partition pruning with a very targeted change in AddRemoveDedupVisitor::is_valid_add
. Today, that method has the following structure:
fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> {
...
let (path, dv_getters, is_add) = ... figure out is this is an add or remove ...;
...
// Check both adds and removes (skipping already-seen), but only transform and return adds
let file_key = FileActionKey::new(path, dv_unique_id);
if self.check_and_record_seen(file_key) || !is_add {
return Ok(false);
}
... call self.get_transform_expr to deal with partition values ...
Ok(true)
}
Crucially, the new transform
parameter @nicklan added contains the information we need to find and parse partition values. So I think we can inject something like the following, right before the call to check_and_record_seen
(so we don't waste memory tracking pruned files):
let partition_values = match &self.transform {
Some(transform) if is_add => {
let partition_values = getters[1].get(i, "add.partitionValues")?;
let partition_values = self.parse_partition_values(transform, &partition_values)?;
if self.is_file_partition_pruned(&partition_values) {
return Ok(false);
}
partition_values
}
_ => Default::default(),
};
Here:
parse_partition_values
steals the partition value discovery+parsing logic that already exists inAddRemoveDedupVisitor::get_transform_expr
.- And then
is_file_partition_pruned
would turn the parsed partition values into aDefaultPredicateEvaluator
(just like this PR already does, just in a different place) - If we're feeling fancy, we could even pass the extracted partition values back to
get_transform_expr
, so it doesn't have to parse them a second time.
We would also need to update the visitor (and its new
method) to pass the physical predicate ExpressionRef
as a predicate_filter
, alongside the existing data skipping filter, for use by the default predicate evaluator.
Note: we can only prune adds (not removes) -- see my other comment.
.filter(|f| partition_columns.contains(f.name())) | ||
.cloned() | ||
.peekable(); | ||
partition_fields.peek()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe helpful to make it clear we're intentionally discarding a result?
partition_fields.peek()?; | |
let _ = partition_fields.peek()?; |
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let getter = getters[0]; | ||
for i in 0..row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Thinking more about this, I realized it's actually UNSAFE to partition-prune removes, for the same reason we can't apply data skipping over removes: Those removes need to suppress earlier incompatible adds we might encounter, if the table's schema was replaced after the most recent checkpoint.
No worries! We really appreciate the time. The comment above from @scovich is hopefully helpful in simplifying things, and feel free to ping us here or on the "Delta Users" slack if you have more questions. |
I don't mean to pass the buck on this, but I really don't understand the transform logic that was added in recently. I'm trying to grok it from the code comments, but I think they assume a level of familiarity with this code beyond what I have. If anyone has time to pick this up and pull it over the line, I'd really appreciate it. Otherwise I'm going to need to spend more time trying to understand the transform PR that went in. |
What changes are proposed in this pull request?
This supersedes #615
This PR adds in a partition filter step that is similar to the data skipping. It adds in a row visitor that checks to see if any filters should be applied at a file level based on partition values. The approach here is based on the discussion in #607.
How was this change tested?
Tested in datafusion against an existing partitioned dataset and also against the integration test datasets for both multi-partition and single partition. I have also tested that combinations of data skipping and partition skipping are returning the correct number of files in the scan.
Remaining TODO items before this is ready for review/merge: