partition skipping filter #624

timsaucer · 2025-01-07T17:18:29Z

What changes are proposed in this pull request?

This supersedes #615

This PR adds in a partition filter step that is similar to the data skipping. It adds in a row visitor that checks to see if any filters should be applied at a file level based on partition values. The approach here is based on the discussion in #607.

How was this change tested?

Tested in datafusion against an existing partitioned dataset and also against the integration test datasets for both multi-partition and single partition. I have also tested that combinations of data skipping and partition skipping are returning the correct number of files in the scan.

Remaining TODO items before this is ready for review/merge:

Add in unit tests
Ensure data types are cast based on schema (currently just evaluating the string values)

timsaucer · 2025-01-07T17:21:29Z

@scovich I still have some more testing to add in unit tests before this is ready for review, but would you mind taking a quick look to see if it matches the design you were thinking?

kernel/src/scan/partition_skipping.rs

scovich · 2025-01-07T20:05:56Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


We should consider passing (and updating) the existing selection vector instead of creating a new one, because not all rows of the data we visit are even valid add actions. That way, we only try to extract partition values for rows that have survived this far, and we clear the selection bit if the filter prunes out the file.

Yes. I think we should have an additive vector to minimize what work we need to do, and to keep things simple.

Update: It's not just about minimizing work. Right now, we're attempting to apply partition filters to every action, even non-file actions. Which forces us to treat the partition values column as nullable even tho the Delta spec says it's required. That means extra complexity to compensate for possible nullness, and risk of missing a corrupt table whose add actions illegally lack the add.partitionValues column.

Perhaps it would be simpler to inject partition pruning into the existing AddRemoveDedupVisitor::is_valid_add method, instead of defining a whole new visitor? It already accepts a selection vector, and it already skips non-file actions. We would just have to add the partition values to the schema it works with. Because add and remove actions both always provide partition columns, we could partition prune both before calling AddRemoveDedupVisitor::self.check_and_record_seen, to avoid bloating the hash table with pruned entries.

The one complication would be how to handle non-partition tables cleanly, if we try to avoid fetching the partition values columns for non-partitioned tables. My guess is, that's over-optimizing. We should go ahead and unconditionally fetch the empty column, and just let the visitor conditionally ignore it. Something like this:

fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> { // Add will have a path at index 0 if it is valid; otherwise, if it is a log batch, we may // have a remove with a path at index 5. In either case, extract the getters for partition // values and dv columns at indexes that immediately follow a valid path index. let (path, pv_getter, dv_getters, is_add) = if let Some(path) = getters[0].get_str(i, "add.path")? { (path, &getters[1], &getters[2..5], true) } else if !self.is_log_batch { return Ok(false); } else if let Some(path) = getters[5].get_opt(i, "remove.path")? { (path, &getters[6], &getters[7..10], false) } else { return Ok(false); }; // Only consider adds and removes that survive partition pruning if !self.apply_partition_filters(pv_getter)? { return Ok(false) } let dv_unique_id = match dv_getters[0].get_opt(i, "deletionVector.storageType")? { Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts( storage_type, dv_getters[1].get(i, "deletionVector.pathOrInlineDv")?, dv_getters[2].get_opt(i, "deletionVector.offset")?, )), None => None, }; // Process both adds and removes, but only return not already-seen adds let file_key = FileActionKey::new(path, dv_unique_id); Ok(!self.check_and_record_seen(file_key) && is_add) } fn apply_partition_filters<'a>(&mut self, i: usize, pv_getter: &'a dyn GetData<'a>) -> DeltaResult<bool> { let Some(partition_filter) = self.partition_filter else { return Ok(true); // no filter => keep it } // extract and parse partition values, and apply the filter to them let partition_values: HashMap<String, String> = pv_getter.get(i, "partitionValues")?; todo!() }

Also: I think today's approach only looks at adds, and removes would still bloat up the hash table. If so, we should fix that regardless of where the partition pruning code lives.

I plan to do this when I merge this with #607. I think I'd be okay to merge this first with this code and then I can move it over when I update my pr. Thoughts?

Update: Thinking more about this, I realized it's actually UNSAFE to partition-prune removes, for the same reason we can't apply data skipping over removes: Those removes need to suppress earlier incompatible adds we might encounter, if the table's schema was replaced after the most recent checkpoint.

codecov · 2025-01-07T23:51:03Z

Codecov Report

Attention: Patch coverage is 89.18919% with 20 lines in your changes missing coverage. Please review.

Project coverage is 84.13%. Comparing base (6751838) to head (3265262).

Files with missing lines	Patch %	Lines
kernel/src/scan/partition_skipping.rs	91.21%	5 Missing and 8 partials ⚠️
kernel/src/scan/log_replay.rs	80.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #624      +/-   ##
==========================================
+ Coverage   84.08%   84.13%   +0.05%     
==========================================
  Files          76       77       +1     
  Lines       17526    17707     +181     
  Branches    17526    17707     +181     
==========================================
+ Hits        14736    14898     +162     
- Misses       2077     2088      +11     
- Partials      713      721       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nicklan

This is great, thanks.

I think we need to figure out if we merge this or #607 first. I suspect the code flow will look rather different once both are in.

In particular, what do you think about AddRemoveDedupVisitor becoming the main point where we do both skipping and fixup expression calculation? (We'd probably want to rename it 😄)

The code flow would then be something like:

call process_scan_batch on the scanner
build initial selection vec based on stats (this should probably be folded into the new visitor as well btw)
construct new visitor
have new visitor visit actions

new visitor does:

resolving adds/removes to skip removed files
visiting add.partitionValues
Applying the predicate to the extracted values (as in this PR except update the existing selection vector embedded in the visitor)
Computing the predicate fixup expression if not skipped

In the end we have a single selection vector that has filtered out any Add files that should be skipped either for stats or partitions, as well as the physical->logical transforms needed, and can return that to the engine.

I'm okay with merging this first and then I can update my PR to pull this logic into the visitor, or we can merge #607 and adjust this PR.

kernel/src/scan/partition_skipping.rs

nicklan · 2025-01-08T00:39:25Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


Yes. I think we should have an additive vector to minimize what work we need to do, and to keep things simple.

nicklan · 2025-01-08T00:48:37Z

We'll need some tests for this too, probably both unit for the skipping and then at least one in kernel/test/read.rs to ensure we can properly skip

timsaucer · 2025-01-08T01:07:19Z

Absolutely, I wasn’t going to move it out of draft until after adding tests. I wasn’t mostly looking for an early check to make sure the approach matched the suggestion in the other PR linked above.

scovich · 2025-01-08T17:12:55Z

We'll need some tests for this too, probably both unit for the skipping and then at least one in kernel/test/read.rs to ensure we can properly skip

Absolutely, I wasn’t going to move it out of draft until after adding tests. I wasn’t mostly looking for an early check to make sure the approach matched the suggestion in the other PR linked above.

One thing to note tho -- the DefaultPredicateEvaluator itself is already tested pretty thoroughly, so any partition pruning unit tests only need to validate that the partition values map it takes as input is correct (correctly parsed, correct names, etc). The read.rs would then be a lightweight end to end test to make sure everything is plumbed through correctly.

kernel/src/scan/log_replay.rs

kernel/src/scan/partition_skipping.rs

scovich · 2025-01-08T18:18:20Z

kernel/src/scan/partition_skipping.rs

                let filter = DefaultPredicateEvaluator::from(resolver);
-                Some(filter.eval_expr(&self.predicate, false).unwrap_or(true))
+                Ok(filter.eval_expr(&self.predicate, false).unwrap_or(false))


This doesn't look right? If the expression evaluated to None that means we don't know whether it can be skipped and we must not skip it (see my other comment about the importance of passing Scalar::Null values for partition values that are known to be NULL, so that IS [NOT] NULL predicates work correctly).

Meanwhile, if we just need to handle the extra Option/Result nesting, we can do:

let val = getter.get_map(i, "output")?.and_then(|m| { ... Ok(filter.eval_expr(&self.predicate, false)).transpose() });

I see your point now.

Something I was struggling with is I was testing with the multi partitioned acceptance tests examples, some of which do have null values for some partitions. In particular looking at acceptance/tests/dat/out/reader_tests/generated/multi_partitioned I was trying to use a predicate that column "letter" equals a literal "b". This does let the nulls through because in the equality we're using PartialCmp even for equality. I get that when doing a > or < comparison it's ill defined for null values, but I would expect that checking equality should be valid. I tested adding in a partial_eq_scalars along side partial_cmp_scalars and it resolves the problem I was having.

I might coming at the problem from a different point of view than the delta-kernel design. It the intent is for nulls to make it through this kind of predicate, then I can adjust my datafusion code to add in additional predicates of "is not null" for each of these cases.

Comparisons with NULL get tricky... in SQL, any comparison against NULL produces NULL -- including e.g. NULL == NULL. There's a concept of "null-safe equal" in spark (operator <=>) which always returns either TRUE or FALSE, and there are other ways in other systems to achieve a similar result. But yes -- if you want a null-safe equal (or comparison) in a system that doesn't explicitly claim to support it, you need to turn e.g. x > 10 into x IS NOT NULL AND x > 10.

Additionally for our predicate filters, a missing value is as NULL some of the time (e.g. comparisons against it return NULL), but it's NOT the same when it comes to IS [NOT] NULL predicates. Because if a value is outright missing, we don't know whether that's because it was null, or because it was unavailable. So a missing input has to cause IS [NOT] NULL to return NULL, when it comes to data skipping. For other comparisons, the correct thing already happens naturally because the result is the same for missing vs. NULL.

To give an example, suppose the predicate was:

WHERE partition_col > 10 AND value_col = 42

Then for partition pruning, we won't have stats for value_col and we'll effectively end up with:

WHERE partition_col > 10 AND NULL

That's totally fine -- FALSE AND NULL is FALSE, so we can still skip. TRUE AND NULL is NULL, but that's fine because TRUE and NULL both force us to keep the file.

Complicated stuff! For more details, see extensive code comments including:
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/predicates/mod.rs#L25-L44
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/predicates/parquet_stats_skipping.rs#L109-L153

I should add, data skipping and SQL WHERE have conflicting requirements:

Data skipping must keep the file unless the predicate returns an actual FALSE

WHERE discards the row unless the predicate returns an actual TRUE

For partition pruning, we should in theory use SQL WHERE semantics -- but that's only safe if we're certain that every column the predicate references is a partition column, which is not guaranteed. Hence my compromise suggestion, that we should ensure the partition values map always has a value (Scalar::Null by default) for every partition column.

Additionally, we will probably want to hoist the eval_sql_where function from ParquetStatsSkippingFilter (which can probably go away) up to the PredicateEvaluator trait so you can use it. That way, the required IS NOT NULL checks would be injected automatically. That would also allow to simplify the can_statically_skip_all_files function in scan/mod.rs, because it could also use a DefaultPredicateEvaluator (with an empty column resolver) instead of weirdly using parquet stats skipping logic.

Thank you for taking the time to write this all up. From reading the delta spec, specifically this line in requirements for writers :

Values for all partition columns present in the schema MUST be present for all files in the table.

This would lead me to think that the right thing to do is that if any of the partitionValues are entirely missing that we should return an error because it doesn't match the spec.

Also from my reading of the spec it looks like null values should be an empty string per partition value serialization. The acceptance tests currently have things like "partitionValues":{"letter":"a","date":null,"data":"x"} which is also how my other examples show up. But to meet the spec it looks like we need to also support the empty string case. I don't immediately see any examples in the repo that have this for a unit test, so I'll be sure to add both variants (null and "").

Hoisting eval_sql_where up so DefaultPredicateEvaluator can use it: #627

Thank you for taking the time to write this all up. From reading the delta spec, specifically this line in requirements for writers :

Values for all partition columns present in the schema MUST be present for all files in the table.

This would lead me to think that the right thing to do is that if any of the partitionValues are entirely missing that we should return an error because it doesn't match the spec.

That sounds right to me as well. However:

We may want to ask delta-rs folks if they've seen other behaviors in the wild that we need to consider tolerating.

We still have to worry about the non-partition columns the predicate might mention -- they won't have entries in add.partitionValues and we want them to resolve as None (not Scalar::Null).

Maybe the simplest thing to do -- in this partition skipping code at least -- is to treat missing partition values as if they were value columns. They would simply not participate in skipping, and the query that comes later can error out as appropriate.

timsaucer · 2025-01-08T21:34:51Z

Thank you for all the thoughtful feedback. I have to step away from this for a few days to focus on other things, but it is high priority for me because it will greatly impact many of my work flows.

timsaucer · 2025-01-11T15:31:58Z

I've added a unit test. Aside from passing the selection vector around to reduce work, are there other show stoppers anyone sees remaining?

scovich

Aside from passing the selection vector around to reduce work, are there other show stoppers anyone sees remaining?

Show stoppers -- no. I added some comments on some potential ways to simplify and harden the code tho.

scovich · 2025-01-13T16:46:58Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


Update: It's not just about minimizing work. Right now, we're attempting to apply partition filters to every action, even non-file actions. Which forces us to treat the partition values column as nullable even tho the Delta spec says it's required. That means extra complexity to compensate for possible nullness, and risk of missing a corrupt table whose add actions illegally lack the add.partitionValues column.

kernel/src/scan/partition_skipping.rs

kernel/src/scan/log_replay.rs

scovich · 2025-01-13T17:13:24Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


Perhaps it would be simpler to inject partition pruning into the existing AddRemoveDedupVisitor::is_valid_add method, instead of defining a whole new visitor? It already accepts a selection vector, and it already skips non-file actions. We would just have to add the partition values to the schema it works with. Because add and remove actions both always provide partition columns, we could partition prune both before calling AddRemoveDedupVisitor::self.check_and_record_seen, to avoid bloating the hash table with pruned entries.

The one complication would be how to handle non-partition tables cleanly, if we try to avoid fetching the partition values columns for non-partitioned tables. My guess is, that's over-optimizing. We should go ahead and unconditionally fetch the empty column, and just let the visitor conditionally ignore it. Something like this:

fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> { // Add will have a path at index 0 if it is valid; otherwise, if it is a log batch, we may // have a remove with a path at index 5. In either case, extract the getters for partition // values and dv columns at indexes that immediately follow a valid path index. let (path, pv_getter, dv_getters, is_add) = if let Some(path) = getters[0].get_str(i, "add.path")? { (path, &getters[1], &getters[2..5], true) } else if !self.is_log_batch { return Ok(false); } else if let Some(path) = getters[5].get_opt(i, "remove.path")? { (path, &getters[6], &getters[7..10], false) } else { return Ok(false); }; // Only consider adds and removes that survive partition pruning if !self.apply_partition_filters(pv_getter)? { return Ok(false) } let dv_unique_id = match dv_getters[0].get_opt(i, "deletionVector.storageType")? { Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts( storage_type, dv_getters[1].get(i, "deletionVector.pathOrInlineDv")?, dv_getters[2].get_opt(i, "deletionVector.offset")?, )), None => None, }; // Process both adds and removes, but only return not already-seen adds let file_key = FileActionKey::new(path, dv_unique_id); Ok(!self.check_and_record_seen(file_key) && is_add) } fn apply_partition_filters<'a>(&mut self, i: usize, pv_getter: &'a dyn GetData<'a>) -> DeltaResult<bool> { let Some(partition_filter) = self.partition_filter else { return Ok(true); // no filter => keep it } // extract and parse partition values, and apply the filter to them let partition_values: HashMap<String, String> = pv_getter.get(i, "partitionValues")?; todo!() }

Also: I think today's approach only looks at adds, and removes would still bloat up the hash table. If so, we should fix that regardless of where the partition pruning code lives.

nicklan

Nice, thanks. Just a few more things I think

nicklan · 2025-01-17T00:16:30Z

kernel/src/scan/log_replay.rs

-            filter: DataSkippingFilter::new(engine, physical_predicate),
+            partition_filter: PartitionSkippingFilter::new(
+                engine,
+                partition_predicate.cloned(),


nit: it's nice to note that this is a cheap arc clone

kernel/src/scan/log_replay.rs

nicklan · 2025-01-17T00:20:03Z

kernel/src/scan/log_replay.rs

-        assert_eq!(selection_vector.len(), actions.len());
+        assert_eq!(partition_selection_vector.len(), actions.len());
+
+        let selection_vector = data_selection_vector


why not just:

Suggested change

let selection_vector = data_selection_vector

data_selection_vector.append(&mut partition_selection_vector);

let selection_vector = data_selection_vector;

I don't think we want to do this - we want to zip one iter with the other, not append it.

nicklan · 2025-01-17T00:27:17Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


I plan to do this when I merge this with #607. I think I'd be okay to merge this first with this code and then I can move it over when I update my pr. Thoughts?

timsaucer · 2025-01-17T13:47:14Z

Thanks for the feedback. I'm going to try to find some time this weekend to resolve the remaining comments.

nicklan · 2025-01-22T22:55:48Z

Thanks for the feedback. I'm going to try to find some time this weekend to resolve the remaining comments.

Hi Tim. I'm going to go ahead and merge #607 to unblock some stuff. I'm happy to help merge/rebase this as we planned in #624 (comment), or you can as well. Just let me know how you'd like to proceed.

Thanks!

timsaucer · 2025-01-22T23:27:34Z

That’s understandable. I’m sorry, I’ve been meaning to get to this sooner but had some high priority things come up.

scovich · 2025-01-22T23:51:42Z

@timsaucer FYI when you do get back to this, you should now be able to use DefaultPredicateEvaluator::eval_sql_where to directly evaluate data skipping filters over partition column values that are stored in a string-scalar hashmap.

nicklan · 2025-01-23T21:15:01Z

That’s understandable. I’m sorry, I’ve been meaning to get to this sooner but had some high priority things come up.

No worries at all, we appreciate your time! #607 is merged now, so you could begin the refactor. If you'd like me to help with that lmk

… is necessary to skip the other partition columns. Also, if the evaluation returns a None it means the evaluation is not possible, such as equality of a scalar value and a null. These should return false in the filter.

…tream will have to handle it

timsaucer · 2025-01-26T15:14:09Z

Ok, finally getting back around to looking at this. I did a rebase and resolved the conflicts. Now I'm going to try to go back through all of the above comments.

… rather than assuming the caller knows what they're doing

timsaucer · 2025-01-26T15:51:17Z

Ok, that's all I can do today. I need to come back and address the question of looking at both adds and removes and seeing how to merge this with the AddRemoveDedupVisitor as suggested.

Sorry, this is turning out to be a far more complicated PR than I had originally anticipated.

scovich

I need to come back and address the question of looking at both adds and removes and seeing how to merge this with the AddRemoveDedupVisitor as suggested.

Sorry, this is turning out to be a far more complicated PR than I had originally anticipated.

This has been a super helpful exploration, thanks for tackling it. It prompted us to take a hard look at this code, and already resulted in helpful changes to kernel -- and this PR didn't even merge yet.

After staring at your code and the data transform changes @nicklan recently merged, I think we can implement partition pruning with a very targeted change in AddRemoveDedupVisitor::is_valid_add. Today, that method has the following structure:

    fn is_valid_add<'a>(&mut self, i: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<bool> {
          ...
        let (path, dv_getters, is_add) = ... figure out is this is an add or remove ...;
          ...
        // Check both adds and removes (skipping already-seen), but only transform and return adds
        let file_key = FileActionKey::new(path, dv_unique_id);
        if self.check_and_record_seen(file_key) || !is_add {
            return Ok(false);
        }
          ... call self.get_transform_expr to deal with partition values ...
        Ok(true)
    }

Crucially, the new transform parameter @nicklan added contains the information we need to find and parse partition values. So I think we can inject something like the following, right before the call to check_and_record_seen (so we don't waste memory tracking pruned files):

       let partition_values = match &self.transform {
           Some(transform) if is_add => {
               let partition_values = getters[1].get(i, "add.partitionValues")?;
               let partition_values = self.parse_partition_values(transform, &partition_values)?;
               if self.is_file_partition_pruned(&partition_values) {
                   return Ok(false);
               }
               partition_values
           }
           _ => Default::default(),
       };

Here:

parse_partition_values steals the partition value discovery+parsing logic that already exists in AddRemoveDedupVisitor::get_transform_expr.
And then is_file_partition_pruned would turn the parsed partition values into a DefaultPredicateEvaluator (just like this PR already does, just in a different place)
If we're feeling fancy, we could even pass the extracted partition values back to get_transform_expr, so it doesn't have to parse them a second time.

We would also need to update the visitor (and its new method) to pass the physical predicate ExpressionRef as a predicate_filter, alongside the existing data skipping filter, for use by the default predicate evaluator.

Note: we can only prune adds (not removes) -- see my other comment.

scovich · 2025-01-27T16:10:38Z

kernel/src/scan/partition_skipping.rs

+            .filter(|f| partition_columns.contains(f.name()))
+            .cloned()
+            .peekable();
+        partition_fields.peek()?;


Maybe helpful to make it clear we're intentionally discarding a result?

Suggested change

partition_fields.peek()?;

let _ = partition_fields.peek()?;

scovich · 2025-01-27T20:02:14Z

kernel/src/scan/partition_skipping.rs

+
+    fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
+        let getter = getters[0];
+        for i in 0..row_count {


Update: Thinking more about this, I realized it's actually UNSAFE to partition-prune removes, for the same reason we can't apply data skipping over removes: Those removes need to suppress earlier incompatible adds we might encounter, if the table's schema was replaced after the most recent checkpoint.

nicklan · 2025-01-28T00:59:32Z

Sorry, this is turning out to be a far more complicated PR than I had originally anticipated.

No worries! We really appreciate the time. The comment above from @scovich is hopefully helpful in simplifying things, and feel free to ping us here or on the "Delta Users" slack if you have more questions.

timsaucer · 2025-02-07T01:49:08Z

I don't mean to pass the buck on this, but I really don't understand the transform logic that was added in recently. I'm trying to grok it from the code comments, but I think they assume a level of familiarity with this code beyond what I have. If anyone has time to pick this up and pull it over the line, I'd really appreciate it. Otherwise I'm going to need to spend more time trying to understand the transform PR that went in.

github-actions bot assigned timsaucer Jan 7, 2025

scovich reviewed Jan 7, 2025

View reviewed changes

nicklan reviewed Jan 8, 2025

View reviewed changes

github-actions bot added the breaking-change Change that will require a version bump label Jan 8, 2025

scovich reviewed Jan 8, 2025

View reviewed changes

timsaucer marked this pull request as ready for review January 11, 2025 15:30

timsaucer changed the title ~~[WIP] partition skipping filter~~ partition skipping filter Jan 11, 2025

scovich reviewed Jan 13, 2025

View reviewed changes

scovich mentioned this pull request Jan 13, 2025

Part 1, Read transforms via expressions: Just compute the expression and return it. #607

Merged

nicklan reviewed Jan 17, 2025

View reviewed changes

timsaucer added 8 commits January 26, 2025 09:53

Initial implementation of partition skipping filter

94ffd81

remove unnecessary mut

1fb2656

Cast data type based on expected schema

469a88d

Remove debug line

fe64b0c

Do not create partition filter if no partitions exist

42796d4

Remove stale comment

bc64530

Replace generic error with more specific ones

96fa565

timsaucer added 5 commits January 26, 2025 09:53

When we cannot evaluate a predicate, set the filter to true and downs…

313880a

…tream will have to handle it

Set any missing partition values to NULL during predicate evaluation

cf041e5

cargo fmt

d5cba2d

Apply code suggestion

8104fad

Add unit test for partition skipping

5553bbb

timsaucer mentioned this pull request Jan 26, 2025

Merge partition columns into scan statistics for data skipping #615

Closed

Resolve merge conflict

7122332

timsaucer force-pushed the feat/partition-predicate-evaluator branch from 2e071d9 to 7122332 Compare January 26, 2025 15:13

timsaucer added 7 commits January 26, 2025 10:19

Cargo fmt

770901f

line wrapping

020cacf

If partition values are missing and non-nullable, return an error

7df4ada

Return early if we have no available partition fields to process

020f997

Move checking for empty partitions into the partition skipping filter…

e11afad

… rather than assuming the caller knows what they're doing

Change asserts into errors

ce13b24

Cargo fmt

3265262

scovich reviewed Jan 27, 2025

View reviewed changes

	let selection_vector = data_selection_vector
	data_selection_vector.append(&mut partition_selection_vector);
	let selection_vector = data_selection_vector;

partition skipping filter #624

Are you sure you want to change the base?

partition skipping filter #624

Conversation

timsaucer commented Jan 7, 2025 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

timsaucer commented Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 7, 2025 • edited Loading

Codecov Report

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented Jan 8, 2025

timsaucer commented Jan 8, 2025

scovich commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timsaucer commented Jan 8, 2025

timsaucer commented Jan 11, 2025

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timsaucer commented Jan 17, 2025

nicklan commented Jan 22, 2025

timsaucer commented Jan 22, 2025

scovich commented Jan 22, 2025

nicklan commented Jan 23, 2025

timsaucer commented Jan 26, 2025

timsaucer commented Jan 26, 2025

scovich left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented Jan 28, 2025

timsaucer commented Feb 7, 2025

timsaucer commented Jan 7, 2025 •

edited

Loading

timsaucer commented Jan 7, 2025 •

edited

Loading

codecov bot commented Jan 7, 2025 •

edited

Loading

scovich Jan 8, 2025 •

edited

Loading

scovich left a comment •

edited

Loading