-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add visitors CheckpointFileActionsVisitor
and CheckpointNonFileActionsVisitor
#738
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #738 +/- ##
==========================================
+ Coverage 84.29% 84.46% +0.16%
==========================================
Files 77 81 +4
Lines 19099 19584 +485
Branches 19099 19584 +485
==========================================
+ Hits 16100 16542 +442
- Misses 2201 2216 +15
- Partials 798 826 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0221955
to
19733cd
Compare
/// access stale snapshots. A remove action remains as a tombstone in a checkpoint file until | ||
/// it expires, which happens when the current time exceeds the removal timestamp plus the | ||
/// expiration threshold. | ||
fn is_expired_tombstone<'a>(&self, i: usize, getter: &'a dyn GetData<'a>) -> DeltaResult<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, remove.deletion_timestamp is set to 0 in Java-kernel. So if the field is not present, the remove action is excluded from the checkpoint file.
What changes are proposed in this pull request?
This PR implements the core visitor components necessary for writing checkpoint files. resolves #737
A complete V1 checkpoint encapsulates:
CheckpointFileActionsVisitor
This visitor selects the FILE actions actions with a selection vector, implementing the following logic:
CheckpointNonFileActionsVisitor
This visitor selects the NON-FILE actions with a selection vector, implementing the following logic:
Design Considerations
We split the functionality into two separate visitors for extensibility purposes. While V1 checkpoints will apply both visitors in a single pass, V2 checkpoints require processing file actions and non-file actions in different phases:
This separation provides the flexibility needed for future V2 checkpoint support.
Questions for Reviewers
check_and_record_seen
method on theCheckpointFileActionsVisitor
is identical to the method onAddRemoveDedupVisitor
. Should we refactor?AddRemoveDedupVisitor
is implemented insrc/scan/log_replay.rs
as it is only used there. Should we consider moving these new visitors once we build out the use-case?selected_column_names_and_types
method. Can this requirement be relaxed?How was this change tested?
File Actions Visitor Tests:
test_parse_checkpoint_file_action_visitor
test_checkpoint_file_action_visitor_boundary_cases_for_tombstone_expiration
test_checkpoint_file_action_visitor_duplicate_file_actions_in_log_batch
(path, dvId)
handling in log batchestest_checkpoint_file_action_visitor_file_actions_in_checkpoint_batch
test_checkpoint_file_action_visitor_with_deletion_vectors
Non-File Actions Visitor Tests:
test_parse_checkpoint_non_file_actions_visitor
test_checkpoint_non_file_actions_visitor_already_seen_actions
test_checkpoint_non_file_actions_visitor_duplicate_non_file_actions