Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add visitors CheckpointFileActionsVisitor and CheckpointNonFileActionsVisitor #738

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Mar 12, 2025

What changes are proposed in this pull request?

This PR implements the core visitor components necessary for writing checkpoint files. resolves #737

A complete V1 checkpoint encapsulates:

  1. All FILE actions that make up the state of a version of a table:
  • Add actions (after action reconciliation)
  • Unexpired remove actions (remove tombstones)
  1. All NON-FILE actions that make up the state of a version of a table:
  • Protocol action
  • Metadata action
  • Txn actions

CheckpointFileActionsVisitor
This visitor selects the FILE actions actions with a selection vector, implementing the following logic:

  1. Processes add/remove actions with proper deduplication based on path and deletion vector ID pairs
  2. Optimization: Only tracks already seen file paths in log files, as actions in checkpoints are the last batches to be processed, and do not conflict with other actions in checkpoint files.
  3. Applies tombstone expiration logic by filtering out remove actions with deletion timestamps older than the minimum file retention timestamp

CheckpointNonFileActionsVisitor
This visitor selects the NON-FILE actions with a selection vector, implementing the following logic:

  1. Ensures exactly one protocol action is included (the newest one encountered)
  2. Ensures exactly one metadata action is included (the newest one encountered)
  3. Deduplicates transaction (txn) actions by app ID to include only the newest action for each app ID

Design Considerations
We split the functionality into two separate visitors for extensibility purposes. While V1 checkpoints will apply both visitors in a single pass, V2 checkpoints require processing file actions and non-file actions in different phases:

  1. First, process file actions to write sidecar files
  2. Then, process non-file actions to write the top-level V2 checkpoint file

This separation provides the flexibility needed for future V2 checkpoint support.

Questions for Reviewers

  1. Should we combine these visitors into a third visitor implementation for V1 checkpoints?
  2. The check_and_record_seen method on the CheckpointFileActionsVisitor is identical to the method on AddRemoveDedupVisitor. Should we refactor?
  3. The AddRemoveDedupVisitor is implemented in src/scan/log_replay.rs as it is only used there. Should we consider moving these new visitors once we build out the use-case?
  4. The visitor infrastructure expects the columns in the visited data to follow the same order as defined in the visitor's selected_column_names_and_types method. Can this requirement be relaxed?

How was this change tested?

File Actions Visitor Tests:

  • test_parse_checkpoint_file_action_visitor
    • Tests basic visitor functionality for file actions
  • test_checkpoint_file_action_visitor_boundary_cases_for_tombstone_expiration
    • Tests various timestamp scenarios for tombstone expiration
  • test_checkpoint_file_action_visitor_duplicate_file_actions_in_log_batch
    • Tests duplicate (path, dvId) handling in log batches
  • test_checkpoint_file_action_visitor_file_actions_in_checkpoint_batch
    • Tests actions handling in checkpoint batches
  • test_checkpoint_file_action_visitor_with_deletion_vectors
    • Tests actions actions with deletion vectors

Non-File Actions Visitor Tests:

  • test_parse_checkpoint_non_file_actions_visitor
    • Tests basic visitor functionality for non-file actions
  • test_checkpoint_non_file_actions_visitor_already_seen_actions
    • Tests skipping actions when non-file actions have been tracked in previously visited batches
  • test_checkpoint_non_file_actions_visitor_duplicate_non_file_actions
    • Tests multiple protocol, metadata, and txn actions in a batch

Copy link

codecov bot commented Mar 12, 2025

Codecov Report

Attention: Patch coverage is 91.25683% with 32 lines in your changes missing coverage. Please review.

Project coverage is 84.46%. Comparing base (db86d97) to head (f91baeb).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/actions/visitors.rs 91.23% 10 Missing and 22 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #738      +/-   ##
==========================================
+ Coverage   84.29%   84.46%   +0.16%     
==========================================
  Files          77       81       +4     
  Lines       19099    19584     +485     
  Branches    19099    19584     +485     
==========================================
+ Hits        16100    16542     +442     
- Misses       2201     2216      +15     
- Partials      798      826      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

/// access stale snapshots. A remove action remains as a tombstone in a checkpoint file until
/// it expires, which happens when the current time exceeds the removal timestamp plus the
/// expiration threshold.
fn is_expired_tombstone<'a>(&self, i: usize, getter: &'a dyn GetData<'a>) -> DeltaResult<bool> {
Copy link
Collaborator Author

@sebastiantia sebastiantia Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, remove.deletion_timestamp is set to 0 in Java-kernel. So if the field is not present, the remove action is excluded from the checkpoint file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New FileActionsVisitor and NonFileActionsVisitor visitors
1 participant