Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle v2 uuid named json/parquet checkpoints #3222

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zeevm
Copy link
Contributor

@zeevm zeevm commented Feb 16, 2025

Description

Match V2 UUID named checkpoints and read JSON checkpoints

Related Issue(s)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 16, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@zeevm zeevm changed the title Handle V2 UUID named JSON/Parquet checkpoints fix: Handle V2 UUID named JSON/Parquet checkpoints Feb 16, 2025
@zeevm zeevm changed the title fix: Handle V2 UUID named JSON/Parquet checkpoints fix: handle v2 uuid named json/parquet checkpoints Feb 16, 2025
Copy link

codecov bot commented Feb 16, 2025

Codecov Report

Attention: Patch coverage is 51.42857% with 17 lines in your changes missing coverage. Please review.

Project coverage is 72.28%. Comparing base (2f1a5ac) to head (ff94051).

Files with missing lines Patch % Lines
crates/core/src/kernel/snapshot/log_segment.rs 51.42% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
- Coverage   72.31%   72.28%   -0.04%     
==========================================
  Files         138      138              
  Lines       45398    45432      +34     
  Branches    45398    45432      +34     
==========================================
+ Hits        32831    32840       +9     
- Misses      10489    10510      +21     
- Partials     2078     2082       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Collaborator

@zeevm could you add some tests?

@ion-elgreco ion-elgreco reopened this Feb 16, 2025
@zeevm
Copy link
Contributor Author

zeevm commented Feb 17, 2025

@ion-elgreco
I don't have a V2 UUID checkpoint delta table sample I can use for a UT. I have one such table I got from a customer to debug an issue they're having but obviously I can't share that as a sample table.
I've experimented with databricks but couldn't get it to generate such checkpoint file, instead I got something like "00000000000000000001.00000000000000000006.compacted.json" as a checkpoint.

If anyone can tell me exactly how to generate such table in databricks I'll do it, otherwise I don't see how to add a UT, all I can do is verify the fix locally with my customers' table.

@rtyler
Copy link
Member

rtyler commented Feb 18, 2025

@zeevm I took a look at this pull request this morning, and I''m not sure if we can safely merge it. I believe these uuid checkpoints are v2 checkpoints which can be enabled via a table feature.

They're structurally different than v1 checkpoints and may include sidecars which contain additional checkpoint data that I believe delta-rs will currently ignore.

Without sharing the table data (obviously 😆) can you share the error that was run into? I can imagine a scenario where older versions of the transaction log were cleaned up and the historical information on a table would be in v2 checkpoints causing us problems reading the table 🤔

@rtyler rtyler self-assigned this Feb 18, 2025
@zeevm
Copy link
Contributor Author

zeevm commented Feb 18, 2025

@rtyler I've created such sample table I can share, try opening this table
delta_v2_uuid_json_checkpoint.zip

@zeevm
Copy link
Contributor Author

zeevm commented Feb 18, 2025

@rtyler I see all the "sidecar" code is already implemented, why would the library have a problem reading them?
The crate documentation says reader level 3 is supported, this includes V2 checkpoints, if the lib can't read V2 checkpoints why does it say in the doc it does?

@rtyler
Copy link
Member

rtyler commented Feb 19, 2025

I see all the "sidecar" code is already implemented, why would the library have a problem reading them?
The crate documentation says reader level 3 is supported, this includes V2 checkpoints, if the lib can't read V2 checkpoints why does it say in the doc it does?

There is an ability to parse a sidecar action in the transaction log, which doesn't mean much other than the transaction log wouldn't be unreadable to delta-rs if these actions are present in the table. "v2 checkpoints" are a table feature which require writer/reader version 3/7, which can definitely be confusing. Supporting reader v3, for example, does not mean that a reader supports all the table features which exist.

Thanks for the sample table, I'll take a looksee at it shortly.

@zeevm
Copy link
Contributor Author

zeevm commented Feb 21, 2025

@rtyler any insights with the table I provided?

@ion-elgreco
Copy link
Collaborator

Delta-kernel-rs supports v2 checkpoints reading. @roeap when do you expect your PR to be ready move over to the new log replay relying more on kernel, I assume v2 checkpoints will be auto supported by that change?

@ion-elgreco ion-elgreco marked this pull request as draft March 1, 2025 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error reading V2 UUID-named checkpoint
3 participants