Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: get prefix from offset path #699

Merged
merged 41 commits into from
Mar 21, 2025
Merged

fix!: get prefix from offset path #699

merged 41 commits into from
Mar 21, 2025

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Feb 14, 2025

What changes are proposed in this pull request?

Our list_from implementation for the object_store based filesystem client is currently broken, since it does not behave as documented / required for that function. Specifically we should list all files in the parent folder for using the path as offset to list from.

In a follow up PR we then need to lift the assumption that all URLs will always be under the same store to get proper URL handling.

This PR affects the following public APIs

  • DefaultEngine::new no longer requires a table_root parameter.
  • list_from consistently returns keys greater than (>) the offset, previously the sync-engines client returned all keys (>=)

How was this change tested?

Additional tests asserting consistent list_from behavior for all our file system client implementations.

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 8 lines in your changes missing coverage. Please review.

Project coverage is 84.36%. Comparing base (3065a76) to head (79bd24d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/filesystem.rs 72.22% 4 Missing and 1 partial ⚠️
kernel/src/engine/default/json.rs 0.00% 0 Missing and 1 partial ⚠️
kernel/src/engine/default/mod.rs 90.00% 0 Missing and 1 partial ⚠️
kernel/src/engine/sync/json.rs 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #699      +/-   ##
==========================================
- Coverage   84.37%   84.36%   -0.02%     
==========================================
  Files          80       81       +1     
  Lines       19221    19246      +25     
  Branches    19221    19246      +25     
==========================================
+ Hits        16218    16236      +18     
- Misses       2201     2205       +4     
- Partials      802      805       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 14, 2025
@roeap roeap requested review from zachschuermann, scovich, nicklan and OussamaSaoudi and removed request for zachschuermann and scovich February 15, 2025 00:11
Signed-off-by: Robert Pack <[email protected]>
Comment on lines +349 to +353
/// List the paths in the same directory that are lexicographically greater than
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name.
///
/// If the path is directory-like (ends with '/'), the result should contain
/// all the files in the directory.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure this is the behavior we want to go for, but thought I"d put up the PR for discussion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to split this behavior change to a different PR, if at all possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep this here can we add to the PR description? I think this is a breaking change

roeap and others added 8 commits March 19, 2025 11:51
…o#700)

Fixes delta-io#698

## What changes are proposed in this pull request?

Updates the `DataSkippingFilter` to treat all columns as nullable for
the purpose of parsing stats, as suggested in
delta-io#698 (comment).

This is particularly important for partition columns, which won't have
values present in stats. But stats are also only usually stored for the
first 32 columns, so we shouldn't rely on stats being present for
non-partition fields either.

## How was this change tested?

I've added a new unit test.

I've also tested building duckdb-delta with this change (cherry-picked
onto 0.6.1) and verified that the code in delta-io#698 now works.
…_snapshot_schema (delta-io#683)

## What changes are proposed in this pull request?
When given a schema (e.g. in `global_scan_state`) the engine needs a way
to visit this schema. This introduces a new API `visit_schema` to allow
engines to visit any schema over FFI. An API called `visit_schema`
previously existed but visited the schema of a given _snapshot_; this
has now been renamed to `visit_snapshot_schema`.


### This PR affects the following public APIs
Renamed `visit_schema` to `visit_snapshot_schema` and now `visit_schema`
takes `SharedSchema` as an argument instead of a snapshot.


## How was this change tested?
updated read_table test
…o#654)

This change introduces arrow_53 and arrow_54 feature flags on kernel
which are _required_ when using default-engine or sync-engine.
Fundamentally we must push users of the crate to select their arrow
major version through flags since Cargo _will_ include multiple major
versions in the dependency tree which can cause ABI breakages when
passing around symbols such as `RecordBatch`

See delta-io#640

---------

Signed-off-by: R. Tyler Croy <[email protected]>
Our previous write protocol check was too strict. Now we just ensure that the protocol makes sense given what features are present/specified.

Made all existing `write.rs` tests also write to a protocol 1/1 table, and they all work.
Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :)

In particular: 
- remove `add_partition_columns` in `arrow.c`, we don't need it anymore
- expose ffi methods to get an expression evaluator and evaluate an
expression from `c`
- use the above to add an `apply_transform` function in `arrow.c`

## How was this change tested?

- existing tests
…ta-io#709)

## What changes are proposed in this pull request?
This PR removes the old `visit_snapshot_schema` introduced in delta-io#683 - we
should just go ahead and do the 'right thing' with having a
`visit_schema` (introduced in the other PR) and a `logical_schema()`
function (added here) in order to facilitate visiting the snapshot
schema. Additionally I've moved the schema-related things up from `scan`
module to top-level in ffi crate. Exact changes listed below; this PR
updates tests/examples to leverage the new changes.

### This PR affects the following public APIs

1. Remove `visit_snapshot_schema()` API
2. Add a new `logical_schema(snapshot)` API so you can get the schema of
a snapshot and use the `visit_schema` directly
3. Renames `free_global_read_schema` to just `free_schema`
4. Moves `SharedSchema` and `free_schema` up from `mod scan` into
top-level `ffi` crate.


## How was this change tested?
existing UT
zachschuermann and others added 15 commits March 19, 2025 19:53
…elta-io#679)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Summary

This PR introduces foundational changes required for V2 checkpoint read
support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment `replay` functionality to retrieve actions
from sidecar files if need be.

This PR specifically adds support for Item 2.

This PR **does not introduce full v2Checkpoints reader/writer support**
as we are missing support for Item 1, meaning log segments can never
have V2 checkpoint files in the first place. That functionality will be
completed in [PR
delta-io#685](delta-io#685) which is
stacked on top of this PR. However, the changes to log `replay` done
here are compatible with tables using V1 checkpoints, allowing us to
safely merge the changes here.
### Changes

For each batch of `EngineData` from a checkpoint file:
1. Use the new `SidecarVisitor` to scan each batch for sidecar file
paths embedded in sidecar actions.
3. If sidecar file paths exist:
    - Read the corresponding sidecar files.
- Generate an iterator over batches of actions within the sidecar files.
- Insert the sidecar batches that contain the add actions necessary to
reconstruct the table’s state into the top level iterator
**- Note: the original checkpoint batch is still included in the
iterator**
4. If no sidecar file paths exist, move to the next batch & leave the
original checkpoint batch in the iterator.
  
Notes:
- If the `checkpoint_read_schema` does not have file actions, we do not
need to scan the batch with the `SidecarVisitor` and can leave the batch
as-is in the top-level iterator.
- Multi-part checkpoints do not have sidecar actions, so we do not need
to scan the batch with the `SidecarVisitor` and can leave the batch
as-is in the top-level iterator.
- A batch may not include add actions, but other actions (like txn,
metadata, protocol). This is safe to leave in the iterator as the
non-file actions will be ignored.

resolves delta-io#670

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
Although log segments can not yet have V2 checkpoints, we can easily
mock batches that include sidecar actions that we can encounter in V2
checkpoints.

- `test_sidecar_to_filemeta_valid_paths`
  - Tests handling of sidecar paths that can either be:
- A relative path within the _delta_log/_sidecars directory, but it is
just file-name
  - paths that are relative and have a parent (i.e. directory component)
  - An absolute path.

**Unit tests for process_single_checkpoint_batch:**

- `test_checkpoint_batch_with_no_sidecars_returns_none`
- Verifies that if no sidecar actions are present, the checkpoint batch
is returned unchanged.
- `test_checkpoint_batch_with_sidecars_returns_sidecar_batches`
- Ensures that when sidecars are present, the corresponding sidecar
files are read, and their batches are returned.
- `test_checkpoint_batch_with_sidecar_files_that_do_not_exist`
- Tests behavior when sidecar files referenced in the checkpoint batch
do not exist, ensuring an error is returned.
- `test_reading_sidecar_files_with_predicate`
- Tests that sidecar files that do not match the passed predicate are
skipped correctly
  
**Unit tests for create_checkpoint_stream:**

-
`test_create_checkpoint_stream_errors_when_schema_has_remove_but_no_sidecar_action`
- Validates that if the schema includes the remove action, it must also
contain the sidecar column.
-
`test_create_checkpoint_stream_errors_when_schema_has_add_but_no_sidecar_action`
- Validates that if the schema includes the add action, it must also
contain the sidecar column.
-
`test_create_checkpoint_stream_returns_checkpoint_batches_as_is_if_schema_has_no_file_actions`
- Checks that if the schema has no file actions, the checkpoint batches
are returned unchanged
-
`test_create_checkpoint_stream_returns_checkpoint_batches_if_checkpoint_is_multi_part`
- Ensures that for multi-part checkpoints, the batch is not visited, and
checkpoint batches are returned as-is.
-
`test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars`
- Tests reading a Parquet checkpoint batch and verifying it matches the
expected result.
-
`test_create_checkpoint_stream_reads_json_checkpoint_batch_without_sidecars`
  - Verifies that JSON checkpoint batches are read correctly
- `test_create_checkpoint_stream_reads_checkpoint_batch_with_sidecar`
- Test ensuring that checkpoint files containing sidecar references
return the additional corresponding sidecar batches correctly
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->



### Summary

This PR introduces foundational changes required for V2 checkpoint read
support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from
sidecar files if need be.

This PR specifically adds support for Item 1.

This PR enables support for the `v2Checkpoints` reader/writer table
feature for delta kernel rust by
1. Allowing snapshots to now leverage UUID-named checkpoints as part of
their log segment.
2. Adding the `v2Checkpoints` feature to the list of supported reader
features.

- This PR is stacked on Item 2
[here](delta-io#679). Golden
table tests are included in this PR.

- More integration tests will be introduced in a follow-up PR tracked
here: delta-io#671

- This PR stacks changes on top of
delta-io#679. For the correct
file diff view, [please only review these
commits](https://github.com/delta-io/delta-kernel-rs/pull/685/files/501c675736dd102a691bc2132c6e81579cf4a1a6..3dcd0859be048dc05f3e98223d0950e460633b60)

resolves delta-io#688


### Changes

We already have the capability to recognize UUID-named checkpoint files
with the variant `LogPathFileType::UuidCheckpoint(uuid)`. This PR does
the folllowing:
- Adds `LogPathFileType::UuidCheckpoint(_)` to the list of valid
checkpoint file types that are collected during log listing
  - This addition allows V2 checkpoints to be included in log segments.
- Adds `ReaderFeatures::V2Checkpoint` to the list of supported reader
features
- This addition allows protocol & metadata validation to pass for tables
with the `v2Checkpoints` reader feature
- Adds the `UnsupportedFeature` reader/writer feature for testing
purposes.

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

Test coverage for the changes required to support building log segments
with V2 checkpoints:

- `test_uuid_checkpoint_patterns` (already exists, small update)
- Verifies the behavior of parsing log file paths that follow the
UUID-naming scheme
- `test_v2_checkpoint_supported`
- Tests the `ensure_read_supported()` func appropriately validates
protocol with `ReaderFeatures::V2Checkpoint`
- `build_snapshot_with_uuid_checkpoint_json`
- `build_snapshot_with_uuid_checkpoint_parquet` (already exists)
- `build_snapshot_with_correct_last_uuid_checkpoint`

Golden table tests:
- `v2-checkpoint-json`
- `v2-checkpoint-parquet`

Potential todos:
- is it worth introducing a preference for V2 checkpoints vs V1
checkpoints if both are present in the log for a version
- what about a preference for checkpoints referenced by
_last_checkpoint?
Updates HDFS dependencies to newest versions according to [compatibility
matrix](https://github.com/datafusion-contrib/hdfs-native-object-store?tab=readme-ov-file#compatibility).
## How was this change tested?

I expect current CI pipeline to cover this since there is a [HDFS
integration
test](delta-io@1f57962).
Also, I have run tests successfully (apart from code coverage due to
missing CI secret) on [my
fork](rzepinskip@d87922d).

---------

Co-authored-by: Nick Lanham <[email protected]>
…enabled (delta-io#664)

## What changes are proposed in this pull request?
This PR adds two functions to TableConfiguration: 
1) check whether appendOnly table feature is supported
2) check whether appendOnly table feature is enabled

It also enabled writes on tables with `AppendOnly` writer feature.

## How was this change tested?
I check that write is supported on Protocol with
`WriterFeatures::AppendOnly`.

---------

Co-authored-by: Zach Schuermann <[email protected]>
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR is part of building support for reading V2 checkpoints.
delta-io#498

This PR ports over existing delta‑spark tests and the tables they
create. This test coverage is necessary to ensure that V2 checkpoint
files - whether written in JSON or Parquet, with or without sidecars -
are read correctly and reliably.

This PR stacks changes on top of
delta-io#685
resolves delta-io#671

# How are these tests generated?

The test cases are derived from `delta-spark`'s
[CheckpointSuite](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48)
which creates known valid tables, reads them, and asserts correctness.
The process for adapting these tests is as follows:

1. I modified specific test cases in of interest in `delta-spark` to
persist their generated tables.
2. These tables were then compressed into `.tar.zst` archives and copied
over to delta-kernel-rs.
3. Each test in this PR loads a stored table, scans it, and asserts that
the returned table state matches the expected state - ( derived from the
corresponding table insertions in `delta-spark`.)

e.g in delta-spark test .
```
     // Append operations and assertions on checkpoint versions
      spark.range(1).repartition(1).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 1)
      assert(getV2CheckpointProvider(deltaLog).sidecarFileStatuses.size == 1)
      assert(getNumFilesInSidecarDirectory() == 1)

      spark.range(30).repartition(9).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 2)
      assert(getNumFilesInSidecarDirectory() == 3)

      spark.range(100).repartition(9).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 3)
      assert(getNumFilesInSidecarDirectory() == 5)

      spark.range(100).repartition(11).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 4)
      assert(getNumFilesInSidecarDirectory() == 9)
    }

```
Translates to an expected table state in the kernel:
```
    let mut expected = [
        header,
        vec!["| 0   |".to_string(); 3],
        generate_rows(30),
        generate_rows(100),
        generate_rows(100),
        generate_rows(1000),
        vec!["+-----+".to_string()],
    ]
```
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
Tables from test-cases of interest in delta-spark's
[`CheckpointSuite`](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48)
have been compressed into `.tar.zst` archives. They are read by the
kernel and the resulting tables are asserted for correctness.
- `v2_checkpoints_json_with_sidecars`
- `v2_checkpoints_parquet_with_sidecars`
- `v2_checkpoints_json_without_sidecars`
- `v2_checkpoints_parquet_without_sidecars`
- `v2_classic_checkpoint_json`
- `v2_classic_checkpoint_parquet`
- `v2_checkpoints_parquet_with_last_checkpoint`
- `v2_checkpoints_json_with_last_checkpoint`
## What changes are proposed in this pull request?

Add basic support for partition pruning by combining two pieces of
existing infra:
1. The log replay row visitor already needs to parse partition values
and already filters out unwanted rows
2. The default predicate evaluator works directly with scalars

Result: partition pruning gets applied during log replay, just before
deduplication so we don't have to remember pruned files.

WARNING: The implementation currently has a flaw, in case the history
contains a table-replace that affected partition columns. For example,
changing a value column into a non-nullable partition column, or an
incompatible type change to a partition column. In such cases, the
remove actions generated by the table-replace operation (for old files)
would have the wrong type or even be entirely absent. While the code can
handle an absent partition value, an incompatibly typed value would
cause a parsing error that fails the whole query. Note that stats-based
data skipping already has the same flaw, so we are not making the
problem worse. We will fix the problem for both as a follow-up item,
tracked by delta-io#712

NOTE: While this is a convenient way to achieve partition pruning in the
immediate term, Delta
[checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1)
can provide strongly-typed `stats_parsed` and `partitionValues_parsed`
columns which would have a completely different access.
* For `stats` vs. `stats_parsed`, the likely solution is simple enough
because we already json-parse `stats` into a strongly-typed nested
struct in order to evaluate the data skipping predicate over its record
batch. We just avoid the parsing overhead if `stats_parsed` is already
available.
* The `partitionValues` field poses a bigger challenge, because it's a
string-string map, not a JSON literal. In order to turn it into a
strongly-typed nested struct, we would need a SQL expression that can
extract the string values and try-cast them to the desired types. That's
ugly enough we might prefer to keep completely different code paths for
parsed vs. string partition values, but then there's a risk that
partition pruning behavior changes depending on which path got invoked.

## How was this change tested?

New unit tests, and adjusted one unit test that assumed no partition
pruning.
## What changes are proposed in this pull request?
Add `DeletionVectors` to supported writer features. We trivially support
DVs since we never write DVs. Note as we implement DML in the future we
need to ensure it correctly handles DVs

## How was this change tested?
modified UT
)

## What changes are proposed in this pull request?
Support writer version 2 and `Invariant` table (writer) feature. Note
that we don't _actually_ support invariants, rather we enable writing to
tables **without invariants** with version=2 or Invariant feature
enabled.


### This PR affects the following public APIs

Enable writes to version=2/Invariant enabled.

## How was this change tested?
new UTs

resolves delta-io#706
## What changes are proposed in this pull request?

`MetadataValue::Number(i32)` was the previous values for metadata
values, but identity columns are only longs, so updated
MetadataValue::Number to be `MetadataValue::Number(i64)` instead.


## How was this change tested?

I ran the tests, this doesn't change any existing functionality only the
type.

---------

Co-authored-by: Robert Pack <[email protected]>
## What changes are proposed in this pull request?

The `actions-rs/toolchain` action is deprecated in favor of
`actions-rust-lang/setup-rust-toolchain`. This PR updates the usages of
the respective actions in the github workflows. THe new action already
includes an integration with the rust-cache action, so no need to set
that up separately anymore.

This also sets up a dependabot configuration for `cargo` and
`github-actions` which we may or may not choose to keep.

## How was this change tested?

no code changes.

---------

Co-authored-by: Zach Schuermann <[email protected]>
@roeap roeap requested a review from scovich March 19, 2025 19:18
@@ -39,7 +39,7 @@ impl FileSystemClient for SyncFilesystemClient {
let all_ents: Vec<_> = std::fs::read_dir(path_to_read)?
.filter(|ent_res| {
match (ent_res, min_file_name) {
(Ok(ent), Some(min_file_name)) => ent.file_name() >= *min_file_name,
(Ok(ent), Some(min_file_name)) => ent.file_name() > *min_file_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a behavior change? Does it need to be part of this refactor PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need is a strong word, but IIRC, if we split this, we need some more refactoring in wrapping objects store to filter out that one item. Since object_store behaves this way and this is the behavior that is documented in java kernel, i thought to just go for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, this was required since i introduced tests that run against default-engine and sync-engine which were were inconsistent in that regard. With this change, both engines behave the same.

I then guess for most (if not all) current users of one of our engines things stay the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, but then can we include this in the changes listed in the PR description?

Comment on lines +349 to +353
/// List the paths in the same directory that are lexicographically greater than
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name.
///
/// If the path is directory-like (ends with '/'), the result should contain
/// all the files in the directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to split this behavior change to a different PR, if at all possible?

} else {
let mut parts = offset.parts().collect_vec();
if parts.pop().is_none() {
return Err(Error::generic(format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this will double format, use Error::Generic

@roeap roeap requested review from scovich and nicklan March 20, 2025 01:39
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zachschuermann zachschuermann changed the title fix: get prefix from offset path fix!: get prefix from offset path Mar 20, 2025
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -39,7 +39,7 @@ impl FileSystemClient for SyncFilesystemClient {
let all_ents: Vec<_> = std::fs::read_dir(path_to_read)?
.filter(|ent_res| {
match (ent_res, min_file_name) {
(Ok(ent), Some(min_file_name)) => ent.file_name() >= *min_file_name,
(Ok(ent), Some(min_file_name)) => ent.file_name() > *min_file_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, but then can we include this in the changes listed in the PR description?

Comment on lines +349 to +353
/// List the paths in the same directory that are lexicographically greater than
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name.
///
/// If the path is directory-like (ends with '/'), the result should contain
/// all the files in the directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep this here can we add to the PR description? I think this is a breaking change

@roeap roeap merged commit ec47584 into delta-io:main Mar 21, 2025
21 checks passed
@roeap roeap deleted the feat/engine-fs branch March 21, 2025 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants