-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: get prefix from offset path #699
Conversation
5559d58
to
46ca271
Compare
Signed-off-by: Robert Pack <[email protected]>
46ca271
to
a3a7671
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #699 +/- ##
==========================================
- Coverage 84.37% 84.36% -0.02%
==========================================
Files 80 81 +1
Lines 19221 19246 +25
Branches 19221 19246 +25
==========================================
+ Hits 16218 16236 +18
- Misses 2201 2205 +4
- Partials 802 805 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
/// List the paths in the same directory that are lexicographically greater than | ||
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name. | ||
/// | ||
/// If the path is directory-like (ends with '/'), the result should contain | ||
/// all the files in the directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure this is the behavior we want to go for, but thought I"d put up the PR for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better to split this behavior change to a different PR, if at all possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we keep this here can we add to the PR description? I think this is a breaking change
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
…o#700) Fixes delta-io#698 ## What changes are proposed in this pull request? Updates the `DataSkippingFilter` to treat all columns as nullable for the purpose of parsing stats, as suggested in delta-io#698 (comment). This is particularly important for partition columns, which won't have values present in stats. But stats are also only usually stored for the first 32 columns, so we shouldn't rely on stats being present for non-partition fields either. ## How was this change tested? I've added a new unit test. I've also tested building duckdb-delta with this change (cherry-picked onto 0.6.1) and verified that the code in delta-io#698 now works.
…_snapshot_schema (delta-io#683) ## What changes are proposed in this pull request? When given a schema (e.g. in `global_scan_state`) the engine needs a way to visit this schema. This introduces a new API `visit_schema` to allow engines to visit any schema over FFI. An API called `visit_schema` previously existed but visited the schema of a given _snapshot_; this has now been renamed to `visit_snapshot_schema`. ### This PR affects the following public APIs Renamed `visit_schema` to `visit_snapshot_schema` and now `visit_schema` takes `SharedSchema` as an argument instead of a snapshot. ## How was this change tested? updated read_table test
…o#654) This change introduces arrow_53 and arrow_54 feature flags on kernel which are _required_ when using default-engine or sync-engine. Fundamentally we must push users of the crate to select their arrow major version through flags since Cargo _will_ include multiple major versions in the dependency tree which can cause ABI breakages when passing around symbols such as `RecordBatch` See delta-io#640 --------- Signed-off-by: R. Tyler Croy <[email protected]>
Our previous write protocol check was too strict. Now we just ensure that the protocol makes sense given what features are present/specified. Made all existing `write.rs` tests also write to a protocol 1/1 table, and they all work.
Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :) In particular: - remove `add_partition_columns` in `arrow.c`, we don't need it anymore - expose ffi methods to get an expression evaluator and evaluate an expression from `c` - use the above to add an `apply_transform` function in `arrow.c` ## How was this change tested? - existing tests
…ta-io#709) ## What changes are proposed in this pull request? This PR removes the old `visit_snapshot_schema` introduced in delta-io#683 - we should just go ahead and do the 'right thing' with having a `visit_schema` (introduced in the other PR) and a `logical_schema()` function (added here) in order to facilitate visiting the snapshot schema. Additionally I've moved the schema-related things up from `scan` module to top-level in ffi crate. Exact changes listed below; this PR updates tests/examples to leverage the new changes. ### This PR affects the following public APIs 1. Remove `visit_snapshot_schema()` API 2. Add a new `logical_schema(snapshot)` API so you can get the schema of a snapshot and use the `visit_schema` directly 3. Renames `free_global_read_schema` to just `free_schema` 4. Moves `SharedSchema` and `free_schema` up from `mod scan` into top-level `ffi` crate. ## How was this change tested? existing UT
release 0.8.0
…elta-io#679) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Summary This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are: Item 1. Allow log segments to be built with V2 checkpoint files Item 2. Allow log segment `replay` functionality to retrieve actions from sidecar files if need be. This PR specifically adds support for Item 2. This PR **does not introduce full v2Checkpoints reader/writer support** as we are missing support for Item 1, meaning log segments can never have V2 checkpoint files in the first place. That functionality will be completed in [PR delta-io#685](delta-io#685) which is stacked on top of this PR. However, the changes to log `replay` done here are compatible with tables using V1 checkpoints, allowing us to safely merge the changes here. ### Changes For each batch of `EngineData` from a checkpoint file: 1. Use the new `SidecarVisitor` to scan each batch for sidecar file paths embedded in sidecar actions. 3. If sidecar file paths exist: - Read the corresponding sidecar files. - Generate an iterator over batches of actions within the sidecar files. - Insert the sidecar batches that contain the add actions necessary to reconstruct the table’s state into the top level iterator **- Note: the original checkpoint batch is still included in the iterator** 4. If no sidecar file paths exist, move to the next batch & leave the original checkpoint batch in the iterator. Notes: - If the `checkpoint_read_schema` does not have file actions, we do not need to scan the batch with the `SidecarVisitor` and can leave the batch as-is in the top-level iterator. - Multi-part checkpoints do not have sidecar actions, so we do not need to scan the batch with the `SidecarVisitor` and can leave the batch as-is in the top-level iterator. - A batch may not include add actions, but other actions (like txn, metadata, protocol). This is safe to leave in the iterator as the non-file actions will be ignored. resolves delta-io#670 <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Although log segments can not yet have V2 checkpoints, we can easily mock batches that include sidecar actions that we can encounter in V2 checkpoints. - `test_sidecar_to_filemeta_valid_paths` - Tests handling of sidecar paths that can either be: - A relative path within the _delta_log/_sidecars directory, but it is just file-name - paths that are relative and have a parent (i.e. directory component) - An absolute path. **Unit tests for process_single_checkpoint_batch:** - `test_checkpoint_batch_with_no_sidecars_returns_none` - Verifies that if no sidecar actions are present, the checkpoint batch is returned unchanged. - `test_checkpoint_batch_with_sidecars_returns_sidecar_batches` - Ensures that when sidecars are present, the corresponding sidecar files are read, and their batches are returned. - `test_checkpoint_batch_with_sidecar_files_that_do_not_exist` - Tests behavior when sidecar files referenced in the checkpoint batch do not exist, ensuring an error is returned. - `test_reading_sidecar_files_with_predicate` - Tests that sidecar files that do not match the passed predicate are skipped correctly **Unit tests for create_checkpoint_stream:** - `test_create_checkpoint_stream_errors_when_schema_has_remove_but_no_sidecar_action` - Validates that if the schema includes the remove action, it must also contain the sidecar column. - `test_create_checkpoint_stream_errors_when_schema_has_add_but_no_sidecar_action` - Validates that if the schema includes the add action, it must also contain the sidecar column. - `test_create_checkpoint_stream_returns_checkpoint_batches_as_is_if_schema_has_no_file_actions` - Checks that if the schema has no file actions, the checkpoint batches are returned unchanged - `test_create_checkpoint_stream_returns_checkpoint_batches_if_checkpoint_is_multi_part` - Ensures that for multi-part checkpoints, the batch is not visited, and checkpoint batches are returned as-is. - `test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars` - Tests reading a Parquet checkpoint batch and verifying it matches the expected result. - `test_create_checkpoint_stream_reads_json_checkpoint_batch_without_sidecars` - Verifies that JSON checkpoint batches are read correctly - `test_create_checkpoint_stream_reads_checkpoint_batch_with_sidecar` - Test ensuring that checkpoint files containing sidecar references return the additional corresponding sidecar batches correctly
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Summary This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are: Item 1. Allow log segments to be built with V2 checkpoint files Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be. This PR specifically adds support for Item 1. This PR enables support for the `v2Checkpoints` reader/writer table feature for delta kernel rust by 1. Allowing snapshots to now leverage UUID-named checkpoints as part of their log segment. 2. Adding the `v2Checkpoints` feature to the list of supported reader features. - This PR is stacked on Item 2 [here](delta-io#679). Golden table tests are included in this PR. - More integration tests will be introduced in a follow-up PR tracked here: delta-io#671 - This PR stacks changes on top of delta-io#679. For the correct file diff view, [please only review these commits](https://github.com/delta-io/delta-kernel-rs/pull/685/files/501c675736dd102a691bc2132c6e81579cf4a1a6..3dcd0859be048dc05f3e98223d0950e460633b60) resolves delta-io#688 ### Changes We already have the capability to recognize UUID-named checkpoint files with the variant `LogPathFileType::UuidCheckpoint(uuid)`. This PR does the folllowing: - Adds `LogPathFileType::UuidCheckpoint(_)` to the list of valid checkpoint file types that are collected during log listing - This addition allows V2 checkpoints to be included in log segments. - Adds `ReaderFeatures::V2Checkpoint` to the list of supported reader features - This addition allows protocol & metadata validation to pass for tables with the `v2Checkpoints` reader feature - Adds the `UnsupportedFeature` reader/writer feature for testing purposes. <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Test coverage for the changes required to support building log segments with V2 checkpoints: - `test_uuid_checkpoint_patterns` (already exists, small update) - Verifies the behavior of parsing log file paths that follow the UUID-naming scheme - `test_v2_checkpoint_supported` - Tests the `ensure_read_supported()` func appropriately validates protocol with `ReaderFeatures::V2Checkpoint` - `build_snapshot_with_uuid_checkpoint_json` - `build_snapshot_with_uuid_checkpoint_parquet` (already exists) - `build_snapshot_with_correct_last_uuid_checkpoint` Golden table tests: - `v2-checkpoint-json` - `v2-checkpoint-parquet` Potential todos: - is it worth introducing a preference for V2 checkpoints vs V1 checkpoints if both are present in the log for a version - what about a preference for checkpoints referenced by _last_checkpoint?
Updates HDFS dependencies to newest versions according to [compatibility matrix](https://github.com/datafusion-contrib/hdfs-native-object-store?tab=readme-ov-file#compatibility). ## How was this change tested? I expect current CI pipeline to cover this since there is a [HDFS integration test](delta-io@1f57962). Also, I have run tests successfully (apart from code coverage due to missing CI secret) on [my fork](rzepinskip@d87922d). --------- Co-authored-by: Nick Lanham <[email protected]>
…enabled (delta-io#664) ## What changes are proposed in this pull request? This PR adds two functions to TableConfiguration: 1) check whether appendOnly table feature is supported 2) check whether appendOnly table feature is enabled It also enabled writes on tables with `AppendOnly` writer feature. ## How was this change tested? I check that write is supported on Protocol with `WriterFeatures::AppendOnly`. --------- Co-authored-by: Zach Schuermann <[email protected]>
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR is part of building support for reading V2 checkpoints. delta-io#498 This PR ports over existing delta‑spark tests and the tables they create. This test coverage is necessary to ensure that V2 checkpoint files - whether written in JSON or Parquet, with or without sidecars - are read correctly and reliably. This PR stacks changes on top of delta-io#685 resolves delta-io#671 # How are these tests generated? The test cases are derived from `delta-spark`'s [CheckpointSuite](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48) which creates known valid tables, reads them, and asserts correctness. The process for adapting these tests is as follows: 1. I modified specific test cases in of interest in `delta-spark` to persist their generated tables. 2. These tables were then compressed into `.tar.zst` archives and copied over to delta-kernel-rs. 3. Each test in this PR loads a stored table, scans it, and asserts that the returned table state matches the expected state - ( derived from the corresponding table insertions in `delta-spark`.) e.g in delta-spark test . ``` // Append operations and assertions on checkpoint versions spark.range(1).repartition(1).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 1) assert(getV2CheckpointProvider(deltaLog).sidecarFileStatuses.size == 1) assert(getNumFilesInSidecarDirectory() == 1) spark.range(30).repartition(9).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 2) assert(getNumFilesInSidecarDirectory() == 3) spark.range(100).repartition(9).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 3) assert(getNumFilesInSidecarDirectory() == 5) spark.range(100).repartition(11).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 4) assert(getNumFilesInSidecarDirectory() == 9) } ``` Translates to an expected table state in the kernel: ``` let mut expected = [ header, vec!["| 0 |".to_string(); 3], generate_rows(30), generate_rows(100), generate_rows(100), generate_rows(1000), vec!["+-----+".to_string()], ] ``` <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Tables from test-cases of interest in delta-spark's [`CheckpointSuite`](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48) have been compressed into `.tar.zst` archives. They are read by the kernel and the resulting tables are asserted for correctness. - `v2_checkpoints_json_with_sidecars` - `v2_checkpoints_parquet_with_sidecars` - `v2_checkpoints_json_without_sidecars` - `v2_checkpoints_parquet_without_sidecars` - `v2_classic_checkpoint_json` - `v2_classic_checkpoint_parquet` - `v2_checkpoints_parquet_with_last_checkpoint` - `v2_checkpoints_json_with_last_checkpoint`
## What changes are proposed in this pull request? Add basic support for partition pruning by combining two pieces of existing infra: 1. The log replay row visitor already needs to parse partition values and already filters out unwanted rows 2. The default predicate evaluator works directly with scalars Result: partition pruning gets applied during log replay, just before deduplication so we don't have to remember pruned files. WARNING: The implementation currently has a flaw, in case the history contains a table-replace that affected partition columns. For example, changing a value column into a non-nullable partition column, or an incompatible type change to a partition column. In such cases, the remove actions generated by the table-replace operation (for old files) would have the wrong type or even be entirely absent. While the code can handle an absent partition value, an incompatibly typed value would cause a parsing error that fails the whole query. Note that stats-based data skipping already has the same flaw, so we are not making the problem worse. We will fix the problem for both as a follow-up item, tracked by delta-io#712 NOTE: While this is a convenient way to achieve partition pruning in the immediate term, Delta [checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1) can provide strongly-typed `stats_parsed` and `partitionValues_parsed` columns which would have a completely different access. * For `stats` vs. `stats_parsed`, the likely solution is simple enough because we already json-parse `stats` into a strongly-typed nested struct in order to evaluate the data skipping predicate over its record batch. We just avoid the parsing overhead if `stats_parsed` is already available. * The `partitionValues` field poses a bigger challenge, because it's a string-string map, not a JSON literal. In order to turn it into a strongly-typed nested struct, we would need a SQL expression that can extract the string values and try-cast them to the desired types. That's ugly enough we might prefer to keep completely different code paths for parsed vs. string partition values, but then there's a risk that partition pruning behavior changes depending on which path got invoked. ## How was this change tested? New unit tests, and adjusted one unit test that assumed no partition pruning.
## What changes are proposed in this pull request? Add `DeletionVectors` to supported writer features. We trivially support DVs since we never write DVs. Note as we implement DML in the future we need to ensure it correctly handles DVs ## How was this change tested? modified UT
) ## What changes are proposed in this pull request? Support writer version 2 and `Invariant` table (writer) feature. Note that we don't _actually_ support invariants, rather we enable writing to tables **without invariants** with version=2 or Invariant feature enabled. ### This PR affects the following public APIs Enable writes to version=2/Invariant enabled. ## How was this change tested? new UTs resolves delta-io#706
## What changes are proposed in this pull request? `MetadataValue::Number(i32)` was the previous values for metadata values, but identity columns are only longs, so updated MetadataValue::Number to be `MetadataValue::Number(i64)` instead. ## How was this change tested? I ran the tests, this doesn't change any existing functionality only the type. --------- Co-authored-by: Robert Pack <[email protected]>
## What changes are proposed in this pull request? The `actions-rs/toolchain` action is deprecated in favor of `actions-rust-lang/setup-rust-toolchain`. This PR updates the usages of the respective actions in the github workflows. THe new action already includes an integration with the rust-cache action, so no need to set that up separately anymore. This also sets up a dependabot configuration for `cargo` and `github-actions` which we may or may not choose to keep. ## How was this change tested? no code changes. --------- Co-authored-by: Zach Schuermann <[email protected]>
@@ -39,7 +39,7 @@ impl FileSystemClient for SyncFilesystemClient { | |||
let all_ents: Vec<_> = std::fs::read_dir(path_to_read)? | |||
.filter(|ent_res| { | |||
match (ent_res, min_file_name) { | |||
(Ok(ent), Some(min_file_name)) => ent.file_name() >= *min_file_name, | |||
(Ok(ent), Some(min_file_name)) => ent.file_name() > *min_file_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a behavior change? Does it need to be part of this refactor PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need is a strong word, but IIRC, if we split this, we need some more refactoring in wrapping objects store to filter out that one item. Since object_store behaves this way and this is the behavior that is documented in java kernel, i thought to just go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, this was required since i introduced tests that run against default-engine and sync-engine which were were inconsistent in that regard. With this change, both engines behave the same.
I then guess for most (if not all) current users of one of our engines things stay the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, but then can we include this in the changes listed in the PR description?
/// List the paths in the same directory that are lexicographically greater than | ||
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name. | ||
/// | ||
/// If the path is directory-like (ends with '/'), the result should contain | ||
/// all the files in the directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better to split this behavior change to a different PR, if at all possible?
} else { | ||
let mut parts = offset.parts().collect_vec(); | ||
if parts.pop().is_none() { | ||
return Err(Error::generic(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this will double format, use Error::Generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -39,7 +39,7 @@ impl FileSystemClient for SyncFilesystemClient { | |||
let all_ents: Vec<_> = std::fs::read_dir(path_to_read)? | |||
.filter(|ent_res| { | |||
match (ent_res, min_file_name) { | |||
(Ok(ent), Some(min_file_name)) => ent.file_name() >= *min_file_name, | |||
(Ok(ent), Some(min_file_name)) => ent.file_name() > *min_file_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, but then can we include this in the changes listed in the PR description?
/// List the paths in the same directory that are lexicographically greater than | ||
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name. | ||
/// | ||
/// If the path is directory-like (ends with '/'), the result should contain | ||
/// all the files in the directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we keep this here can we add to the PR description? I think this is a breaking change
Co-authored-by: Zach Schuermann <[email protected]>
What changes are proposed in this pull request?
Our
list_from
implementation for the object_store based filesystem client is currently broken, since it does not behave as documented / required for that function. Specifically we should list all files in the parent folder for using the path as offset to list from.In a follow up PR we then need to lift the assumption that all URLs will always be under the same store to get proper URL handling.
This PR affects the following public APIs
DefaultEngine::new
no longer requires atable_root
parameter.list_from
consistently returns keys greater than (>
) the offset, previously thesync-engines
client returned all keys (>=
)How was this change tested?
Additional tests asserting consistent
list_from
behavior for all our file system client implementations.