-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tracking] Kernelize! #3298
Comments
@roeap Based on the challenges you wrote, I think we should actually move forward with the current API as 1.0 for us. Because one thing I would not like to see is we hit many regressions after switching to kernel, and then this would set us back many months if not longer to get to a 1.0 release. I would rather see us doing this in 1.X work and then we can release 2.0 after that is stabilized |
@ion-elgreco - i do see the point, but are we under some time pressure? I think we still have a few correctness problems anyways that I personally would expect to handled in a 1.0. But yes, this is certainly not going to be done in a week or so. Personally I do also still struggle a bit with how wide our APIs on the delta table are. i.e. tailored to both log inspection and table scans. The SQL APIs are also still considered experimental IIRC, do we have a design for that? Finally getting to 1.0 is something I am tremendously looking forward to, but is there any motivation to rush it now that we are nearing the end? Knowing we might break things? W.r.t. to the challenges, I think we can cover most of this via a hybrid state where we leverage what kernel can do and layer our stuff on top of that. |
Which things exactly?
This worries me, I want a python 1.0 out as soon as possible
SQL apis? You mean parsing predicates, we also have this experimental querybuilder in python which I'm not a major fan of
But we shouldn't break things anymore for python users. I've been thinking about this for a while but I think we can do a similar thing as Polars does, make python 1.0 but the rust crates not. |
I guess most if not all of this is on the rust side, but may break things on the python side.
Do we have a pressing motivation for this? I agree that this is a priority, but are there time constraints?
^^ this
I may not be the best judge of this, but there are a few things I would take a close look at in terms of api maintenance. This may just be a personal feeling, but to me it seems we kept just adding more and more parameters to
With kernel specifically, these two should "collapse" since kernel handles file skipping.
Also, I think we may need to deprecate the "pyarrow" engine altogether as it cannot support all features. Just my personal preference (so nothing that matter all that much 😆), but I'd like to remove the |
Hmm, do you have repro for this? First time I hear about this
Yeah I dove into it, it's actually working fine and throwing the correct erros when you do try to time travel beyond such a state.
Yes, but I preferably don't want any SQL support in deltalake. We already have duckdb and datafusion that can read delta tables, so I never truly saw the purpose of this.
I think you missed this PR :) #3285 All of these options were dedicated to the pyarrow engine writer. Since that is now removed I was able to restructure and simplify everything. I could only do this since we are closing in on 1.0.
|
It's marked as experimental so in theory we can always remove it imho. But if others want to add things to it they can do that.
In python a |
Important
This is a living issue that we'll update with new issues / comments as we get more clarity on the concrete implementations.
Description
This is a tracking issue to align on and coordinate the integration of
delta-kernel-rs
intodelta-rs
.Motivation
While tremendous strides have been made by the community to support more and more delta features in
delta-rs
, we are still lagging behind with more features on the way that user will want to leverage. This is exactly the use case the kernel libraries aim to address - a correct and complete implementation of the Delta protocol.Kernel explicitly does not take an opinion on all io / execution related aspects that are needed to actually consume / work with delta tables. This is what
delta-rs
provides, leaving the current (high level) user facing APIs conceptually as is.Execution
In simplified terms adopting kernel mean carving out the functionality that currently resides in
core/src/kernel
(named so in preparation for being replaced by kernel)core/src/protocol
(mainly our snapshot code, that I wanted to update for quite a while now)core/src/schema
(only partition pruning remained in this module after previous updates)At the heart of the migration is creating a new snapshot implementation (RFC in #3137) which provides all required machinery (the engine) to kernel and exposes methods tailored to the needs of
delta-rs
.One potential avenue forward is to get the RFC merge-ready and merge it without being "hooked-up" to the rest of the crate. This PR also exposes a
Snapshot
trait (we already have something similar, but not quite fitting - I think) the we can hopefully leverage to refactor all the operations that require access to the snapshot - i.e. implement that trait for current snapshots ... This should hopefully surface any missing APIs in kernel that we may yet require for full adoption.Challenges
Txn
,CommtInfo
...Any feedback / concers around proceeding with this is highly appreciated.
Related Work
v2Checkpoint
reader/writer feature delta-kernel-rs#685Snapshot::try_new_from()
API delta-kernel-rs#549The text was updated successfully, but these errors were encountered: