-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for Polars DataFrame and LazyFrame #1614
base: main
Are you sure you want to change the base?
Added support for Polars DataFrame and LazyFrame #1614
Conversation
Really nice - i really like working with polars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks @yigal-rozenberg for working on this. I'm open to adding this. Could you add a section on Polars to the docs as well?
See https://py.iceberg.apache.org/api/#query-the-data for examples. You can find the file here: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md
I will provide with the relevant documentation. |
At first glance, this approach seems fairly inferior to using the Polars scan_iceberg functionality, which:
I think it might be better to document the existing polars functionality vs adding and documenting this pattern. |
Thanks for the comment! |
…he Table class with a to_polars method whihc returns a polars LazyFrame
The difference is the approach as documented is encouraging folks to write their own filter predicates for pyiceberg before materializing a dataframe with polars, whereas the "polars way" (as a lazy dataframe API) would be to just create the lazyframe, construct your compute graph with whatever polars predicates/etc make sense for you, and rely on polars to push that down at |
Separately, rather than adding more library-specific conversion code, it might make sense for pyiceberg to start leveraging the PyCapsule protocol to allow any third party library (dataframe or otherwise) that supports Arrow data to seamlessly consume pyiceberg constructs. Polars already supports the PyCapsule interface. See https://docs.pola.rs/user-guide/misc/arrow/#using-the-arrow-pycapsule-interface for details. Implementing the interface on e.g. pyiceberg tables would allow them to be passed directly to dataframe init in polars, just like you can do a pyarrow table today. It also doesn't assume anything about polars support/doesn't add a dependency on polars. |
love the idea! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I've added a few comments
can you rebase off main? looks like theres a conflict |
theres still conflict with main. could you also remove |
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Kevin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I fixed the merge conflict and some linter issues.
I'll let Fokko chime in before proceeding
Polars (https://pola.rs) is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.
this chnage is a simple 'to_polars' addiotn to the table api.
iceberg_table = catalog.load_table('data.data_points')
pdf = iceberg_table.scan().to_polars()
print(pdf)