Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Polars DataFrame and LazyFrame #1614

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

yigal-rozenberg
Copy link

Polars (https://pola.rs) is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

this chnage is a simple 'to_polars' addiotn to the table api.

iceberg_table = catalog.load_table('data.data_points')
pdf = iceberg_table.scan().to_polars()
print(pdf)

@amitgilad3
Copy link
Contributor

Really nice - i really like working with polars

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks @yigal-rozenberg for working on this. I'm open to adding this. Could you add a section on Polars to the docs as well?

See https://py.iceberg.apache.org/api/#query-the-data for examples. You can find the file here: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md

pyproject.toml Outdated Show resolved Hide resolved
@yigal-rozenberg
Copy link
Author

I will provide with the relevant documentation.

@corleyma
Copy link

corleyma commented Feb 5, 2025

At first glance, this approach seems fairly inferior to using the Polars scan_iceberg functionality, which:

  • returns a LazyFrame
  • on evaluation of the LazyFrame, will push down filters to pyiceberg to limit the data returned.

I think it might be better to document the existing polars functionality vs adding and documenting this pattern.

@yigal-rozenberg
Copy link
Author

Thanks for the comment!
I can add this to the documentation as an alternative.
The motivations was to extend existing PyIceberg table API to align with 'to_pandas' as an example.
Polars 'scan_iceberg' uses PyIceberg to create the LazyFrame:
https://github.com/pola-rs/polars/blob/9359ed576d972dce257346fcd62c8857f3d23277/py-polars/polars/io/iceberg.py#L139
The filtering can be done in PyIceberg, so aren't the2 approaches similar?

…he Table class with a to_polars method whihc returns a polars LazyFrame
@yigal-rozenberg yigal-rozenberg changed the title Added support for Polars DataFrame Added support for Polars DataFrame and LazyFarame Feb 6, 2025
@corleyma
Copy link

corleyma commented Feb 6, 2025

Polars 'scan_iceberg' uses PyIceberg to create the LazyFrame:
https://github.com/pola-rs/polars/blob/9359ed576d972dce257346fcd62c8857f3d23277/py-polars/polars/io/iceberg.py#L139
The filtering can be done in PyIceberg, so aren't the2 approaches similar?

The difference is the approach as documented is encouraging folks to write their own filter predicates for pyiceberg before materializing a dataframe with polars, whereas the "polars way" (as a lazy dataframe API) would be to just create the lazyframe, construct your compute graph with whatever polars predicates/etc make sense for you, and rely on polars to push that down at .collect() time to appropriately filter data before load where possible.

@corleyma
Copy link

corleyma commented Feb 6, 2025

Separately, rather than adding more library-specific conversion code, it might make sense for pyiceberg to start leveraging the PyCapsule protocol to allow any third party library (dataframe or otherwise) that supports Arrow data to seamlessly consume pyiceberg constructs.

Polars already supports the PyCapsule interface. See https://docs.pola.rs/user-guide/misc/arrow/#using-the-arrow-pycapsule-interface for details.

Implementing the interface on e.g. pyiceberg tables would allow them to be passed directly to dataframe init in polars, just like you can do a pyarrow table today. It also doesn't assume anything about polars support/doesn't add a dependency on polars.

@yigal-rozenberg
Copy link
Author

love the idea!
Will research in how to implement.
In the mean time, I believe this specific change request is straight forward, and allow both DataFrame, and LazyFrames to be utilized by Polars.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I've added a few comments

pyproject.toml Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor

can you rebase off main? looks like theres a conflict

@kevinjqliu
Copy link
Contributor

theres still conflict with main. could you also remove .vscode/settings.json?

@Fokko Fokko changed the title Added support for Polars DataFrame and LazyFarame Added support for Polars DataFrame and LazyFrame Feb 11, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I fixed the merge conflict and some linter issues.

I'll let Fokko chime in before proceeding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants