Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter rows directly from pa.RecordBatch #1621

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gabeiglio
Copy link
Contributor

@gabeiglio gabeiglio commented Feb 7, 2025

This PR from Apache Arrow was merged to allow to filter with a boolean expression directly on pa.RecordBatch.

I believe pyiceberg is currently using pyarrow version 19.0.0.
Filtering from pa.RecordBatch was introduced in python in version 17.0.0

I have not run integration tests for some reason my docker setup is messed up. I believe this test should check this change:

def test_read_multiple_batches_in_task_with_position_deletes(spark: SparkSession, session_catalog: RestCatalog) -> None:

Closes #1050

@Fokko
Copy link
Contributor

Fokko commented Feb 7, 2025

Thanks for fixing this @gabeiglio

In addition, I think we also need to bump the minimal version of Arrow here:

pyarrow = { version = ">=14.0.0,<20.0.0", optional = true }

@kevinjqliu
Copy link
Contributor

thanks for following up on that comment 😄

if we're bumping minimum pyarrow version to 17, we might want to address this comment as well
https://github.com/apache/iceberg-python/pull/1621/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1335-R1338

@gabeiglio
Copy link
Contributor Author

@kevinjqliu IIUC removing the schema casting will allow pyarrow scanner to infer by itself if it needs or not large types? So it is basically a matter of changing the assertions in tests to the types of the result of the scan?

@kevinjqliu
Copy link
Contributor

I believe so. We can also do this in a follow up PR! I just saw that comment during code review

@kevinjqliu
Copy link
Contributor

Looks like theres an issue in CI tests

@gabeiglio
Copy link
Contributor Author

Yes, I think it would be better to split these changes in separate PRs since there are a lot of changes to be made to tests specially. (If thats okay ill open the other PR for schema casting @kevinjqliu @Fokko)

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, looks great, and so much cleaner :)

@@ -1348,33 +1348,34 @@ def _task_to_record_batches(
next_index = 0
batches = fragment_scanner.to_batches()
for batch in batches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I think we can drop the batch:

Suggested change
for batch in batches:
for current_batch in batches:

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like CIs broken on poetry

Comment on lines +1351 to +1352
current_index = next_index
next_index = current_index + len(batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logically equivalent? feels like there was a reason to write it the other way.

cc @sungwy do you have context on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I wasn't planning on pushing this change 🤦. I'll revert it in the next commit if we want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] push down filters and positional deletes to the record batch level
3 participants