-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter rows directly from pa.RecordBatch #1621
base: main
Are you sure you want to change the base?
Conversation
8b0ab79
to
61772f5
Compare
Thanks for fixing this @gabeiglio In addition, I think we also need to bump the minimal version of Arrow here: Line 64 in dfbee4b
|
thanks for following up on that comment 😄 if we're bumping minimum pyarrow version to 17, we might want to address this comment as well |
@kevinjqliu IIUC removing the schema casting will allow pyarrow scanner to infer by itself if it needs or not large types? So it is basically a matter of changing the assertions in tests to the types of the result of the scan? |
I believe so. We can also do this in a follow up PR! I just saw that comment during code review |
Looks like theres an issue in CI tests |
Yes, I think it would be better to split these changes in separate PRs since there are a lot of changes to be made to tests specially. (If thats okay ill open the other PR for schema casting @kevinjqliu @Fokko) |
e19ebe0
to
4dc7ff2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, looks great, and so much cleaner :)
@@ -1348,33 +1348,34 @@ def _task_to_record_batches( | |||
next_index = 0 | |||
batches = fragment_scanner.to_batches() | |||
for batch in batches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, I think we can drop the batch
:
for batch in batches: | |
for current_batch in batches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like CIs broken on poetry
current_index = next_index | ||
next_index = current_index + len(batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this logically equivalent? feels like there was a reason to write it the other way.
cc @sungwy do you have context on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I wasn't planning on pushing this change 🤦. I'll revert it in the next commit if we want
This PR from Apache Arrow was merged to allow to filter with a boolean expression directly on
pa.RecordBatch
.I believe pyiceberg is currently using pyarrow version 19.0.0.
Filtering from pa.RecordBatch was introduced in python in version 17.0.0
I have not run integration tests for some reason my docker setup is messed up. I believe this test should check this change:
iceberg-python/tests/integration/test_deletes.py
Line 314 in dfbee4b
Closes #1050