Filter rows directly from pa.RecordBatch #1621

gabeiglio · 2025-02-07T02:43:25Z

This PR from Apache Arrow was merged to allow to filter with a boolean expression directly on pa.RecordBatch.

I believe pyiceberg is currently using pyarrow version 19.0.0.
Filtering from pa.RecordBatch was introduced in python in version 17.0.0

I have not run integration tests for some reason my docker setup is messed up. I believe this test should check this change:

iceberg-python/tests/integration/test_deletes.py

Line 314 in dfbee4b

    
           def test_read_multiple_batches_in_task_with_position_deletes(spark: SparkSession, session_catalog: RestCatalog) -> None:

Closes #1050

Fokko · 2025-02-07T09:32:18Z

Thanks for fixing this @gabeiglio

In addition, I think we also need to bump the minimal version of Arrow here:

iceberg-python/pyproject.toml

Line 64 in dfbee4b

pyarrow = { version = ">=14.0.0,<20.0.0", optional = true }

kevinjqliu · 2025-02-07T15:32:36Z

thanks for following up on that comment 😄

if we're bumping minimum pyarrow version to 17, we might want to address this comment as well
https://github.com/apache/iceberg-python/pull/1621/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1335-R1338

gabeiglio · 2025-02-08T00:19:17Z

@kevinjqliu IIUC removing the schema casting will allow pyarrow scanner to infer by itself if it needs or not large types? So it is basically a matter of changing the assertions in tests to the types of the result of the scan?

kevinjqliu · 2025-02-08T19:01:23Z

I believe so. We can also do this in a follow up PR! I just saw that comment during code review

kevinjqliu · 2025-02-08T19:02:17Z

Looks like theres an issue in CI tests

gabeiglio · 2025-02-09T07:46:22Z

Yes, I think it would be better to split these changes in separate PRs since there are a lot of changes to be made to tests specially. (If thats okay ill open the other PR for schema casting @kevinjqliu @Fokko)

pyiceberg/io/pyarrow.py

Fokko

One minor comment, looks great, and so much cleaner :)

Fokko · 2025-02-10T12:58:02Z

pyiceberg/io/pyarrow.py

@@ -1348,33 +1348,34 @@ def _task_to_record_batches(
        next_index = 0
        batches = fragment_scanner.to_batches()
        for batch in batches:


nit, I think we can drop the batch:

Suggested change

for batch in batches:

for current_batch in batches:

kevinjqliu

looks like CIs broken on poetry

kevinjqliu · 2025-02-10T17:35:55Z

pyiceberg/io/pyarrow.py

+            current_index = next_index
+            next_index = current_index + len(batch)


is this logically equivalent? feels like there was a reason to write it the other way.

cc @sungwy do you have context on this?

Oh I wasn't planning on pushing this change 🤦. I'll revert it in the next commit if we want

gabeiglio added 2 commits February 6, 2025 18:33

Filter rows from RecordBatch

081e75b

use num_rows for len

61772f5

gabeiglio force-pushed the filterRecordBatch branch from 8b0ab79 to 61772f5 Compare February 7, 2025 02:51

Update pyarrow, check for empty record batches

0923c92

kevinjqliu reviewed Feb 9, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

gabeiglio added 3 commits February 9, 2025 11:58

fix issues

7c1e085

add comments

9142c44

poetry lock and checking for empty batches

4dc7ff2

gabeiglio force-pushed the filterRecordBatch branch from e19ebe0 to 4dc7ff2 Compare February 9, 2025 21:39

Merge branch 'main' into filterRecordBatch

ad9410c

Fokko approved these changes Feb 10, 2025

View reviewed changes

kevinjqliu reviewed Feb 10, 2025

View reviewed changes

kevinjqliu added 2 commits February 11, 2025 21:25

Merge branch 'main' into filterRecordBatch

8896c7c

merge main

f2ced4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter rows directly from pa.RecordBatch #1621

Filter rows directly from pa.RecordBatch #1621

gabeiglio commented Feb 7, 2025 •

edited by Fokko

Loading

Fokko commented Feb 7, 2025

kevinjqliu commented Feb 7, 2025

gabeiglio commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025

gabeiglio commented Feb 9, 2025

Fokko left a comment

Fokko Feb 10, 2025

kevinjqliu left a comment

kevinjqliu Feb 10, 2025

gabeiglio Feb 10, 2025

		current_index = next_index
		next_index = current_index + len(batch)

Filter rows directly from pa.RecordBatch #1621

Are you sure you want to change the base?

Filter rows directly from pa.RecordBatch #1621

Conversation

gabeiglio commented Feb 7, 2025 • edited by Fokko Loading

Fokko commented Feb 7, 2025

kevinjqliu commented Feb 7, 2025

gabeiglio commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025

gabeiglio commented Feb 9, 2025

Fokko left a comment

Choose a reason for hiding this comment

Fokko Feb 10, 2025

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Feb 10, 2025

Choose a reason for hiding this comment

gabeiglio Feb 10, 2025

Choose a reason for hiding this comment

gabeiglio commented Feb 7, 2025 •

edited by Fokko

Loading