[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767

paleolimbot · 2025-01-22T19:08:55Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-660. The PR name follows the format [SEDONA-660] my subject.

Closes #1756.

What changes were proposed in this PR?

Added Adapter.toArrow() to collect an RDD or DataFrame to an Arrow Table where Geometry columns are represented as geoarrow.wkb extension types

How was this patch tested?

Tests will beadded to python/tests/utils/test_geoarrow.py.

Did this PR include necessary documentation updates?

A new API is added to the Python bindings and will be documented

paleolimbot · 2025-01-22T20:18:02Z

For future me:

import os
import pyspark
from sedona.spark import SedonaContext
if "SPARK_HOME" in os.environ:
    del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]

config = (
    SedonaContext.builder()
    .config(
        "spark.jars.packages",
        f"org.apache.sedona:sedona-spark-{pyspark_version}_2.12:1.7.0,"
        "org.datasyslab:geotools-wrapper:1.7.0-28.5",
    )
    .config(
        "spark.jars.repositories",
        "https://artifacts.unidata.ucar.edu/repository/unidata-all",
    )
    .getOrCreate()
)
sedona = SedonaContext.create(config)

import pyarrow as pa
from pyspark.sql.types import StringType, StructType

from sedona.utils.geoarrow import dataframe_to_arrow

test_wkt = ["POINT (0 1)", "LINESTRING (0 1, 2 3)"]

schema = StructType().add("wkt", StringType())
wkt_df = sedona.createDataFrame(zip(test_wkt), schema)

# No geometry
dataframe_to_arrow(wkt_df)
#> pyarrow.Table
#> wkt: string
#> ----
#> wkt: [["POINT (0 1)"],["LINESTRING (0 1, 2 3)"]]

# With geometry (not yet implemented)
geo_df = wkt_df.selectExpr("ST_GeomFromText(wkt) AS geom")
dataframe_to_arrow(geo_df)

james-willis · 2025-01-23T02:14:22Z

Why not do all this in scala?

james-willis · 2025-01-23T02:12:25Z

python/sedona/utils/geoarrow.py

+    col_is_geometry = [
+        isinstance(f.dataType, GeometryType) for f in spark_schema.fields
+    ]


This doesnt seem to handle geometries in complex types (arrays, structs)

The tools for dealing with this on the pyarrow side are not great and might be quite a lot of work. Is this an important use case or is there a way to structure a select() (or push something into Scala) that I'm missing?

I think it is important. I use geometries in complex types frequently. At least arrays of geometries is something that sedona itself outputs.

Got it! I may defer to a follow-up since that's a bit of a learning curve for me. I think I can figure out how to walk the PySpark schema to get the nested paths corresponding to the Geometry nodes, and I think I can figure out how to do the node search/replace from the pyarrow side. For conversion, we can either:

Go back to the version here where we transform the result on the Python side in C

Figure out how to replace the FieldSerializer/ArrowSerializer on the Scala side to do what we need it to do there (basically: write WKB and add field metadata)

Figure out how to issue a select() call in pyspark that does the transformation using the pyspark dataframe API

paleolimbot · 2025-01-23T04:54:02Z

Why not do all this in scala?

Great point! It's pyspark, but I pushed a version that does the WKB in Spark which is way better than what I had in mind 🙂

import pyarrow as pa
from pyspark.sql.types import StringType, StructType

from sedona.utils.geoarrow import dataframe_to_arrow

test_wkt = ["POINT (0 1)", "LINESTRING (0 1, 2 3)"]

schema = StructType().add("wkt", StringType())
wkt_df = sedona.createDataFrame(zip(test_wkt), schema)

geo_df = wkt_df.selectExpr("ST_GeomFromText(wkt) AS geom")
dataframe_to_arrow(geo_df)
#> pyarrow.Table
#> geom: extension<geoarrow.wkb<WkbType>>
#> ----
#> geom: [[01010000000000000000000000000000000000F03F],[0102000000020000000000000000000000000000000000F03F00000000000000400000000000000840]]

jiayuasu · 2025-01-23T05:06:36Z

Maybe relevant: can this solve the toPandas() issue of PySpark DataFrame (https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#enabling-for-conversion-to-from-pandas)?

In Sedona, when you call toPandas(), it gives you a GeoPandas DF but this will throw an exception if spark.sql.execution.arrow.pyspark.enabled set to true because Sedona Python does not work with GeoArrow.

CC @Kontinuation

paleolimbot · 2025-01-23T05:51:45Z

can this solve the toPandas() issue of PySpark DataFrame

That probably needs something in Java land to define the Arrow serialization of a user-defined type?

In Python-land it won't quite get us to a GeoSeries today but it's a lot closer. I think that needs the GeoPandasDtype to define a method (and possibly something in Pandas to get a GeoSeries instead of a Series) but I think everybody from GeoPandas is on board to make that happen.

paleolimbot · 2025-01-23T20:35:57Z

I still have some questions here, but it does work!

Where should this function live? I had been thinking Adapter.toArrow() but the Adapter is mostly a Scala wrapper and this is very much not Scala. I could leave it in its own sedona.utils.geoarrow too.
I updated geoarrow-types to be compatible with Python 3.7 ( chore(geoarrow-types): Test against Python 3.7 and Python 3.8 geoarrow/geoarrow-python#62 ), but until that is on pip we can't add it to the Pipfile and it's harder to test that branch in CI.
I think the ultimate solution here is to hook in at the Scala/Java level to add the field metadata we need there. That would also give options other than WKB that might be faster, but we'd need the ability to add field metadata. I couldn't find anything immediately in https://github.com/apache/spark/tree/fe29f83127f813e67764e40b7b9672ea358ed6ce/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow but I'm not sure I would know exactly where to find that.
Should conversion of arbitrary nested geometry fields be handled in this PR or a follow-up?

Edit (more questions + example):

We need to propagate the CRS. We can add an argument so that if already known one can pass crs=... to avoid calculating it, but otherwise we probably should to ensure it gets propagated. We can update the AsBinary() call to be AsEWKB() and do that from Python or issue a Spark call?

Example converting to GeoPandas:

import geopandas

from sedona.utils.geoarrow import dataframe_to_arrow

# 200 million points
df = sedona.read.format("geoparquet").load("microsoft-buildings-point.parquet")

# 17s
table = dataframe_to_arrow(df.limit(10_000_000))

# 8s
geopandas.GeoDataFrame.from_arrow(table)

I think this is possibly a substantial speedup over following the existing documentation for Pandas conversion ( https://sedona.apache.org/1.7.0/tutorial/geopandas-shapely/#from-sedona-dataframe-to-geopandas ).

# 47 s
pdf = df.limit(10_000_000).toPandas()
geopandas.GeoDataFrame(pdf, geometry="geom")

jiayuasu · 2025-01-24T17:19:13Z

@paleolimbot do you have a benchmark number to show the performance gain? Will this benefit our integration with lonboard?

paleolimbot · 2025-01-24T17:31:28Z

do you have a benchmark number to show the performance gain?

My example above (end-to-end 10 million points from Sedona to GeoPandas) takes 25s after this PR (and 47s before, starting from a fresh session each time), although I am too new to Sedona to know if I am doing something obviously wrong or whether I'm benchmarking an unrepresentative workload.

Will this benefit our integration with lonboard?

I'll double check, but this should work out of the box (viz() accepts anything that implements __arrow_c_stream__, like a pyarrow.Table).

from lonboard import viz

table = dataframe_to_arrow(df.limit(10_000_000))
viz(table)

jiayuasu · 2025-01-24T17:38:34Z

@paleolimbot

This is awesome.

Let me know if this works. Then I will merge this PR.

I'll double check, but this should work out of the box (viz() accepts anything that implements arrow_c_stream, like a pyarrow.Table).

paleolimbot · 2025-01-24T20:01:19Z

It works! The two TODOs here are:

Document the function and make sure it can be imported from sedona.spark import to_arrow
Ensure the CRS on the output is set properly

paleolimbot · 2025-01-24T21:48:45Z

I'd forgotten about CRS handling (which works in theory, but needs some explicit tests for some of the corner cases)!

jiayuasu

can you also add a short paragraph at the bottom of this page? https://sedona.apache.org/latest/tutorial/sql/#convert-between-dataframe-and-spatialrdd

paleolimbot · 2025-01-27T19:05:46Z

can you also add a short paragraph at the bottom of this page?

I added some documentation to the GeoPandas section (I'll add to the section you linked here for the partitioning PR!)

I'm comfortable with the tests and CRS propagation here! The EWKB bit was a bit of a ride but I think I was able to test the implementation adequately here.

add module + test stub

240f151

github-actions bot added the sedona-python label Jan 22, 2025

paleolimbot added 2 commits January 22, 2025 19:22

stub test with geometry

ba54d10

get structure

3f4f516

geoarrow-types maybe in test

6e82847

james-willis reviewed Jan 23, 2025

View reviewed changes

push wkb conversion in to spark

7e0e6a3

paleolimbot added 3 commits January 23, 2025 18:06

maximize compatibility

09a5df7

better test

abee18f

fix typo

dc5224d

paleolimbot marked this pull request as ready for review January 23, 2025 20:29

paleolimbot requested a review from jiayuasu as a code owner January 23, 2025 20:29

paleolimbot added 4 commits January 24, 2025 21:31

handle CRS

81489b7

lazy import to be safe

f48de6e

add import

92149b1

format

a6de162

paleolimbot added 3 commits January 24, 2025 21:52

fix test

6d1f751

fix null dropping behaviour

01716df

maybe fix test

a25335d

jiayuasu requested changes Jan 25, 2025

View reviewed changes

jiayuasu added this to the sedona-1.7.1 milestone Jan 27, 2025

jiayuasu added the improvement label Jan 27, 2025

paleolimbot added 4 commits January 27, 2025 17:04

add tests for unique sridifier

b9e5822

add tests for field wrappers

540e801

spelling

669c4e8

add to docs

6b52f6c

github-actions bot added the docs label Jan 27, 2025

fix methods

dbbc700

jiayuasu approved these changes Jan 27, 2025

View reviewed changes

jiayuasu merged commit 7d32fe0 into apache:master Jan 27, 2025
25 checks passed

paleolimbot deleted the python-geoarrow-serde branch February 3, 2025 16:39

jiayuasu mentioned this pull request Feb 6, 2025

[SEDONA-708] Sedona should use PyArrow to get GeoPandas #1794

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767

[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767

paleolimbot commented Jan 22, 2025 •

edited

Loading

paleolimbot commented Jan 22, 2025 •

edited

Loading

james-willis commented Jan 23, 2025

james-willis Jan 23, 2025

paleolimbot Jan 23, 2025

james-willis Jan 23, 2025

paleolimbot Jan 23, 2025

paleolimbot commented Jan 23, 2025

jiayuasu commented Jan 23, 2025

paleolimbot commented Jan 23, 2025

paleolimbot commented Jan 23, 2025 •

edited

Loading

jiayuasu commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

jiayuasu commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

jiayuasu left a comment

paleolimbot commented Jan 27, 2025

[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767

[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767

Conversation

paleolimbot commented Jan 22, 2025 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

paleolimbot commented Jan 22, 2025 • edited Loading

james-willis commented Jan 23, 2025

james-willis Jan 23, 2025

Choose a reason for hiding this comment

paleolimbot Jan 23, 2025

Choose a reason for hiding this comment

james-willis Jan 23, 2025

Choose a reason for hiding this comment

paleolimbot Jan 23, 2025

Choose a reason for hiding this comment

paleolimbot commented Jan 23, 2025

jiayuasu commented Jan 23, 2025

paleolimbot commented Jan 23, 2025

paleolimbot commented Jan 23, 2025 • edited Loading

jiayuasu commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

jiayuasu commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

paleolimbot commented Jan 24, 2025

jiayuasu left a comment

Choose a reason for hiding this comment

paleolimbot commented Jan 27, 2025

paleolimbot commented Jan 22, 2025 •

edited

Loading

paleolimbot commented Jan 22, 2025 •

edited

Loading

paleolimbot commented Jan 23, 2025 •

edited

Loading