-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-660] Add GeoArrow export from Spark DataFrame #1767
Conversation
For future me: import os
import pyspark
from sedona.spark import SedonaContext
if "SPARK_HOME" in os.environ:
del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]
config = (
SedonaContext.builder()
.config(
"spark.jars.packages",
f"org.apache.sedona:sedona-spark-{pyspark_version}_2.12:1.7.0,"
"org.datasyslab:geotools-wrapper:1.7.0-28.5",
)
.config(
"spark.jars.repositories",
"https://artifacts.unidata.ucar.edu/repository/unidata-all",
)
.getOrCreate()
)
sedona = SedonaContext.create(config) import pyarrow as pa
from pyspark.sql.types import StringType, StructType
from sedona.utils.geoarrow import dataframe_to_arrow
test_wkt = ["POINT (0 1)", "LINESTRING (0 1, 2 3)"]
schema = StructType().add("wkt", StringType())
wkt_df = sedona.createDataFrame(zip(test_wkt), schema)
# No geometry
dataframe_to_arrow(wkt_df)
#> pyarrow.Table
#> wkt: string
#> ----
#> wkt: [["POINT (0 1)"],["LINESTRING (0 1, 2 3)"]]
# With geometry (not yet implemented)
geo_df = wkt_df.selectExpr("ST_GeomFromText(wkt) AS geom")
dataframe_to_arrow(geo_df) |
Why not do all this in scala? |
python/sedona/utils/geoarrow.py
Outdated
col_is_geometry = [ | ||
isinstance(f.dataType, GeometryType) for f in spark_schema.fields | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesnt seem to handle geometries in complex types (arrays, structs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tools for dealing with this on the pyarrow side are not great and might be quite a lot of work. Is this an important use case or is there a way to structure a select()
(or push something into Scala) that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important. I use geometries in complex types frequently. At least arrays of geometries is something that sedona itself outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I may defer to a follow-up since that's a bit of a learning curve for me. I think I can figure out how to walk the PySpark schema to get the nested paths corresponding to the Geometry nodes, and I think I can figure out how to do the node search/replace from the pyarrow side. For conversion, we can either:
- Go back to the version here where we transform the result on the Python side in C
- Figure out how to replace the
FieldSerializer
/ArrowSerializer
on the Scala side to do what we need it to do there (basically: write WKB and add field metadata) - Figure out how to issue a
select()
call in pyspark that does the transformation using the pyspark dataframe API
Great point! It's pyspark, but I pushed a version that does the WKB in Spark which is way better than what I had in mind 🙂 import pyarrow as pa
from pyspark.sql.types import StringType, StructType
from sedona.utils.geoarrow import dataframe_to_arrow
test_wkt = ["POINT (0 1)", "LINESTRING (0 1, 2 3)"]
schema = StructType().add("wkt", StringType())
wkt_df = sedona.createDataFrame(zip(test_wkt), schema)
geo_df = wkt_df.selectExpr("ST_GeomFromText(wkt) AS geom")
dataframe_to_arrow(geo_df)
#> pyarrow.Table
#> geom: extension<geoarrow.wkb<WkbType>>
#> ----
#> geom: [[01010000000000000000000000000000000000F03F],[0102000000020000000000000000000000000000000000F03F00000000000000400000000000000840]] |
Maybe relevant: can this solve the toPandas() issue of PySpark DataFrame (https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#enabling-for-conversion-to-from-pandas)? In Sedona, when you call |
That probably needs something in Java land to define the Arrow serialization of a user-defined type? In Python-land it won't quite get us to a |
I still have some questions here, but it does work!
Edit (more questions + example):
Example converting to GeoPandas: import geopandas
from sedona.utils.geoarrow import dataframe_to_arrow
# 200 million points
df = sedona.read.format("geoparquet").load("microsoft-buildings-point.parquet")
# 17s
table = dataframe_to_arrow(df.limit(10_000_000))
# 8s
geopandas.GeoDataFrame.from_arrow(table) I think this is possibly a substantial speedup over following the existing documentation for Pandas conversion ( https://sedona.apache.org/1.7.0/tutorial/geopandas-shapely/#from-sedona-dataframe-to-geopandas ). # 47 s
pdf = df.limit(10_000_000).toPandas()
geopandas.GeoDataFrame(pdf, geometry="geom") |
@paleolimbot do you have a benchmark number to show the performance gain? Will this benefit our integration with lonboard? |
My example above (end-to-end 10 million points from Sedona to GeoPandas) takes 25s after this PR (and 47s before, starting from a fresh session each time), although I am too new to Sedona to know if I am doing something obviously wrong or whether I'm benchmarking an unrepresentative workload.
I'll double check, but this should work out of the box ( from lonboard import viz
table = dataframe_to_arrow(df.limit(10_000_000))
viz(table) |
This is awesome. Let me know if this works. Then I will merge this PR.
|
I'd forgotten about CRS handling (which works in theory, but needs some explicit tests for some of the corner cases)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a short paragraph at the bottom of this page? https://sedona.apache.org/latest/tutorial/sql/#convert-between-dataframe-and-spatialrdd
I added some documentation to the GeoPandas section (I'll add to the section you linked here for the partitioning PR!) I'm comfortable with the tests and CRS propagation here! The EWKB bit was a bit of a ride but I think I was able to test the implementation adequately here. |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-660] my subject
.Closes #1756.
What changes were proposed in this PR?
Added
Adapter.toArrow()
to collect an RDD or DataFrame to an Arrow Table where Geometry columns are represented asgeoarrow.wkb
extension typesHow was this patch tested?
Tests will beadded to
python/tests/utils/test_geoarrow.py
.Did this PR include necessary documentation updates?