You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now PipelineDP supports 3 execution modes - with Apache Spark, Apache Beam, w/o frameworks (here is an example how to run on different frameworks).
Basically the current API supports works with unstructured collections (RDD for Spark, PCollection for Beam, iterable w/o frameworks) and data semantic is specified with corresponding extractor functions (DataExtractor, usage example is in DPEngine.aggregate).
DataFrame is very common API, it would be great if PipelineDP supports that API natively. There are DataFrame API for both Beam and Spark.
Note: currently it's possible to apply PipelineDP transformations on DataFrames by specifying extractors that returns corresponding column value (example), but this approach has downsides
The idea of this issue is to design DataFrame API for PipelineDP. There are following ideas:
It should be similar for Pandas, Spark and Beam DataFrames.
Low-level API (i.e. DPEngine.aggregate ) might be something as providing private_id_column, 'partition_key_column', value_column instead of corresponding extractors and taking as input/returning DataFrames.
Ideally (not necessary in 1st version) PipelineDP performs column based operations on DataFrames instead of per-element operations (which should provide performance speed-up).
Note: This is very high-level issue, it will design discussions. I'm happy to participate and help with that.
The text was updated successfully, but these errors were encountered:
dvadym
added
the
Type: Epic 🤙
Describes a large amount of functionality that will likely be broken down into smaller issues
label
Mar 30, 2022
dvadym
changed the title
DataFrame style API (WIP)
DataFrame style API
Mar 30, 2022
For Apache Spark it could be done in a way when we convert initial pyspark.sql.DataFrame to RDD[Row] and then to List[RDD[Any]]. So it is not so hard to add a syntax sugar like make_private_df(df: DataFrame, budget_accountant, privacy_id_columns) with a method groupBy and aggregate.
under the hood it will convert initial pyspark.sql.DataFrame to a list of PrivateRDD and also store the schema of the initial data. Aggregations will be applied to separate RDDs and also there will be corresponding schema update. toDF will combine List[RDD] to RDD[Row] and convert it to spark DataFrame using the inner schema.
It is a "spark-like" syntax. Also, it could be done in the same way for Pandas but without neccesarity of storing the schema cause Pandas is not lazy and schema inference is doing in runtime. But Im not sure about Beam because I'm not familiar with Beam...
Context
Now PipelineDP supports 3 execution modes - with Apache Spark, Apache Beam, w/o frameworks (here is an example how to run on different frameworks).
Basically the current API supports works with unstructured collections (RDD for Spark, PCollection for Beam, iterable w/o frameworks) and data semantic is specified with corresponding extractor functions (DataExtractor, usage example is in DPEngine.aggregate).
DataFrame
is very common API, it would be great if PipelineDP supports that API natively. There are DataFrame API for both Beam and Spark.Note: currently it's possible to apply PipelineDP transformations on DataFrames by specifying extractors that returns corresponding column value (example), but this approach has downsides
DataFrame
PipelineDP
can't optimize using column operationsDataFrame
API is usually more expressivePipelineDP APIs
PipelineDP has 2 APIs:
DPEngine
, the main function is DPEngine.aggregate.Goals
The idea of this issue is to design
DataFrame
API for PipelineDP. There are following ideas:private_id_column
, 'partition_key_column',value_column
instead of corresponding extractors and taking as input/returning DataFrames.Note: This is very high-level issue, it will design discussions. I'm happy to participate and help with that.
The text was updated successfully, but these errors were encountered: