You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.
Examples of this complex configuration are:
Users may require different encryption or decryption keys to be specified per Parquet file
The encryption and decryption keys specified may depend on the file schema
The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the FileDecryptionProperties in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).
Currently all Parquet format options can be easily encoded as strings or primitive types, and live in datafusion-common, which has an optional dependency on the parquet crate, although TableParquetOptions is always defined even if the parquet feature is disabled.
We're experimenting with using encryption in DataFusion by adding encoded keys to the ParquetOptions struct, but this is quite limited and doesn't support the more complex configuration options I mention above.
Describe the solution you'd like
One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.
TableParquetOptions can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?
Another minor issue is that TableParquetOptions implements PartialEq, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.
Describe alternatives you've considered
@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an Arc<dyn Any> to allow passing more complex configuration types through TableParquetOptions.
I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the Any trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate ArrowReaderOptions and WriterProperties per file.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge?
arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.
Examples of this complex configuration are:
FileDecryptionProperties
in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs
Currently all Parquet format options can be easily encoded as strings or primitive types, and live in
datafusion-common
, which has an optional dependency on the parquet crate, althoughTableParquetOptions
is always defined even if the parquet feature is disabled.We're experimenting with using encryption in DataFusion by adding encoded keys to the
ParquetOptions
struct, but this is quite limited and doesn't support the more complex configuration options I mention above.Describe the solution you'd like
One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.
TableParquetOptions
can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?Another minor issue is that
TableParquetOptions
implementsPartialEq
, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.Describe alternatives you've considered
@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an
Arc<dyn Any>
to allow passing more complex configuration types throughTableParquetOptions
.I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the
Any
trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generateArrowReaderOptions
andWriterProperties
per file.Additional context
No response
The text was updated successfully, but these errors were encountered: