Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support integration with Parquet modular encryption #15216

Open
adamreeve opened this issue Mar 14, 2025 · 0 comments
Open

Support integration with Parquet modular encryption #15216

adamreeve opened this issue Mar 14, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@adamreeve
Copy link

Is your feature request related to a problem or challenge?

arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.

Examples of this complex configuration are:

  • Users may require different encryption or decryption keys to be specified per Parquet file
  • The encryption and decryption keys specified may depend on the file schema
  • The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
  • Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the FileDecryptionProperties in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).

I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs

Currently all Parquet format options can be easily encoded as strings or primitive types, and live in datafusion-common, which has an optional dependency on the parquet crate, although TableParquetOptions is always defined even if the parquet feature is disabled.

We're experimenting with using encryption in DataFusion by adding encoded keys to the ParquetOptions struct, but this is quite limited and doesn't support the more complex configuration options I mention above.

Describe the solution you'd like

One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:

--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
     /// )
     /// ```
     pub key_value_metadata: HashMap<String, Option<String>>,
+    /// Callback to modify the Parquet WriterPropertiesBuilder with custom configuration
+    #[cfg(feature = "parquet")]
+    pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> WriterPropertiesBuilder>>,
+    /// Callback to modify the Parquet ArrowReaderOptions with custom configuration
+    #[cfg(feature = "parquet")]
+    pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> ArrowReaderOptions>>,
 }
 
 impl TableParquetOptions {

These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.

TableParquetOptions can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?

Another minor issue is that TableParquetOptions implements PartialEq, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.

Describe alternatives you've considered

@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an Arc<dyn Any> to allow passing more complex configuration types through TableParquetOptions.

I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the Any trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate ArrowReaderOptions and WriterProperties per file.

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant