Option to always enable page index for `ParquetOpener` #15179

connec · 2025-03-12T12:51:41Z

Is your feature request related to a problem or challenge?

We are using DataFusion as described in #13456 – i.e. querying Parquet files directly from object storage (S3). I've recently been exploring the possibility of introducing cached parquet metadata into query planning and execution (related: #12592).

After a bit of a journey, I was able to get a pre-deserialized ParquetMetaData plumbed into a custom TableProvider, which would then refer to the ready-set metadata throughout planning and execution (for execution, a custom ParquetFileReaderFactory and AsyncFileReader are also required).

For queries with a filter (e.g. column = value), this had the expected effect of removing the metadata calls from our traces – success!

However, I was surprised to see that traces for queries without a filter (only a limit) showed dramatically different object store calls. Specifically:

Using the default setup, one query would make 6 object store calls (5 for metadata and 1 for data), and would scan ~137Mb of data.
Using the caching setup, the same query would make 29 object store calls (all for data), and would scan ~4.2Mb of data.

I had thought my changes to be fairly surgical, and would not have introduced this kind of difference in the query execution. After picking through the implementation again, I think this is caused by this logic in ParquetOpener:

datafusion/datafusion/datasource-parquet/src/opener.rs

Lines 119 to 122 in efb75f3

    
           let enable_page_index = should_enable_page_index( 
        
               self.enable_page_index, 
        
               &self.page_pruning_predicate, 
        
           );

This calls should_enable_page_index, which is defined as:

datafusion/datafusion/datasource-parquet/src/mod.rs

Lines 537 to 547 in efb75f3

    
           fn should_enable_page_index( 
        
               enable_page_index: bool, 
        
               page_pruning_predicate: &Option<Arc<PagePruningAccessPlanFilter>>, 
        
           ) -> bool { 
        
               enable_page_index 
        
                   && page_pruning_predicate.is_some() 
        
                   && page_pruning_predicate 
        
                       .as_ref() 
        
                       .map(|p| p.filter_number() > 0) 
        
                       .unwrap_or(false) 
        
           }

The resulting enable_page_index seems to ultimately be what controls whether the page index is read and used inside ParquetOpener.

In my caching implementation, the whole ParquetMetaData is always available and so presumably some of the range pruning logic inside ParquetOpener has an effect, where otherwise that metadata would not be loaded.

For the example query above, this leads to a significant (~30%) reduction in latency.

Describe the solution you'd like

The latency reduction we have observed comes almost entirely from using the page index even though there's no predicate.

I.e., although caching does reduce the number of object calls for otherwise like-for-like execution plans (e.g. with a predicate), it has a small impact on latency.

For our use-case, we would get most of the benefit if there was some way to "force" the use of the page index, even without a pruning predicate (and/or for this decision to made heuristically based on the likely savings).

And/or, the documentation around enable_page_index could clarify that actual use of the metadata will be heuristic.

Describe alternatives you've considered

Even without caching, it still takes quite a lot of plumbing to get the page index to be used. The smallest implementation I could come up with uses:

A FileFormat, whose create_physical_plan injects a custom...
ParquetFileReaderFactory, whose create_reader:
1. Builds a ParquetObjectReader and sets with_preload_column_index/with_preload_offset_index to true
2. Returns a custom...
AsyncFileReader (not necessary for functionality, but ParquetFileReader used by DefaultParquetFileReaderFactory is private).

This is ~200 lines of boilerplate consisting mostly of trait forwarding and copied implementations in order to add .with_preload_column_index(true).with_preload_offset_index(true) at the right point.

I've also not done a thorough investigation into the performance – it's possible that not using the page index is the best choice in the general case.

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-12T13:30:08Z

I agree this would be super useful -- thank you @connec

I think @AdamGS may have been looking into making more caching available in the core engine too which might be worth a discussion

p172913 · 2025-03-12T17:54:14Z

Take

alamb · 2025-03-12T21:27:01Z

@p172913 I don't think this is a good first issue -- it may be tricky to implement. Let us know if you need help

connec added the enhancement New feature or request label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to always enable page index for `ParquetOpener` #15179

Option to always enable page index for `ParquetOpener` #15179

connec commented Mar 12, 2025 •

edited

Loading

alamb commented Mar 12, 2025

p172913 commented Mar 12, 2025

alamb commented Mar 12, 2025

Option to *always* enable page index for ParquetOpener #15179

Option to *always* enable page index for ParquetOpener #15179

Comments

connec commented Mar 12, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Mar 12, 2025

p172913 commented Mar 12, 2025

alamb commented Mar 12, 2025

Option to always enable page index for `ParquetOpener` #15179

Option to always enable page index for `ParquetOpener` #15179

connec commented Mar 12, 2025 •

edited

Loading