Support metadata columns (location
, size
, last_modified
) in ListingTableProvider
#15173
Labels
enhancement
New feature or request
Is your feature request related to a problem or challenge?
The
ListingTableProvider
in DataFusion provides an implementation of aTableProvider
that organizes a collection of (potentially hive partitioned) files in an object store into a single table.Similar to how hive partitions are injected into the listing table schema, but they don't actually exist in the physical parquet files - I'd like to be able to request the ListingTable to inject metadata columns that get their data from the
ObjectMeta
provided by the object store crate. Then I can query for and filter on the columnslocation
,size
andlast_modified
).I'd also like queries that filter on the metadata columns to be able to prune out files, similar to partition pruning. I.e. if I do
SELECT * FROM my_listing_table WHERE last_modified > '2025-03-10'
then only files that were modified after'2025-03-10'
should be passed to the FileScanConfig to be read.My scenario is I'd like to be able to efficiently ingest files from an object store bucket that I haven't seen before - and filtering on
last_modified
seems like a good solution.This could potentially fold into the work ongoing in #13975 / #14057 / #14362 to mark these columns as proper system/metadata columns - but it fundamentally isn't blocked on that work. Since this would be an opt-in from the consumer, automatic filtering out on a
SELECT *
doesn't seem required.Describe the solution you'd like
A new API on the
ListingOptions
struct that is passed to aListingTableConfig
which is passed toListingTable::try_new
.The definition for
MetadataColumn
is a simple enum:The order of the
MetadataColumn
passed intowith_metadata_cols
denotes the order it will appear in the table schema. Metadata columns will be added after partition columns.Describe alternatives you've considered
I considered what it might look like to make
ListingTableProvider
more extensible to be able to implement these changes without a core DataFusion change. I wasn't able to come up with anything simpler than the above though.Another option might be to make a lot of the internals of ListingTableProvider public so that it is easier for people to maintain their own customized versions of ListingTableProvider.
Additional context
I've already implemented this in my project, I will be upstreaming my change and linking to this issue. To view what this looks like already implemented, see: spiceai#74
And to see the changes needed to integrate with it from a consuming project, see: spiceai/spiceai#4970 (It is quite contained, which I'm happy with)
This change will have no visible effect on consumers - they need to explicitly opt-in to see the metadata columns.
The text was updated successfully, but these errors were encountered: