Fix spurious warnings and bogus index when reflecting Iceberg tables #520

metadaddy · 2025-01-15T23:10:10Z

Description

Previously, during reflection, TrinoDialect.get_indexes() called _get_partitions(), which executed

SELECT * FROM {schema}."{table_name}$partitions"

and assumed that, if this query executed successfully, the table in hand was a Hive table. The issue (#518) was that this same query also succeeds for Iceberg tables, resulting in a spurious index (containing no columns) being added to the table, and a series of warnings from SQLAlchemy about index keys not being located in the column names for the table.

This PR adds a check to TrinoDialect.get_indexes() to ensure that the catalog in hand is a Hive catalog before calling _get_partitions().

I looked at creating a test method, but I couldn't see how to do so. Instead, I bench-tested the fix against Hive and Iceberg catalogs with the test app from #518. The output for an Iceberg table is:

Calling reflect()

Listing tables:
Table name: drivestats

For a Hive table it is:

Calling reflect()

Listing tables:
Table name: drivestats
	Index name: partition
		Column name: drivestats.year
		Column name: drivestats.month

Non-technical explanation

This PR fixes spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

* Fix spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables. ({issue}`518`)

Fixes #518

metadaddy · 2025-01-15T23:28:54Z

I'll see if I can figure out a way to make it work on 351.

metadaddy · 2025-01-16T03:36:23Z

The 351 check passes now. This is the query that the code uses to check whether the catalog's connector is Hive. If it returns 1, then it's Hive.

SELECT
    COUNT(*)
FROM "system"."metadata"."table_properties"
WHERE "catalog_name" = :catalog_name
  AND "property_name" = 'bucketing_version'

If there's a better way to do this that passes the 351 test, I'd be happy to change the code!

hashhar · 2025-01-16T16:46:57Z

for reference the earlier alternative was reading from "system"."metadata"."catalogs". I think for this case it's fine to let the warnings happen for versions which don't have system.metadata.catalogs. I don't like replying on table properties since nothing guarantees the property won't get added to Iceberg for example.

For the test then you can make it "pass" by mark.skipif (or whatever it's called).

cc: @damian3031 @ebyhr @dungdm93 what do you think?

metadaddy · 2025-01-24T02:00:00Z

@hashhar I realized I could query "system"."information_schema"."columns" to see if the connector_name column is there. So, the warning is shown on old versions and all tests pass without any additional @pytest.mark.skipif annotations.

hashhar · 2025-02-05T10:17:27Z

trino/sqlalchemy/dialect.py

@@ -229,6 +229,33 @@ def _get_partitions(
        partition_names = [desc[0] for desc in res.cursor.description]
        return partition_names

+    def _has_connector_name(self, connection: Connection):


we now need to issue 3 queries instead of 1 in the old version.

I'd recommend to not try to do any "detection" and instead just query system.metadata.catalogs and ignore failure if it doesn't exist.
When it exists we identify if the catalog is Hive or something else.
When it doesn't exist we can determine Hive or something else by looking at the "format" of the output from the actual query to $partitions.

Also I wonder if we can simply issue query to $partitions and use the output shape to determine what connector it is. After all we don't care what CONNECTOR it is. We rather care about the fact that we return whatever the partition columns are.

@hashhar said

I wonder if we can simply issue query to $partitions and use the output shape to determine what connector it is

Something like this?

query = dedent( f""" SELECT * FROM {schema}."{table_name}$partitions" """ ).strip() res = connection.execute(sql.text(query)) partition_names = [desc[0] for desc in res.cursor.description] if (partition_names == ['partition', 'record_count', 'file_count', 'total_size', 'data'] and data_types[0].startswith('row(') and data_types[1] == 'bigint' and data_types[2] == 'bigint' and data_types[3] == 'bigint' and data_types[4].startswith('row(')): # This is an Iceberg $partitions table - these are not partition names return None return partition_names

It looks like this is safe, since the schema of the Iceberg $partitions is documented, and it's impossible for a Hive partition to have the ROW datatype.

I just pushed the above.

metadaddy · 2025-02-20T22:05:26Z

Python 3.10 should be happy now... We'll see!

metadaddy · 2025-02-20T22:13:15Z

All tests pass, @hashhar - ready for your renewed attention 🙂

hashhar · 2025-02-24T06:05:27Z

Thanks @metadaddy, I'll test locally once today and then merge. Sorry for the delays and long back and forth.

metadaddy · 2025-02-24T17:02:56Z

Hi @hashhar - no problem at all - as much of the delay was on my side as yours, and I think we arrived at the correct solution!

metadaddy · 2025-03-10T01:16:11Z

Hi @hashhar - just bumping this so it doesn't get forgotten 😉

cla-bot bot added the cla-signed label Jan 15, 2025

metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from c396080 to ac86862 Compare January 16, 2025 02:06

metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch 3 times, most recently from 264ea95 to 35343f1 Compare January 24, 2025 01:59

metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from 35343f1 to 4d1f272 Compare January 24, 2025 02:05

hashhar reviewed Feb 5, 2025

View reviewed changes

metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch 2 times, most recently from 37fa2f1 to 32ac7dc Compare February 20, 2025 02:46

Fix spurious warnings and bogus index when reflecting Iceberg tables

a373f1d

metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from 32ac7dc to a373f1d Compare February 20, 2025 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

metadaddy commented Jan 15, 2025 •

edited by hashhar

Loading

metadaddy commented Jan 15, 2025

metadaddy commented Jan 16, 2025 •

edited

Loading

hashhar commented Jan 16, 2025

metadaddy commented Jan 24, 2025

hashhar Feb 5, 2025

metadaddy Feb 20, 2025

metadaddy Feb 20, 2025

metadaddy commented Feb 20, 2025

metadaddy commented Feb 20, 2025

hashhar commented Feb 24, 2025

metadaddy commented Feb 24, 2025

metadaddy commented Mar 10, 2025

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

Are you sure you want to change the base?

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

Conversation

metadaddy commented Jan 15, 2025 • edited by hashhar Loading

Description

Non-technical explanation

Release notes

metadaddy commented Jan 15, 2025

metadaddy commented Jan 16, 2025 • edited Loading

hashhar commented Jan 16, 2025

metadaddy commented Jan 24, 2025

hashhar Feb 5, 2025

Choose a reason for hiding this comment

metadaddy Feb 20, 2025

Choose a reason for hiding this comment

metadaddy Feb 20, 2025

Choose a reason for hiding this comment

metadaddy commented Feb 20, 2025

metadaddy commented Feb 20, 2025

hashhar commented Feb 24, 2025

metadaddy commented Feb 24, 2025

metadaddy commented Mar 10, 2025

metadaddy commented Jan 15, 2025 •

edited by hashhar

Loading

metadaddy commented Jan 16, 2025 •

edited

Loading