Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

metadaddy
Copy link
Member

@metadaddy metadaddy commented Jan 15, 2025

Description

Previously, during reflection, TrinoDialect.get_indexes() called _get_partitions(), which executed

SELECT * FROM {schema}."{table_name}$partitions"

and assumed that, if this query executed successfully, the table in hand was a Hive table. The issue (#518) was that this same query also succeeds for Iceberg tables, resulting in a spurious index (containing no columns) being added to the table, and a series of warnings from SQLAlchemy about index keys not being located in the column names for the table.

This PR adds a check to TrinoDialect.get_indexes() to ensure that the catalog in hand is a Hive catalog before calling _get_partitions().

I looked at creating a test method, but I couldn't see how to do so. Instead, I bench-tested the fix against Hive and Iceberg catalogs with the test app from #518. The output for an Iceberg table is:

Calling reflect()

Listing tables:
Table name: drivestats

For a Hive table it is:

Calling reflect()

Listing tables:
Table name: drivestats
	Index name: partition
		Column name: drivestats.year
		Column name: drivestats.month

Non-technical explanation

This PR fixes spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

* Fix spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables. ({issue}`518`)

Fixes #518

@cla-bot cla-bot bot added the cla-signed label Jan 15, 2025
@metadaddy
Copy link
Member Author

I'll see if I can figure out a way to make it work on 351.

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from c396080 to ac86862 Compare January 16, 2025 02:06
@metadaddy
Copy link
Member Author

metadaddy commented Jan 16, 2025

The 351 check passes now. This is the query that the code uses to check whether the catalog's connector is Hive. If it returns 1, then it's Hive.

SELECT
    COUNT(*)
FROM "system"."metadata"."table_properties"
WHERE "catalog_name" = :catalog_name
  AND "property_name" = 'bucketing_version'

If there's a better way to do this that passes the 351 test, I'd be happy to change the code!

@hashhar
Copy link
Member

hashhar commented Jan 16, 2025

for reference the earlier alternative was reading from "system"."metadata"."catalogs". I think for this case it's fine to let the warnings happen for versions which don't have system.metadata.catalogs. I don't like replying on table properties since nothing guarantees the property won't get added to Iceberg for example.

For the test then you can make it "pass" by mark.skipif (or whatever it's called).

cc: @damian3031 @ebyhr @dungdm93 what do you think?

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch 3 times, most recently from 264ea95 to 35343f1 Compare January 24, 2025 01:59
@metadaddy
Copy link
Member Author

@hashhar I realized I could query "system"."information_schema"."columns" to see if the connector_name column is there. So, the warning is shown on old versions and all tests pass without any additional @pytest.mark.skipif annotations.

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from 35343f1 to 4d1f272 Compare January 24, 2025 02:05
@@ -229,6 +229,33 @@ def _get_partitions(
partition_names = [desc[0] for desc in res.cursor.description]
return partition_names

def _has_connector_name(self, connection: Connection):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now need to issue 3 queries instead of 1 in the old version.

I'd recommend to not try to do any "detection" and instead just query system.metadata.catalogs and ignore failure if it doesn't exist.
When it exists we identify if the catalog is Hive or something else.
When it doesn't exist we can determine Hive or something else by looking at the "format" of the output from the actual query to $partitions.

Also I wonder if we can simply issue query to $partitions and use the output shape to determine what connector it is. After all we don't care what CONNECTOR it is. We rather care about the fact that we return whatever the partition columns are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar said

I wonder if we can simply issue query to $partitions and use the output shape to determine what connector it is

Something like this?

        query = dedent(
            f"""
            SELECT * FROM {schema}."{table_name}$partitions"
        """
        ).strip()
        res = connection.execute(sql.text(query))
        partition_names = [desc[0] for desc in res.cursor.description]
        if (partition_names == ['partition', 'record_count', 'file_count', 'total_size', 'data'] 
                and data_types[0].startswith('row(') 
                and data_types[1] == 'bigint' 
                and data_types[2] == 'bigint' 
                and data_types[3] == 'bigint' 
                and data_types[4].startswith('row(')):
            # This is an Iceberg $partitions table - these are not partition names
            return None        
        return partition_names

It looks like this is safe, since the schema of the Iceberg $partitions is documented, and it's impossible for a Hive partition to have the ROW datatype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the above.

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch 2 times, most recently from 37fa2f1 to 32ac7dc Compare February 20, 2025 02:46
@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from 32ac7dc to a373f1d Compare February 20, 2025 22:05
@metadaddy
Copy link
Member Author

Python 3.10 should be happy now... We'll see!

@metadaddy
Copy link
Member Author

All tests pass, @hashhar - ready for your renewed attention 🙂

@hashhar
Copy link
Member

hashhar commented Feb 24, 2025

Thanks @metadaddy, I'll test locally once today and then merge. Sorry for the delays and long back and forth.

@metadaddy
Copy link
Member Author

Hi @hashhar - no problem at all - as much of the delay was on my side as yours, and I think we arrived at the correct solution!

@metadaddy
Copy link
Member Author

Hi @hashhar - just bumping this so it doesn't get forgotten 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants