Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iceberg] Add UUID type support #23627

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Sep 11, 2024

Description

This PR adds support for reading and writing UUIDs in Iceberg and with Iceberg+Hive. In order to support this we also needed improvements to the parquet reader and writer for Parquet's UUID logical type.

Motivation and Context

Presto has type support for UUIDs. We should support reading and writing them in some of the connectors.

Impact

  • Iceberg tables can now be created with UUID types.

Test Plan

  • Basic tests inside of the Iceberg module for round-trip UUID reading and writing.
  • Additional tests in the parquet module for reading and writing UUID values

I also added a benchmark for reading and writing UUID types and compared it to our current LongDecimal benchmark to see the performance difference for another type which uses FIXED_LENGTH_BYTE_ARRAY as the underlying physical type.

These were the microbenchmarks from my local machine on ARM using a build of Corretto JDK11 and with reader verification disabled


Benchmark                       (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                     true              PLAIN  thrpt   10  158.193 ± 1.058  ops/s
BenchmarkUuidColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   10   32.211 ± 0.345  ops/s
BenchmarkUuidColumnReader.read                    false              PLAIN  thrpt   10   13.030 ± 0.463  ops/s
BenchmarkUuidColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   10   10.121 ± 0.377  ops/s


Benchmark                              (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt   Score   Error  Units
BenchmarkLongDecimalColumnReader.read                     true              PLAIN  thrpt   20  10.926 ± 0.178  ops/s
BenchmarkLongDecimalColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   20   8.948 ± 0.180  ops/s
BenchmarkLongDecimalColumnReader.read                    false              PLAIN  thrpt   20   8.632 ± 0.129  ops/s
BenchmarkLongDecimalColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   20   7.771 ± 0.066  ops/s

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-uuid branch 3 times, most recently from 43be063 to 49f7ab3 Compare September 13, 2024 16:55
The iceberg spec lists uuid as a valid schema type. Presto supports
UUID types but there was no support for reading or writing them
in the connector.

This commit makes the necessary changes in the connector to create
tables with UUID columns and support for UUIDs in the parquet reader.
This includes an implementation for UUIDs in the batchreader.
@steveburnett
Copy link
Contributor

Nit suggestion for the release note entry to follow the Order of changes in the Release Notes Guidelines:

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

@ZacBlanco
Copy link
Contributor Author

Fixed, thanks Steve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants