Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python defined schema does not match created schema #1728

Open
2 of 3 tasks
grbinho opened this issue Feb 26, 2025 · 2 comments
Open
2 of 3 tasks

Python defined schema does not match created schema #1728

grbinho opened this issue Feb 26, 2025 · 2 comments

Comments

@grbinho
Copy link

grbinho commented Feb 26, 2025

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Hi,

I'm using pyiceberg with Glue and S3 and while creating a table that contains list types, I noticed that element_id do not match the id's defined in the schema itself.
This specifically happens when using ListType with element type being a primitive type.
The element id i give to the list type is not maintained.
It is assigned ids that follow the last id of the root fields.

My expectation is that ids would be preserved.
Inserting data obviously fails due to schema missmatch.

Here is the example

Python object

NestedField(field_id=1, name='__export_id', field_type=StringType(), required=True), 
NestedField(field_id=2, name='__export_timestamp', field_type=TimestampType(), required=True), 
NestedField(field_id=3, name='id', field_type=IntegerType(), required=True), 
NestedField(field_id=4, name='first_name', field_type=StringType(), required=True), 
NestedField(field_id=5, name='last_name', field_type=StringType(), required=True), 
NestedField(field_id=6, name='email', field_type=StringType(), required=False), 
NestedField(field_id=7, name='telephone', field_type=StringType(), required=False), 
NestedField(field_id=8, name='timezone', field_type=StringType(), required=True), 
NestedField(field_id=9, name='has_access_to_all_future_projects', field_type=BooleanType(), required=True), 
NestedField(field_id=10, name='is_contractor', field_type=BooleanType(), required=True), 
NestedField(field_id=11, name='is_active', field_type=BooleanType(), required=True), 
NestedField(field_id=12, name='weekly_capacity', field_type=IntegerType(), required=True), 
NestedField(field_id=13, name='default_hourly_rate', field_type=DecimalType(precision=14, scale=2), required=False), 
NestedField(field_id=14, name='cost_rate', field_type=DecimalType(precision=14, scale=2), required=False), 
NestedField(field_id=15, name='roles', field_type=ListType(type='list', element_id=16, element_type=StringType(), element_required=False), required=True), 
NestedField(field_id=17, name='access_roles', field_type=ListType(type='list', element_id=18, element_type=StringType(), element_required=False), required=True), 
NestedField(field_id=19, name='created_at', field_type=TimestampType(), required=True), 
NestedField(field_id=20, name='updated_at', field_type=TimestampType(), required=True)

Iceberg metadata

{
 
    "schemas": [
        {
            "type": "struct",
            "fields": [
                {
                    "id": 1,
                    "name": "__export_id",
                    "type": "string",
                    "required": true,
                    "doc": "Unique identifier of the run that wrote this data."
                },
                {
                    "id": 2,
                    "name": "__export_timestamp",
                    "type": "timestamp",
                    "required": true,
                    "doc": "Timestamp of when export that wrote this data started."
                },
                {
                    "id": 3,
                    "name": "id",
                    "type": "int",
                    "required": true,
                    "doc": "Unique id of the user"
                },
                {
                    "id": 4,
                    "name": "first_name",
                    "type": "string",
                    "required": true,
                    "doc": "First name of the user"
                },
                {
                    "id": 5,
                    "name": "last_name",
                    "type": "string",
                    "required": true,
                    "doc": "Last name of the user"
                },
                {
                    "id": 6,
                    "name": "email",
                    "type": "string",
                    "required": false,
                    "doc": "Email address of the user"
                },
                {
                    "id": 7,
                    "name": "telephone",
                    "type": "string",
                    "required": false,
                    "doc": "The user's telephone number"
                },
                {
                    "id": 8,
                    "name": "timezone",
                    "type": "string",
                    "required": true,
                    "doc": "The user's timezone"
                },
                {
                    "id": 9,
                    "name": "has_access_to_all_future_projects",
                    "type": "boolean",
                    "required": true,
                    "doc": "Whether the user should be automatically added to future projects"
                },
                {
                    "id": 10,
                    "name": "is_contractor",
                    "type": "boolean",
                    "required": true,
                    "doc": "Whether the user is a contractor or an employee"
                },
                {
                    "id": 11,
                    "name": "is_active",
                    "type": "boolean",
                    "required": true,
                    "doc": "Whether the user is active or archived"
                },
                {
                    "id": 12,
                    "name": "weekly_capacity",
                    "type": "int",
                    "required": true,
                    "doc": "The number of hours per week this person is available to work (in seconds, in half hour increments)"
                },
                {
                    "id": 13,
                    "name": "default_hourly_rate",
                    "type": "decimal(14, 2)",
                    "required": false,
                    "doc": "The billable rate to use for this user when they are added to a project"
                },
                {
                    "id": 14,
                    "name": "cost_rate",
                    "type": "decimal(14, 2)",
                    "required": false,
                    "doc": "The cost rate to use for this user when calculating a projects cost vs billable amount"
                },
                {
                    "id": 15,
                    "name": "roles",
                    "type": {
                        "type": "list",
                        "element-id": 19,
                        "element": "string",
                        "element-required": false
                    },
                    "required": true,
                    "doc": "Descriptive names of the business roles assigned to this user. They have no effect on the permissions. Can be used for reporting."
                },
                {
                    "id": 16,
                    "name": "access_roles",
                    "type": {
                        "type": "list",
                        "element-id": 20,
                        "element": "string",
                        "element-required": false
                    },
                    "required": true,
                    "doc": "Access roles that determine users permissions."
                },
                {
                    "id": 17,
                    "name": "created_at",
                    "type": "timestamp",
                    "required": true,
                    "doc": "Date and time the time entry was created"
                },
                {
                    "id": 18,
                    "name": "updated_at",
                    "type": "timestamp",
                    "required": true,
                    "doc": "Date and time the time entry was updated"
                }
            ],
            "schema-id": 0,
            "identifier-field-ids": [
                1
            ]
        }
    ]
}

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@Fokko
Copy link
Contributor

Fokko commented Mar 2, 2025

@grbinho Thanks for raising this. The IDs get re-assigned, so users don't have to worry about it. You mentioned that inserting the data fails; can you explain your process of how you insert data? If you use the PyIceberg API the fields are resolved in the Arrow dataframe using names.

@grbinho
Copy link
Author

grbinho commented Mar 3, 2025

Hi @Fokko

I'm doing this

mapped_data = [self.mapper(item, self.export_id, self.export_timestamp) for item in data]
arrow_data = pa.Table.from_pylist(mapped_data, schema = self.schema.as_arrow())

...

if arrow_data:
  with table.transaction() as tx:
    tx.append(arrow_data)
else:
   ...

self.mapper maps the data so it matches the schema field names.
self.schema holds the iceberg schema for the data.

And that is it.
Maybe this self.schema.as_arrow() is causing problems?
What I was seeing as error was mismatch in field Ids for lists and their element field.

I will try removing that call later today if I get the chance and test that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants