Skip to content

Substrait plan read relation baseSchema does not include the struct with type information #12244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
richtia opened this issue Aug 29, 2024 · 5 comments · Fixed by #15011
Closed
Labels
bug Something isn't working

Comments

@richtia
Copy link

richtia commented Aug 29, 2024

Describe the bug

Datafusion produces substrait plans that do not include a struct with type information

                    "baseSchema": {
                      "names": [
                        "ps_partkey",
                        "ps_suppkey",
                        "ps_availqty",
                        "ps_supplycost",
                        "ps_comment"
                      ]
                    },

It should look more like this to be valid

                      "baseSchema": {
                        "names": ["PS_PARTKEY", "PS_SUPPKEY", "PS_AVAILQTY", "PS_SUPPLYCOST", "PS_COMMENT"],
                        "struct": {
                          "types": [{
                            "i64": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }, {
                            "i64": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }, {
                            "i64": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }, {
                            "decimal": {
                              "scale": 2,
                              "precision": 15,
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }, {
                            "string": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }],
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },

To Reproduce

Generate any substrait plan that includes a read relation and you'll be able to see that the plan output doesn't include the struct field with type information in the baseSchema.

base_schema is a NamedStruct
https://substrait.io/relations/logical_relations/#__tabbed_1_1

https://substrait.io/types/named_structs/

Expected behavior

No response

Additional context

You can also vaidate plans by running them through the substrait-validator

import substrait_validator as sv
import substrait.gen.proto.plan_pb2 as plan_pb2
from datafusion import SessionContext
from datafusion import substrait as ss

ctx = SessionContext()
substrait_proto = plan_pb2.Plan()
substrait_plan = ss.serde.serialize_to_plan(sql_query, ctx)
substrait_plan_bytes = substrait_plan.encode()

config = sv.Config()
sv.check_plan_valid(substrait_plan_bytes, config)
@richtia richtia added the bug Something isn't working label Aug 29, 2024
@Blizzara
Copy link
Contributor

I belive this was fixed in #12245 ?

@amoeba
Copy link
Member

amoeba commented Mar 4, 2025

The behavior was modified in #12245 and the original issue looks addressed but the plans I'm getting don't validate. DataFusion currently hardcodes struct nullability to NULLABILITY_UNSPECIFIED, see

nullability: r#type::Nullability::Unspecified as i32,

The baseSchema I get with the latest datafusion looks like this,

"baseSchema": {
  "names": [
    "species",
    "island",
    "bill_length_mm",
    "bill_depth_mm",
    "body_mass_g",
    "sex",
    "year"
  ],
  "struct": {
    "types": [
      {
        "string": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "string": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "fp64": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "fp64": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "i32": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "string": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "i32": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      }
    ]
  }
},

The above implicitly sets the nullability on the baseSchema to NULLABILITY_UNSPECIFIED which, when validated as part of a larger plan, errors out with:

Error (code 0002):
  at plan.relations[0].rel_type<root>.input.rel_type<project>.input.rel_type<read>.base_schema.struct.nullability:
  illegal value: nullability information is required in this context (code 0002)

I can get the plan to validate if I set nullability to NULLABILITY_REQUIRED,

                "baseSchema": {
                  "names": [...],
                  "struct": {
                    "types": [...],
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },

Is the validator right and should DataFusion change its behavior here?

@Blizzara
Copy link
Contributor

Blizzara commented Mar 4, 2025

I didn‘t see something in substrait.io where it‘d be explicitly said, but the example there (https://substrait.io/tutorial/sql_to_substrait/#types-and-schemas) does use NULLABILITY_REQUIRED. I don’t think it makes any difference in reality, given it’s the fields inside that struct type that matter, not the root struct itself.

If the validator is happier with NULLABILITY_REQUIRED, fine by me to change it to that.

@amoeba
Copy link
Member

amoeba commented Mar 4, 2025

Thanks @Blizzara. I'll file a PR to change this soon and we can close this issue out. I asked for some clarification on the Substrait Slack so we can be sure it's the right change and I'll create a PR once I'm sure. Hopefully also update the Substrait docs too.

@amoeba
Copy link
Member

amoeba commented Mar 5, 2025

I was able to confirm this is probably something Substrait should make required and put up a PR with the change at #15011. That should close this issue out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants