Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunEndEncodeTableColumns doesn't change the table's schema type to reflect that the column is run-end-encoded #45534

Open
lesterfan opened this issue Feb 13, 2025 · 1 comment

Comments

@lesterfan
Copy link

lesterfan commented Feb 13, 2025

Describe the bug, including details regarding any error messages, version, and platform.

This is half a bug report regarding the RunEndEncodeTableColumns gtest util and half a usage question.

If a string column in an arrow::Table is run-end encoded, should the corresponding schema type be arrow::utf8() or arrow::run_end_encoded(arrow::int32(), arrow::utf8())? In the current main branch, the RunEndEncodeTableColumns gtest util returns a table like

ree_table = col: string
----
col:
  [

    -- run_ends:
      [
        1,
        2,
        3,
        4
      ]
    -- values:
      [
        "a",
        "b",
        "c",
        "d"
      ]
  ]

whereas I would have expected a table like

ree_table = col: run_end_encoded<run_ends: int32, values: string>
  child 0, run_ends: int32 not null
  child 1, values: string
----
col:
  [

    -- run_ends:
      [
        1,
        2,
        3,
        4
      ]
    -- values:
      [
        "a",
        "b",
        "c",
        "d"
      ]
  ]

I'm not sure what the right behavior is here. My instinct is that the second is more correct since I see in the codebase that certain features are disabled for run-end-encoded types (example), so we would want the schema to be accurate to reflect what the library currently supports on the column. I'm not sure whether the existence of these unsupported features implies that having a plain string type in the schema is incorrect though. I definitely don't have a lot of context here though, so let me know if I'm missing something in my considerations here 🙂

Component(s)

C++

@lesterfan
Copy link
Author

I put up #45535 to change this function to also modify the schema type pending further discussion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant