bug: type error when accessing `length()` without schema information #10651

choucavalier · 2025-01-02T13:42:04Z

What happened?

The following example is self-explanatory

By default, it's assumed that t.some_col is of type Column, while it could be of type ArrayColumn (or other more specific types for that matter).

This results in typing issues.

import ibis
import pandas as pd


d = pd.DataFrame(
  {
    "some_col": [
      [1, 2, 3],
      [4, 5],
    ]
  }
)
t = ibis.memtable(d)

t.some_col.length()  # ■ Cannot access attribute "length" for class "Column"

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2025-01-02T16:42:00Z

I can't reproduce this on main:

In [4]: import ibis
   ...: import pandas as pd
   ...:
   ...:
   ...: d = pd.DataFrame(
   ...:   {
   ...:     "some_col": [
   ...:       [1, 2, 3],
   ...:       [4, 5],
   ...:     ]
   ...:   }
   ...: )
   ...: t = ibis.memtable(d)

In [5]: t.some_col.length()
Out[5]:
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
          some_col
      0  [1, 2, 3]
      1     [4, 5]

ArrayLength(some_col): ArrayLength(r0.some_col)

choucavalier · 2025-01-02T16:42:57Z

The code works. It's the static type checking that outputs an error while it should not.

cpcloud · 2025-01-02T16:48:14Z

Schema information generally can't be made static in Python.

A type checker would need access to information from a running database.

choucavalier · 2025-01-02T16:49:54Z

Of course, but in this case I think the DX would be much better if instead of assuming that the column can't be a list, we assume it could be, and prevent the type checker from erroring

Right now I have to do

from typing import cast
from ibis.expr.types import ArrayColumn

t.select(t).filter(
  cast(ArrayColumn, t.my_list_column).length() > ibis.literal(0)
)

(type checking also requires me to write ibis.literal(...) for any constant value.

More and more projects use type checking, and not having it in ibis will prevent all of them from having a good DX and reduce ibis' adoption.

cpcloud · 2025-01-02T16:58:05Z

It sounds like you effectively want every type-specific method to exist on Column. Is that correct?

choucavalier · 2025-01-02T17:09:11Z

That could be a solution. Maybe there's another one which does not imply modifying the Column definition, but instead provide some type hints that could help the static type checkers understand that "by default this could actually be an ArrayColumn, or a StructColumn (or others) so let's assume that these collection-specific methods might exist because we don't have access to the schema of the table so let's trust this guy that he knows what he's doing when querying his own data. if he's trying to access the method length on a column that's actually not an ArrayColumn we'll throw an error at runtime. but let's not ask the guy to write type casting code everywhere or disable the type checker on all the lines that actually use ibis"

cpcloud · 2025-01-02T17:15:36Z

In that case a giant ir.StringColumn | ir.IntegerColum | ... with every single column subclass as the return value of Table.__getattr__ would probably suffice for this specific use case.

If this pattern shows up in a bunch of places maybe we can make a type alias to simplify our lives.

NickCrews · 2025-01-17T06:56:57Z

a giant ir.StringColumn | ir.IntegerColum | ...

Wouldn't pyright still complain in this case? if you try to do .upper() on something that is ir.StringColumn | ir.IntegerColumn, then it could be an ir.IntegerColumn, which would fail at runtime. Or is pyright optimistic?

I started #10682 as an attempt to slightly mitigate this pain point for some workflows. eg if/after that lands, then my_table.my_int_column.cast(int) now correctly types as a ir.IntegerValue (our type system doesn't currently allow us to be more specific and say ir.IntegerColumn ☹ ). But that PR can't work for complex dtypes like arrays, because we can't enumerate all possible array<int>, array<array<int>>, array<array<array<int>>>, etc options to @typing.overload

cpcloud · 2025-01-17T08:01:00Z

eg if/after that lands

It's been merged into main!

our type system doesn't currently allow us to be more specific

That is by design. ir.IntegerColumn is an implementation detail, not a type that represents values of a column in a database.

can't work for complex dtypes like arrays

I think this is true of any type system that supports recursive types, because there are infinitely many types in such systems.

NickCrews · 2025-01-18T06:33:01Z

ir.IntegerColumn is an implementation detail

Are you sure? ir.IntegerScalar does not have a .max() method, ir.IntegerColumn does. We talked about this here as well. I think if it were possible, I think it would be better if ibis's type system could discern between Columns and Scalars.

cpcloud · 2025-01-18T13:24:57Z

Are you sure?

Yes.

By implementation detail here, I mean that we don't expose those objects as inputs to expression APIs.

Casting is an in-database operation, and our backends don't and probably won't (and shouldn't) ever have knowledge of how Ibis models expressions.

romaingd · 2025-03-16T10:44:03Z

Hi! I'm hitting similar static type checking issues with many expressions. Long text ahead - I can open a separate issue if more appropriate. I am also willing to open a PR if there's an agreement on changes.

Consider the following expressions and their types as inferred by Pyright.

t = ibis.memtable({"a": [1, 2]})

x = t.a.name("b").min()  # Typing error
x = t.a.log(t.a)  # Typing error
x = t.a.log(t.a).min().log(ibis.literal(2))  # Typing error

a = typing.cast("ibis.expr.types.IntegerColumn", t.a)
y = a.name("y")  # Typed as Value - is IntegerColumn
y = a.log()  # Typed as NumericValue - is NumericColumn
y = a.isnull()  # Typed as BooleanValue - is BooleanColumn

Note that t.a is accurately inferred as a Column, and that all expressions are valid.

Nonetheless x gives typing errors because:

Value.name() is typed as returning Value, for which .min() is undefined - .min() is shape-specific, but shape is lost mid-way
Column.log() is undefined, as .log() is only defined for NumericValue - .log() is type-specific, but type can't be inferred statically

Even when the type checker is informed of the precise shape and dtype of t.a, y has lost some information - sometimes shape, sometimes dtype, sometimes both. This matters, because many operations are only defined on some combination of shape and dtype, e.g. .mean() - so any operation on y risks being a typing error.

Why this is a problem

Static type checking with tools like Pyright is heavily relied on by many Python projects, and only gets more popular with time. I agree with @choucavalier : valid Ibis basic operations being flagged as invalid is painful for DX and a real obstacle to wider adoption. Users must choose between living with a good chunk of their Ibis code being highlighted as invalid, or disabling type checking on Ibis altogether. Neither option is particularly appealing.

Note that this is completely separate from Ibis runtime operations type safety (which took me some time to figure out).

Suggestions

I believe that, in this context and at this scale, false positives (statically typed as valid, actually invalid at runtime) are better than false negatives (statically typed as invalid, actually valid at runtime).

Dtype-agnostic declarations

Given that value dtype usually can't be inferred statically, I see no other option than declaring dtype-specific methods (e.g. NumericColumn.mean()) at the dtype-agnostic appropriate level (e.g. Column.mean()). More specifically, Column.mean() could be an abstract method that is implemented only in NumericColumn.mean() because it doesn't make sense for other dtypes.

A similar choice was made in pandas for Series, or Polars for Expr.

Shape-aware typing

Contrary to dtype, shape can usually be inferred statically, and it's valuable typing information. Most (if not all?) Value methods can be typed as returning a specific shape: for example, value.isnull() will always have the same shape as value. Using appropriate shape typing, or Self / a TypeVar when fitting, provides useful information and should prevent the need for shape-agnostic declarations.

Proof of concept

Trying out on some methods, see how it looks in romaingd#1

T = TypeVar("T", "Value", "Column", "Scalar")

class Value(Expr):
    def name(self: Self, name: str, /) -> Self: ...

    def isnull(self: T) -> T: ...

	@abstractmethod  # Implemented in `NumericValue`
    def log(self: T, base: Value | None = None, /) -> T: ...

Results:

t = ibis.memtable({"a": [1, 2]})

x = t.a.name("b").min()  # Typed as Scalar
x = t.a.log(t.a)  # Typed as Column
x = t.a.log(t.a).min().log(ibis.literal(2))  # Typed as Scalar

a = typing.cast("ibis.expr.types.IntegerColumn", t.a)
y = a.name("y")  # Typed as IntegerColumn
y = a.log()  # Typed as Column
y = a.isnull()  # Typed as Column

What we gained:

No static typing error on valid operations (no false negatives)
Correct shape typing

What we lost:

Less static typing errors on invalid operations (more false positives)
Virtually no dtype typing - but the information was rarely available to begin with, and matters less now that dtype-specific methods are declared as dtype-agnostic
Some interface duplication, e.g. Value.log() (abstract) and NumericValue.log() (concrete)

Other options

"A giant ir.StringColumn | ir.IntegerColum | ... with every single column subclass as the return value of Table.__getattr__", as mentioned in bug: type error when accessing length() without schema information #10651 (comment). I don't think that would work. Type checkers, considering the possibility of t.a being a ir.StringColumn, would flag any numeric-only operation like t.a.mean() as invalid.
Using method(self: Self, ...) -> Self everywhere. While this is true of some methods, like .name(), it's incorrect for many others, like .mean() (changes shape) or .log() (changes dtype for ir.IntegerValue).
I have tried using generics, e.g. class Value(Generic[Datashape, Datatype]), but failed to achieve anything meaningful without essentially declaring every method in Value directly (i.e. shape-agnostic and type-agnostic). For similar reasons, something like Column[Boolean] did not solve the issues mentioned above. The lack of Higher Kinded TypeVars was really an obstacle. Maybe I just didn't see the correct way to do it.
Type stubs? Probably a lot more maintenance work, but no interface duplication in actual Ibis code.

choucavalier added the bug Incorrect behavior inside of ibis label Jan 2, 2025

github-project-automation bot added this to Ibis planning and roadmap Jan 2, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: type error when accessing `length()` without schema information #10651

bug: type error when accessing `length()` without schema information #10651

choucavalier commented Jan 2, 2025

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025 •

edited

Loading

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025

cpcloud commented Jan 2, 2025

NickCrews commented Jan 17, 2025

cpcloud commented Jan 17, 2025

NickCrews commented Jan 18, 2025

cpcloud commented Jan 18, 2025 •

edited

Loading

romaingd commented Mar 16, 2025

bug: type error when accessing length() without schema information #10651

bug: type error when accessing length() without schema information #10651

Comments

choucavalier commented Jan 2, 2025

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025 • edited Loading

cpcloud commented Jan 2, 2025

choucavalier commented Jan 2, 2025

cpcloud commented Jan 2, 2025

NickCrews commented Jan 17, 2025

cpcloud commented Jan 17, 2025

NickCrews commented Jan 18, 2025

cpcloud commented Jan 18, 2025 • edited Loading

romaingd commented Mar 16, 2025

Why this is a problem

Suggestions

Dtype-agnostic declarations

Shape-aware typing

Proof of concept

Other options

bug: type error when accessing `length()` without schema information #10651

bug: type error when accessing `length()` without schema information #10651

choucavalier commented Jan 2, 2025 •

edited

Loading

cpcloud commented Jan 18, 2025 •

edited

Loading