Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: type error when accessing length() without schema information #10651

Open
1 task done
choucavalier opened this issue Jan 2, 2025 · 12 comments
Open
1 task done
Labels
bug Incorrect behavior inside of ibis

Comments

@choucavalier
Copy link

What happened?

The following example is self-explanatory

By default, it's assumed that t.some_col is of type Column, while it could be of type ArrayColumn (or other more specific types for that matter).

This results in typing issues.

import ibis
import pandas as pd


d = pd.DataFrame(
  {
    "some_col": [
      [1, 2, 3],
      [4, 5],
    ]
  }
)
t = ibis.memtable(d)

t.some_col.length()  # ■ Cannot access attribute "length" for class "Column"

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@choucavalier choucavalier added the bug Incorrect behavior inside of ibis label Jan 2, 2025
@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

I can't reproduce this on main:

In [4]: import ibis
   ...: import pandas as pd
   ...:
   ...:
   ...: d = pd.DataFrame(
   ...:   {
   ...:     "some_col": [
   ...:       [1, 2, 3],
   ...:       [4, 5],
   ...:     ]
   ...:   }
   ...: )
   ...: t = ibis.memtable(d)

In [5]: t.some_col.length()
Out[5]:
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
          some_col
      0  [1, 2, 3]
      1     [4, 5]

ArrayLength(some_col): ArrayLength(r0.some_col)

@choucavalier
Copy link
Author

The code works. It's the static type checking that outputs an error while it should not.

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

Schema information generally can't be made static in Python.

A type checker would need access to information from a running database.

@choucavalier
Copy link
Author

choucavalier commented Jan 2, 2025

Of course, but in this case I think the DX would be much better if instead of assuming that the column can't be a list, we assume it could be, and prevent the type checker from erroring

Right now I have to do

from typing import cast
from ibis.expr.types import ArrayColumn

t.select(t).filter(
  cast(ArrayColumn, t.my_list_column).length() > ibis.literal(0)
)

(type checking also requires me to write ibis.literal(...) for any constant value.

More and more projects use type checking, and not having it in ibis will prevent all of them from having a good DX and reduce ibis' adoption.

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

It sounds like you effectively want every type-specific method to exist on Column. Is that correct?

@choucavalier
Copy link
Author

That could be a solution. Maybe there's another one which does not imply modifying the Column definition, but instead provide some type hints that could help the static type checkers understand that "by default this could actually be an ArrayColumn, or a StructColumn (or others) so let's assume that these collection-specific methods might exist because we don't have access to the schema of the table so let's trust this guy that he knows what he's doing when querying his own data. if he's trying to access the method length on a column that's actually not an ArrayColumn we'll throw an error at runtime. but let's not ask the guy to write type casting code everywhere or disable the type checker on all the lines that actually use ibis"

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

In that case a giant ir.StringColumn | ir.IntegerColum | ... with every single column subclass as the return value of Table.__getattr__ would probably suffice for this specific use case.

If this pattern shows up in a bunch of places maybe we can make a type alias to simplify our lives.

@NickCrews
Copy link
Contributor

a giant ir.StringColumn | ir.IntegerColum | ...

Wouldn't pyright still complain in this case? if you try to do .upper() on something that is ir.StringColumn | ir.IntegerColumn, then it could be an ir.IntegerColumn, which would fail at runtime. Or is pyright optimistic?

I started #10682 as an attempt to slightly mitigate this pain point for some workflows. eg if/after that lands, then my_table.my_int_column.cast(int) now correctly types as a ir.IntegerValue (our type system doesn't currently allow us to be more specific and say ir.IntegerColumn ☹ ). But that PR can't work for complex dtypes like arrays, because we can't enumerate all possible array<int>, array<array<int>>, array<array<array<int>>>, etc options to @typing.overload

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2025

eg if/after that lands

It's been merged into main!

our type system doesn't currently allow us to be more specific

That is by design. ir.IntegerColumn is an implementation detail, not a type that represents values of a column in a database.

can't work for complex dtypes like arrays

I think this is true of any type system that supports recursive types, because there are infinitely many types in such systems.

@NickCrews
Copy link
Contributor

ir.IntegerColumn is an implementation detail

Are you sure? ir.IntegerScalar does not have a .max() method, ir.IntegerColumn does. We talked about this here as well. I think if it were possible, I think it would be better if ibis's type system could discern between Columns and Scalars.

@cpcloud
Copy link
Member

cpcloud commented Jan 18, 2025

Are you sure?

Yes.

By implementation detail here, I mean that we don't expose those objects as inputs to expression APIs.

Casting is an in-database operation, and our backends don't and probably won't (and shouldn't) ever have knowledge of how Ibis models expressions.

@romaingd
Copy link
Contributor

Hi! I'm hitting similar static type checking issues with many expressions. Long text ahead - I can open a separate issue if more appropriate. I am also willing to open a PR if there's an agreement on changes.

Consider the following expressions and their types as inferred by Pyright.

t = ibis.memtable({"a": [1, 2]})

x = t.a.name("b").min()  # Typing error
x = t.a.log(t.a)  # Typing error
x = t.a.log(t.a).min().log(ibis.literal(2))  # Typing error

a = typing.cast("ibis.expr.types.IntegerColumn", t.a)
y = a.name("y")  # Typed as Value - is IntegerColumn
y = a.log()  # Typed as NumericValue - is NumericColumn
y = a.isnull()  # Typed as BooleanValue - is BooleanColumn

Note that t.a is accurately inferred as a Column, and that all expressions are valid.

Nonetheless x gives typing errors because:

  • Value.name() is typed as returning Value, for which .min() is undefined - .min() is shape-specific, but shape is lost mid-way
  • Column.log() is undefined, as .log() is only defined for NumericValue - .log() is type-specific, but type can't be inferred statically

Even when the type checker is informed of the precise shape and dtype of t.a, y has lost some information - sometimes shape, sometimes dtype, sometimes both. This matters, because many operations are only defined on some combination of shape and dtype, e.g. .mean() - so any operation on y risks being a typing error.

Why this is a problem

Static type checking with tools like Pyright is heavily relied on by many Python projects, and only gets more popular with time. I agree with @choucavalier : valid Ibis basic operations being flagged as invalid is painful for DX and a real obstacle to wider adoption. Users must choose between living with a good chunk of their Ibis code being highlighted as invalid, or disabling type checking on Ibis altogether. Neither option is particularly appealing.

Note that this is completely separate from Ibis runtime operations type safety (which took me some time to figure out).

Suggestions

I believe that, in this context and at this scale, false positives (statically typed as valid, actually invalid at runtime) are better than false negatives (statically typed as invalid, actually valid at runtime).

Dtype-agnostic declarations

Given that value dtype usually can't be inferred statically, I see no other option than declaring dtype-specific methods (e.g. NumericColumn.mean()) at the dtype-agnostic appropriate level (e.g. Column.mean()). More specifically, Column.mean() could be an abstract method that is implemented only in NumericColumn.mean() because it doesn't make sense for other dtypes.

A similar choice was made in pandas for Series, or Polars for Expr.

Shape-aware typing

Contrary to dtype, shape can usually be inferred statically, and it's valuable typing information. Most (if not all?) Value methods can be typed as returning a specific shape: for example, value.isnull() will always have the same shape as value. Using appropriate shape typing, or Self / a TypeVar when fitting, provides useful information and should prevent the need for shape-agnostic declarations.

Proof of concept

Trying out on some methods, see how it looks in romaingd#1

T = TypeVar("T", "Value", "Column", "Scalar")

class Value(Expr):
    def name(self: Self, name: str, /) -> Self: ...

    def isnull(self: T) -> T: ...

	@abstractmethod  # Implemented in `NumericValue`
    def log(self: T, base: Value | None = None, /) -> T: ...

Results:

t = ibis.memtable({"a": [1, 2]})

x = t.a.name("b").min()  # Typed as Scalar
x = t.a.log(t.a)  # Typed as Column
x = t.a.log(t.a).min().log(ibis.literal(2))  # Typed as Scalar

a = typing.cast("ibis.expr.types.IntegerColumn", t.a)
y = a.name("y")  # Typed as IntegerColumn
y = a.log()  # Typed as Column
y = a.isnull()  # Typed as Column

What we gained:

  • No static typing error on valid operations (no false negatives)
  • Correct shape typing

What we lost:

  • Less static typing errors on invalid operations (more false positives)
  • Virtually no dtype typing - but the information was rarely available to begin with, and matters less now that dtype-specific methods are declared as dtype-agnostic
  • Some interface duplication, e.g. Value.log() (abstract) and NumericValue.log() (concrete)

Other options

  1. "A giant ir.StringColumn | ir.IntegerColum | ... with every single column subclass as the return value of Table.__getattr__", as mentioned in bug: type error when accessing length() without schema information #10651 (comment). I don't think that would work. Type checkers, considering the possibility of t.a being a ir.StringColumn, would flag any numeric-only operation like t.a.mean() as invalid.
  2. Using method(self: Self, ...) -> Self everywhere. While this is true of some methods, like .name(), it's incorrect for many others, like .mean() (changes shape) or .log() (changes dtype for ir.IntegerValue).
  3. I have tried using generics, e.g. class Value(Generic[Datashape, Datatype]), but failed to achieve anything meaningful without essentially declaring every method in Value directly (i.e. shape-agnostic and type-agnostic). For similar reasons, something like Column[Boolean] did not solve the issues mentioned above. The lack of Higher Kinded TypeVars was really an obstacle. Maybe I just didn't see the correct way to do it.
  4. Type stubs? Probably a lot more maintenance work, but no interface duplication in actual Ibis code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

4 participants