-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add standard data ingestion pipelines to pylibcudf for ndarrays #18311
base: branch-25.06
Are you sure you want to change the base?
Add standard data ingestion pipelines to pylibcudf for ndarrays #18311
Conversation
try: | ||
import numpy as np | ||
np_error = None | ||
except ImportError as err: | ||
np = None | ||
np_error = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Tried following approach in #18020 to make imports optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This captures, in err
(if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.
I would prefer not saving the error and just if np
/cp
is None
raising when we come to use things.
(We should probably do the same in the scalar handling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a compromise we could save the exception class and string error message but not the traceback frames so we have a more faithful representation of the original error, but I don't have a strong opinion how we reduce this
@@ -321,6 +338,80 @@ cdef class Column: | |||
[] | |||
) | |||
|
|||
@singledispatchmethod | |||
@classmethod | |||
def from_any(cls, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Maybe we can start with an from_ndarray
method that accepts numpy
and cupy
arrays, and then build up an from_any
when we build up more ingestion methods (e.g. from_pyobject
, from_pandas
, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we eventually remove from_ndarray
in favor offrom_any
? If so, I'd rather keep from_any
since its ultimately what we want. And because we're planning on adding more ingestion methods during 25.06 release.
If we plan to keep it, then I'll change from_any
to from_ndarray
. I wanted to add a stable API sooner rather than later for Curator's use case. They need to be able create a list column from a cupy array. It should be public and stable so they can keep doing something like the following without it breaking between cudf releases.
cudf.Series.from_pylibcudf(
plc.Column.from_any(
cupy_array
)
)
*Also forgot to add a doc string, will add that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I'll start with from_ndarray
in this PR and add from_any
in a follow-up PR.
Narwhals failures are unrelated to this PR |
import cupy as cp | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use define a fixture that does pytest.importorskip("numpy"/"cupy")
(like in https://github.com/rapidsai/cudf/pull/18020/files#diff-4ca9193a8f5aa8079576b6ba20cffb2f875bf62fe8f75921f6f151250915947aR11) since technically these are optional dependencies?
@classmethod | ||
def from_cuda_array_interface_obj(cls, obj: Any) -> Column: ... | ||
@classmethod | ||
def from_ndarray(cls, obj: Any) -> Column: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's tighten this typing. We can define a Protocol
:
from typing import TypedDict, Any, Protocol, Union
class CAI(TypedDict):
shape: tuple[int, ...]
typestr: str
data: tuple[int, bool]
version: int
strides: None | tuple[int, ...]
# TODO: better type for this
descr: None | tuple[Any, ...]
mask: Union[None, "SupportsCAI"]
stream: None | int
class SupportsCAI(Protocol):
@property
def __cuda_array_interface__(self) -> CAI: ...
@classmethod
def from_cuda_array_interface_obj(cls, obj: SupportsCAI) -> Column: ...
Similarly we can probably do the same for the array interface.
@staticmethod | ||
def from_cuda_array_interface_obj(object obj): | ||
@classmethod | ||
def from_cuda_array_interface_obj(cls, object obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we call this from_cuda_array_interface
@singledispatchmethod | ||
@classmethod | ||
def _from_ndarray(cls, obj): | ||
if np_error is not None: | ||
raise np_error | ||
if cp_error is not None: | ||
raise cp_error | ||
raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column") | ||
|
||
if np is not None: | ||
@classmethod | ||
def from_numpy_array(cls, object obj): | ||
# TODO: Should expand to support __array_interface__ | ||
raise NotImplementedError( | ||
"Converting to a pylibcudf Column from " | ||
"a numpy object is not yet implemented." | ||
) | ||
|
||
@_from_ndarray.register(np.ndarray) | ||
@classmethod | ||
def _(cls, obj): | ||
return cls.from_numpy_array(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need singledispatch, and should just handle anything that supports the array interface.
flat_data = arr.ravel() | ||
|
||
num_rows, num_cols = arr.shape | ||
offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use size_type
.
try: | ||
import numpy as np | ||
np_error = None | ||
except ImportError as err: | ||
np = None | ||
np_error = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This captures, in err
(if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.
I would prefer not saving the error and just if np
/cp
is None
raising when we come to use things.
(We should probably do the same in the scalar handling).
@@ -360,6 +377,110 @@ cdef class Column: | |||
[] | |||
) | |||
|
|||
@classmethod | |||
def from_ndarray(cls, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have two methods:
from_array_interface
from_cuda_array_interface
Or maybe from_arraylike
where the object supports either the cuda array interface or the array interface?
def _from_ndarray(cls, obj): | ||
if np_error is not None: | ||
raise np_error | ||
if cp_error is not None: | ||
raise cp_error | ||
raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises too eagerly in the case where we're trying to convert from a numpy array and the user has numpy but not cupy.
data_view = gpumemoryview(flat_data) | ||
offsets_view = gpumemoryview(offsets) | ||
typestr = arr.__cuda_array_interface__['typestr'][1:] | ||
|
||
data_col = cls( | ||
data_type=_datatype_from_dtype_desc(typestr), | ||
size=flat_data.size, | ||
data=data_view, | ||
mask=None, | ||
null_count=0, | ||
offset=0, | ||
children=[], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the data are C-contiguous we can do this without a copy.
In that case, we can also avoid requiring cupy, because we can make the offsets column with pylibcudf.filling.sequence
.
num_rows, num_cols = arr.shape | ||
offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone needs to check that this will not produce a column with more than the maximum number of rows. The way we represent list columns in libcudf is that we have a column of N
rows. But the offsets column has N+1
rows. But if N == size_type::max()
then N+1
overflows, so we can't represent the offsets.
Description
Contributes to #15132 and #18214. This PR starts with cupy arrays and adds the skeleton code for array interface objects (eg. numpy arrays). This is breaking because
from_cuda_array_interface_obj
now raises for multi dimensional arrays.I think this PR should be sufficient for Curator to replace their existing logic to convert cupy arrays to list columns with
from_pylibcudf
andfrom_ndarray
. Eg.Checklist