Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Open
wants to merge 8 commits into
base: branch-25.06
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Mar 18, 2025

Description

Contributes to #15132 and #18214. This PR starts with cupy arrays and adds the skeleton code for array interface objects (eg. numpy arrays). This is breaking because from_cuda_array_interface_obj now raises for multi dimensional arrays.

I think this PR should be sufficient for Curator to replace their existing logic to convert cupy arrays to list columns with from_pylibcudf and from_ndarray. Eg.

import cudf
import pylibcudf as plc

cudf.Series.from_pylibcudf(
    plc.Column.from_ndarray(...) # call on cupy array
)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Sorry, something went wrong.

@Matt711 Matt711 added feature request New feature or request breaking Breaking change labels Mar 18, 2025
@Matt711 Matt711 requested a review from a team as a code owner March 18, 2025 17:30
@Matt711 Matt711 requested review from wence- and mroeschke March 18, 2025 17:30
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025
Comment on lines +21 to +26
try:
import numpy as np
np_error = None
except ImportError as err:
np = None
np_error = err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Tried following approach in #18020 to make imports optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures, in err (if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.

I would prefer not saving the error and just if np/cp is None raising when we come to use things.

(We should probably do the same in the scalar handling).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise we could save the exception class and string error message but not the traceback frames so we have a more faithful representation of the original error, but I don't have a strong opinion how we reduce this

@@ -321,6 +338,80 @@ cdef class Column:
[]
)

@singledispatchmethod
@classmethod
def from_any(cls, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Maybe we can start with an from_ndarray method that accepts numpy and cupy arrays, and then build up an from_any when we build up more ingestion methods (e.g. from_pyobject, from_pandas, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we eventually remove from_ndarray in favor offrom_any? If so, I'd rather keep from_any since its ultimately what we want. And because we're planning on adding more ingestion methods during 25.06 release.

If we plan to keep it, then I'll change from_any to from_ndarray. I wanted to add a stable API sooner rather than later for Curator's use case. They need to be able create a list column from a cupy array. It should be public and stable so they can keep doing something like the following without it breaking between cudf releases.

cudf.Series.from_pylibcudf(
    plc.Column.from_any(
        cupy_array
    )
)

*Also forgot to add a doc string, will add that too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK I do see now that @vyasr proposed in #15132 that a from_any method would just dispatch to another public method e.g. from_ndarray (for example), by singledispatch. So we would have both a public from_any and from_ndarray on the column method in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'll start with from_ndarray in this PR and add from_any in a follow-up PR.

@Matt711 Matt711 changed the title Add standard data ingestion pipelines to pylibcudf Add standard data ingestion pipelines to pylibcudf for ndarrays Mar 18, 2025
@Matt711 Matt711 requested review from mroeschke and vyasr March 20, 2025 00:53
@Matt711
Copy link
Contributor Author

Matt711 commented Mar 20, 2025

Narwhals failures are unrelated to this PR

Comment on lines +3 to +4
import cupy as cp
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use define a fixture that does pytest.importorskip("numpy"/"cupy") (like in https://github.com/rapidsai/cudf/pull/18020/files#diff-4ca9193a8f5aa8079576b6ba20cffb2f875bf62fe8f75921f6f151250915947aR11) since technically these are optional dependencies?

Comment on lines +38 to +41
@classmethod
def from_cuda_array_interface_obj(cls, obj: Any) -> Column: ...
@classmethod
def from_ndarray(cls, obj: Any) -> Column: ...
Copy link
Contributor

@wence- wence- Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tighten this typing. We can define a Protocol:

from typing import TypedDict, Any, Protocol, Union


class CAI(TypedDict):
    shape: tuple[int, ...]
    typestr: str
    data: tuple[int, bool]
    version: int
    strides: None | tuple[int, ...]
    # TODO: better type for this
    descr: None | tuple[Any, ...]
    mask: Union[None, "SupportsCAI"]
    stream: None | int


class SupportsCAI(Protocol):
    @property
    def __cuda_array_interface__(self) -> CAI: ...

@classmethod
def from_cuda_array_interface_obj(cls, obj: SupportsCAI) -> Column: ...

Similarly we can probably do the same for the array interface.

@staticmethod
def from_cuda_array_interface_obj(object obj):
@classmethod
def from_cuda_array_interface_obj(cls, object obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call this from_cuda_array_interface

Comment on lines +409 to +430
@singledispatchmethod
@classmethod
def _from_ndarray(cls, obj):
if np_error is not None:
raise np_error
if cp_error is not None:
raise cp_error
raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column")

if np is not None:
@classmethod
def from_numpy_array(cls, object obj):
# TODO: Should expand to support __array_interface__
raise NotImplementedError(
"Converting to a pylibcudf Column from "
"a numpy object is not yet implemented."
)

@_from_ndarray.register(np.ndarray)
@classmethod
def _(cls, obj):
return cls.from_numpy_array(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need singledispatch, and should just handle anything that supports the array interface.

flat_data = arr.ravel()

num_rows, num_cols = arr.shape
offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use size_type.

Comment on lines +21 to +26
try:
import numpy as np
np_error = None
except ImportError as err:
np = None
np_error = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures, in err (if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.

I would prefer not saving the error and just if np/cp is None raising when we come to use things.

(We should probably do the same in the scalar handling).

@@ -360,6 +377,110 @@ cdef class Column:
[]
)

@classmethod
def from_ndarray(cls, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have two methods:

  • from_array_interface
  • from_cuda_array_interface

Or maybe from_arraylike where the object supports either the cuda array interface or the array interface?

Comment on lines +411 to +416
def _from_ndarray(cls, obj):
if np_error is not None:
raise np_error
if cp_error is not None:
raise cp_error
raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises too eagerly in the case where we're trying to convert from a numpy array and the user has numpy but not cupy.

Comment on lines +441 to +453
data_view = gpumemoryview(flat_data)
offsets_view = gpumemoryview(offsets)
typestr = arr.__cuda_array_interface__['typestr'][1:]

data_col = cls(
data_type=_datatype_from_dtype_desc(typestr),
size=flat_data.size,
data=data_view,
mask=None,
null_count=0,
offset=0,
children=[],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the data are C-contiguous we can do this without a copy.

In that case, we can also avoid requiring cupy, because we can make the offsets column with pylibcudf.filling.sequence.

Comment on lines +438 to +439
num_rows, num_cols = arr.shape
offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone needs to check that this will not produce a column with more than the maximum number of rows. The way we represent list columns in libcudf is that we have a column of N rows. But the offsets column has N+1 rows. But if N == size_type::max() then N+1 overflows, so we can't represent the offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change feature request New feature or request pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

3 participants