Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Matt711 · 2025-03-18T17:30:22Z

Description

Contributes to #15132 and #18214. This PR starts with cupy arrays and adds the skeleton code for array interface objects (eg. numpy arrays). This is breaking because from_cuda_array_interface_obj now raises for multi dimensional arrays.

I think this PR should be sufficient for Curator to replace their existing logic to convert cupy arrays to list columns with from_pylibcudf and from_ndarray. Eg.

import cudf
import pylibcudf as plc

cudf.Series.from_pylibcudf(
    plc.Column.from_ndarray(...) # call on cupy array
)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Matt711 · 2025-03-18T17:32:37Z

python/pylibcudf/pylibcudf/column.pyx

+try:
+    import numpy as np
+    np_error = None
+except ImportError as err:
+    np = None
+    np_error = err


Note: Tried following approach in #18020 to make imports optional

This captures, in err (if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.

I would prefer not saving the error and just if np/cp is None raising when we come to use things.

(We should probably do the same in the scalar handling).

As a compromise we could save the exception class and string error message but not the traceback frames so we have a more faithful representation of the original error, but I don't have a strong opinion how we reduce this

python/pylibcudf/pylibcudf/tests/test_column_from_any.py

python/pylibcudf/pylibcudf/column.pyi

mroeschke · 2025-03-18T18:25:43Z

python/pylibcudf/pylibcudf/column.pyx

@@ -321,6 +338,80 @@ cdef class Column:
            []
        )

+    @singledispatchmethod
+    @classmethod
+    def from_any(cls, obj):


Suggestion: Maybe we can start with an from_ndarray method that accepts numpy and cupy arrays, and then build up an from_any when we build up more ingestion methods (e.g. from_pyobject, from_pandas, etc)

Would we eventually remove from_ndarray in favor offrom_any? If so, I'd rather keep from_any since its ultimately what we want. And because we're planning on adding more ingestion methods during 25.06 release.

If we plan to keep it, then I'll change from_any to from_ndarray. I wanted to add a stable API sooner rather than later for Curator's use case. They need to be able create a list column from a cupy array. It should be public and stable so they can keep doing something like the following without it breaking between cudf releases.

cudf.Series.from_pylibcudf( plc.Column.from_any( cupy_array ) )

*Also forgot to add a doc string, will add that too.

Ah OK I do see now that @vyasr proposed in #15132 that a from_any method would just dispatch to another public method e.g. from_ndarray (for example), by singledispatch. So we would have both a public from_any and from_ndarray on the column method in this PR.

Yes, but I'll start with from_ndarray in this PR and add from_any in a follow-up PR.

python/pylibcudf/pylibcudf/column.pyi

Matt711 · 2025-03-20T13:12:46Z

Narwhals failures are unrelated to this PR

mroeschke · 2025-03-21T00:12:36Z

python/pylibcudf/pylibcudf/tests/test_column_from_ndarray.py

+import cupy as cp
+import numpy as np


Could you use define a fixture that does pytest.importorskip("numpy"/"cupy") (like in https://github.com/rapidsai/cudf/pull/18020/files#diff-4ca9193a8f5aa8079576b6ba20cffb2f875bf62fe8f75921f6f151250915947aR11) since technically these are optional dependencies?

wence- · 2025-03-21T13:06:51Z

python/pylibcudf/pylibcudf/column.pyi

+    @classmethod
+    def from_cuda_array_interface_obj(cls, obj: Any) -> Column: ...
+    @classmethod
+    def from_ndarray(cls, obj: Any) -> Column: ...


Let's tighten this typing. We can define a Protocol:

from typing import TypedDict, Any, Protocol, Union class CAI(TypedDict): shape: tuple[int, ...] typestr: str data: tuple[int, bool] version: int strides: None | tuple[int, ...] # TODO: better type for this descr: None | tuple[Any, ...] mask: Union[None, "SupportsCAI"] stream: None | int class SupportsCAI(Protocol): @property def __cuda_array_interface__(self) -> CAI: ... @classmethod def from_cuda_array_interface_obj(cls, obj: SupportsCAI) -> Column: ...

Similarly we can probably do the same for the array interface.

wence- · 2025-03-21T13:06:53Z

python/pylibcudf/pylibcudf/column.pyx

-    @staticmethod
-    def from_cuda_array_interface_obj(object obj):
+    @classmethod
+    def from_cuda_array_interface_obj(cls, object obj):


How about we call this from_cuda_array_interface

wence- · 2025-03-21T14:19:46Z

python/pylibcudf/pylibcudf/column.pyx

+    @singledispatchmethod
+    @classmethod
+    def _from_ndarray(cls, obj):
+        if np_error is not None:
+            raise np_error
+        if cp_error is not None:
+            raise cp_error
+        raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column")
+
+    if np is not None:
+        @classmethod
+        def from_numpy_array(cls, object obj):
+            # TODO: Should expand to support __array_interface__
+            raise NotImplementedError(
+                "Converting to a pylibcudf Column from "
+                "a numpy object is not yet implemented."
+            )
+
+        @_from_ndarray.register(np.ndarray)
+        @classmethod
+        def _(cls, obj):
+            return cls.from_numpy_array(obj)


I think we don't need singledispatch, and should just handle anything that supports the array interface.

wence- · 2025-03-21T14:20:12Z

python/pylibcudf/pylibcudf/column.pyx

+            flat_data = arr.ravel()
+
+            num_rows, num_cols = arr.shape
+            offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)


nit: use size_type.

wence- · 2025-03-21T14:23:06Z

python/pylibcudf/pylibcudf/column.pyx

+try:
+    import numpy as np
+    np_error = None
+except ImportError as err:
+    np = None
+    np_error = err


This captures, in err (if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.

I would prefer not saving the error and just if np/cp is None raising when we come to use things.

(We should probably do the same in the scalar handling).

wence- · 2025-03-21T14:24:04Z

python/pylibcudf/pylibcudf/column.pyx

@@ -360,6 +377,110 @@ cdef class Column:
            []
        )

+    @classmethod
+    def from_ndarray(cls, obj):


I think we should have two methods:

from_array_interface

from_cuda_array_interface

Or maybe from_arraylike where the object supports either the cuda array interface or the array interface?

wence- · 2025-03-21T14:25:27Z

python/pylibcudf/pylibcudf/column.pyx

+    def _from_ndarray(cls, obj):
+        if np_error is not None:
+            raise np_error
+        if cp_error is not None:
+            raise cp_error
+        raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column")


This raises too eagerly in the case where we're trying to convert from a numpy array and the user has numpy but not cupy.

wence- · 2025-03-21T14:40:28Z

python/pylibcudf/pylibcudf/column.pyx

+            data_view = gpumemoryview(flat_data)
+            offsets_view = gpumemoryview(offsets)
+            typestr = arr.__cuda_array_interface__['typestr'][1:]
+
+            data_col = cls(
+                data_type=_datatype_from_dtype_desc(typestr),
+                size=flat_data.size,
+                data=data_view,
+                mask=None,
+                null_count=0,
+                offset=0,
+                children=[],
+            )


I think if the data are C-contiguous we can do this without a copy.

In that case, we can also avoid requiring cupy, because we can make the offsets column with pylibcudf.filling.sequence.

wence- · 2025-03-21T14:42:03Z

python/pylibcudf/pylibcudf/column.pyx

+            num_rows, num_cols = arr.shape
+            offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)


Someone needs to check that this will not produce a column with more than the maximum number of rows. The way we represent list columns in libcudf is that we have a column of N rows. But the offsets column has N+1 rows. But if N == size_type::max() then N+1 overflows, so we can't represent the offsets.

Add standard data ingestion pipelines to pylibcudf

c3e18c6

Matt711 added feature request breaking labels Mar 18, 2025

Matt711 requested a review from a team as a code owner March 18, 2025 17:30

Matt711 requested review from wence- and mroeschke March 18, 2025 17:30

github-actions bot assigned Matt711 Mar 18, 2025

github-actions bot added Python pylibcudf labels Mar 18, 2025

Matt711 commented Mar 18, 2025

View reviewed changes

mroeschke reviewed Mar 18, 2025

View reviewed changes

python/pylibcudf/pylibcudf/tests/test_column_from_any.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 18, 2025

View reviewed changes

python/pylibcudf/pylibcudf/column.pyi Outdated Show resolved Hide resolved

mroeschke reviewed Mar 18, 2025

View reviewed changes

Matt711 added 2 commits March 18, 2025 15:50

address review

8a0f175

address review

c703fb0

Matt711 commented Mar 18, 2025

View reviewed changes

python/pylibcudf/pylibcudf/column.pyi Outdated Show resolved Hide resolved

Update python/pylibcudf/pylibcudf/column.pyi

a620b4e

Matt711 changed the title ~~Add standard data ingestion pipelines to pylibcudf~~ Add standard data ingestion pipelines to pylibcudf for ndarrays Mar 18, 2025

Merge branch 'branch-25.06' into fea/plc/support-column-from-any

9f116c7

Matt711 mentioned this pull request Mar 19, 2025

[WIP] Support converting cuda array interface objects with masks to pylibcudf Columns #18320

Draft

3 tasks

Matt711 added 2 commits March 19, 2025 12:32

add tests

07f4749

fix docs

564f1fa

Matt711 requested review from mroeschke and vyasr March 20, 2025 00:53

Merge branch 'branch-25.06' into fea/plc/support-column-from-any

5d0ed3e

mroeschke reviewed Mar 21, 2025

View reviewed changes

wence- requested changes Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Matt711 commented Mar 18, 2025 •

edited

Loading

Matt711 Mar 18, 2025

wence- Mar 21, 2025

mroeschke Mar 21, 2025

mroeschke Mar 18, 2025

Matt711 Mar 18, 2025

mroeschke Mar 18, 2025

Matt711 Mar 18, 2025

Matt711 commented Mar 20, 2025

mroeschke Mar 21, 2025

wence- Mar 21, 2025 •

edited

Loading

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

wence- Mar 21, 2025

		num_rows, num_cols = arr.shape
		offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Are you sure you want to change the base?

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Conversation

Matt711 commented Mar 18, 2025 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Matt711 commented Mar 20, 2025

Choose a reason for hiding this comment

wence- Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Matt711 commented Mar 18, 2025 •

edited

Loading

wence- Mar 21, 2025 •

edited

Loading