Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standard data ingestion pipelines to pylibcudf for ndarrays #18311

Open
wants to merge 8 commits into
base: branch-25.06
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions python/pylibcudf/pylibcudf/column.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@ class Column:
def from_scalar(scalar: Scalar, size: int) -> Column: ...
@staticmethod
def all_null_like(like: Column, size: int) -> Column: ...
@staticmethod
def from_cuda_array_interface_obj(obj: Any) -> Column: ...
@classmethod
def from_cuda_array_interface_obj(cls, obj: Any) -> Column: ...
@classmethod
def from_any(cls, obj: Any) -> Column: ...

class ListColumnView:
def __init__(self, column: Column): ...
Expand Down
127 changes: 122 additions & 5 deletions python/pylibcudf/pylibcudf/column.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,22 @@ from .scalar cimport Scalar
from .types cimport DataType, size_of, type_id
from .utils cimport int_to_bitmask_ptr, int_to_void_ptr

import functools
from functools import cache, singledispatchmethod


try:
import numpy as np
np_error = None
except ImportError as err:
np = None
np_error = err
Comment on lines +21 to +26
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Tried following approach in #18020 to make imports optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures, in err (if we have an import error) any live variables for the lifetime of the process. If that import happens to be done not at top-level, that might be a lot of stuff.

I would prefer not saving the error and just if np/cp is None raising when we come to use things.

(We should probably do the same in the scalar handling).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise we could save the exception class and string error message but not the traceback frames so we have a more faithful representation of the original error, but I don't have a strong opinion how we reduce this


try:
import cupy as cp
cp_error = None
except ImportError as err:
cp = None
cp_error = err

__all__ = ["Column", "ListColumnView", "is_c_contiguous"]

Expand Down Expand Up @@ -275,8 +290,8 @@ cdef class Column:
c_result = make_column_from_scalar(dereference(slr.get()), size)
return Column.from_libcudf(move(c_result))

@staticmethod
def from_cuda_array_interface_obj(object obj):
@classmethod
def from_cuda_array_interface_obj(cls, object obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call this from_cuda_array_interface

"""Create a Column from an object with a CUDA array interface.

Parameters
Expand Down Expand Up @@ -310,8 +325,10 @@ cdef class Column:
):
raise ValueError("Data must be C-contiguous")

if len(iface['shape']) > 1:
raise ValueError("Data must be 1-dimensional")
size = iface['shape'][0]
return Column(
return cls(
data_type,
size,
data,
Expand All @@ -321,6 +338,106 @@ cdef class Column:
[]
)

@singledispatchmethod
@classmethod
def from_ndarray(cls, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have two methods:

  • from_array_interface
  • from_cuda_array_interface

Or maybe from_arraylike where the object supports either the cuda array interface or the array interface?

"""
Create a Column from any object which supports the NumPy array interface.

Parameters
----------
obj : Any
The input array to be converted into a `pylibcudf.Column`.

Returns
-------
Column

Raises
------
TypeError
If the input type is neither `numpy.ndarray` nor `cupy.ndarray`.
ImportError
If NumPy or CuPy is required but not installed.

Notes
-----
- If `obj` is a 2D CuPy array, the resulting column is a list column.
- NumPy conversion logic is not yet implemented.
- Multi-dimensional arrays (ndim > 2) are not supported.
"""
if np_error is not None:
raise np_error
if cp_error is not None:
raise cp_error
raise TypeError(f"Cannot convert a {type(obj)} to a pylibcudf Column")

if np is not None:
@classmethod
def from_numpy_array(cls, object obj):
# TODO: Should expand to support __array_interface__
raise NotImplementedError(
"Converting to a pylibcudf Column from "
"a numpy object is not yet implemented."
)

@from_ndarray.register(np.ndarray)
@classmethod
def _(cls, obj):
return cls.from_numpy_array(obj)

if cp is not None:
@classmethod
def _from_2d_cupy_array(cls, object arr):
"""Convert a 2D CuPy array to a Column."""
flat_data = arr.ravel()

num_rows, num_cols = arr.shape
offsets = cp.arange(0, (num_rows + 1) * num_cols, num_cols, dtype=cp.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use size_type.

Comment on lines +438 to +439
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone needs to check that this will not produce a column with more than the maximum number of rows. The way we represent list columns in libcudf is that we have a column of N rows. But the offsets column has N+1 rows. But if N == size_type::max() then N+1 overflows, so we can't represent the offsets.


data_view = gpumemoryview(flat_data)
offsets_view = gpumemoryview(offsets)
typestr = arr.__cuda_array_interface__['typestr'][1:]

data_col = cls(
data_type=_datatype_from_dtype_desc(typestr),
size=flat_data.size,
data=data_view,
mask=None,
null_count=0,
offset=0,
children=[],
)
Comment on lines +441 to +453
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the data are C-contiguous we can do this without a copy.

In that case, we can also avoid requiring cupy, because we can make the offsets column with pylibcudf.filling.sequence.

offsets_col = cls(
data_type=DataType(type_id.INT32),
size=num_rows + 1,
data=offsets_view,
mask=None,
null_count=0,
offset=0,
children=[],
)
return cls(
data_type=DataType(type_id.LIST),
size=num_rows,
data=None,
mask=None,
null_count=0,
offset=0,
children=[offsets_col, data_col],
)

@from_ndarray.register(cp.ndarray)
@classmethod
def _(cls, obj):
ndim = len(obj.shape)
if ndim == 1:
return cls.from_cuda_array_interface_obj(obj)
elif ndim == 2:
return cls._from_2d_cupy_array(obj)
else:
raise ValueError("Must pass a 1D or 2D CuPy array only")

cpdef DataType type(self):
"""The type of data in the column."""
return self._data_type
Expand Down Expand Up @@ -407,7 +524,7 @@ cdef class ListColumnView:
return lists_column_view(self._column.view())


@functools.cache
@cache
def _datatype_from_dtype_desc(desc):
mapping = {
'u1': type_id.UINT8,
Expand Down
Loading