[FEA] Accelerate cupy array creation from DataFrame.values #16483

bdice · 2024-08-02T15:53:46Z

Is your feature request related to a problem? Please describe.
Users with large numerical datasets (such as a dataframe with thousands of time-series columns) would like to be able to convert from a cuDF dataframe to a cupy array as quickly as possible. Currently we have a raw loop in Python that does casting and assignment for each column.

cudf/python/cudf/cudf/core/frame.py

Lines 458 to 463 in a8a3670

    
           for i, col in enumerate(self._data.values()): 
        
               # TODO: col.values may fail if there is nullable data or an 
        
               # unsupported dtype. We may want to catch and provide a more 
        
               # suitable error. 
        
               matrix[:, i] = to_array(col, dtype) 
        
           return matrix

It should be possible to lower this into libcudf, and use a kernel that calls a batched memcpy from CCCL CUB to copy the same-type input columns into the matrix directly. Some columns may require casting, and that work could be launched in separate streams.

Describe the solution you'd like

template<typename T>
void table_to_array(cudf::table_view input, cudf::device_span<T> output) {
    CUDF_EXPECTS(std::all_of( ... /* all input columns are convertible to T */, cudf::data_type_error));
    // 1. Get boolean map of which columns already match the output type
    // 2. Call batchmemcpy on all matching columns
    // 3. Use a thrust transform with custom input and output iterators for casting all other types
    //    This is nontrivial but shouldn't be too hard. Some device-side type dispatch, maybe.
}

Describe alternatives you've considered
~~We might be able to use cudf::contiguous_copy_column_device_views, but that requires all the types to be the same.~~ (I was wrong, this is not what I want.) I think the best performing solution would do casting of any compatible input type to the target type as it copies.

We could also make the API take a void * and a cudf::data_type output_dtype? I'm not sure. I think it is important for this to have an output parameter and let the data be allocated by cupy with matrix = cupy.empty(shape=(len(self), ncol), dtype=dtype, order="F") like we already do here.

The text was updated successfully, but these errors were encountered:

bdice · 2024-08-02T15:56:52Z

We could also make the API take a void* output and a cudf::data_type output_dtype?

Yes, this is probably the right way. We ought to do host-side type-dispatch to determine which kernel to call, and device-side type-dispatch to handle casting the various input column types.

bdice · 2024-08-02T15:58:44Z

We might be able to shortcut this and just accelerate the "easy" path where all types are the same, to start out.

mroeschke · 2024-08-02T16:59:14Z

xref #11648

bdice · 2024-08-02T19:07:44Z

Also xref #12928 - I don't think this is a duplicate issue, since that one focuses on transpose-related issues and this one offers concrete proposals for implementation. I think this might solve the same problem, but it may depend on the implementation choices.

bdice · 2025-03-14T17:30:27Z

Sharing some notes from a related offline discussion:

cuDF table -> cupy array:
- cannot be a view because columns aren't contiguous
- a batchmemcpy can generate contiguous output very efficiently if we produce F ordered output, otherwise a transpose is needed (out of scope for this issue)
- this path is the focus of this issue
cupy array -> cuDF table:
- can be a view for each column (if input array is F ordered)
- typically arrays are not F ordered unless you made them that way on purpose. In C-ordered cases, a transpose is needed.
cuDF lists column -> cupy array:
- can be a view (zero-copy, produces C ordered output)
- Snippet: ser.list.leaves.values.reshape(len(ser), -1)
cupy array -> cuDF lists column
- can be a view (if input array is C ordered) but requires offsets construction
- [FEA] API for Creating List Columns from CuPy Arrays #18214

bdice added the feature request New feature or request label Aug 2, 2024

bdice added the Performance Performance related issue label Aug 2, 2024

vyasr added this to cuDF Python Mar 13, 2025

github-project-automation bot moved this to Todo in cuDF Python Mar 13, 2025

vyasr assigned Matt711 Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Accelerate cupy array creation from DataFrame.values #16483

[FEA] Accelerate cupy array creation from DataFrame.values #16483

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Aug 2, 2024 •

edited

Loading

mroeschke commented Aug 2, 2024

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Mar 14, 2025 •

edited

Loading

[FEA] Accelerate cupy array creation from DataFrame.values #16483

[FEA] Accelerate cupy array creation from DataFrame.values #16483

Comments

bdice commented Aug 2, 2024 • edited Loading

bdice commented Aug 2, 2024 • edited Loading

bdice commented Aug 2, 2024 • edited Loading

mroeschke commented Aug 2, 2024

bdice commented Aug 2, 2024 • edited Loading

bdice commented Mar 14, 2025 • edited Loading

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Aug 2, 2024 •

edited

Loading

bdice commented Mar 14, 2025 •

edited

Loading