Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 · 2024-12-19T01:57:27Z

High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:

Process multiple input iterations per work-item which involves a switch to a nd_range kernel combined with a sub / work group strided indexing approach.
To generate wide loads for small types, implement a path that vectorizes loads / stores by processing adjacent indices within a single work item. This is combined with the above approach to maximize hardware bandwidth utilization. Vectorization is only applied to fundamental types of size less than 4 (e.g. uint16_t, uint8_t) under a contiguous container.

Implementation Details

Parallel for bricks have been reworked in the following manner:
- Each brick contains a pack of ranges within its template parameters to define tuning parameters.
- The following static integral members are defined (implemented with inheritance):
  - __can_vectorize
  - __preferred_vector_size (1 if __can_vectorize is false)
  - __preferred_iters_per_item
- The following public member functions are defined
  - __scalar_path (for small input sizes this member function is explicitly called)
  - __vector_path (optional for algorithms that are not vectorizable e.g. binary_search)
  - An overloaded function call operator which dispatches to the appropriate strategy

To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.

This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.

SergeyKopienko · 2024-12-23T14:18:42Z

include/oneapi/dpl/pstl/utils.h

+{
+    template <typename _Tp>
+    void
+    operator()(__lazy_ctor_storage<_Tp> __storage) const


Why you pass __storage parameter by value?

Great catch. I have made this a l-value reference.

SergeyKopienko · 2024-12-23T14:22:07Z

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

-                           __par_backend_hetero::access_mode::read_write>(
-        __tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f);
+    auto __n = __last1 - __first1;
+    if (__n <= 0)


What is the case when __n < 0 is true?

Never if a valid sequence is passed :) I switched to __n == 0.

SergeyKopienko · 2024-12-23T14:25:15Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+
+// Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search)
+template <typename... _Ranges>
+class walk_scalar_base


Why class walk_scalar_base declared as class but

template <typename _ExecutionPolicy, typename _F, typename _Range> struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>

declared as struct ?

I have made them all structs for consistency.

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

SergeyKopienko · 2024-12-23T14:31:04Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
+    {
+        // This is needed to enable vectorization
+        auto __raw_ptr = __rng.begin();


I think that __raw_ptr isn't very good name because begin() usually linked in mind with iterator. But raw usually is some pointer.

Do we really need to have here local variable __raw_ptr ? Can we pass __rng.begin() instead of that variable into __vector_walk call?

In the contexts in which we vectorize, begin() does return pointers, but I agree the name is confusing.

I have addressed this in a different way due to a performance issue. With uint8_t types, I found the compiler was not properly vectorizing even when calling begin() on the set of ranges within the kernel leading to performance regressions (about 30% slower than where we should be). Calling begin from the host and passing it to the submitter to use in the kernel resolves the issue and gives us good performance.

Since begin() is called on all ranges and passed through the bricks from the submitter, I have switched from the _Rng naming to _Acc here as the underlying type may not be a range. Additional template types are also needed.

Update
Please see the comment: #1976 (comment). All of the begin() calls in this context have been removed.

SergeyKopienko · 2024-12-23T14:36:07Z

So now we have 3 entity with defined constexpr static bool __can_vectorize :

class walk_vector_or_scalar_base
class walk_scalar_base
struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

SergeyKopienko · 2024-12-23T14:51:16Z

In some moments implementation details remind me tag-dispatching which were designed by @rarutyun.
But with some differences: for example the walk2_vectors_or_scalars has not only information about vectorization or parallelization should be executed, but also two variant of functional staff and operator() with compile-time condition check to run one code or another code.

But what if we instead of two different functions

    template <typename _IsFull, typename _ItemId>
    void
    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
    {
        // This is needed to enable vectorization
        auto __raw_ptr = __rng.begin();
        oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
                                                                                                 __raw_ptr);
    }

    // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
    template <typename _IsFull, typename _ItemId>
    void
    __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
    {

        __f(__rng[__idx]);
    }

we will have some two functions with the same name and the format excepting the first parameter type which will be used as some tag ?

Please take a look at __parallel_policy_tag_selector_t for details.

include/oneapi/dpl/pstl/utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

SergeyKopienko · 2024-12-23T15:30:54Z

One more point: __vector_path and __scalar_path tell me about some path but not imp.
May be better to rename them to ..._impl ?

danhoeflinger

First round of review. I've not gotten to all the details yet, but this is enough to be interesting.

include/oneapi/dpl/pstl/utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/algorithm_ranges_impl_hetero.h

include/oneapi/dpl/internal/binary_search_impl.h

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

include/oneapi/dpl/pstl/hetero/histogram_impl_hetero.h

include/oneapi/dpl/internal/binary_search_impl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

mmichel11 · 2025-01-07T14:31:42Z

So now we have 3 entity with defined constexpr static bool __can_vectorize :

class walk_vector_or_scalar_base

class walk_scalar_base

struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

These three cases are all unique when you consider that they define __can_vectorize, __preferred_vector_size, and __preferred_iters_per_item. These three fields are all tightly coupled, so in my opinion it makes sense to define them together for readability. If we were to define a single __can_vectorize, then I think it would need to function more like a trait class dependent on the provided brick as the brick itself plays a roll in whether not vectorization is possible. This design would not get us much in my opinion as we would still need specializations for the different cases.

The three unique cases I mention are the following:

struct walk_vector_or_scalar_base - Vectorization is possible so long as the ranges meet the requirements to be vectorizable. This is then used to determine iters per item and the vector size.
struct walk_scalar_base - Vectorization is not possible due to some limitation of the brick. Binary search is a good example since its accesses are non-sequential. The iterations per work item is still set based on the size of the provided ranges.
struct __brick_shift_left - This brick has a limitation that prevents vectorization and that only one iteration per item can be processed and is a special case.

Signed-off-by: Matthew Michel <[email protected]>

With uint8_t types, the icpx compiler fails to vectorize even when calling begin() on our range within a kernel to pull out a raw pointer. To work around this issue, begin() needs to be called on the host and passed to the kernel Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

This reverts commit 1336735.

Signed-off-by: Matthew Michel <[email protected]>

…_path_impl Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

…e_range Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 · 2025-01-17T14:39:00Z

A point I would like to make and an open question from me after our offline discussion regarding type support.

Vectorization paths are only enabled for arithmetic types as compilers only vectorize through these limited set of types (seems to align with sycl::vec supported types). Even something such as a wrapper struct that wraps around a uint8_t does not generate vector instructions which is I have enforced fundamental types in walk_vector_or_scalar_base. Vectorization is applied to 1 and 2 byte types, and I hypothesize user provided structs will likely be bigger than this.

I do think vectorization through an arbitrary struct is possible so long as it is trivially copyable, and it may be worth investigating performance vectorizing through larger structs with multiple fields. This could be done through vectorizing load / stores of the struct via serializing it into a bytestream and making use of some reinterpret casting to apply any functors to the struct. Careful attention would have to be made to alignment.

Such approach would likely have portability restrictions, be optimized for specific hardware vector instructions (e.g. PVC), and would certainly be a precedent in oneDPL, so I do not think it would be appropriate for our generic implementation but rather in a kernel template for a for_each or transform operation. I plan to mention this in a discussion I plan to create after the milestone.

What are others' thoughts here? With the supported types, I believe the usage of __lazy_ctor_storage can be removed from this PR as default constructability is not an issue (I have prepared a patch to do so).

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

… variable warning Signed-off-by: Matthew Michel <[email protected]>

danhoeflinger · 2025-01-17T22:43:10Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+        std::size_t __remaining_elements = __idx >= __n ? 0 : __n - __idx;
+        std::uint8_t __elements_to_process =
+            std::min(static_cast<std::size_t>(__base_t::__preferred_vector_size), __remaining_elements);
+        const _Idx __output_start = __size - __idx - __elements_to_process;


Cant this (wont this always) underflow for non-full cases?

I believe I've tracked down to the stride recommender that this is a std::size_t, at least sometimes.
Also, is it ever not a std::size_t? If it is always a std::size_t then lets call it that rather than a template param.

For the non-full case, __idx < __size should always hold which can be seen in the non-full case of __strided_loop. It ensures that we do not dispatch an index that is greater than or equal to the size. Actually the logic I have above this that sets __remaining_elements is unnecessary (the ternary is always false) and can be removed.

Throughout this PR I think I can make all of these brick _Idx types just std::size_t as that is what gets passed through. The original implementations were templates but I think it's unnecessary.

Update
I have made these changes.

danhoeflinger · 2025-01-17T22:54:30Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+        else
+        {
+            oneapi::dpl::__par_backend_hetero::__vector_reverse<__base_t::__preferred_vector_size>{}(
+                std::false_type{}, __elements_to_process, __rng1_vector);
+            for (std::uint8_t __i = 0; __i < __elements_to_process; ++__i)
+                __rng2[__output_start + __i] = __rng1_vector[__i].__v;
+        }


I was unclear about why this is necessary with the logic already inside the __vector_... helpers, but I think it is because we can't reverse data which wasn't loaded / initialized. Instead we flip only inside the first part of the local vector. However, it doesn't seem like this final assignment is changing the offset for the final write like I would expect it to.

I must be missing something else because you in theory could still use __vector_store just with an offset of 0 instead of __output_start or something.

Also, it seems like we should probably be consistent and use __pstl_assign here rather than directly written assignment unless there is a reason not to.

This case is tricky. Suppose we have a vector size of 4 with 3 remaining elements in the buffer. _IsFull will be false. These three elements we will want to store at indices 0, 1, 2 after reversing in registers.

If we did a:

oneapi::dpl::__par_backend_hetero::__vector_store<__base_t::__preferred_vector_size>{__n}( __is_full, __output_start, oneapi::dpl::__par_backend_hetero::__lazy_store_transform_op<oneapi::dpl::__internal::__pstl_assign>{}, __rng1_vector, __rng2);

here then the vector operation would try to store 4 elements as the gap between n and __output_start (0) is large.

We could replace __n in the __vector_store construction with __remaining_elements which would fix this similar to the vector walk deleter, but when implementing the for loop felt more clear. What makes more sense to you?

Good point on consistency with __pstl_assign. I will address it.

…d::size_t Signed-off-by: Matthew Michel <[email protected]>

mmichel11 added the enhancement label Dec 19, 2024

mmichel11 added this to the 2022.8.0 milestone Dec 19, 2024

mmichel11 marked this pull request as ready for review December 19, 2024 19:17

mmichel11 requested review from timmiesmith, akukanov, MikeDvorskiy, SergeyKopienko, danhoeflinger, adamfidel and dmitriy-sobolev December 19, 2024 19:18

mmichel11 changed the title ~~[Draft] Re-implement SYCL backend parallel_for to improve bandwidth utilization~~ Re-implement SYCL backend parallel_for to improve bandwidth utilization Dec 19, 2024

mmichel11 force-pushed the dev/mmichel11/parallel_for_vectorize branch from 085eaf5 to 505bdf3 Compare December 19, 2024 22:13

dmitriy-sobolev mentioned this pull request Dec 22, 2024

Do not use deprecated sub-group load/store extension #1979

Merged

SergeyKopienko reviewed Dec 23, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Dec 23, 2024

View reviewed changes

include/oneapi/dpl/pstl/utils.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Dec 23, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Dec 31, 2024

View reviewed changes

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/histogram_impl_hetero.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/internal/binary_search_impl.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h Outdated Show resolved Hide resolved

mmichel11 added 21 commits January 15, 2025 19:25

Address review comments

1c4ed8c

Signed-off-by: Matthew Michel <[email protected]>

Remove intrusive test macro and adjust input sizes in test framework

d0a66ae

Signed-off-by: Matthew Michel <[email protected]>

Make walk_scalar_base and walk_vector_or_scalar_base structs

ac6d945

Signed-off-by: Matthew Michel <[email protected]>

Add missing max_n

4654b1d

Signed-off-by: Matthew Michel <[email protected]>

Add constructors for for-based bricks

59ea1ec

Signed-off-by: Matthew Michel <[email protected]>

Remove extraneous {} and add constructor to custom_brick

bbee988

Signed-off-by: Matthew Michel <[email protected]>

Limit recursive searching of __min_nested_type_size to tuples

33dc8b7

Signed-off-by: Matthew Michel <[email protected]>

Add missing decays

0f81298

Signed-off-by: Matthew Michel <[email protected]>

Add compile time check to ensure we do not get buffer pointer on host

971edae

Signed-off-by: Matthew Michel <[email protected]>

Revert "Work around compiler vectorization issue"

e7309c9

This reverts commit 1336735.

Remove all begin() calls on views in vectorization paths

d5c7157

Signed-off-by: Matthew Michel <[email protected]>

Remove unused __is_passed_directly_range utility

0280f7c

Signed-off-by: Matthew Michel <[email protected]>

Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…

52ce868

…_path_impl Signed-off-by: Matthew Michel <[email protected]>

Correct __vector_walk deleters and a type in __reverse_copy

ab70533

Signed-off-by: Matthew Michel <[email protected]>

Set upper limit of 10,000,000 for get_pattern_for_max_n

a26cdba

This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <[email protected]>

General cleanup and renaming for consistency

6db2d58

Signed-off-by: Matthew Michel <[email protected]>

Explicitly list template types in specializations of __is_vectorizabl…

2e378ea

…e_range Signed-off-by: Matthew Michel <[email protected]>

Remove unnecessary local variables

f387a4f

Signed-off-by: Matthew Michel <[email protected]>

Remove unnecessary local variables in async and numeric headers

8a387b2

Signed-off-by: Matthew Michel <[email protected]>

Correct optimization in __reverse_functor and improve explanation

2ccb478

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 force-pushed the dev/mmichel11/parallel_for_vectorize branch from eb2cdf8 to 2ccb478 Compare January 16, 2025 01:25

Rename custom_brick to __custom_brick

af2e16f

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 mentioned this pull request Jan 17, 2025

Consider combining implementations of walk*_vectors_or_scalars with fold instructions #2006

Open

danhoeflinger reviewed Jan 17, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

Rename __n to __full_range_size in vec utils and fix potential unused…

6a4db2c

… variable warning Signed-off-by: Matthew Michel <[email protected]>

danhoeflinger reviewed Jan 17, 2025

View reviewed changes

Remove unnecessary ternary operator and replace _Idx template with st…

5e31e07

…d::size_t Signed-off-by: Matthew Michel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 commented Dec 19, 2024 •

edited

Loading

SergeyKopienko Dec 23, 2024

mmichel11 Jan 2, 2025

SergeyKopienko Dec 23, 2024

mmichel11 Jan 2, 2025

SergeyKopienko Dec 23, 2024

mmichel11 Jan 4, 2025

SergeyKopienko Dec 23, 2024

mmichel11 Jan 6, 2025 •

edited

Loading

SergeyKopienko commented Dec 23, 2024 •

edited

Loading

SergeyKopienko commented Dec 23, 2024

SergeyKopienko commented Dec 23, 2024

danhoeflinger left a comment

mmichel11 commented Jan 7, 2025

mmichel11 commented Jan 17, 2025

danhoeflinger Jan 17, 2025 •

edited

Loading

danhoeflinger Jan 17, 2025

mmichel11 Jan 17, 2025 •

edited

Loading

danhoeflinger Jan 17, 2025 •

edited

Loading

mmichel11 Jan 17, 2025 •

edited

Loading

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Are you sure you want to change the base?

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Conversation

mmichel11 commented Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

SergeyKopienko commented Dec 23, 2024 • edited Loading

SergeyKopienko commented Dec 23, 2024

SergeyKopienko commented Dec 23, 2024

danhoeflinger left a comment

Choose a reason for hiding this comment

mmichel11 commented Jan 7, 2025

mmichel11 commented Jan 17, 2025

danhoeflinger Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

danhoeflinger Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

mmichel11 Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 commented Dec 19, 2024 •

edited

Loading

mmichel11 Jan 6, 2025 •

edited

Loading

SergeyKopienko commented Dec 23, 2024 •

edited

Loading

danhoeflinger Jan 17, 2025 •

edited

Loading

mmichel11 Jan 17, 2025 •

edited

Loading

danhoeflinger Jan 17, 2025 •

edited

Loading

mmichel11 Jan 17, 2025 •

edited

Loading