-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Conversation
parallel_for
to improve bandwidth utilizationparallel_for
to improve bandwidth utilization
085eaf5
to
505bdf3
Compare
include/oneapi/dpl/pstl/utils.h
Outdated
{ | ||
template <typename _Tp> | ||
void | ||
operator()(__lazy_ctor_storage<_Tp> __storage) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you pass __storage
parameter by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I have made this a l-value reference.
__par_backend_hetero::access_mode::read_write>( | ||
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f); | ||
auto __n = __last1 - __first1; | ||
if (__n <= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the case when __n < 0
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never if a valid sequence is passed :) I switched to __n == 0
.
|
||
// Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search) | ||
template <typename... _Ranges> | ||
class walk_scalar_base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why class walk_scalar_base
declared as class
but
template <typename _ExecutionPolicy, typename _F, typename _Range>
struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>
declared as struct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made them all structs for consistency.
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const | ||
{ | ||
// This is needed to enable vectorization | ||
auto __raw_ptr = __rng.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that
__raw_ptr
isn't very good name becausebegin()
usually linked in mind with iterator. Butraw
usually is some pointer. - Do we really need to have here local variable
__raw_ptr
? Can we pass__rng.begin()
instead of that variable into__vector_walk
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the contexts in which we vectorize, begin() does return pointers, but I agree the name is confusing.
I have addressed this in a different way due to a performance issue. With uint8_t
types, I found the compiler was not properly vectorizing even when calling begin()
on the set of ranges within the kernel leading to performance regressions (about 30% slower than where we should be). Calling begin
from the host and passing it to the submitter to use in the kernel resolves the issue and gives us good performance.
Since begin()
is called on all ranges and passed through the bricks from the submitter, I have switched from the _Rng
naming to _Acc
here as the underlying type may not be a range. Additional template types are also needed.
Update
Please see the comment: #1976 (comment). All of the begin()
calls in this context have been removed.
So now we have 3 entity with defined
Does these constexpr-variables really has different semantic? And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity |
In some moments implementation details remind me But what if we instead of two different functions template <typename _IsFull, typename _ItemId>
void
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
// This is needed to enable vectorization
auto __raw_ptr = __rng.begin();
oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
__raw_ptr);
}
// _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
template <typename _IsFull, typename _ItemId>
void
__scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
{
__f(__rng[__idx]);
} we will have some two functions with the same name and the format excepting the first parameter type which will be used as some Please take a look at |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
One more point: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. I've not gotten to all the details yet, but this is enough to be interesting.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
These three cases are all unique when you consider that they define The three unique cases I mention are the following:
|
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
With uint8_t types, the icpx compiler fails to vectorize even when calling begin() on our range within a kernel to pull out a raw pointer. To work around this issue, begin() needs to be called on the host and passed to the kernel Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
This reverts commit 1336735.
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…_path_impl Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…e_range Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
eb2cdf8
to
2ccb478
Compare
Signed-off-by: Matthew Michel <[email protected]>
A point I would like to make and an open question from me after our offline discussion regarding type support. Vectorization paths are only enabled for arithmetic types as compilers only vectorize through these limited set of types (seems to align with sycl::vec supported types). Even something such as a wrapper struct that wraps around a I do think vectorization through an arbitrary struct is possible so long as it is trivially copyable, and it may be worth investigating performance vectorizing through larger structs with multiple fields. This could be done through vectorizing load / stores of the struct via serializing it into a bytestream and making use of some reinterpret casting to apply any functors to the struct. Careful attention would have to be made to alignment. Such approach would likely have portability restrictions, be optimized for specific hardware vector instructions (e.g. PVC), and would certainly be a precedent in oneDPL, so I do not think it would be appropriate for our generic implementation but rather in a kernel template for a What are others' thoughts here? With the supported types, I believe the usage of __lazy_ctor_storage can be removed from this PR as default constructability is not an issue (I have prepared a patch to do so). |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
… variable warning Signed-off-by: Matthew Michel <[email protected]>
std::size_t __remaining_elements = __idx >= __n ? 0 : __n - __idx; | ||
std::uint8_t __elements_to_process = | ||
std::min(static_cast<std::size_t>(__base_t::__preferred_vector_size), __remaining_elements); | ||
const _Idx __output_start = __size - __idx - __elements_to_process; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant this (wont this always) underflow for non-full cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I've tracked down to the stride recommender that this is a std::size_t
, at least sometimes.
Also, is it ever not a std::size_t
? If it is always a std::size_t
then lets call it that rather than a template param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the non-full case, __idx < __size
should always hold which can be seen in the non-full case of __strided_loop
. It ensures that we do not dispatch an index that is greater than or equal to the size. Actually the logic I have above this that sets __remaining_elements
is unnecessary (the ternary is always false) and can be removed.
Throughout this PR I think I can make all of these brick _Idx
types just std::size_t
as that is what gets passed through. The original implementations were templates but I think it's unnecessary.
Update
I have made these changes.
else | ||
{ | ||
oneapi::dpl::__par_backend_hetero::__vector_reverse<__base_t::__preferred_vector_size>{}( | ||
std::false_type{}, __elements_to_process, __rng1_vector); | ||
for (std::uint8_t __i = 0; __i < __elements_to_process; ++__i) | ||
__rng2[__output_start + __i] = __rng1_vector[__i].__v; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unclear about why this is necessary with the logic already inside the __vector_...
helpers, but I think it is because we can't reverse data which wasn't loaded / initialized. Instead we flip only inside the first part of the local vector. However, it doesn't seem like this final assignment is changing the offset for the final write like I would expect it to.
I must be missing something else because you in theory could still use __vector_store
just with an offset of 0
instead of __output_start
or something.
Also, it seems like we should probably be consistent and use __pstl_assign
here rather than directly written assignment unless there is a reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is tricky. Suppose we have a vector size of 4
with 3
remaining elements in the buffer. _IsFull
will be false. These three elements we will want to store at indices 0, 1, 2
after reversing in registers.
If we did a:
oneapi::dpl::__par_backend_hetero::__vector_store<__base_t::__preferred_vector_size>{__n}(
__is_full, __output_start,
oneapi::dpl::__par_backend_hetero::__lazy_store_transform_op<oneapi::dpl::__internal::__pstl_assign>{},
__rng1_vector, __rng2);
here then the vector operation would try to store 4
elements as the gap between n
and __output_start (0)
is large.
We could replace __n
in the __vector_store
construction with __remaining_elements
which would fix this similar to the vector walk deleter, but when implementing the for loop felt more clear. What makes more sense to you?
Good point on consistency with __pstl_assign
. I will address it.
…d::size_t Signed-off-by: Matthew Michel <[email protected]>
High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:
Implementation Details
binary_search
)To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.
This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.