Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
f3acdca
Optimize memory transactions in SYCL backend parallel for
mmichel11 Sep 5, 2024
06e06ff
clang-format
mmichel11 Sep 5, 2024
ab7a75f
Correct comment and error handling.
mmichel11 Sep 6, 2024
ec0761c
__num_groups bugfix
mmichel11 Sep 10, 2024
281f642
Introduce stride recommender for different targets and better distrib…
mmichel11 Sep 16, 2024
6ffb904
Cleanup
mmichel11 Sep 16, 2024
fad85fe
Unroll loop if possible
mmichel11 Sep 18, 2024
329f000
Revert "Unroll loop if possible"
mmichel11 Sep 18, 2024
420bd6c
Use a small and large kernel in parallel for
mmichel11 Sep 20, 2024
ef78c6a
Improve __iters_per_work_item heuristic.
mmichel11 Sep 20, 2024
7883c3e
Code cleanup
mmichel11 Sep 20, 2024
5c12d66
Clang format
mmichel11 Sep 23, 2024
36a602b
Update comments
mmichel11 Sep 23, 2024
4d645f6
Bugfix in comment
mmichel11 Sep 23, 2024
ca9a06f
More cleanup and better handle non-full case
mmichel11 Sep 23, 2024
3713d62
Rename __ndi to __item for consistency with codebase
mmichel11 Sep 24, 2024
305bf2b
Update all comments on kernel naming trick
mmichel11 Sep 24, 2024
3b50010
Handle non-full case in a cleaner way
mmichel11 Sep 24, 2024
8e5de99
Switch min tuple type utility to return size of type
mmichel11 Sep 24, 2024
65e0b05
Remove unnecessary template parameter
mmichel11 Sep 24, 2024
257815a
Make non-template function inline for ODR compliance
mmichel11 Sep 24, 2024
3929705
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 Sep 24, 2024
31a7aae
Address several PR comments
mmichel11 Sep 25, 2024
08d24aa
Remove free function __stride_recommender
mmichel11 Sep 25, 2024
1748a6b
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 Sep 25, 2024
cc829e5
Address reviewer comments
mmichel11 Nov 6, 2024
8dc7706
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 Dec 16, 2024
1309f6a
Improve testing and cleanup of code
mmichel11 Dec 16, 2024
288499f
clang format
mmichel11 Dec 16, 2024
d683b72
Miscellaneous fixes identified during testing
mmichel11 Dec 17, 2024
b4cfcae
clang-format
mmichel11 Dec 17, 2024
62c104f
Fix ordering to __vector_load call
mmichel11 Dec 17, 2024
b525ab7
Add support for vectorization with C++20 parallel range APIs
mmichel11 Dec 17, 2024
7d16c16
Add device copyable specializations for new walk patterns
mmichel11 Dec 17, 2024
f9d63aa
Align vector_walk implementation with other vector functors
mmichel11 Dec 17, 2024
9aa36e1
Add back non-spirv path
mmichel11 Dec 17, 2024
b6d5d98
Further improve test coverage
mmichel11 Dec 17, 2024
4c1a974
Restore original shift_left due to implicit implementation requiremen…
mmichel11 Dec 17, 2024
bebd84b
Fix issues in vectorized rotate
mmichel11 Dec 18, 2024
02d0a18
Fix fpga parallel for compilation issues
mmichel11 Dec 18, 2024
1c3f455
Restore initial shift_left_right.pass.cpp
mmichel11 Dec 18, 2024
774e6f0
Fix test side issue when unnamed lambdas are disabled
mmichel11 Dec 18, 2024
cad0e1b
Add a vector path specialization for std::swap_ranges
mmichel11 Dec 18, 2024
0c2c9a8
General code cleanup
mmichel11 Dec 18, 2024
7aa5bf8
Bugfix with __pattern_swap using nanoranges
mmichel11 Dec 18, 2024
62a19fd
clang-format
mmichel11 Dec 19, 2024
b2128fe
Address applicable comments from PR #1870
mmichel11 Dec 20, 2024
2b1281b
Refactor __lazy_ctor_storage deleter
mmichel11 Jan 2, 2025
1c4ed8c
Address review comments
mmichel11 Jan 2, 2025
d0a66ae
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 Jan 4, 2025
ac6d945
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 Jan 4, 2025
4654b1d
Add missing max_n
mmichel11 Jan 4, 2025
59ea1ec
Add constructors for for-based bricks
mmichel11 Jan 4, 2025
bbee988
Remove extraneous {} and add constructor to custom_brick
mmichel11 Jan 6, 2025
33dc8b7
Limit recursive searching of __min_nested_type_size to tuples
mmichel11 Jan 6, 2025
8a0f4b5
Work around compiler vectorization issue
mmichel11 Jan 6, 2025
0f81298
Add missing decays
mmichel11 Jan 7, 2025
971edae
Add compile time check to ensure we do not get buffer pointer on host
mmichel11 Jan 7, 2025
e7309c9
Revert "Work around compiler vectorization issue"
mmichel11 Jan 7, 2025
d5c7157
Remove all begin() calls on views in vectorization paths
mmichel11 Jan 7, 2025
0280f7c
Remove unused __is_passed_directly_range utility
mmichel11 Jan 7, 2025
52ce868
Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…
mmichel11 Jan 8, 2025
ab70533
Correct __vector_walk deleters and a type in __reverse_copy
mmichel11 Jan 8, 2025
a26cdba
Set upper limit of 10,000,000 for get_pattern_for_max_n
mmichel11 Jan 9, 2025
6db2d58
General cleanup and renaming for consistency
mmichel11 Jan 9, 2025
2e378ea
Explicitly list template types in specializations of __is_vectorizabl…
mmichel11 Jan 13, 2025
f387a4f
Remove unnecessary local variables
mmichel11 Jan 14, 2025
8a387b2
Remove unnecessary local variables in async and numeric headers
mmichel11 Jan 14, 2025
2ccb478
Correct optimization in __reverse_functor and improve explanation
mmichel11 Jan 16, 2025
af2e16f
Rename custom_brick to __custom_brick
mmichel11 Jan 16, 2025
6a4db2c
Rename __n to __full_range_size in vec utils and fix potential unused…
mmichel11 Jan 17, 2025
5e31e07
Remove unnecessary ternary operator and replace _Idx template with st…
mmichel11 Jan 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace oneapi::dpl::experimental::kt::gpu::esimd::__impl
{

//------------------------------------------------------------------------
// Please see the comment for __parallel_for_submitter for optional kernel name explanation
// Please see the comment above __parallel_for_small_submitter for optional kernel name explanation
//------------------------------------------------------------------------

template <bool __is_ascending, ::std::uint8_t __radix_bits, ::std::uint16_t __data_per_work_item,
Expand Down
18 changes: 12 additions & 6 deletions include/oneapi/dpl/internal/async_impl/async_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view());
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{
__f, static_cast<std::size_t>(__n)},
__n, __buf.all_view());
return __future_obj;
}

Expand All @@ -67,7 +69,9 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view());

return __future.__make_future(__first2 + __n);
}
Expand All @@ -91,10 +95,12 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

auto __future =
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view());
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{
__f, static_cast<size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view());

return __future.__make_future(__first3 + __n);
}
Expand Down
37 changes: 26 additions & 11 deletions include/oneapi/dpl/internal/binary_search_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,19 @@ enum class search_algorithm
binary_search
};

template <typename Comp, typename T, search_algorithm func>
struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct __custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
{
Comp comp;
T size;
bool use_32bit_indexing;

__custom_brick(Comp comp, T size, bool use_32bit_indexing)
: comp(comp), size(size), use_32bit_indexing(use_32bit_indexing)
{
}

template <typename _Size, typename _ItemId, typename _Acc>
void
search_impl(_ItemId idx, _Acc acc) const
Expand All @@ -68,17 +74,23 @@ struct custom_brick
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value]));
}
}

template <typename _ItemId, typename _Acc>
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_ItemId idx, _Acc acc) const
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we may try to improve this code by replacing run-time bool value use_32bit_indexing to compile-time indexing type specialization.
I found only 3 places with the code

const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();

so it's not big deal to add if statement outside and call __parallel_for inside for both branches with the different index types. But inside the brick we exclude condition check at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.

{
if (use_32bit_indexing)
search_impl<std::uint32_t>(idx, acc);
else
search_impl<std::uint64_t>(idx, acc);
}
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const
{
__scalar_path_impl(__is_full, idx, acc);
}
};
#endif

template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator,
typename StrictWeakOrdering>
Expand Down Expand Up @@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, ::std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input
auto result_buf = keep_result(result, result + value_size);
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view());
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
}
Expand Down
76 changes: 53 additions & 23 deletions include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,10 @@ __pattern_walk1(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
auto __buf = __keep(__first, __last);

oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, __exec, unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view())
_BackendTag{}, __exec,
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{
__f, static_cast<std::size_t>(__n)},
__n, __buf.all_view())
.__deferrable_wait();
}

Expand Down Expand Up @@ -103,7 +106,9 @@ __pattern_walk2(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view());

// Call no wait, wait or deferrable wait depending on _WaitMode
__future.wait(_WaitMode{});
Expand All @@ -130,10 +135,25 @@ _ForwardIterator2
__pattern_swap(__hetero_tag<_BackendTag> __tag, _ExecutionPolicy&& __exec, _ForwardIterator1 __first1,
_ForwardIterator1 __last1, _ForwardIterator2 __first2, _Function __f)
{
return __pattern_walk2</*_WaitMode*/ __par_backend_hetero::__deferrable_mode,
__par_backend_hetero::access_mode::read_write,
__par_backend_hetero::access_mode::read_write>(
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f);
auto __n = __last1 - __first1;
if (__n == 0)
return __first2;

auto __keep1 =
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator1>();
auto __buf1 = __keep1(__first1, __last1);

auto __keep2 =
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator2>();
auto __buf2 = __keep2(__first2, __first2 + __n);

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__brick_swap<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view());
__future.wait(__par_backend_hetero::__deferrable_mode{});
return __first2 + __n;
}

//------------------------------------------------------------------------
Expand All @@ -160,9 +180,12 @@ __pattern_walk3(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
auto __keep3 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode3, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view())
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{
__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view())
.__deferrable_wait();

return __first3 + __n;
Expand Down Expand Up @@ -1564,9 +1587,10 @@ __pattern_reverse(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterato
auto __keep = oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _Iterator>();
auto __buf = __keep(__first, __last);
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_functor<typename ::std::iterator_traits<_Iterator>::difference_type>{__n}, __n / 2,
__buf.all_view())
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_functor<typename std::iterator_traits<_Iterator>::difference_type,
decltype(__buf.all_view())>{__n},
__n / 2, __buf.all_view())
.__deferrable_wait();
}

Expand All @@ -1590,8 +1614,9 @@ __pattern_reverse_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bi
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator>();
auto __buf2 = __keep2(__result, __result + __n);
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n},
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_copy<typename std::iterator_traits<_BidirectionalIterator>::difference_type,
decltype(__buf1.all_view()), decltype(__buf2.all_view())>{__n},
__n, __buf1.all_view(), __buf2.all_view())
.__deferrable_wait();

Expand Down Expand Up @@ -1632,18 +1657,20 @@ __pattern_rotate(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterator
const auto __shift = __new_first - __first;
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n,
__buf.all_view(), __temp_rng_w);
unseq_backend::__rotate_copy<typename std::iterator_traits<_Iterator>::difference_type,
decltype(__buf.all_view()), decltype(__temp_rng_w)>{__n, __shift},
__n, __buf.all_view(), __temp_rng_w);

//An explicit wait isn't required here because we are working with a temporary sycl::buffer and sycl accessors and
//SYCL runtime makes a dependency graph to prevent the races between two __parallel_for patterns.

using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>;
auto __brick = unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}};

auto __temp_rng_rw =
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::read_write>(__temp_buf.get_buffer());
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), __brick,
auto __brick =
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__temp_rng_rw),
decltype(__buf.all_view())>{_Function{}, static_cast<std::size_t>(__n)};
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), __brick,
__n, __temp_rng_rw, __buf.all_view())
.__deferrable_wait();

Expand Down Expand Up @@ -1677,8 +1704,8 @@ __pattern_rotate_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bid

oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n,
__shift},
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type,
decltype(__buf1.all_view()), decltype(__buf2.all_view())>{__n, __shift},
__n, __buf1.all_view(), __buf2.all_view())
.__deferrable_wait();

Expand Down Expand Up @@ -1936,19 +1963,22 @@ __pattern_shift_left(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Rang
if (__n >= __mid)
{
using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>;
auto __brick = oneapi::dpl::unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}};

//TODO: to consider use just "read" access mode for a source range and just "write" - for a destination range.
auto __src = oneapi::dpl::__ranges::drop_view_simple<_Range, _DiffType>(__rng, __n);
auto __dst = oneapi::dpl::__ranges::take_view_simple<_Range, _DiffType>(__rng, __size_res);

auto __brick =
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__src), decltype(__dst)>{
_Function{}, static_cast<std::size_t>(__size_res)};

oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
__brick, __size_res, __src, __dst)
.__deferrable_wait();
}
else //2. n < size/2; 'n' parallel copying
{
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType>{__size, __n};
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType, decltype(__rng)>{__size, __n};
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{},
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__shift_left_right>(
Expand Down
Loading
Loading