-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
Open
mmichel11
wants to merge
72
commits into
main
Choose a base branch
from
dev/mmichel11/parallel_for_vectorize
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
f3acdca
Optimize memory transactions in SYCL backend parallel for
mmichel11 06e06ff
clang-format
mmichel11 ab7a75f
Correct comment and error handling.
mmichel11 ec0761c
__num_groups bugfix
mmichel11 281f642
Introduce stride recommender for different targets and better distrib…
mmichel11 6ffb904
Cleanup
mmichel11 fad85fe
Unroll loop if possible
mmichel11 329f000
Revert "Unroll loop if possible"
mmichel11 420bd6c
Use a small and large kernel in parallel for
mmichel11 ef78c6a
Improve __iters_per_work_item heuristic.
mmichel11 7883c3e
Code cleanup
mmichel11 5c12d66
Clang format
mmichel11 36a602b
Update comments
mmichel11 4d645f6
Bugfix in comment
mmichel11 ca9a06f
More cleanup and better handle non-full case
mmichel11 3713d62
Rename __ndi to __item for consistency with codebase
mmichel11 305bf2b
Update all comments on kernel naming trick
mmichel11 3b50010
Handle non-full case in a cleaner way
mmichel11 8e5de99
Switch min tuple type utility to return size of type
mmichel11 65e0b05
Remove unnecessary template parameter
mmichel11 257815a
Make non-template function inline for ODR compliance
mmichel11 3929705
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 31a7aae
Address several PR comments
mmichel11 08d24aa
Remove free function __stride_recommender
mmichel11 1748a6b
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 cc829e5
Address reviewer comments
mmichel11 8dc7706
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 1309f6a
Improve testing and cleanup of code
mmichel11 288499f
clang format
mmichel11 d683b72
Miscellaneous fixes identified during testing
mmichel11 b4cfcae
clang-format
mmichel11 62c104f
Fix ordering to __vector_load call
mmichel11 b525ab7
Add support for vectorization with C++20 parallel range APIs
mmichel11 7d16c16
Add device copyable specializations for new walk patterns
mmichel11 f9d63aa
Align vector_walk implementation with other vector functors
mmichel11 9aa36e1
Add back non-spirv path
mmichel11 b6d5d98
Further improve test coverage
mmichel11 4c1a974
Restore original shift_left due to implicit implementation requiremen…
mmichel11 bebd84b
Fix issues in vectorized rotate
mmichel11 02d0a18
Fix fpga parallel for compilation issues
mmichel11 1c3f455
Restore initial shift_left_right.pass.cpp
mmichel11 774e6f0
Fix test side issue when unnamed lambdas are disabled
mmichel11 cad0e1b
Add a vector path specialization for std::swap_ranges
mmichel11 0c2c9a8
General code cleanup
mmichel11 7aa5bf8
Bugfix with __pattern_swap using nanoranges
mmichel11 62a19fd
clang-format
mmichel11 b2128fe
Address applicable comments from PR #1870
mmichel11 2b1281b
Refactor __lazy_ctor_storage deleter
mmichel11 1c4ed8c
Address review comments
mmichel11 d0a66ae
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 ac6d945
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 4654b1d
Add missing max_n
mmichel11 59ea1ec
Add constructors for for-based bricks
mmichel11 bbee988
Remove extraneous {} and add constructor to custom_brick
mmichel11 33dc8b7
Limit recursive searching of __min_nested_type_size to tuples
mmichel11 8a0f4b5
Work around compiler vectorization issue
mmichel11 0f81298
Add missing decays
mmichel11 971edae
Add compile time check to ensure we do not get buffer pointer on host
mmichel11 e7309c9
Revert "Work around compiler vectorization issue"
mmichel11 d5c7157
Remove all begin() calls on views in vectorization paths
mmichel11 0280f7c
Remove unused __is_passed_directly_range utility
mmichel11 52ce868
Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…
mmichel11 ab70533
Correct __vector_walk deleters and a type in __reverse_copy
mmichel11 a26cdba
Set upper limit of 10,000,000 for get_pattern_for_max_n
mmichel11 6db2d58
General cleanup and renaming for consistency
mmichel11 2e378ea
Explicitly list template types in specializations of __is_vectorizabl…
mmichel11 f387a4f
Remove unnecessary local variables
mmichel11 8a387b2
Remove unnecessary local variables in async and numeric headers
mmichel11 2ccb478
Correct optimization in __reverse_functor and improve explanation
mmichel11 af2e16f
Rename custom_brick to __custom_brick
mmichel11 6a4db2c
Rename __n to __full_range_size in vec utils and fix potential unused…
mmichel11 5e31e07
Remove unnecessary ternary operator and replace _Idx template with st…
mmichel11 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we may try to improve this code by replacing run-time
bool
valueuse_32bit_indexing
to compile-time indexing type specialization.I found only 3 places with the code
so it's not big deal to add
if
statement outside and call__parallel_for
inside for both branches with the different index types. But inside the brick we exclude condition check at all.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.