-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More data types supported in cv::cuda::transpose() #3371
base: 4.x
Are you sure you want to change the base?
Conversation
proposal for #22782
|
||
CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 ); | ||
//CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if elemSize == 5
?
What kind of error message we would show to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is handled by
...
else
CV_Error(Error::StsUnsupportedFormat, "");
if (src.empty()) | ||
dst.release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty input should be a error as nobody want to process "nothing" in real use cases: opencv/opencv#8300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I thought it was better since Allow empty matrices in most functions (personnaly I prefer a no-op rather than an exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add a fastpath for size == (1, 1), where transpose is just a copy
(srcType == CV_32SC1) || (srcType == CV_32SC3) || (srcType == CV_32SC4) || | ||
(srcType == CV_32FC1) || (srcType == CV_32FC3) || (srcType == CV_32FC4); | ||
const bool isElemSizeSupportedByNppi = | ||
((elemSize != 0) && !(elemSize%1) && ((elemSize/1)<=4)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if I have completely misunderstood the logic but
Isn't elemSize%1 == 0 always?
When can elemSize == 0?
Isn't this already taken care of by
(srcType == CV_8UC1) || (srcType == CV_8UC3) || (srcType == CV_8UC4) || (srcType == CV_16UC1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but it was on purpose for code clarity/readibility
elemSize%1 and elemSize/1 will be optimized out by the compiler, this is just to balance with the 2, 4, 8 cases
Ok to get rid of the (elemSize != 0) check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity I would use elemSize1 and channels?
For what CV data type will isNppiNativelySupported == false
and isElemSizeSupportedByNppi == true
or isElemSizeSupportedByGridTranspose == true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is overlap between isNppiNativelySupported, isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose
The idea is that if anything goes wrong, they can be disabled independently.
isNppiNativelySupported
means that the data type is directly mapped to an nppiTranspose call
isElemSizeSupportedByNppi
means that we can cheat on the real data type to use another data type of the same size (assuming that NPP is only using memory copies, otherwise it would not work)
isElemSizeSupportedByGridTranspose
is the fallback if we don't rely on the NPP implementation
For instance, 16UC2 is not supported by NPP but can be mapped to 32SC1 (isNppiNativelySupported == false and isElemSizeSupportedByNppi == true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a complete overlap, i.e. is isElemSizeSupportedByNppi
and isElemSizeSupportedByGridTranspose
redundant unless
isNppiNativelySupported
is manually set to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a complete overlap, i.e. is
isElemSizeSupportedByNppi
andisElemSizeSupportedByGridTranspose
redundant unless
isNppiNativelySupported
is manually set tofalse
?
There are cases where (isNppiNativelySupported == false) and (isElemSizeSupportedByNppi== true) (e.g. 16UC2)
But there are currently no cases where (isElemSizeSupportedByNppi== false) and (isElemSizeSupportedByGridTranspose == true). This is a fallback implementation if :
- a (future) opencv global option disables NPP
- performance tests reveals that some NPP calls are slower than gridTranspose(). Specific cases could be excluded from isElemSizeSupportedByNppi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 16UC2
is handled by the below block?
else if (!(elemSize%2) && ((elemSize/2)==2))
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
If so wouldn't it be better to have this under
else if (srcType == CV_16UC2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is not a native nppi Call (isNppiNativelySupported == false) but it can be handled by cheating on the data type (isElemSizeSupportedByNppi == true)
That's why the tests inside if (isElemSizeSupportedByNppi)
is explicitely focusing on elemSize
rather than srcType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this makes it easier or harder for the next person who looks at it to see whats going on, what do you think @alalek ?
Either way I think having overlap between both makes it difficult to understand, maybe if the "cheat" only included the cases where it was applicable it would be more obvious?
Would it be possible to incorportate the newer stream api for newer versions of CUDA aswell? |
Fixed a typo in gridTranspose() usage Added fast path for single cell/row/col matrix Throw error for empty matrix instead of no-op Code style
I am currently based on 4.x branch, usage of _Ctx seems to be in master. |
@@ -201,6 +199,9 @@ void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream) | |||
else if (!(elemSize%8) && ((elemSize/8)==2)) | |||
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step), | |||
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) ); | |||
|
|||
if (!stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be applied to gridTranspose()
as well, I can't understand why it was previously missing, surely the result could have been in flight on returns from this function when the default stream is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the opposite question : why is it ever needed ? I let them here, but I don't understand the purpose of this extra synchronization when the default stream (sync by default) is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel launches are asynchronous with respect. The default stream syncs with other streams by default in legacy mode.
My interpretation is that the OpenCV API works on the assumption that if a stream isn't passed the user wants synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that now that exist both cudaStreamPerThread
and cudaStreamLegacy
, people using OpenCV always expect the cudaStreamLegacy
behaviour with the default stream, thus requiring the cudaDeviceSynchronize()
?
If so, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No users of the api will expect functions which they don't pass a stream to, to be synchronous with respect to the host when they return. This would not be the case with either cudaStreamPerThread
or cudaStreamLegacy
as the kernel launch is asynchronous in both cases.
In the example from the docs
1) k_1<<<1, 1, 0, s>>>();
2) k_2<<<1, 1>>>();
3) k_3<<<1, 1, 0, s>>>()
4) ...
k_2 waits on k_1 because k_2 is in the legacy default stream, then k_3 waits on the legacy stream. Because of the specific way this has been set up k_1 and k_2 have finished executing before the call to k_3<<<1, 1, 0, s>>>()
however the result from k_3 may still be in flight after control has returned to the host when you reach line 4.
Now I haven't used per thread default streams (I always use explicit streams) but my understanding is that if the CUDA_API_PER_THREAD_DEFAULT_STREAM macro was used to enable per thread default streams k_1 would run before k_3 but both would be asynchronous with respect to k_2. Either way when control returns to the host on line 4 they may still all be in flight.
On the other hand if we have the following which is our case if no stream is passed
1) k_1<<<1, 1>>>();
2) k_2<<<1, 1>>>();
3) k_3<<<1, 1>>>();
4)
in either case cudaStreamPerThread
or cudaStreamLegacy
all three kernels may still be in flight when control returns to the host on line 4 if we don't explicitly syncronize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, anyway, I just checked and gridTranspose()
does ultimately call cudaDeviceSynchronize()
for null stream (see https://github.com/opencv/opencv_contrib/blob/4.x/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cudawarped Do you have open issues with the PR. I want to merge, if you do not mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asmorkalov I am not 100% convinced that isNppiNativelySupported
and isElemSizeSupportedByNppi
flags are needed. I can understand that @chacha21 wants a fall back option in case something goes wrong and because of that wants to seperate out the two logic paths but I am not sure if the redundant calls under isElemSizeSupportedByNppi
make the function harder to maintain, what do you think?
Additionally elemSize1
is unused, the isElemSizeSupportedByNppi
logic path won't be fully tested and due to the names of the nppi functions it may make more sense to just use the bit size instead of examining elemSize
, i.e.
else if (!(elemSize%2) && ((elemSize/2)==2))
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
could be
else if (elemSize==4)
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever the decision, I will certainly agree : I explained my initial code structure, but I am OK to adapt for a more "OpenCV style"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Conditions like
if (!(elemSize%2) && ((elemSize/2)==2))
looks very cryptic. I understand the logic behind it, butif (elemSize==4)
definitely more readable. - The only reason for fallback I see is 2 channel matrix. All other cases are handled with regular types. I propose to merge "native" support and just "support" in single case and use relevant Npp call. No duplicated/dead branches, mor obvious testing.
if (rows == 1), the stride has no importance for the reshape()
@@ -201,6 +199,9 @@ void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream) | |||
else if (!(elemSize%8) && ((elemSize/8)==2)) | |||
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step), | |||
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) ); | |||
|
|||
if (!stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cudawarped Do you have open issues with the PR. I want to merge, if you do not mind.
I could not test performance with other GPUs than mine (an old GTX 750). I am pretty confident, but it would be good if someone could have a look with more modern ones. |
Transpose perf test results for GF 1080 with CUDA 10.2, Ubuntu 18.04:
The execution time is very small. Not sure if, it's real speedup or some system side effect. In any case, the PR does not introduce regressions. |
more obvious elemSize test logic to dispatch to the right nppiTranspose variant added modern NPP _ctx stream support
else if (elemSize == 1) | ||
nppSafeCall( nppiTranspose_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 2) | ||
nppSafeCall( nppiTranspose_16u_C1R(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 3) | ||
nppSafeCall( nppiTranspose_8u_C3R(src.ptr<Npp8u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 4) | ||
nppSafeCall( nppiTranspose_32s_C1R(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 6) | ||
nppSafeCall( nppiTranspose_16u_C3R(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 8) | ||
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 12) | ||
nppSafeCall( nppiTranspose_32s_C3R(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) ); | ||
else if (elemSize == 16) | ||
nppSafeCall( nppiTranspose_32s_C4R(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpretation block does the transpose second time, if elemSize
duplicates some of srcType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpretation block does the transpose second time, if
elemSize
duplicates some ofsrcType
Well... no ? This is still an "else if"
else if (elemSize == 1) | ||
nppSafeCall( nppiTranspose_8u_C1R_Ctx(src.ptr<Npp8u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 2) | ||
nppSafeCall( nppiTranspose_16u_C1R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 3) | ||
nppSafeCall( nppiTranspose_8u_C3R_Ctx(src.ptr<Npp8u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 4) | ||
nppSafeCall( nppiTranspose_32s_C1R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 6) | ||
nppSafeCall( nppiTranspose_16u_C3R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 8) | ||
nppSafeCall( nppiTranspose_16u_C4R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step), | ||
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 12) | ||
nppSafeCall( nppiTranspose_32s_C3R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) ); | ||
else if (elemSize == 16) | ||
nppSafeCall( nppiTranspose_32s_C4R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step), | ||
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
|
||
if (!stream) | ||
CV_CUDEV_SAFE_CALL( cudaDeviceSynchronize() ); | ||
} | ||
CV_CUDEV_SAFE_CALL( cudaDeviceSynchronize() ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syncronious NPP call is used if !stream
. No need to sync again. Right?
if (elemSize == 1) | ||
gridTranspose(globPtr<unsigned char>(src), globPtr<unsigned char>(dst), stream); | ||
else if (elemSize == 2) | ||
gridTranspose(globPtr<unsigned short>(src), globPtr<unsigned short>(dst), stream); | ||
else if (elemSize == 4) | ||
{ | ||
gridTranspose(globPtr<int>(src), globPtr<int>(dst), stream); | ||
} | ||
else // if (elemSize == 8) | ||
{ | ||
gridTranspose(globPtr<double>(src), globPtr<double>(dst), stream); | ||
} | ||
gridTranspose(globPtr<signed int>(src), globPtr<signed int>(dst), stream); | ||
else if (elemSize == 8) | ||
gridTranspose(globPtr<double>(src), globPtr<double>(dst), stream); | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove dead code.
#endif | ||
cudaStream_t _stream = StreamAccessor::getStream(stream); | ||
|
||
if (!_stream || !useNppStreamCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to always use the old api when CUDA SDK < 10.1 and the newer one otherwise regardless of whether there is a stream or not? That way it should be easier to carve out the older code when it is depreciated in newer CUDA versions.
void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream) | ||
{ | ||
GpuMat src = getInputMat(_src, stream); | ||
#define USE_NPP_STREAM_CONTEXT (NPP_VERSION >= (10 * 1000 + 1 * 100 + 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to define this globally somewhere although not sure where, maybe modules/cudev/include/opencv2/cudev/common.hpp.
@chacha21 I like the changes apart from my comments. For reference I was thinking the modification could look like I would wait for @asmorkalov to take a look before making any more updates as he will ultimately decide on the required format. |
|
||
syncOutput(dst, _dst, stream); | ||
const int srcType = src.type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant?
(elemSize == 1) || (elemSize == 2) || (elemSize == 3) || (elemSize == 4) || | ||
(elemSize == 6) || (elemSize == 8) || (elemSize == 12) || (elemSize == 16); | ||
|
||
if (!isSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would CV_Assert() be better here?
Fixes opencv/opencv#22782
proposal for opencv/opencv#22782
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.