-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Version #21
Comments
That is something I would love to do. But due to time constraints, I do not think that will happen soon unless another person contributes the GPU implementation. However, by the end of the year there will be some improvements in speed regarding the CPU version. |
Hi Pablo,
Cheers |
I think GPU will help a lot to speed up the nonlinear scale space computation, as well as keypoint selection and descriptor computation. |
I might attempt a GPU implementation, together with Alessandro Pieropan, but that depends on whether we'll find time for it. Most things are relatively easy to parallelize, even if there are some exceptions. Preferably, we would like an implementation to results in exactly the same features as the original implementation. It's relatively easy to port the code to a GPU, but to exploit the full power of a GPU, without short-cuts, is way harder. |
A GPU implementation will be highly appreciated and really interesting for the community. My main problem these days about not doing that is really lack of time, and that I still have some CPU optimizations that need to be integrated. However, if someone is interested in taking the lead for the GPU implementation I can help testing that the performance matches the CPU version. |
To speed up the CPU version, you should try to focus on the memory footprint and access patterns. For example, instead of first creating the pyramid and then detecting features, you better interleave the two. If you have something in CPU cache, you better complete as many operations as possible, before data is eventually stored into main memory. Also try to reduce the memory buffers, possibly by using different names to make the code readable, but still share the same physical memory. |
Hi Pablo, While working through AKAZE, there seem to be a simple optimization in the Compute_Multiscale_Derivatives area. The original code have the following: evolution_[i].Lx = evolution_[i].Lx_((sigma_size_)); First off , it seems Lx and Ly will not be used again after this, so we can just take them out. Also, we can multiply sigma_size_^2 during the determine phase: sigma_size_quad = sigma_size_^4; for (int ix = 0; ix < evo[i].Ldet.rows; ix++) { so we can pretty much remove the whole 5 scaling off. Thanks! |
Hi Jay Hsia, Yes, the modifications that you are suggesting work fine. I will incorporate those changes, since it will speed up a bit. Regards, |
On a Tesla K40c I can compute the scale space in about 6.7 ms, excluding GPU-to-CPU transfers. The sequential nature of the computations makes it hard to limit the number of kernel calls, which leads to a considerable overhead for finer scales. Preferably, one would like to work on all scales of each octave in parallel, since the resolution on smaller images is too low for parallelism spatially to be enough. |
Any plans on having a GPU accelerated version of the algorithm?
The text was updated successfully, but these errors were encountered: