Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Version #21

Open
expd opened this issue Nov 3, 2015 · 9 comments
Open

GPU Version #21

expd opened this issue Nov 3, 2015 · 9 comments

Comments

@expd
Copy link

expd commented Nov 3, 2015

Any plans on having a GPU accelerated version of the algorithm?

@pablofdezalc
Copy link
Owner

That is something I would love to do. But due to time constraints, I do not think that will happen soon unless another person contributes the GPU implementation. However, by the end of the year there will be some improvements in speed regarding the CPU version.

@jhhsia
Copy link

jhhsia commented Nov 3, 2015

Hi Pablo,

 What are the areas of AKAZE  you think GPU would contribute most?  

Cheers

@pablofdezalc
Copy link
Owner

I think GPU will help a lot to speed up the nonlinear scale space computation, as well as keypoint selection and descriptor computation.

@Celebrandil
Copy link

I might attempt a GPU implementation, together with Alessandro Pieropan, but that depends on whether we'll find time for it. Most things are relatively easy to parallelize, even if there are some exceptions. Preferably, we would like an implementation to results in exactly the same features as the original implementation. It's relatively easy to port the code to a GPU, but to exploit the full power of a GPU, without short-cuts, is way harder.

@pablofdezalc
Copy link
Owner

A GPU implementation will be highly appreciated and really interesting for the community. My main problem these days about not doing that is really lack of time, and that I still have some CPU optimizations that need to be integrated. However, if someone is interested in taking the lead for the GPU implementation I can help testing that the performance matches the CPU version.

@Celebrandil
Copy link

To speed up the CPU version, you should try to focus on the memory footprint and access patterns. For example, instead of first creating the pyramid and then detecting features, you better interleave the two. If you have something in CPU cache, you better complete as many operations as possible, before data is eventually stored into main memory. Also try to reduce the memory buffers, possibly by using different names to make the code readable, but still share the same physical memory.

@jhhsia
Copy link

jhhsia commented Nov 25, 2015

Hi Pablo,

While working through AKAZE, there seem to be a simple optimization in the Compute_Multiscale_Derivatives area. The original code have the following:

evolution_[i].Lx = evolution_[i].Lx_((sigma_size_));
evolution_[i].Ly = evolution_[i].Ly_((sigma_size_));
evolution_[i].Lxx = evolution_[i].Lxx_((sigma_size_) * (sigma_size_));
evolution_[i].Lxy = evolution_[i].Lxy_((sigma_size_) * (sigma_size_));
evolution_[i].Lyy = evolution_[i].Lyy*((sigma_size_) * (sigma_size_));

First off , it seems Lx and Ly will not be used again after this, so we can just take them out. Also, we can multiply sigma_size_^2 during the determine phase:

sigma_size_quad = sigma_size_^4;

for (int ix = 0; ix < evo[i].Ldet.rows; ix++) {
const float* lxx = evo[i].Lxx.ptr(ix);
const float* lxy = evo[i].Lxy.ptr(ix);
const float* lyy = evo[i].Lyy.ptr(ix);
float* ldet = evo[i].Ldet.ptr(ix);
for (int jx = 0; jx < evo[i].Ldet.cols; jx++)
ldet[jx] = (lxx[jx]_lyy[jx]-lxy[jx]_lxy[jx])*sigma_size_quad;
}

so we can pretty much remove the whole 5 scaling off.
I have only ran against few test images and the result seems to be identical. Do them seem to be correct?

Thanks!

@pablofdezalc
Copy link
Owner

Hi Jay Hsia,

Yes, the modifications that you are suggesting work fine. I will incorporate those changes, since it will speed up a bit.

Regards,
Pablo

@Celebrandil
Copy link

On a Tesla K40c I can compute the scale space in about 6.7 ms, excluding GPU-to-CPU transfers. The sequential nature of the computations makes it hard to limit the number of kernel calls, which leads to a considerable overhead for finer scales. Preferably, one would like to work on all scales of each octave in parallel, since the resolution on smaller images is too low for parallelism spatially to be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants