Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functionality to allow X to be a precomputed distance matrix #55

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jjmaldonis
Copy link

This is useful if the distance metric is expensive to calculate and the user wants to speed up multiple runs of DeBaCl by pre-calculating the distance matrix and inputting precomputed values.

In addition, in some uncommon cases, data may not be easily represented in N dimensional space but distance values for the data may be readily available. This code provides functionality for these less common use cases.

Scikit-learn provides this functionality for many of it's methods, as does Leland McInnes's hdbscan library. I use these libraries regularly and have found the precomputed option quite useful.

I would be happy to explain anything that isn't clear enough, and I hope the code is written to be as computationally inexpensive as possible.

This is useful if the distance metric is expensive to calculate and
the user wants to speed up multiple runs of DeBaCl by inputting
precomputed distance values.
In some odd instances, data may not be easily represented in N
dimensional space but distance values for the data may be
readily available. This code provides functionality for those
less common use cases.
@bpkent
Copy link
Collaborator

bpkent commented Jan 27, 2016

Hi @jjmaldonis, thanks for contributing! My first priority is to push a new version to PyPI with a critical bugfix, but I agree with your point about making DeBaCl easier to use for people who have pre-computed the distance matrix, so I'll get back to this review in a couple days.

@bpkent bpkent self-assigned this Feb 3, 2016
@bpkent bpkent added this to the v1.2 milestone Feb 3, 2016
@@ -1260,6 +1260,72 @@ def construct_tree(X, k, prune_threshold=None, num_levels=None, verbose=False):
return tree


def construct_tree_from_precomputed_matrix(X, p, k, prune_threshold=None, num_levels=None, verbose=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change "precomputed_matrix" to "distance_matrix"? I think it would be a little more explicit.

@bpkent
Copy link
Collaborator

bpkent commented Feb 3, 2016

One general comment, in addition to the in-line suggestions I've made above: the two new functionalities should be covered by unit testing before being merged into master. If you're interested in working on that, super. If not, totally fine, just let me know, and I'll work on it from my end.

Thanks again for the PR, I think this is a good addition!
-Brian

@jjmaldonis
Copy link
Author

Hi Brain, thanks for all the comments. I have fixed everything, but I haven't worked on the unit tests yet. I don't have experience working with them except for a few minutes on another project, so I will have to read your tests and give it a try. I've got a few other things I need to finish so it might not be until next week.

If I remember correctly, the reason I implemented the order lines to reorder the neighbors was so that the neighbor list is exactly consistent with the other results (which I believe are all in increasing order). As you pointed out, I definitely do not need to do that for at least the radii calculation, and it looks like the order of the neighbors isn't important (and judging from your edits I think that's true). However, just a note that while the results don't change, the order of the neighbors isn't identical to the other return statements. After looking at my test runs with your edits, everything looks good. I'll work on the unit tests in the next week or so.

@jjmaldonis
Copy link
Author

Hi Brian,

I have finally written the test functions (and they pass). I had never written real test functions before, so hopefully I covered what was necessary. I was unsure if I should add 'precomputed' to the list in for method in ['brute_force', 'kd_tree', 'ball_tree']: within the TestSimilarityGraphs. test_knn_graph function, or copy-paste the contents of the for-loop to deal with the 'precomputed' option separately. I thought that because the inputs for the 'precomputed' options were different, I shouldn't treat it within the same for-loop, so I went with the copy-paste option. If you would rather have an if statement in the for loop to check for the 'precomputed' option, I can change it.

I see that there is now a merge conflict with some of your recent commits. If this is something you want me to fix just let me know.

Hopefully it's good to go now, but if you have corrections let me know.

Thanks,
Jason

@bpkent
Copy link
Collaborator

bpkent commented Mar 15, 2016

Very cool. I'm busy with work at the moment, but I'll take a look at this update soon. Thanks again for submitting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants