-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added functionality to allow X to be a precomputed distance matrix #55
base: master
Are you sure you want to change the base?
Conversation
This is useful if the distance metric is expensive to calculate and the user wants to speed up multiple runs of DeBaCl by inputting precomputed distance values. In some odd instances, data may not be easily represented in N dimensional space but distance values for the data may be readily available. This code provides functionality for those less common use cases.
Hi @jjmaldonis, thanks for contributing! My first priority is to push a new version to PyPI with a critical bugfix, but I agree with your point about making DeBaCl easier to use for people who have pre-computed the distance matrix, so I'll get back to this review in a couple days. |
@@ -1260,6 +1260,72 @@ def construct_tree(X, k, prune_threshold=None, num_levels=None, verbose=False): | |||
return tree | |||
|
|||
|
|||
def construct_tree_from_precomputed_matrix(X, p, k, prune_threshold=None, num_levels=None, verbose=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change "precomputed_matrix" to "distance_matrix"? I think it would be a little more explicit.
One general comment, in addition to the in-line suggestions I've made above: the two new functionalities should be covered by unit testing before being merged into master. If you're interested in working on that, super. If not, totally fine, just let me know, and I'll work on it from my end. Thanks again for the PR, I think this is a good addition! |
Hi Brain, thanks for all the comments. I have fixed everything, but I haven't worked on the unit tests yet. I don't have experience working with them except for a few minutes on another project, so I will have to read your tests and give it a try. I've got a few other things I need to finish so it might not be until next week. If I remember correctly, the reason I implemented the |
Hi Brian, I have finally written the test functions (and they pass). I had never written real test functions before, so hopefully I covered what was necessary. I was unsure if I should add I see that there is now a merge conflict with some of your recent commits. If this is something you want me to fix just let me know. Hopefully it's good to go now, but if you have corrections let me know. Thanks, |
Very cool. I'm busy with work at the moment, but I'll take a look at this update soon. Thanks again for submitting! |
This is useful if the distance metric is expensive to calculate and the user wants to speed up multiple runs of DeBaCl by pre-calculating the distance matrix and inputting precomputed values.
In addition, in some uncommon cases, data may not be easily represented in N dimensional space but distance values for the data may be readily available. This code provides functionality for these less common use cases.
Scikit-learn provides this functionality for many of it's methods, as does Leland McInnes's hdbscan library. I use these libraries regularly and have found the precomputed option quite useful.
I would be happy to explain anything that isn't clear enough, and I hope the code is written to be as computationally inexpensive as possible.