-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Evidence Accumulation Clustering #1830
Conversation
The algorithm works, but isn't very fast or accurate. Not fast because I haven't optimised, not accurate due to the poor final clusterer (I think)
…ther bug somewhere. The common clustering test now tells you which clustering algorithm failed (if one does).
… until I finish the MST algorithm. This required a changed to the test, which should be removed after the change.X
I had a read through. Looks very interesting. The API makes sense to me so far 👍 |
this looks rather interesting - i have two questions before reading the papers:
|
to clarify the |
sklearn/cluster/eac.py
Outdated
accumulation." Pattern Recognition, 2002. Proceedings. 16th International | ||
Conference on. Vol. 4. IEEE, 2002. | ||
""" | ||
X = np.asarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using k-means, then sparse matrix support can be added quite easily by doing atleast2d_or_csr
here.
@satra Thanks for your comments. You are right on the multiple X question -- I've used that myself, but (1) I can't think of a very clear way to do it and (2) it isn't the "base" algorithm. If you have an idea for solving (1) I'm happy to include it. @everyone_else, thanks for your comments, I'll finish up the PR. |
@robertlayton: how about having a function that takes |
# Co-association matrix, originally zeros everywhere | ||
C = np.zeros((n_samples, n_samples), dtype='float') | ||
num_initial_clusterers = 0 | ||
for model in initial_clusterers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth doing this fitting in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. I was going with "get it right, then optimise".
Question without reading paper or code: is the MST clustering just single-link agglomerative? Sounds like a very interesting algorithm btw :) |
No, but I started some work to have a fast version of more agglomerative I'd love to sprint with someone on that. |
For calculating the minimum spanning tree, I see three options: 1 This version in scipy here, but it requires v0.11, which is higher than scikit-learn's current dependency. (Currently it is 0.7, which is probably a little low, but there hasn't been a need for higher so far I believe.) Thoughts on the best option? I'd rather not reimplement it myself -- it's tricky to optimise properly and pointless if others have already solved this problem. |
I haven't looked at @GaelVaroquaux but I'd vote for backporting scipy. |
Do you only need the euclidean case? For high-dims, that shouldn't really make a (practical) difference, though... |
I think it would be better to use the "proper" method, even if the euclidean case works practically. |
What would be the process of backporting from scipy? Any examples I could use? |
(1) is very easy. I believe all it takes is to replace the Also need to add a |
OK, that doesn't look too bad. I will try find some time tomorrow to do the What's a good default for n_neighbors? On 26 September 2013 04:39, Jake Vanderplas [email protected]:
Public key at: http://pgp.mit.edu/ Search for this email address and select |
I did some experiments on a variety of datasets and found that 15-20 neighbors is usually sufficient to recover the true minimum spanning tree |
OK, I just took a stab at #1. I'm not sure if I'm missing something obvious but I can't do a nearest neighbour from a sparse distance matrix (i.e. n_samples by n_samples). Thoughts? e.g. This won't work (after I put 'precomputed' in the list of valid distance matrices) X = atleast2d_or_csr(X)
X = pairwise_distances(X, metric=metric)
clf = NearestNeighbors(n_neighbors, algorithm='brute', metric='precomputed')
X = clf.fit(X).kneighbors_graph(X._fit_X, n_neighbors, mode='distance') Trying to fix this leads me down a path of updating nearest neighbours and so on. (argsort is used on in neighbors/base.py line 301, which is invalid for a sparse matrix) Am I missing something obvious? |
Hi, X = atleast2d_or_csr(X)
clf = NearestNeighbors(5, algorithm='brute')
G = clf.fit(X).kneighbors_graph(X, mode='distance') |
That will work is X is n_samples by n_features, and will even work in the case of n_samples by n_samples (i.e. a distance matrix), when n_samples is small. The problem is that nearest neighbours will interpret this as n_samples by n_features, and that could be infeasible for large n_samples. I need to accept n_samples by n_samples, as that is what is returned by the evidence_accumulation_clustering algorithm. |
With a dense # X is n_samples x n_samples
# k is number of neighbors
G = X[np.argsort(X, 1) <= k] There might be a clever way to do this in the sparse case as well |
OK, I'll open a new PR to get this change in (i.e "Allow NearestNeighbors On 16 October 2013 12:32, Jake Vanderplas [email protected] wrote:
Public key at: http://pgp.mit.edu/ Search for this email address and select |
@robertlayton are you still interested in completing this contribution? It received wide acclaim above, although there may be variants to the EAC algorithm introduced in the literature that are worth looking into. Should we open this up to another contributor to complete? |
I'd open this up to another contributor -- I'm not active in the research area at the moment, and wouldn't have the time to investigate further options. |
Should we move this to scikit-learn-extras for now? Adding a tag, feel free to argue ;) |
@amueller go for it! |
Can we close? |
I elect to close as this PR is marked for |
Evidence accumulation clustering: EAC, an ensemble based clustering framework:
Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
accumulation." Pattern Recognition, 2002. Proceedings. 16th International
Conference on. Vol. 4. IEEE, 2002.
Basic overview of algorithm:
This seems to work really well, like a kernel method, making the clustering "easier" that it was for the original dataset.
The default of the algorithm are setup to follow the defaults used by Fred and Jain (2002), whereby the clustering in step 1 is k-means with k selected randomly from 10 and 30. The clustering in step 3 is the MST algorithm, which I have yet to implement (will do in this PR).
After initial feedback, I think people are happy with the API.
TODO:
There is an improvement to the speed of the algorithm (don't have the paper on hand) that has been published, that should be incorporated(will be done in a later PR)