Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Evidence Accumulation Clustering #1830

Closed
wants to merge 55 commits into from

Conversation

robertlayton
Copy link
Member

Evidence accumulation clustering: EAC, an ensemble based clustering framework:
Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
accumulation." Pattern Recognition, 2002. Proceedings. 16th International
Conference on. Vol. 4. IEEE, 2002.

Basic overview of algorithm:

  1. Cluster the data many times using a clustering algorithm with randomly (within reason) selected parameters.
  2. Create a co-association matrix, which records the number of times each pair of instances were clustered together.
  3. Cluster this matrix.

This seems to work really well, like a kernel method, making the clustering "easier" that it was for the original dataset.

The default of the algorithm are setup to follow the defaults used by Fred and Jain (2002), whereby the clustering in step 1 is k-means with k selected randomly from 10 and 30. The clustering in step 3 is the MST algorithm, which I have yet to implement (will do in this PR).

After initial feedback, I think people are happy with the API.

TODO:

  • MST algorithm from the paper, which was used as the final clusterer. Completed in PR MRG: Minimal spanning tree (backport from scipy 0.13) #1991
  • There is an improvement to the speed of the algorithm (don't have the paper on hand) that has been published, that should be incorporated (will be done in a later PR)
  • Examples/Usage
  • Narrative documentation
  • Revert test_clustering, line 508, to only check for SpectralClustering
  • Use a sparse matrix for the co-association matrix

bob and others added 8 commits February 28, 2013 22:57
The algorithm works, but isn't very fast or accurate.
Not fast because I haven't optimised, not accurate due to the poor final clusterer (I think)
…ther bug somewhere.

The common clustering test now tells you which clustering algorithm failed (if one does).
… until I finish the MST algorithm.

This required a changed to the test, which should be removed after the change.X
@jaquesgrobler
Copy link
Member

I had a read through. Looks very interesting. The API makes sense to me so far 👍
Seems clear enough and isn't hard to follow.
Nice work :)

@satra
Copy link
Member

satra commented Apr 19, 2013

this looks rather interesting - i have two questions before reading the papers:

  • could this be used in general across any set of clusters/clustering algorithms?
  • could this be used in some ways to do online learning along these lines (http://arxiv.org/pdf/1209.0237v1.pdf)?

@satra
Copy link
Member

satra commented Apr 19, 2013

to clarify the across any set of clusters comment. currently the api is given a single X and many clustering algorithms. what if it was given a single algorithm but many Xs. In principle, it seems that should work as well.

accumulation." Pattern Recognition, 2002. Proceedings. 16th International
Conference on. Vol. 4. IEEE, 2002.
"""
X = np.asarray(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using k-means, then sparse matrix support can be added quite easily by doing atleast2d_or_csr here.

@robertlayton
Copy link
Member Author

@satra Thanks for your comments. You are right on the multiple X question -- I've used that myself, but (1) I can't think of a very clear way to do it and (2) it isn't the "base" algorithm. If you have an idea for solving (1) I'm happy to include it.

@everyone_else, thanks for your comments, I'll finish up the PR.

@satra
Copy link
Member

satra commented Apr 20, 2013

@robertlayton: how about having a function that takes C, X and clustering_algo and updates C and returns it? i believe you already have it inside the eac function.

# Co-association matrix, originally zeros everywhere
C = np.zeros((n_samples, n_samples), dtype='float')
num_initial_clusterers = 0
for model in initial_clusterers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth doing this fitting in parallel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I was going with "get it right, then optimise".

@amueller
Copy link
Member

Question without reading paper or code: is the MST clustering just single-link agglomerative?
If so, can code be reused / refactored from WARD?

Sounds like a very interesting algorithm btw :)
Are there any links to Buhmann's work?

@GaelVaroquaux
Copy link
Member

If so, can code be reused / refactored from WARD?

No, but I started some work to have a fast version of more agglomerative
clustering
https://github.com/GaelVaroquaux/scikit-learn/tree/hc_linkage

I'd love to sprint with someone on that.

@robertlayton
Copy link
Member Author

For calculating the minimum spanning tree, I see three options:

1 This version in scipy here, but it requires v0.11, which is higher than scikit-learn's current dependency. (Currently it is 0.7, which is probably a little low, but there hasn't been a need for higher so far I believe.)
2 Use @GaelVaroquaux 's code from here
3 Use @amueller 's code from here

Thoughts on the best option? I'd rather not reimplement it myself -- it's tricky to optimise properly and pointless if others have already solved this problem.

@amueller
Copy link
Member

I haven't looked at @GaelVaroquaux but I'd vote for backporting scipy.

@amueller
Copy link
Member

Do you only need the euclidean case? For high-dims, that shouldn't really make a (practical) difference, though...

@robertlayton
Copy link
Member Author

I think it would be better to use the "proper" method, even if the euclidean case works practically.

@robertlayton
Copy link
Member Author

What would be the process of backporting from scipy? Any examples I could use?

@jakevdp
Copy link
Member

jakevdp commented Sep 25, 2013

(1) is very easy. I believe all it takes is to replace the pairwise_distances line with a suitable kneighbors_graph call. I took a similar approach in astroML: https://github.com/astroML/astroML/blob/master/astroML/clustering/mst_clustering.py#L78

Also need to add a n_neighbors parameter to the class constructor.

@robertlayton
Copy link
Member Author

OK, that doesn't look too bad. I will try find some time tomorrow to do the
change/testing/example update.

What's a good default for n_neighbors?

On 26 September 2013 04:39, Jake Vanderplas [email protected]:

(1) is very easy. I believe all it takes is to replace the
pairwise_distances line with a suitable kneighbors_graph call. I took a
similar approach in astroML:
https://github.com/astroML/astroML/blob/master/astroML/clustering/mst_clustering.py#L78

Also need to add a n_neighbors parameter to the class constructor.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1830#issuecomment-25113237
.

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

@jakevdp
Copy link
Member

jakevdp commented Sep 29, 2013

I did some experiments on a variety of datasets and found that 15-20 neighbors is usually sufficient to recover the true minimum spanning tree

@robertlayton
Copy link
Member Author

OK, I just took a stab at #1. I'm not sure if I'm missing something obvious but I can't do a nearest neighbour from a sparse distance matrix (i.e. n_samples by n_samples). Thoughts?

e.g. This won't work (after I put 'precomputed' in the list of valid distance matrices)

    X = atleast2d_or_csr(X)
    X = pairwise_distances(X, metric=metric)
    clf = NearestNeighbors(n_neighbors, algorithm='brute', metric='precomputed')
    X = clf.fit(X).kneighbors_graph(X._fit_X, n_neighbors, mode='distance')

Trying to fix this leads me down a path of updating nearest neighbours and so on. (argsort is used on in neighbors/base.py line 301, which is invalid for a sparse matrix)

Am I missing something obvious?

@jakevdp
Copy link
Member

jakevdp commented Oct 13, 2013

Hi,
I think something like this should work for either sparse or dense input. For dense input, though, it would be better to not use brute force.

X = atleast2d_or_csr(X)
clf = NearestNeighbors(5, algorithm='brute')
G = clf.fit(X).kneighbors_graph(X, mode='distance')

@robertlayton
Copy link
Member Author

That will work is X is n_samples by n_features, and will even work in the case of n_samples by n_samples (i.e. a distance matrix), when n_samples is small. The problem is that nearest neighbours will interpret this as n_samples by n_features, and that could be infeasible for large n_samples.

I need to accept n_samples by n_samples, as that is what is returned by the evidence_accumulation_clustering algorithm.

@jakevdp
Copy link
Member

jakevdp commented Oct 16, 2013

With a dense n_samples x n_samples distance matrix, you could compute the graph with something like this:

# X is n_samples x n_samples
# k is number of neighbors
G = X[np.argsort(X, 1) <= k]

There might be a clever way to do this in the sparse case as well

@robertlayton
Copy link
Member Author

OK, I'll open a new PR to get this change in (i.e "Allow NearestNeighbors
to accept a precomputed distance matrix")

On 16 October 2013 12:32, Jake Vanderplas [email protected] wrote:

With a dense n_samples x n_samples distance matrix, you could compute the
graph with something like this:

X is n_samples x n_samples# k is number of neighborsG = X[np.argsort(X, 1) <= k]

There might be a clever way to do this in the sparse case as well


Reply to this email directly or view it on GitHubhttps://github.com//pull/1830#issuecomment-26386668
.

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

@robertlayton are you still interested in completing this contribution? It received wide acclaim above, although there may be variants to the EAC algorithm introduced in the literature that are worth looking into. Should we open this up to another contributor to complete?

@robertlayton
Copy link
Member Author

I'd open this up to another contributor -- I'm not active in the research area at the moment, and wouldn't have the time to investigate further options.

@amueller
Copy link
Member

Should we move this to scikit-learn-extras for now? Adding a tag, feel free to argue ;)

@amueller amueller added the Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository label Jul 14, 2019
@robertlayton
Copy link
Member Author

@amueller go for it!

Base automatically changed from master to main January 22, 2021 10:48
@lorentzenchr
Copy link
Member

Can we close?

@thomasjpfan
Copy link
Member

I elect to close as this PR is marked for scikit-learn-extra, I opened scikit-learn-contrib/scikit-learn-extra#134 to keep track of the move.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster module:utils Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository New Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.