Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Remove preprocessing the data for RCA #194

Merged

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Apr 23, 2019

Fixes #125

This PR removes the preprocessing of the data in the algorithms.

TODO:

if self.pca_comps is not None:
pca = decomposition.PCA(n_components=self.pca_comps)
X_t = pca.fit_transform(X)
M_pca = pca.components_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this code was giving a PCA initialization at the same time, so for now we'll remove it, but I think I'll do the PR about initialization before merging this PR into master, and then we can merge it into this PR to keep the same possibility of initialization with PCA

@bellet
Copy link
Member

bellet commented Apr 23, 2019

There is still a warning that mentions pca_comps to update. if the matrix is low-rank (due to high dimension compared to the number of points) , we can suggest the user to use a dimensionality reduction technique such as PCA as preprocessing. Unless we have a better idea

@wdevazelhes
Copy link
Member Author

I think this PR is ready to merge as soon as the initialization PR is merged (#195)

# Conflicts:
#	metric_learn/rca.py
#	test/metric_learn_test.py
#	test/test_base_metric.py
@wdevazelhes
Copy link
Member Author

@bellet @perimosocordiae I guess this PR is ready to merge now

@wdevazelhes wdevazelhes changed the title [WIP] Remove preprocessing the data for RCA [MRG] Remove preprocessing the data for RCA Jun 11, 2019
X_t = pca.fit_transform(X)
M_pca = pca.components_
else:
X_t = X - X.mean(axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this centering step gone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because we should remove any pre-processing step, but I agree I didn't talk about it at all, maybe we should keep the ChangedBehaviorWarning message below, but rather replace "no longer trained on a preprocessed version" by "no longer trained on centered data by default", and encourage to use a "StandardScaler" if needed ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough (I double-checked and this centering is not part of standard RCA)
Maybe keep ChangedBehaviorWarning but change to "no longer center the data before training RCA" (no need to mention scaler I think)
And in the deprecation warning, add the fact that PCA preprocessing should now be done by the user

Finally, have you checked the influence of removing the centering step on the examples?

else:
X_t = X - X.mean(axis=0)
M_pca = None
warnings.warn("RCA will no longer be trained on a preprocessed version "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a ChangedBehaviorWarning? The default behavior was pca_comps=None
Maybe the information in this warning should rather go to the deprecation warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@bellet
Copy link
Member

bellet commented Jun 12, 2019

I think addressing #194 (comment) and then we can merge

@wdevazelhes
Copy link
Member Author

I think addressing #194 (comment) and then we can merge

Done, I'll just check quickly what it changes on iris for instance, as you recommended for check-proof

@wdevazelhes
Copy link
Member Author

With the same example as plot_metric_learning_examples.py, it seems that no centering does fine:

from metric_learn import RCA_Supervised
from sklearn.datasets import make_classification
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=100, n_classes=3, n_clusters_per_class=2,
                           n_informative=3, class_sep=4., n_features=5,
                           n_redundant=0, shuffle=True, random_state=42,
                           scale=[1, 1, 20, 20, 20])

tsne = TSNE()
X_e = tsne.fit_transform(X)
plt.figure()
plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show()
rca = RCA_Supervised(num_chunks=30, chunk_size=2)
X_r = rca.fit_transform(X, y)
X_er = tsne.fit_transform(X_r)
plt.figure()
plt.scatter(X_er[:, 0], X_er[:, 1], c=y)
plt.show()

Here are the tsne plots after RCA transformation:

for this PR:
image

for master:
image

@wdevazelhes
Copy link
Member Author

So I guess we're OK to merge now

@bellet bellet merged commit 8518517 into scikit-learn-contrib:master Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do we want to preprocess data within our algorithms?
3 participants