-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Remove preprocessing the data for RCA #194
[MRG] Remove preprocessing the data for RCA #194
Conversation
if self.pca_comps is not None: | ||
pca = decomposition.PCA(n_components=self.pca_comps) | ||
X_t = pca.fit_transform(X) | ||
M_pca = pca.components_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this code was giving a PCA initialization at the same time, so for now we'll remove it, but I think I'll do the PR about initialization before merging this PR into master, and then we can merge it into this PR to keep the same possibility of initialization with PCA
There is still a warning that mentions pca_comps to update. if the matrix is low-rank (due to high dimension compared to the number of points) , we can suggest the user to use a dimensionality reduction technique such as PCA as preprocessing. Unless we have a better idea |
# Conflicts: # metric_learn/rca.py # test/metric_learn_test.py # test/test_base_metric.py
I think this PR is ready to merge as soon as the initialization PR is merged (#195) |
# Conflicts: # metric_learn/rca.py # test/metric_learn_test.py # test/test_base_metric.py
@bellet @perimosocordiae I guess this PR is ready to merge now |
X_t = pca.fit_transform(X) | ||
M_pca = pca.components_ | ||
else: | ||
X_t = X - X.mean(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this centering step gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess because we should remove any pre-processing step, but I agree I didn't talk about it at all, maybe we should keep the ChangedBehaviorWarning
message below, but rather replace "no longer trained on a preprocessed version" by "no longer trained on centered data by default", and encourage to use a "StandardScaler" if needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough (I double-checked and this centering is not part of standard RCA)
Maybe keep ChangedBehaviorWarning but change to "no longer center the data before training RCA" (no need to mention scaler I think)
And in the deprecation warning, add the fact that PCA preprocessing should now be done by the user
Finally, have you checked the influence of removing the centering step on the examples?
metric_learn/rca.py
Outdated
else: | ||
X_t = X - X.mean(axis=0) | ||
M_pca = None | ||
warnings.warn("RCA will no longer be trained on a preprocessed version " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a ChangedBehaviorWarning? The default behavior was pca_comps=None
Maybe the information in this warning should rather go to the deprecation warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
I think addressing #194 (comment) and then we can merge |
Done, I'll just check quickly what it changes on iris for instance, as you recommended for check-proof |
With the same example as from metric_learn import RCA_Supervised
from sklearn.datasets import make_classification
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=100, n_classes=3, n_clusters_per_class=2,
n_informative=3, class_sep=4., n_features=5,
n_redundant=0, shuffle=True, random_state=42,
scale=[1, 1, 20, 20, 20])
tsne = TSNE()
X_e = tsne.fit_transform(X)
plt.figure()
plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show()
rca = RCA_Supervised(num_chunks=30, chunk_size=2)
X_r = rca.fit_transform(X, y)
X_er = tsne.fit_transform(X_r)
plt.figure()
plt.scatter(X_er[:, 0], X_er[:, 1], c=y)
plt.show() Here are the tsne plots after RCA transformation: |
So I guess we're OK to merge now |
Fixes #125
This PR removes the preprocessing of the data in the algorithms.
TODO: