Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Composable input/output pipeline #234
base: master
Are you sure you want to change the base?
RFC: Composable input/output pipeline #234
Changes from 9 commits
fb4f1ce
ece0349
f6d1196
e6b754f
1275c8e
a1dda46
8b8861a
842537b
292d8bf
7cd7bb6
afe52df
46c847b
17d8a08
42f1646
da2c5fd
44bb449
55e3910
44cd76b
98d279b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "exact Scikit-Learn API," you mean "the exact names of Scikit-learn transformers," right? And the proposed "Scikit-learn-like interface" is public, correct?
If the answers to both of those questions are affirmative, 👎 to that interface. How can this RFC be reworked to conform with the "exact Scikit-Learn API"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "exact Scikit-Learn API" I mean roughly:
fit
method with the signaturefit(X, y=None) -> self
transform(X) -> X'
inverse_transform
method with the signatureinverse_transform(X') -> X
.And no other methods (that are part of the API).
Also
inverse_transform(transform(X)) == X
is not a requirement of the API, but it certainly is the spirit of it (and how most sklearn transformers are implemented, when it makes sense to do so).Yes, it would be public and consists of
ArrayTransformer
andDatasetTransformer
.These can be base classes, protocols/interfaces or just duck-typing (implementation detail).
The idea is that users could take our default data validation/transformations and mix/match with their own for use cases like multi-output models or multi-output class reweighting.
This proposed interface violates the sklearn interface in several ways:
X
,y
andsample_weight
inArrayTransformer
tf.data.Dataset
object inDatasetTransformer
y'
(y predictions) and/ory
itself. Soinverse_transform(transform(X)) != X
.model
(this is an addition to the API, it doesn't strictly break the implementation, but it does break the spirit).I don't think it can be easily reworked: there are too many limitations in the sklearn transformer API (and the authors have said as much themselves).
The good news is that the Scikit-Learn API is a strict subset of this API, so we could provide wrappers to convert the Scikit-Learn transformer API into this one (eg. by specifying what
X
is to your transformer and dispatching the methods).But let me ask: why is it important to adhere to the Scikit-Learn API? I know there are good reasons, I just want understand which you are thinking of.
My thoughts are that if we can't interoperate directly (i.e. use
make_pipeline
and other facilities of sklearn), there is not much value in being semi-compatible. I also think there aren't that many things (beyond pipelines) that would be useful here. But this may be short-sighted, I'm open to other opinions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This library has one job: to bring the Scikit-learn API to Keras. That means there needs to be a really good reason to break the Scikit-learn API.
That reason needs to consider alternative implementations, including the one that exactly follows the Scikit-learn API. I imagine some questions that need to be answered include the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👎
Why not follow the Scikit-Learn API?
Yes, TF uses
set_model
. Why should their boilerplate be followed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require users to give us a class to initialize (this object is being passed to the constructor that creates
model
).I guess we could use parameter routing to allow setting any other parameters. Does this look any better?
An alternative would be to require users to bind any other parameters using
functools.partial
or something.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not rename these functions
inverse_transform
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because their signature and usage does not match
inverse_transform
's signature. They wouldn't work in an sklearnPipeline
, nor would they be chainable with sklearn estimators.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These issues seem to be around modular validation/transforming. This RFC proposes one solution. What are other solutions? Why does this RFC represent the best solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, of course, a great question. I think general solutions would be variations of the same idea, perhaps less structured (eg. hardcoding that
tf.data.Dataset
inputs should skipBaseWrapper._validate_data
). I'd have to think a bit to see if I can come up with any other structured approaches that might also solve the issues.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have two Scikit-learn transformers, one for validation and one to change the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need:
I think this is too many interfaces. The only difference between a "validation" and a "transformation" is that the transformation needs to return the data, but the validation does not. So by having the validation return the data we can collapse those two concepts into one. Validation transformers simply inspect the data but do not modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use case/issue motivates having separate classes for Dataset/array-like inputs? Why not collapse it into one class?
Why is 4 interfaces "too many"? Why is the current framework with
target_encoder_
andfeature_encoder_
not sufficient?I'm asking questions to get answers encoded in the RFC (an RFC is a really good idea to encode these decisions).