Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add dependency injection point to transform X & y together #167

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

adriangb
Copy link
Owner

Resolves #160

@codecov-io
Copy link

codecov-io commented Jan 16, 2021

Codecov Report

Merging #167 (6eee3c4) into master (e36f03f) will decrease coverage by 0.43%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   99.70%   99.26%   -0.44%     
==========================================
  Files           6        5       -1     
  Lines         669      678       +9     
==========================================
+ Hits          667      673       +6     
- Misses          2        5       +3     
Impacted Files Coverage Δ
scikeras/utils/transformers.py 100.00% <100.00%> (ø)
scikeras/wrappers.py 99.44% <100.00%> (+<0.01%) ⬆️
scikeras/_saving_utils.py 96.20% <0.00%> (-3.80%) ⬇️
scikeras/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e36f03f...6eee3c4. Read the comment docs.

@adriangb
Copy link
Owner Author

@stsievert if you are able to, a review of this new interface would be much appreciated. The best place to start is probably section 6 and 7 of this notebook.

Comment on lines 412 to 437
class ClassWeightDataTransformer(BaseEstimator, TransformerMixin):
"""Default dataset_transformer for KerasClassifier.

This transformer implements handling of the `class_weight` parameter
for single output classifiers.
"""

def __init__(self, class_weight: Optional[Union[str, Dict[int, float]]] = None):
self.class_weight = class_weight

def fit(
self,
data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]],
dummy: None = None,
) -> "ClassWeightDataTransformer":
return self

def transform(
self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]
) -> Tuple[np.ndarray, Union[np.ndarray, None], Union[np.ndarray, None]]:
X, y, sample_weight = data
if self.class_weight is None or y is None:
return (X, y, sample_weight)
sample_weight = 1 if sample_weight is None else sample_weight
sample_weight *= compute_sample_weight(class_weight=self.class_weight, y=y)
return (X, y, sample_weight)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another option here, that may be cleaner, which would be to add another dependency injection point that runs in "parallel" to feature_encoder and target_encoder specifically for sample_weight. But since it also makes sense that one would need at least y to process sample_weight, it might be a bit redundant.

docs/source/notebooks/DataTransformers.md Show resolved Hide resolved
docs/source/notebooks/DataTransformers.md Outdated Show resolved Hide resolved
docs/source/notebooks/DataTransformers.md Show resolved Hide resolved
.. code-block::

↗ feature_encoder ↘
your data → sklearn-ecosystem → SciKeras dataset_transformer → Keras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diagram is really useful.

Would this be a better diagram?

                                   ↗ feature_encoder ↘
    SciKeras.fit(features, labels)                    dataset_transformer → Keras.fit(dataset)
                                   ↘ target_encoder  ↗ 

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the addition of .fit. My only worry is that users might think that SciKeras always creates a tf.data.Dataset, which is not the case, by default it gives numpy arrays to Keras.fit. Do you think Keras.fit(dataset or np.array) makes that clear? It could also be dicts of lists or something, but that's at least more uncommon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can dataset_transformer return? Only datasets/ndarrays? Or does it support the other arguments of Keras.fit?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that Keras.fit will accept. Internally, it looks something like this:

X, y, sample_weight = dataset_transformer.fit_transform((X, y, sample_weight))
model.fit(x=X, y=y, sample_weight=sample_weight)  # aka Keras.fit

Copy link
Owner Author

@adriangb adriangb Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Keras.fit(data) is a good way to specify this? That way there's not confusion in interpreting dataset as tf.data.Dataset. I can also add a small code block like in #167 (comment) if that helps explain it.

docs/source/advanced.rst Outdated Show resolved Hide resolved
docs/source/advanced.rst Outdated Show resolved Hide resolved
docs/source/notebooks/DataTransformers.md Outdated Show resolved Hide resolved
docs/source/notebooks/DataTransformers.md Outdated Show resolved Hide resolved
docs/source/notebooks/DataTransformers.md Outdated Show resolved Hide resolved
@adriangb
Copy link
Owner Author

@stsievert thank you for the review, the feedback was very valuable, I think I was able to incorporate most of the smaller suggestions.

For the larger stuff, I moved most of the pseudocode and general background out of the notebook and into the main docs, and added some links back to the docs instead of duplicate information, as you suggested.

Having read the documentation, do you think this data_transformer interface makes sense? And in particular, do you feel that using it to handle class_weight is problematic?

docs/source/advanced.rst Show resolved Hide resolved
(eg. to get the number of outputs) but *will* be able to inject data into the model building
function (more on this below). On the other hand,
``data_transformer`` *will* get access to the built Model, but it cannot pass any data to model building
function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important thing to note is that feature_encoder and target_encoder are run before building the Keras Model, while data_transformer is run after the Model is built.

This isn't clear in the example (though it's clear in the edit I've made).

This means that the former two will not have access to the Model
(eg. to get the number of outputs) but will be able to inject data into the model building
function (more on this below). On the other hand,
data_transformer will get access to the built Model, but it cannot pass any data to model building
function.

I would only say "the output of dataset_transformer get directly passed to tf.keras.Model.fit through self.model_.fit." All the about the number of outputs/etc is a confusing. I think I'd refer directly to the advanced examples.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't clear in the example (though it's clear in the edit I've made).

Thank you for pointing that out! The example is much better with this detail.

I would only say

That seems reasonable, I'll cut it down to ~ what you are suggesting.

Copy link
Collaborator

@stsievert stsievert Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Nothing I say hard and fast, all my comments are illustrations. If I want a change to be hard and fast, I'll submit a PR.

As per the table above, if you find that your target is classified as
``"multiclass-multioutput"`` or ``"unknown"``, you will have to implement your own data processing routine.

Whole dataset manipulation via data_transformer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should go directly below the paragraph on dataset transformers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also unsure about the level of nesting here. It looks like "multi-input" and "whole dataset" are on the same level, something I wouldn't have expected with the titles.

It might be worth collapsing this section into the "data transformers" section. This section really only mentions the signature.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it as you suggest. It then becomes duplicated with the paragraph discussed in #167 (comment), which warrants removing the latter.

This is the last step before passing the data to Keras, and it allows for the greatest
degree of customization because SciKeras does not make any assumptions about the output data
and passes it directly to :py:func:`tensorflow.keras.Model.fit`.
Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's signature is this:

from sklearn.base import BaseEstimator

class DatasetTransformer(BaseEstimator):
    def fit(self, X, y, sample_weight=None) -> "DatasetTransformer":
        ...
        return self

    def transform(self, X, y, sample_weight):  # return a valid input for keras.Model.fit
        ...
        return X, y, sample_weight  # option 1
        return tensorflow_dataset  # option 2

Copy link
Owner Author

@adriangb adriangb Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have design it wrong, and I'm not a fan of the tuple input either (especially because fit requires a dummy argument), but I believe it is required so that these can be chained in a pipeline.

With the signature you are proposing, this fails:

from sklearn.base import BaseEstimator
from sklearn.pipeline import make_pipeline

class DatasetTransformer(BaseEstimator):
    def fit(self, X, y, sample_weight=None) -> "DatasetTransformer":
        return self

    def transform(self, X, y, sample_weight):  # return a valid input for keras.Model.fit
        return X, y, sample_weight

p = make_pipeline(DatasetTransformer(), DatasetTransformer())

X = [1]
y = [1]

p.fit_transform(X, y)

If instead you accept a tuple as your input, then you can chain them:

class DatasetTransformer(BaseEstimator):
    def fit(self, data, dummy=None) -> "DatasetTransformer":
        return self

    def transform(self, data):  # return a valid input for keras.Model.fit
        X, y, sample_weight = data
        return X, y, sample_weight

p = make_pipeline(DatasetTransformer(), DatasetTransformer())

X = [1]
y = [1]

p.fit_transform((X, y, None))

Am I missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the example, I only meant "illustrate the signature with code, not with words." I am unsure what's happening with tuple inputs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yes, that is a good suggestion.

Copy link
Collaborator

@stsievert stsievert Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd be clearer to have the user return a dict. I'd also modify the tuple input to this transform; why not pass keyword arguments?

ValidKerasInput = Union[np.ndarray, tf.Dataset, ...]

def transform(
    self,
    X: ValidKerasInput,
    y: ValidKerasInput,
    sample_weight: Optional[ValidKerasInput]=None
) -> dict[str, ValidKerasInput]:
    ...
    return {"X": X, "y": y, "sample_weight": None}

I'm not sure this signature is correct.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it be messy?

It's minor, but it would now be a dict with dozens of keys, some values being arrays, etc. It can easily be documented, but I can see how at first glance it might be more confusing than a 3 element tuple or a 3 key dict.

Doesn't a user-passed kwarg already passed to fit?

Yes, but these parameters can't be calculated based on the data, which becomes especially problematic if doing cross-valdation and such. This is the basic problem in #131.

Of course, if users want to pass a fixed parameter like batch_size=42 they can do so via fit (or better yet, via the constructor).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's minor, but it would now be a dict with dozens of keys,

Why would it have to be dozens of keys? I'm thinking of this code:

# line 885 of wrappers.py
transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight)
kwargs.update(transform_kwargs)
self._fit_keras_model(**kwargs)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implementation in #167 (comment) is not compatible with the sklearn ecosystem, including pipelines, which I think are in important feature. This would be fine:

self.dataset_transformer_.transform(dict(x=X, y=y, sample_weight=sample_weight))

Regarding the multiple keys, what I was thinking was to pass all Model.fit keys. In _fit_keras_model:

params = self.get_params()
fit_args = route_params(params, destination="fit", pass_filter=self._fit_kwargs)
fit_args["epochs"] = initial_epoch + epochs
fit_args["initial_epoch"] = initial_epoch
fit_args.update(kwargs)
fit_args["x"] = X
fit_args["y"] = y
fit_args["sample_weight"] = sample_weight
fit_args = self.dataset_transformer_.transform(fit_args)
if self._random_state is not None:
with TFRandomState(self._random_state):
hist = self.model_.fit(**fit_args)

This means that users can not only add keys, but also modify existing keys, to their liking.

Regarding the dozens of keys comment, I wasn't referring to the keys assigned via kwargs.update(transform_kwargs) above, I was more thinking that when a user creates a transformer like:

def transform(data: Dict[str, Any) -> Dict[str, Any]:
    len(data.keys())  # many keys, array-values, etc.
    ...

# for context:
class SomeCustomEstimator(BaseWrapper):
    @property
    def data_transformer(self):
        return FunctionTransformer(transformer)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that users can not only add keys, but also modify existing keys, to their liking.

Yes, I think that's a good idea. It also leads to simple documentation (something like "the dataset transformer is passed a dict. Normally, this dictionary is unmodified and passed directly to keras.Model.fit. If desired, it's possible to modify this dictionary through a Scikit-learn transformation: [example]."

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'm going to loop back to #131 and #159 to see if the issue reporters have any input on this implementation.

@github-actions
Copy link

github-actions bot commented Jan 28, 2021

📝 Docs preview for commit f5df4c4 at: https://www.adriangb.com/scikeras/refs/pull/167/merge/

scikeras/wrappers.py Outdated Show resolved Hide resolved
This is the last step before passing the data to Keras, and it allows for the greatest
degree of customization because SciKeras does not make any assumptions about the output data
and passes it directly to :py:func:`tensorflow.keras.Model.fit`.
Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's minor, but it would now be a dict with dozens of keys,

Why would it have to be dozens of keys? I'm thinking of this code:

# line 885 of wrappers.py
transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight)
kwargs.update(transform_kwargs)
self._fit_keras_model(**kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dealing with variable length inputs
3 participants