Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with variable length inputs #160

Closed
eliorc opened this issue Jan 6, 2021 · 16 comments · May be fixed by #167
Closed

Dealing with variable length inputs #160

eliorc opened this issue Jan 6, 2021 · 16 comments · May be fixed by #167

Comments

@eliorc
Copy link

eliorc commented Jan 6, 2021

Let's assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset is the ability to pad batches as they come.

But since scikit-learn's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the sklearn.pipeline.Pipeline object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset object to later plug in my model. But this is not possible since the .transform signature only accepts X and not y, while you'll need both to work with tf.data.Dataset.

So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:

sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0

How will I be able to manage this kind of dataset under scikit learn + scikeras?

@adriangb
Copy link
Owner

adriangb commented Jan 6, 2021

It's tricky: SciKit learn does not really support this, and our "extrernal" wrapper API is mostly based on Scikit-Learn. This is partly because all of the validation that occurs on X and y is array based (like making sure they're the same length).

That said, we do want to try to support Keras API features where there is no way to accomplish the same thing in ScikitLearn, and tf.Dataset is one of those things. In the future, we would probably make BaseWrapper.{fit,predict,...} accept X=t.data.Dataset....

For now, what I think you could do to make this work with SciKeras is override {fit,predict,transform} to access the Keras Model directly. Most things should still work:

from scikeras.wrappers import KerasClassifier

class DataSetWrapper(KerasClassifiier):

    def fit(self, X, y, sample_weight=None, **kwargs):
       if not ((self.warm_start or warm_start) and self.initialized_):
            self._initialize(X, y)  # this instantiates the Keras Model
       
        # you probably need to override target_encoder_ and feature_encoder_
        # or just replace them with FunctionTransformer() here
        y = self.target_encoder_.transform(y)  
        X = self.feature_encoder_.transform(X)

      self._fit_keras_model(
            X,
            y,
            sample_weight=None,
            warm_start=warm_start,
            epochs=epochs,
            initial_epoch=initial_epoch,
            **kwargs,
        )

That said, this does nothing to fix things outside of SciKeras; I think you will run into problems elsewhere in the sklearn ecosystem.

For the future, aside from accepting tf.DataSets directly (which would essentially be something like the code above, just cutting out length checks and such), we could create Dataset wrappers like skorch's.

@eliorc
Copy link
Author

eliorc commented Jan 7, 2021

First of all thanks for the elaborate answer and in genral I think it is of utmost importance that different ML frameworks would learn to cooperate.

I can't see though how this helps me. In my real use case, the data starts in a table form, where the time dimension is the rows. But as I formulated the problem, a sequence of rows is actually one example from the model's perspective.
So I've written a sklearn.pipeline.Pipeline flow and I start all my munging and cleaning in the table format, and I got stuck at the point I want to convert these into sequences (as I said, variable length sequences).
So at that point, since I know how to deal with sequences in tf.data.Dataset I thought to myself I'll try to go that way, but as I said it is not possible.

So how would I go about incorporating a scikeras model at the end of the pipeline? The missing piece is the transformation of tabular data to sequential, which i believe is quite a generic problem that needs solving

@adriangb
Copy link
Owner

adriangb commented Jan 7, 2021

I'm having a bit of trouble picturing your data flow, so sorry if I'm slow and make you explain things multiple times 😅 , please bear with me.

It sounds like your data is tabular/representable as numpy arrays (without padding) up to some point in your pipeline when you want to apply a transformation to get data like you described in #160 (comment). Is this correct? If so, where in your pipeline do you need this transformation (i.e. before all of the sklearn preprocessing, in the middle of it, or right before/within the model)?

@eliorc
Copy link
Author

eliorc commented Jan 7, 2021

It's okay :)

So the starts in a tabular fashion - and there is a column where if you group by its values you can create the examples as they should appear in your training set. These groups are of variable length. So in my pipeline, I'm doing most of the preprocessing on the tabular form, as it is readable and easy (ordinal encoding, scaling and stuff like that).
Then, I want to use a model with a sequential nature, so what I wanted to do, is as last step before the model (aka last transofrmer in my Pipeline object before the scikeras model) I need to get the data to the form I specified in the first comment so I can later ingest them in my model

@adriangb
Copy link
Owner

adriangb commented Jan 7, 2021

I think you are in luck then, we have something in SciKeras specifically for this purpose: https://scikeras.readthedocs.io/en/latest/advanced.html#data-transformers.

Basically you get to insert a transformer within SciKeras itself, like if it were an extra step in the pipeline. But this transformer is run just before passing data to Keras, so it can return dicts, objects, tf.data.Datasets, etc. Here's an example notebook: https://colab.research.google.com/github/adriangb/scikeras/blob/master/notebooks/DataTransformers.ipynb

@eliorc
Copy link
Author

eliorc commented Jan 7, 2021

Looks interesting, I'll check it out and report :)

@eliorc
Copy link
Author

eliorc commented Jan 8, 2021

Okay so I'm on this and I'm wondering...
It looks like the data transformations are made spearately for inptus and outputs, so tf.data.Dataset is out of the question (since hte dataset needs to hold both X and y) - am I right? Or is there any way to use tf.data.Dataset here?

My second question, is if how can I use this to avoid tf.data.Dataset? Since each sample has different length of inputs, I can't convert my inputs into a dictionary of features...
Because sample one will be {'a': [1,2,3], 'b': [2]} and sample two will be {'a': [2,3,4,5,6,7], 'b': [1,2,3,4]}

So I'm not really sure how in general one would deal with variable lengths inputs in scikeras

Maybe wit han example it will be easier, imagine this table

a b id
1 2 xxx
2 2 xxx
1 3 yyy
2 3 yyy
3 3 yyy

From the machine learning model, these are two samples, in the first x == {'a': [1, 2], 'b': [2, 2]} and in the second examples x == {'a': [1, 2, 3], 'b': [3, 3, 3]}

This is what I'm trying to do

@stsievert
Copy link
Collaborator

Or is there any way to use tf.data.Dataset here?

I think that'd be a good feature. Skorch also supports use of PyTorch datasets/dataloaders (source). @eliorc would that resolve your issue?

@adriangb
Copy link
Owner

adriangb commented Jan 9, 2021

I agree that long term we probably will end up support tf.data.Dataset as an input to SciKeras.

In the meantime, I think that using ragged tensors and encoding your observation splits into your data might work?

Example
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from scikeras.wrappers import KerasClassifier
from scikeras.utils.transformers import ClassifierLabelEncoder
import tensorflow as tf


class FeatureEncoder(BaseEstimator, TransformerMixin):

    def fit(self, X):
        X = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
        self.max_ragged_length_ = X.bounding_shape()[-1].numpy()
        return self
    
    def transform(self, X):
        Xr = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
        return Xr

    def get_metadata(self):
        return {"max_ragged_length_": self.max_ragged_length_}

class TargetEncoder(BaseEstimator, TransformerMixin):

    def fit(self, y):
        self._clf_enc = ClassifierLabelEncoder().fit(y[:, 0])
        return self
    
    def transform(self, y):
        obs = y[:, 1]
        y = self._clf_enc.transform(y[:, 0]).reshape(-1, )
        yr = tf.RaggedTensor.from_value_rowids(y, obs)
        y = y[yr.row_limits().numpy() - 1].reshape(-1, )
        return y


class MyModel(KerasClassifier):

    @property
    def feature_encoder(self):
        return FeatureEncoder()

    @property
    def target_encoder(self):
        return TargetEncoder()


def get_model(meta):
    max_ragged_length = meta["max_ragged_length_"]
    model = tf.keras.Sequential([                  
        tf.keras.layers.Input(shape=[2, max_ragged_length], dtype=tf.float32, ragged=True),
        tf.keras.layers.LSTM(64),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model


X = np.array([[1, 2, 1, 2, 3], [2, 2, 3, 3, 3]]).T
y = ["xxx", "xxx", "yyy", "yyy", "yyy"]
# stack the observations onto the data
# this is just so the transformers know where to split
# if your can groupby or something, then you don't need this
obs = [0, 0, 1, 1, 1]
y = np.column_stack([y, obs])
X = np.column_stack([X, obs])


clf = MyModel(get_model, loss="binary_crossentropy")

clf.fit(X, y)

@eliorc
Copy link
Author

eliorc commented Jan 9, 2021

RaggedTensor looks like a nice solution for the time being, I would prefer my inputs to be as dictionaries as it makes the tf model subclassing implementation clearer, but I guess for now I'll manage with this

Let's say I was to try contribute the incorporation of tf.data.Dataset in scikeras how would I go about that? What would be the constraints? We have to remember that when using tf.data.Dataset we must have X and y together, and not separately like scikit-learn manages them.

@adriangb
Copy link
Owner

adriangb commented Jan 9, 2021

It would be great if you could contribute that!

I think that there are two separate use cases where tf.data.Dataset comes into play, which require different solutions:

  1. Compatibility with Keras/TF. This is the case where someone has an existing working Keras setup, and wants to seamlessly migration to SciKeras. This requires SciKeras' external API ({fit,predict,etc}) to be modified to directly accept tf.data.Dataset objects and pass them untouched to Model. This probably consists of skipping the checks for array dtype, shape, etc. as well as skipping the transformers. This means that these users would miss out on many of the features that SciKeras provides, like handling string/object labels in classification, but that's probably ok since the data is already prepared for Keras. Might also need to think about how to make stuff like scoring work, but I think many parts of the sklearn ecosystem will not be compatible with this approach anyway.
  2. Use cases like yours, where your data is tabular up until you're ready to pass it to your model. I think this is the more common use case, since there is no point in integrating with the sklearn ecosystem if your data is never array-like. I think what we can do here is add another dependency injection point for users that runs after the current two transformers and operates on X & y, allowing users to make a tf.data.Dataset or whatever else they need. This has the advantage of being fully compatible with the rest of the skelarn ecosystem, taking advantage of SciKeras' built in checks, etc.

What do you think?

@eliorc
Copy link
Author

eliorc commented Jan 10, 2021

"the current two transformers" do you mean the feature and target encoders?
I'll try to dig into the internals of scikeras maybe next weekend (weekend in Israel is Friday-Saturday) if would save me some time if you can hyperlink me to the code that lives between the transformers to the scikeras model

Any way, I guess this issue can be closed :)

@adriangb
Copy link
Owner

adriangb commented Jan 10, 2021

"the current two transformers" do you mean the feature and target encoders?

Yep

I'll try to dig into the internals of scikeras maybe next weekend (weekend in Israel is Friday-Saturday)

Please enjoy your weekend! No rush.

If would save me some time if you can hyperlink me to the code that lives between the transformers to the scikeras model

Sure thing. The jist of it is that these are dependency injection points for users to insert custom data transformations. Calling BaseWrapper.fit instantiates and fits the transformers here. Adding another transformer just consists of adding a some default transformers (sklearn.preprocessing.FunctionTransformer) and a couple of lines to instantiates and fit the transformer. I think the hardest part is going to be figuring out the signature of the transformer since it'll be non-standard (Sklearn accepts only 1 parameter, we need 2 or a tuple).

@adriangb
Copy link
Owner

adriangb commented Jan 16, 2021

Hi @eliorc , I opened #166 to enable Datasets as inputs. I do think it's useful for the case where one has a Dataset and just wants to use GridSearchCV or something that operates on the estimator itself. But I'm not sure if it would solve your use case, since as far as I know there is no way to "combine" X & y in an sklearn Pipeline?

Edit: #167 adds the "whole dataset" transformer, as described in #160 (comment)

@adriangb
Copy link
Owner

adriangb commented Jan 21, 2021

@eliorc can you check if this example (see section 4. Ragged datasets with tf.data.Dataset) satisfies your use case? Thanks!

@eliorc
Copy link
Author

eliorc commented Jan 24, 2021

@adriangb I've put myself a reminder and I'll try ot look at this next weekend

EDIT

Just to not leave my end open, at the end I did not have time to see this through... Though still interested in the incorporation of tf.data.Dataset in SciKeras :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants