Dealing with variable length inputs #160

eliorc · 2021-01-06T11:54:52Z

Let's assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset is the ability to pad batches as they come.

But since scikit-learn's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the sklearn.pipeline.Pipeline object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset object to later plug in my model. But this is not possible since the .transform signature only accepts X and not y, while you'll need both to work with tf.data.Dataset.

So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:

sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0

How will I be able to manage this kind of dataset under scikit learn + scikeras?

The text was updated successfully, but these errors were encountered:

adriangb · 2021-01-06T16:30:29Z

It's tricky: SciKit learn does not really support this, and our "extrernal" wrapper API is mostly based on Scikit-Learn. This is partly because all of the validation that occurs on X and y is array based (like making sure they're the same length).

That said, we do want to try to support Keras API features where there is no way to accomplish the same thing in ScikitLearn, and tf.Dataset is one of those things. In the future, we would probably make BaseWrapper.{fit,predict,...} accept X=t.data.Dataset....

For now, what I think you could do to make this work with SciKeras is override {fit,predict,transform} to access the Keras Model directly. Most things should still work:

from scikeras.wrappers import KerasClassifier

class DataSetWrapper(KerasClassifiier):

    def fit(self, X, y, sample_weight=None, **kwargs):
       if not ((self.warm_start or warm_start) and self.initialized_):
            self._initialize(X, y)  # this instantiates the Keras Model
       
        # you probably need to override target_encoder_ and feature_encoder_
        # or just replace them with FunctionTransformer() here
        y = self.target_encoder_.transform(y)  
        X = self.feature_encoder_.transform(X)

      self._fit_keras_model(
            X,
            y,
            sample_weight=None,
            warm_start=warm_start,
            epochs=epochs,
            initial_epoch=initial_epoch,
            **kwargs,
        )

That said, this does nothing to fix things outside of SciKeras; I think you will run into problems elsewhere in the sklearn ecosystem.

For the future, aside from accepting tf.DataSets directly (which would essentially be something like the code above, just cutting out length checks and such), we could create Dataset wrappers like skorch's.

eliorc · 2021-01-07T07:22:33Z

First of all thanks for the elaborate answer and in genral I think it is of utmost importance that different ML frameworks would learn to cooperate.

I can't see though how this helps me. In my real use case, the data starts in a table form, where the time dimension is the rows. But as I formulated the problem, a sequence of rows is actually one example from the model's perspective.
So I've written a sklearn.pipeline.Pipeline flow and I start all my munging and cleaning in the table format, and I got stuck at the point I want to convert these into sequences (as I said, variable length sequences).
So at that point, since I know how to deal with sequences in tf.data.Dataset I thought to myself I'll try to go that way, but as I said it is not possible.

So how would I go about incorporating a scikeras model at the end of the pipeline? The missing piece is the transformation of tabular data to sequential, which i believe is quite a generic problem that needs solving

adriangb · 2021-01-07T07:53:23Z

I'm having a bit of trouble picturing your data flow, so sorry if I'm slow and make you explain things multiple times 😅 , please bear with me.

It sounds like your data is tabular/representable as numpy arrays (without padding) up to some point in your pipeline when you want to apply a transformation to get data like you described in #160 (comment). Is this correct? If so, where in your pipeline do you need this transformation (i.e. before all of the sklearn preprocessing, in the middle of it, or right before/within the model)?

eliorc · 2021-01-07T11:55:08Z

It's okay :)

So the starts in a tabular fashion - and there is a column where if you group by its values you can create the examples as they should appear in your training set. These groups are of variable length. So in my pipeline, I'm doing most of the preprocessing on the tabular form, as it is readable and easy (ordinal encoding, scaling and stuff like that).
Then, I want to use a model with a sequential nature, so what I wanted to do, is as last step before the model (aka last transofrmer in my Pipeline object before the scikeras model) I need to get the data to the form I specified in the first comment so I can later ingest them in my model

adriangb · 2021-01-07T16:11:07Z

I think you are in luck then, we have something in SciKeras specifically for this purpose: https://scikeras.readthedocs.io/en/latest/advanced.html#data-transformers.

Basically you get to insert a transformer within SciKeras itself, like if it were an extra step in the pipeline. But this transformer is run just before passing data to Keras, so it can return dicts, objects, tf.data.Datasets, etc. Here's an example notebook: https://colab.research.google.com/github/adriangb/scikeras/blob/master/notebooks/DataTransformers.ipynb

eliorc · 2021-01-07T17:48:15Z

Looks interesting, I'll check it out and report :)

eliorc · 2021-01-08T14:59:51Z

Okay so I'm on this and I'm wondering...
It looks like the data transformations are made spearately for inptus and outputs, so tf.data.Dataset is out of the question (since hte dataset needs to hold both X and y) - am I right? Or is there any way to use tf.data.Dataset here?

My second question, is if how can I use this to avoid tf.data.Dataset? Since each sample has different length of inputs, I can't convert my inputs into a dictionary of features...
Because sample one will be {'a': [1,2,3], 'b': [2]} and sample two will be {'a': [2,3,4,5,6,7], 'b': [1,2,3,4]}

So I'm not really sure how in general one would deal with variable lengths inputs in scikeras

Maybe wit han example it will be easier, imagine this table

a	b	id
1	2	xxx
2	2	xxx
1	3	yyy
2	3	yyy
3	3	yyy

From the machine learning model, these are two samples, in the first x == {'a': [1, 2], 'b': [2, 2]} and in the second examples x == {'a': [1, 2, 3], 'b': [3, 3, 3]}

This is what I'm trying to do

stsievert · 2021-01-08T22:55:49Z

Or is there any way to use tf.data.Dataset here?

I think that'd be a good feature. Skorch also supports use of PyTorch datasets/dataloaders (source). @eliorc would that resolve your issue?

adriangb · 2021-01-09T01:52:01Z

I agree that long term we probably will end up support tf.data.Dataset as an input to SciKeras.

In the meantime, I think that using ragged tensors and encoding your observation splits into your data might work?

Example

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from scikeras.wrappers import KerasClassifier
from scikeras.utils.transformers import ClassifierLabelEncoder
import tensorflow as tf


class FeatureEncoder(BaseEstimator, TransformerMixin):

    def fit(self, X):
        X = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
        self.max_ragged_length_ = X.bounding_shape()[-1].numpy()
        return self
    
    def transform(self, X):
        Xr = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
        return Xr

    def get_metadata(self):
        return {"max_ragged_length_": self.max_ragged_length_}

class TargetEncoder(BaseEstimator, TransformerMixin):

    def fit(self, y):
        self._clf_enc = ClassifierLabelEncoder().fit(y[:, 0])
        return self
    
    def transform(self, y):
        obs = y[:, 1]
        y = self._clf_enc.transform(y[:, 0]).reshape(-1, )
        yr = tf.RaggedTensor.from_value_rowids(y, obs)
        y = y[yr.row_limits().numpy() - 1].reshape(-1, )
        return y


class MyModel(KerasClassifier):

    @property
    def feature_encoder(self):
        return FeatureEncoder()

    @property
    def target_encoder(self):
        return TargetEncoder()


def get_model(meta):
    max_ragged_length = meta["max_ragged_length_"]
    model = tf.keras.Sequential([                  
        tf.keras.layers.Input(shape=[2, max_ragged_length], dtype=tf.float32, ragged=True),
        tf.keras.layers.LSTM(64),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model


X = np.array([[1, 2, 1, 2, 3], [2, 2, 3, 3, 3]]).T
y = ["xxx", "xxx", "yyy", "yyy", "yyy"]
# stack the observations onto the data
# this is just so the transformers know where to split
# if your can groupby or something, then you don't need this
obs = [0, 0, 1, 1, 1]
y = np.column_stack([y, obs])
X = np.column_stack([X, obs])


clf = MyModel(get_model, loss="binary_crossentropy")

clf.fit(X, y)

eliorc · 2021-01-09T08:30:44Z

RaggedTensor looks like a nice solution for the time being, I would prefer my inputs to be as dictionaries as it makes the tf model subclassing implementation clearer, but I guess for now I'll manage with this

Let's say I was to try contribute the incorporation of tf.data.Dataset in scikeras how would I go about that? What would be the constraints? We have to remember that when using tf.data.Dataset we must have X and y together, and not separately like scikit-learn manages them.

adriangb · 2021-01-09T17:09:42Z

It would be great if you could contribute that!

I think that there are two separate use cases where tf.data.Dataset comes into play, which require different solutions:

Compatibility with Keras/TF. This is the case where someone has an existing working Keras setup, and wants to seamlessly migration to SciKeras. This requires SciKeras' external API ({fit,predict,etc}) to be modified to directly accept tf.data.Dataset objects and pass them untouched to Model. This probably consists of skipping the checks for array dtype, shape, etc. as well as skipping the transformers. This means that these users would miss out on many of the features that SciKeras provides, like handling string/object labels in classification, but that's probably ok since the data is already prepared for Keras. Might also need to think about how to make stuff like scoring work, but I think many parts of the sklearn ecosystem will not be compatible with this approach anyway.
Use cases like yours, where your data is tabular up until you're ready to pass it to your model. I think this is the more common use case, since there is no point in integrating with the sklearn ecosystem if your data is never array-like. I think what we can do here is add another dependency injection point for users that runs after the current two transformers and operates on X & y, allowing users to make a tf.data.Dataset or whatever else they need. This has the advantage of being fully compatible with the rest of the skelarn ecosystem, taking advantage of SciKeras' built in checks, etc.

What do you think?

eliorc · 2021-01-10T07:04:19Z

"the current two transformers" do you mean the feature and target encoders?
I'll try to dig into the internals of scikeras maybe next weekend (weekend in Israel is Friday-Saturday) if would save me some time if you can hyperlink me to the code that lives between the transformers to the scikeras model

Any way, I guess this issue can be closed :)

adriangb · 2021-01-10T07:21:12Z

"the current two transformers" do you mean the feature and target encoders?

Yep

I'll try to dig into the internals of scikeras maybe next weekend (weekend in Israel is Friday-Saturday)

Please enjoy your weekend! No rush.

If would save me some time if you can hyperlink me to the code that lives between the transformers to the scikeras model

Sure thing. The jist of it is that these are dependency injection points for users to insert custom data transformations. Calling BaseWrapper.fit instantiates and fits the transformers here. Adding another transformer just consists of adding a some default transformers (sklearn.preprocessing.FunctionTransformer) and a couple of lines to instantiates and fit the transformer. I think the hardest part is going to be figuring out the signature of the transformer since it'll be non-standard (Sklearn accepts only 1 parameter, we need 2 or a tuple).

adriangb · 2021-01-16T18:36:02Z

Hi @eliorc , I opened #166 to enable Datasets as inputs. I do think it's useful for the case where one has a Dataset and just wants to use GridSearchCV or something that operates on the estimator itself. But I'm not sure if it would solve your use case, since as far as I know there is no way to "combine" X & y in an sklearn Pipeline?

Edit: #167 adds the "whole dataset" transformer, as described in #160 (comment)

adriangb · 2021-01-21T21:49:17Z

@eliorc can you check if this example (see section 4. Ragged datasets with tf.data.Dataset) satisfies your use case? Thanks!

eliorc · 2021-01-24T07:13:04Z

@adriangb I've put myself a reminder and I'll try ot look at this next weekend

EDIT

Just to not leave my end open, at the end I did not have time to see this through... Though still interested in the incorporation of tf.data.Dataset in SciKeras :D

eliorc closed this as completed Jan 13, 2021

adriangb mentioned this issue Jan 16, 2021

ENH: accept tf.data.Dataset inputs #166

Open

adriangb mentioned this issue Jan 16, 2021

ENH: add dependency injection point to transform X & y together #167

Open

adriangb mentioned this issue Jun 21, 2021

RFC: Composable input/output pipeline #234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with variable length inputs #160

Dealing with variable length inputs #160

eliorc commented Jan 6, 2021

adriangb commented Jan 6, 2021 •

edited

Loading

eliorc commented Jan 7, 2021

adriangb commented Jan 7, 2021

eliorc commented Jan 7, 2021

adriangb commented Jan 7, 2021

eliorc commented Jan 7, 2021

eliorc commented Jan 8, 2021 •

edited

Loading

stsievert commented Jan 8, 2021

adriangb commented Jan 9, 2021

eliorc commented Jan 9, 2021

adriangb commented Jan 9, 2021

eliorc commented Jan 10, 2021

adriangb commented Jan 10, 2021 •

edited

Loading

adriangb commented Jan 16, 2021 •

edited

Loading

adriangb commented Jan 21, 2021 •

edited

Loading

eliorc commented Jan 24, 2021 •

edited

Loading

Dealing with variable length inputs #160

Dealing with variable length inputs #160

Comments

eliorc commented Jan 6, 2021

adriangb commented Jan 6, 2021 • edited Loading

eliorc commented Jan 7, 2021

adriangb commented Jan 7, 2021

eliorc commented Jan 7, 2021

adriangb commented Jan 7, 2021

eliorc commented Jan 7, 2021

eliorc commented Jan 8, 2021 • edited Loading

stsievert commented Jan 8, 2021

adriangb commented Jan 9, 2021

eliorc commented Jan 9, 2021

adriangb commented Jan 9, 2021

eliorc commented Jan 10, 2021

adriangb commented Jan 10, 2021 • edited Loading

adriangb commented Jan 16, 2021 • edited Loading

adriangb commented Jan 21, 2021 • edited Loading

eliorc commented Jan 24, 2021 • edited Loading

EDIT

adriangb commented Jan 6, 2021 •

edited

Loading

eliorc commented Jan 8, 2021 •

edited

Loading

adriangb commented Jan 10, 2021 •

edited

Loading

adriangb commented Jan 16, 2021 •

edited

Loading

adriangb commented Jan 21, 2021 •

edited

Loading

eliorc commented Jan 24, 2021 •

edited

Loading