-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with variable length inputs #160
Comments
It's tricky: SciKit learn does not really support this, and our "extrernal" wrapper API is mostly based on Scikit-Learn. This is partly because all of the validation that occurs on That said, we do want to try to support Keras API features where there is no way to accomplish the same thing in ScikitLearn, and For now, what I think you could do to make this work with SciKeras is override from scikeras.wrappers import KerasClassifier
class DataSetWrapper(KerasClassifiier):
def fit(self, X, y, sample_weight=None, **kwargs):
if not ((self.warm_start or warm_start) and self.initialized_):
self._initialize(X, y) # this instantiates the Keras Model
# you probably need to override target_encoder_ and feature_encoder_
# or just replace them with FunctionTransformer() here
y = self.target_encoder_.transform(y)
X = self.feature_encoder_.transform(X)
self._fit_keras_model(
X,
y,
sample_weight=None,
warm_start=warm_start,
epochs=epochs,
initial_epoch=initial_epoch,
**kwargs,
) That said, this does nothing to fix things outside of SciKeras; I think you will run into problems elsewhere in the sklearn ecosystem. For the future, aside from accepting tf.DataSets directly (which would essentially be something like the code above, just cutting out length checks and such), we could create Dataset wrappers like skorch's. |
First of all thanks for the elaborate answer and in genral I think it is of utmost importance that different ML frameworks would learn to cooperate. I can't see though how this helps me. In my real use case, the data starts in a table form, where the time dimension is the rows. But as I formulated the problem, a sequence of rows is actually one example from the model's perspective. So how would I go about incorporating a scikeras model at the end of the pipeline? The missing piece is the transformation of tabular data to sequential, which i believe is quite a generic problem that needs solving |
I'm having a bit of trouble picturing your data flow, so sorry if I'm slow and make you explain things multiple times 😅 , please bear with me. It sounds like your data is tabular/representable as numpy arrays (without padding) up to some point in your pipeline when you want to apply a transformation to get data like you described in #160 (comment). Is this correct? If so, where in your pipeline do you need this transformation (i.e. before all of the sklearn preprocessing, in the middle of it, or right before/within the model)? |
It's okay :) So the starts in a tabular fashion - and there is a column where if you group by its values you can create the examples as they should appear in your training set. These groups are of variable length. So in my pipeline, I'm doing most of the preprocessing on the tabular form, as it is readable and easy (ordinal encoding, scaling and stuff like that). |
I think you are in luck then, we have something in SciKeras specifically for this purpose: https://scikeras.readthedocs.io/en/latest/advanced.html#data-transformers. Basically you get to insert a transformer within SciKeras itself, like if it were an extra step in the pipeline. But this transformer is run just before passing data to Keras, so it can return dicts, objects, tf.data.Datasets, etc. Here's an example notebook: https://colab.research.google.com/github/adriangb/scikeras/blob/master/notebooks/DataTransformers.ipynb |
Looks interesting, I'll check it out and report :) |
Okay so I'm on this and I'm wondering... My second question, is if how can I use this to avoid So I'm not really sure how in general one would deal with variable lengths inputs in scikeras Maybe wit han example it will be easier, imagine this table
From the machine learning model, these are two samples, in the first This is what I'm trying to do |
I agree that long term we probably will end up support In the meantime, I think that using ragged tensors and encoding your observation splits into your data might work? Exampleimport numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from scikeras.wrappers import KerasClassifier
from scikeras.utils.transformers import ClassifierLabelEncoder
import tensorflow as tf
class FeatureEncoder(BaseEstimator, TransformerMixin):
def fit(self, X):
X = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
self.max_ragged_length_ = X.bounding_shape()[-1].numpy()
return self
def transform(self, X):
Xr = tf.RaggedTensor.from_value_rowids(X[:, :-1], X[:, -1])
return Xr
def get_metadata(self):
return {"max_ragged_length_": self.max_ragged_length_}
class TargetEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self._clf_enc = ClassifierLabelEncoder().fit(y[:, 0])
return self
def transform(self, y):
obs = y[:, 1]
y = self._clf_enc.transform(y[:, 0]).reshape(-1, )
yr = tf.RaggedTensor.from_value_rowids(y, obs)
y = y[yr.row_limits().numpy() - 1].reshape(-1, )
return y
class MyModel(KerasClassifier):
@property
def feature_encoder(self):
return FeatureEncoder()
@property
def target_encoder(self):
return TargetEncoder()
def get_model(meta):
max_ragged_length = meta["max_ragged_length_"]
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=[2, max_ragged_length], dtype=tf.float32, ragged=True),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation='sigmoid')
])
return model
X = np.array([[1, 2, 1, 2, 3], [2, 2, 3, 3, 3]]).T
y = ["xxx", "xxx", "yyy", "yyy", "yyy"]
# stack the observations onto the data
# this is just so the transformers know where to split
# if your can groupby or something, then you don't need this
obs = [0, 0, 1, 1, 1]
y = np.column_stack([y, obs])
X = np.column_stack([X, obs])
clf = MyModel(get_model, loss="binary_crossentropy")
clf.fit(X, y) |
Let's say I was to try contribute the incorporation of |
It would be great if you could contribute that! I think that there are two separate use cases where
What do you think? |
"the current two transformers" do you mean the feature and target encoders? Any way, I guess this issue can be closed :) |
Yep
Please enjoy your weekend! No rush.
Sure thing. The jist of it is that these are dependency injection points for users to insert custom data transformations. Calling |
Hi @eliorc , I opened #166 to enable Datasets as inputs. I do think it's useful for the case where one has a Dataset and just wants to use GridSearchCV or something that operates on the estimator itself. But I'm not sure if it would solve your use case, since as far as I know there is no way to "combine" X & y in an sklearn Pipeline? Edit: #167 adds the "whole dataset" transformer, as described in #160 (comment) |
@eliorc can you check if this example (see section |
@adriangb I've put myself a reminder and I'll try ot look at this next weekend EDITJust to not leave my end open, at the end I did not have time to see this through... Though still interested in the incorporation of |
Let's assume we are working with variable length inputs. One of the strongest parts in using
tf.data.Dataset
is the ability to pad batches as they come.But since
scikit-learn
's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with thesklearn.pipeline.Pipeline
object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return atf.data.Dataset
object to later plug in my model. But this is not possible since the.transform
signature only accepts X and not y, while you'll need both to work withtf.data.Dataset
.So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:
How will I be able to manage this kind of dataset under scikit learn + scikeras?
The text was updated successfully, but these errors were encountered: