Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No default loss for KerasClassifier #206

Open
stsievert opened this issue Feb 24, 2021 · 7 comments · May be fixed by #208
Open

No default loss for KerasClassifier #206

stsievert opened this issue Feb 24, 2021 · 7 comments · May be fixed by #208

Comments

@stsievert
Copy link
Collaborator

Why does KerasClassifier not have a default loss?

scikeras/scikeras/wrappers.py

Lines 1229 to 1231 in d83bffa

loss: Union[
Union[str, tf.keras.losses.Loss, Type[tf.keras.losses.Loss], Callable], None
] = None,

Shouldn't there be a least a sensible default that works in most use cases?

Reference issues/PRs
dask/dask-ml#794

@adriangb
Copy link
Owner

adriangb commented Feb 24, 2021

Keras itself does not provide a default loss. The loss depends on the dataset, so I don't think it makes sense to provide a concrete loss as a default. At most, we could accept a sting like "auto" or "infer" that could be passed to target_encoder which can then determine an appropriate loss for the specific dataset.

But this also sounds like something users could implement, or that could be shown in notebooks.

If we had to implement a default, I guess it would be "sparse_categorical_crossentropy" since that should also work for binary datasets. For KerasRegressor, it could be "mean_squared_error".

Do you have any suggestions for a default loss you would like for KerasClassifier and KerasRegressor?

@stsievert
Copy link
Collaborator Author

Why should the user specify the loss parameter for their classifier to work? Why is it important to expose the loss function to the user?

Aren't particular classes of loss function pretty commonly used for classification regardless of dataset? (like cross entropy?).

How does the Scikit-learn API choose the loss for their classifier when the loss function can be varied? (e.g., SGDClassifier).

@adriangb
Copy link
Owner

Why is it important to expose the loss function to the user?

Because the loss function depends not only on the input data (which is not to hard to parse, especially since Scikit-Learn has tools like type_of_target) but also on the Model architecture, which is considerably harder to parse and bin into clear choices. I believe this is why Keras does not provide a default loss function.

How does the Scikit-learn API choose the loss for their classifier when the loss function can be varied?

Scikit-Learn has full control of the internals. They can choose/build a model to match a loss function; Scikeras/Keras have to work with whatever Model users create. That said, I don't really know what Scikit-Learn does internally, so I could be missing something.

I don't mean to be overly negative on the idea. I do think that abstracting things like the loss function from users is good. The simpler it is for users to get started, the better. But I would like to work on core parts of SciKeras (like #167) that enable Scikit-Learn/Keras interoperability first before adding more convenience layers.

@stsievert
Copy link
Collaborator Author

stsievert commented Feb 25, 2021

To be clear, BaseWrapper should absolutely not perform any heuristic processing. It should leave everything as untouched as possible. If a user wants to do anything non-standard classification/regression (or something else), that's the class they should use.

I think KerasClassifier and KerasRegressor should work well for most use cases. I think loss should be specified to make that happen.

the loss function depends ... on the Model architecture

Correct me where I'm wrong:

  • All models minimize a function loss(y, model(x)). One example is (y - model(x))**2 for y.shape == model(x).shape == (100, ).
  • In most cases, the model is trying to match the output exactly (same shape, same dtype, etc). In these cases, if the model were perfect then model(x) == y.
  • Typically, the loss only depends on the target shape/dtype/etc. It does not depend on the model architecture at all.
    • This is a little different with classification with onehot encoding.
    • That is, a lot of classification models have y = [0, 1, 2] and model(x) = [[1, 0, 0], [0, 1, 0], [0, 0, 1]] (or some floating point approximation of that).

My question: why can't you perform some heuristic processing to determine the right loss function? I am more than okay with loss="auto" or loss="heuristic". I think this question is especially because the target transformer does similar heuristic processing.

@adriangb
Copy link
Owner

adriangb commented Feb 25, 2021

Correct me where I'm wrong

You are mostly right. But the devil is in the details. One-hot encoded targets are a good example. Some other sticky points I can think of:

  • A binary target can be handled by Dense(1, activation="sigmoid") output with loss="binary_crossentropy" OR Dense(2, activation="softmax") output with loss="{sparse_}categorical_crossentropy". But at the point where we would be injecting the heuristic, the model is not yet built, so we don't know which the user specified! Besides, even though the first option is better, the second is valid, and I'm not sure SciKeras should force users' hands like that.
  • Multi-output models: you need to decide if you want the average the losses (default IIRC) and if you want the same/different loss for each output. We don't provide out of the box support for multi-output models in KerasClassifier, but we would want to make sure that it's possible to support with some custom code.

So to answer your question, there is no particular reason why we can't implement a heuristic for this, I just think that the implementation is harder than it sounds, and it will be hard to think of all use cases and sharp edges.

@stsievert
Copy link
Collaborator Author

it will be hard to think of all use cases and sharp edges.

I'm not proposing to cover every possible scenario. I think covering the majority of the use cases would really enhance the use of SciKeras.

Wouldn't loss="categorical_crossentropy" cover all one-hot encoded targets 0/1 targets? It worked for my (brief) experiments.

@adriangb
Copy link
Owner

I agree that we don't need to cover every scenario, but we at least have to think of how the failure might happen, and what the errors might look like to users. The last thing we want is to make it more confusing for users.

If you want to tackle this in the short term, a PR would be welcome. You might have ideas about implementation that I can't think of.

Wouldn't loss="categorical_crossentropy" cover all one-hot encoded targets 0/1 targets?

Theoretically, for a single output, yes. In practice, "binary_crossentropy" might be optimized differently (there'd be no reason to have parallel implementation otherwise).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants