No default loss for KerasClassifier #206

stsievert · 2021-02-24T16:16:59Z

Why does KerasClassifier not have a default loss?

Lines 1229 to 1231 in d83bffa

    
           loss: Union[ 
        
               Union[str, tf.keras.losses.Loss, Type[tf.keras.losses.Loss], Callable], None 
        
           ] = None,

Shouldn't there be a least a sensible default that works in most use cases?

Reference issues/PRs
dask/dask-ml#794

adriangb · 2021-02-24T16:35:46Z

Keras itself does not provide a default loss. The loss depends on the dataset, so I don't think it makes sense to provide a concrete loss as a default. At most, we could accept a sting like "auto" or "infer" that could be passed to target_encoder which can then determine an appropriate loss for the specific dataset.

But this also sounds like something users could implement, or that could be shown in notebooks.

If we had to implement a default, I guess it would be "sparse_categorical_crossentropy" since that should also work for binary datasets. For KerasRegressor, it could be "mean_squared_error".

Do you have any suggestions for a default loss you would like for KerasClassifier and KerasRegressor?

stsievert · 2021-02-24T19:50:40Z

Why should the user specify the loss parameter for their classifier to work? Why is it important to expose the loss function to the user?

Aren't particular classes of loss function pretty commonly used for classification regardless of dataset? (like cross entropy?).

How does the Scikit-learn API choose the loss for their classifier when the loss function can be varied? (e.g., SGDClassifier).

adriangb · 2021-02-24T20:11:11Z

Why is it important to expose the loss function to the user?

Because the loss function depends not only on the input data (which is not to hard to parse, especially since Scikit-Learn has tools like type_of_target) but also on the Model architecture, which is considerably harder to parse and bin into clear choices. I believe this is why Keras does not provide a default loss function.

How does the Scikit-learn API choose the loss for their classifier when the loss function can be varied?

Scikit-Learn has full control of the internals. They can choose/build a model to match a loss function; Scikeras/Keras have to work with whatever Model users create. That said, I don't really know what Scikit-Learn does internally, so I could be missing something.

I don't mean to be overly negative on the idea. I do think that abstracting things like the loss function from users is good. The simpler it is for users to get started, the better. But I would like to work on core parts of SciKeras (like #167) that enable Scikit-Learn/Keras interoperability first before adding more convenience layers.

stsievert · 2021-02-25T22:23:07Z

To be clear, BaseWrapper should absolutely not perform any heuristic processing. It should leave everything as untouched as possible. If a user wants to do anything non-standard classification/regression (or something else), that's the class they should use.

I think KerasClassifier and KerasRegressor should work well for most use cases. I think loss should be specified to make that happen.

the loss function depends ... on the Model architecture

Correct me where I'm wrong:

All models minimize a function loss(y, model(x)). One example is (y - model(x))**2 for y.shape == model(x).shape == (100, ).
In most cases, the model is trying to match the output exactly (same shape, same dtype, etc). In these cases, if the model were perfect then model(x) == y.
Typically, the loss only depends on the target shape/dtype/etc. It does not depend on the model architecture at all.
- This is a little different with classification with onehot encoding.
- That is, a lot of classification models have y = [0, 1, 2] and model(x) = [[1, 0, 0], [0, 1, 0], [0, 0, 1]] (or some floating point approximation of that).

My question: why can't you perform some heuristic processing to determine the right loss function? I am more than okay with loss="auto" or loss="heuristic". I think this question is especially because the target transformer does similar heuristic processing.

adriangb · 2021-02-25T23:09:25Z

Correct me where I'm wrong

You are mostly right. But the devil is in the details. One-hot encoded targets are a good example. Some other sticky points I can think of:

A binary target can be handled by Dense(1, activation="sigmoid") output with loss="binary_crossentropy" OR Dense(2, activation="softmax") output with loss="{sparse_}categorical_crossentropy". But at the point where we would be injecting the heuristic, the model is not yet built, so we don't know which the user specified! Besides, even though the first option is better, the second is valid, and I'm not sure SciKeras should force users' hands like that.
Multi-output models: you need to decide if you want the average the losses (default IIRC) and if you want the same/different loss for each output. We don't provide out of the box support for multi-output models in KerasClassifier, but we would want to make sure that it's possible to support with some custom code.

So to answer your question, there is no particular reason why we can't implement a heuristic for this, I just think that the implementation is harder than it sounds, and it will be hard to think of all use cases and sharp edges.

stsievert · 2021-02-26T01:23:56Z

it will be hard to think of all use cases and sharp edges.

I'm not proposing to cover every possible scenario. I think covering the majority of the use cases would really enhance the use of SciKeras.

Wouldn't loss="categorical_crossentropy" cover all one-hot encoded targets 0/1 targets? It worked for my (brief) experiments.

adriangb · 2021-02-26T01:51:55Z

I agree that we don't need to cover every scenario, but we at least have to think of how the failure might happen, and what the errors might look like to users. The last thing we want is to make it more confusing for users.

If you want to tackle this in the short term, a PR would be welcome. You might have ideas about implementation that I can't think of.

Wouldn't loss="categorical_crossentropy" cover all one-hot encoded targets 0/1 targets?

Theoretically, for a single output, yes. In practice, "binary_crossentropy" might be optimized differently (there'd be no reason to have parallel implementation otherwise).

stsievert mentioned this issue Feb 24, 2021

MNT Fixes failing test_keras dask/dask-ml#794

Merged

adriangb mentioned this issue Feb 24, 2021

Sequential model doesn't have outputs #207

Open

stsievert linked a pull request Feb 27, 2021 that will close this issue

ENH: Add default losses to KerasClassifier and KerasRegressor #208

Open

german1608 mentioned this issue Jul 1, 2022

Pass the loss from the compile call to the target_encoder instantiation #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No default loss for KerasClassifier #206

No default loss for KerasClassifier #206

stsievert commented Feb 24, 2021

adriangb commented Feb 24, 2021 •

edited

Loading

stsievert commented Feb 24, 2021

adriangb commented Feb 24, 2021

stsievert commented Feb 25, 2021 •

edited

Loading

adriangb commented Feb 25, 2021 •

edited

Loading

stsievert commented Feb 26, 2021

adriangb commented Feb 26, 2021

No default loss for KerasClassifier #206

No default loss for KerasClassifier #206

Comments

stsievert commented Feb 24, 2021

adriangb commented Feb 24, 2021 • edited Loading

stsievert commented Feb 24, 2021

adriangb commented Feb 24, 2021

stsievert commented Feb 25, 2021 • edited Loading

adriangb commented Feb 25, 2021 • edited Loading

stsievert commented Feb 26, 2021

adriangb commented Feb 26, 2021

adriangb commented Feb 24, 2021 •

edited

Loading

stsievert commented Feb 25, 2021 •

edited

Loading

adriangb commented Feb 25, 2021 •

edited

Loading