-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint used with save_best_only
doesn't handle interruptions, even with BackupAndRestore
#430
Comments
@gowthamkpr, |
@nicolasnn - Thanks for reporting this issue with detailed examples. I am working on an update to |
Thanks Ramesh! |
I have run into the same issue. @sampathweb are you still planning to fix this? @rchao |
Hello, just curious if this issue has been fixed or if it is still in the works? |
Fixing this requires us to update |
System information.
Describe the problem.
I am using ModelCheckpoint callback to save the best model, combined with BackupAndRestore callback to handle interruptions.
The problem lies when running again a training script after an interruption. The model restored by BackupAndRetore doesn't have the previous value of losses and metrics. Thus, ModelCheckpoint saves the model on the 1st epoch of this new run, whatever the value of loss, it even overwrites the "best" model with a not-as-good model.
Describe the current behavior.
Describe the expected behavior.
Standalone code to reproduce the issue.
First training
The output is:
2nd training
The output is:
The problem lies at
val_loss improved from inf to 2.09097
, the model restored by BackupAndRetore doesn't restore the previous value of val_loss. The model is initialized with aninf
value, thus ModelCheckpoint doesn't fulfill what it is supposed to do and it even overwrites the "best" model with a not-as-good model.The text was updated successfully, but these errors were encountered: