MESMER-X: better loss function for other coefficients #582

veni-vidi-vici-dormivi · 2024-12-17T17:18:06Z

To find a good first guess for parameters other than location and scale we use the following loss function:

mesmer/mesmer/mesmer_x/train_l_distrib_mesmerx.py

Lines 1173 to 1193 in 72d83a5

    
           def fg_fun_others(self, x_others, margin0=0.05): 
        
               # preparing support 
        
               x = np.copy(self.fg_coeffs) 
        
               x[self.fg_ind_others] = x_others 
        
               distrib = self.expr_fit.evaluate(x, self.data_pred) 
        
               bot, top = distrib.support() 
        
               val_bot = np.min(self.data_targ - bot) 
        
               val_top = np.min(top - self.data_targ) 
        
               # preparing margin on support 
        
               m = np.mean(self.data_targ) 
        
               s = np.std(self.data_targ - m) 
        
               # optimization 
        
               if val_bot < 0: 
        
                   # limit of val_bottom --> 0- = 1/margin0*s 
        
                   return np.exp(-val_bot) * 1 / (margin0 * s) 
        
               elif val_top < 0: 
        
                   # limit of val_top --> 0+ = 1/margin0*s 
        
                   return np.exp(-val_top) * 1 / (margin0 * s) 
        
               else: 
        
                   return 1 / (val_bot + margin0 * s) + 1 / (val_top + margin0 * s)

First, we calculate the differences between all samples and the bounds of the distributions support. Of these differences we take the minimum values. Take for example a truncated normal distribution:

import numpy as np
import scipy

rng = np.random.default_rng(0)
n = 251
pred = np.ones(n)

loc = 0.
scale = 0.1
a = -1.2 # nr of stds from loc at which to truncate
b = 1.2
targ = scipy.stats.truncnorm.rvs(loc=loc, scale=scale, a=a, b=b, size=n, random_state=rng)

The parameters a and b give you the upper and lower bound of the distribution (in terms of numbers ob stds) and thus define the so called support of the distribution. To optimize for a and b we calculate how far the most extreme samples lie from the bounds of the distribution that results from the chosen parameters.

If the smallest sample is smaller than the lower bound of the resulting distribution the difference we calculated (var_bot) will be negative. In this case our loss is $\propto$np.exp(-val_bot) (I am ignoring the margin for now for simplicity). This should result in a being adjusted such that this difference will decrease exponentially and the lower bound will get closer to this most extreme sample, until this sample will be inside the support of the distribution. The same is true for the upper bound. These are the first two cases in the code block above.

If both differences are positive, meaning that all samples are already within the support of the distribution, our loss is 1 / (val_bot + margin0 * s) + 1 / (val_top + margin0 * s) meaning that larger differences lead to less loss. This means we actually will optimize for the support of the distribution being as large as possible.

Now I see several problems with this:

The definition with if cases is not optimal. (1) As long as val_bot is negative, coefficients influencing only val_top do not have any effect. Sometimes only the bottom support (the first if statement) will be optimized since this is the first condition we check. (2) Jumping between cases can lead to local minima, for example if within three iterations we switch from case 1 to 2 and back to 1 it might happen that the loss for case 2 is smaller than the loss for case 1 in the other two iterations so that the minimization will terminate (this happened for the truncated normal I tested).
I am not sure about optimizing for a large support. For a GEV (shape parameter also controls upper or lower bound on support) this might be less sensitive but for the case of the truncated normal it is essential to find these bounds (I see that that might not be a use case for us). Moreover, a loss function without any local or global minimum as this one can lead / leads? to non-convergence.

I want to propose a new loss function that is continuous and does not favor large support, namely simply a quadratic function in two dimensions, i.e. a parabola. This would optimize for the distances of the most extreme samples to the bounds to become 0 (+ some margin maybe?), thus making the bounds to the distribution close to the most extreme samples.

This would look as follows:

if bot == -np.inf:
   return worst_diff_top**2
elif top == np.inf:
    return worst_diff_bot**2
else:
    return worst_diff_bot**2 + worst_diff_top**2

It is again three cases, since if one of the distribution sides is infinite, we should only optimize the other end. I see that this can still lead to jumping between cases and potential local minima arising from that but I would hope that it is much rarer as I would expect that one bound being infinite is a rather "stable" property of a distribution. I tested this for the GEV and the truncnorm and it works more reliably than the old version (if not always better in the sense that the coefficients come out closer to the real ones since some local minima were apparently pretty good).

The text was updated successfully, but these errors were encountered:

veni-vidi-vici-dormivi added the topic-MESMER_X label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MESMER-X: better loss function for other coefficients #582

MESMER-X: better loss function for other coefficients #582

veni-vidi-vici-dormivi commented Dec 17, 2024

MESMER-X: better loss function for other coefficients #582

MESMER-X: better loss function for other coefficients #582

Comments

veni-vidi-vici-dormivi commented Dec 17, 2024