Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power Transformer Stability #440

Open
veni-vidi-vici-dormivi opened this issue May 6, 2024 · 13 comments
Open

Power Transformer Stability #440

veni-vidi-vici-dormivi opened this issue May 6, 2024 · 13 comments
Labels
enhancement New feature or request topic-MESMER_M

Comments

@veni-vidi-vici-dormivi
Copy link
Collaborator

veni-vidi-vici-dormivi commented May 6, 2024

I played around a bit with our power transformer. Try the following example:

import numpy as np
import scipy
import matplotlib.pyplot as plt
from mesmer.mesmer_m.power_transformer import PowerTransformerVariableLambda, lambda_function

np.random.seed(0)
n_years = 10_000

# make skewed data
skew = -5
local_monthly_residuals = scipy.stats.skewnorm.rvs(skew, size=n_years)
yearly_T = np.random.randn(n_years)

# find coefficients
pt = PowerTransformerVariableLambda()
pt.coeffs_ = pt._yeo_johnson_optimize_lambda(local_monthly_residuals, yearly_T)

# transform data
lmbdaleft = lambda_function(pt.coeffs_, yearly_T)
transformed = pt._yeo_johnson_transform(local_monthly_residuals, lmbdaleft)

# plot
fig, axs = plt.subplots(3)
axs[0].hist(local_monthly_residuals, bins=100)
axs[0].set_title('original (left skewed) data, skew = {}'.format(round(scipy.stats.skew(local_monthly_residuals), 2)))
axs[1].hist(transformed, bins=100)
axs[1].set_title('transformed data, skew = {}'.format(round(scipy.stats.skew(transformed),2)))
axs[2].hist(local_monthly_residuals-transformed, bins=100)
axs[2].set_title('difference')
plt.tight_layout()
plt.show()

image
The Data is nicely de-skewed.

But let's try it with 250 data points, which is how many datapoints we have for each month and gridcell. The output looks like this:
image
The power transformer fails to normalize the data.

Arguably, our data is probably never skewed so heavily but I wanted to leave this here for future reference, in case we want to revisit this in the future.

edited to not have years x 12 since this is not what we have in the application

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi added wontfix This will not be worked on topic-MESMER_M labels May 6, 2024
@veni-vidi-vici-dormivi
Copy link
Collaborator Author

Apparently it's not only the sample size but it is generally rather sensitive to initial conditions, similar things happen for some variations of the bounds, the first guess or the yearly_T.

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi changed the title Estimation of coefficients for power transformer depends heavily on sample size Estimation of coefficients for power transformer depends heavily on initial conditions May 6, 2024
@mathause
Copy link
Member

mathause commented May 6, 2024

Interesting - what coeffs do you get out?

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

veni-vidi-vici-dormivi commented May 7, 2024

In the first case around (0, 0.1) and in the second (0.96, 0.013). In this case I set the bounds to (0, 1) for the first and to (-0.1, 0.1) for the second coefficient. First guess is (1,0). When I do the same without any bounds for example, also the example with 10.000 samples is much less normalized with coefficients coming out to be (0.6, 0.2). When I try with bounds of (0, 100) and (-100, 100) the data is more skewed after the transform than before and coefficients are (1.4, 96.2).

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

Maybe we should check the scipy.optimize.OptimizeResult.success attribute...

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

@mathause and I looked into this a little more.

The first interesting thing is that, as Mathias mentioned already, lambda does not need to be between 0 and 2. Thus, our lambda function is maybe not very well suited to estimate lambda. We should look into the possible benefits of making it a rational function (having possible $\xi_0 < 0$) or maybe a completely different lambda function.

Secondly, the definition of the loss function

n_samples = local_monthly_residuals.shape[0]
loglikelihood = (
-n_samples / 2 * np.log(transformed_local_monthly_resids.var())
)
loglikelihood += (
(lambdas - 1)
* np.sign(local_monthly_residuals)
* np.log1p(np.abs(local_monthly_residuals))
).sum()

is not intuitively clear to us. It is however the same as in sklearn. But in our opinion not the same as in the original paper:

image

@mathause
Copy link
Member

mathause commented May 8, 2024

Additional comments:

@mathause
Copy link
Member

Can you try your example with n_years = 250 again, commenting the jac=rosen_der part again? That just worked for me.

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

True, works for me too. But it's at least not a universal fix because when I do it with skew = 5 it does not work anymore.

@mathause
Copy link
Member

Can you try with the following diff (from main):

diff --git a/mesmer/mesmer_m/power_transformer.py b/mesmer/mesmer_m/power_transformer.py
index d746921..a3020cc 100644
--- a/mesmer/mesmer_m/power_transformer.py
+++ b/mesmer/mesmer_m/power_transformer.py
@@ -144,7 +144,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
         local_yearly_T = local_yearly_T[~np.isnan(local_yearly_T)]
 
         # choosing bracket -2, 2 like for boxcox
-        bounds = np.c_[[0, -0.1], [1, 0.1]]
+        bounds = np.c_[[0, -0.1], [np.inf, 0.1]]
         # first guess is that data is already normal distributed
         first_guess = np.array([1, 0])
 
@@ -153,7 +153,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
             first_guess,
             bounds=bounds,
             method="SLSQP",
-            jac=rosen_der,
+            # jac=rosen_der,
         ).x
 
     def _yeo_johnson_transform(self, local_monthly_residuals, lambdas):

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

veni-vidi-vici-dormivi commented May 15, 2024

Aha! I had already tried the infinite upper bound but not without rosen_der that seems to work well! (The infinite upper bound should have been implemented in #427)

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

I actually had a feeling that the PRs above are not the last straw here. Turns out that using method = "SLSQP" leads to numerical instabilities in the coefficients (between operating systems, see #430). It is here in the code:

return minimize(
_neg_log_likelihood,
first_guess,
bounds=bounds,
method="SLSQP",
).x

The scipy optimize.minimize documentation says that bounds can be used with methods Nelder-Mead, L-BFGS-B, TNC, SLSQP, Powell, trust-constr, and COBYLA. However, SLSQP, trust-constr, and COBYLA are later listed under "constrained minimization". The others are listed under "bound-constrained minimization". I think the other methods are preferred when one passes an actual constraint to the constraint parameter of optimize.minimize.

Concerning the bound-constraint methods, Nelder-Mead is listed as the most robust one but "if numerical computation of derivative can be trusted, other algorithms using the first and/or second derivatives information might be preferred for their better performance in general".

I am not exactly sure if this is the case here. We minimize the negative loglikelihood which takes in the variance of the transformed residuals (a constant), the original residuals and lambda (derived from a logistic function). In general, it is not guaranteed that the negloglikelihood is twice differentiable see here.

I tried the other options and

  • TNC is the slowest of the four options,
  • L-BFGS-B is also slower than Nelder-Mead (this would be the default option)
  • Powell is faster than Nelder-Mead (also does not need a differantiable function)

All of them are actually faster than SLSQP. Moreover, testing on the coarse grid data I get several warnings with SLSQP that the coefficients went outside the bounds, which does not seem to be a problem with the other methods.

I think our best option for now is to use Nelder-Mead because it seems to be the safest/most robust one judging from the documentation and it solves the instability issue. Of course it would be better to read into all of this and make a more informed choice but at the moment I think it is not worth the time.

@mathause
Copy link
Member

mathause commented May 23, 2024

Thanks for looking into this! I tried to figure out the difference between bounds and constraints. If I understand correctly, bounds restrict the values individual params can take, while constraints restrict the values of all params (e.g. x[0] + x[1] < 50 or so). Se we need bounds but not constraints (although I think bounds could be expressed as constraints).


Not for now, but one thing we could play around with: would it be more efficient to reformulate the logistic regression such that no bounds are needed? I.e. write it as $2 / (1 + \exp(- (\beta_0 + \beta_1 \cdot x)))$ instead of $2 / (1 + \beta_0 \cdot \exp(\beta_1 \cdot x))$ and then use a unbounded optimization? (As this formulation ensures the expression cannot be negative).

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

That's a very interesting point! I'll leave the issue open for now.

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi changed the title Estimation of coefficients for power transformer depends heavily on initial conditions Power Transformer Stability May 24, 2024
@veni-vidi-vici-dormivi veni-vidi-vici-dormivi added the enhancement New feature or request label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request topic-MESMER_M
Projects
None yet
Development

No branches or pull requests

2 participants