Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] pseudo-p significance calculation #281

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

JosiahParry
Copy link

This PR drafts a function calculate_significance() to provide a consistent way to calculate pseudo-p values from a reference distribution.

It is based on the discussion at #199

Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 62 lines in your changes are missing coverage. Please review.

Project coverage is 72.3%. Comparing base (f9236f9) to head (1c59fa7).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #281     +/-   ##
=======================================
- Coverage   73.6%   72.3%   -1.4%     
=======================================
  Files         24      25      +1     
  Lines       3316    3378     +62     
  Branches     520     529      +9     
=======================================
  Hits        2441    2441             
- Misses       708     770     +62     
  Partials     167     167             
Files Coverage Δ
esda/significance.py 0.0% <0.0%> (ø)

@ljwolf
Copy link
Member

ljwolf commented Feb 15, 2024

OK, done here on the logic & implementation. Thank you @JosiahParry for getting the ball rolling here 😄 Very much appreciated!

I've re-implemented the percentile-based two-sided test from scratch using scipy.stats.scoreatpercentile. This approach finds the percentile for the test statistic in the reference distribution and counts how many replicates are outside of (p, 1-p). Over simulations, these are always 2*directed. Second, I modified your folding approach to fold around the mean of the replicates, rather than zero (since the expected value of local stats generally isn't zero) and kept it as a folded option for testing.

I don't think we should expose the folded variant to the user in the classes, since the power in each direction is dependent on the symmetry of the distribution. For example, in the illustration below, the smallest replicate, when folded, is not "extreme," but this is accounted for in the percentile-based method.

IMG_98921AE64450-1

The percentile will always equal the folded version for symmetric distributions, but the folded version becomes a directed test as skew increases. I think that the (over+under)/all is also the intended estimand of the directed approach, after re-reading the Hope paper referred to in #199

If other maintainers approve these four options (greater, lesser, two-sided, and directed) for the user classes and a folded option for this function only (for replication/testing purposes) I can start propagating this across the user classes.

Copy link
Member

@knaaptime knaaptime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet. presume the stuff in main gets moved to a test file or something but this is great

If other maintainers approve these four options (greater, lesser, two-sided, and directed) for the user classes and a folded option for this function only (for replication/testing purposes) I can start propagating this across the user classes.

+1

percentile = (reference_distribution < test_stat).mean(axis=1)
bounds = np.column_stack((1 - percentile, percentile)) * 100
bounds.sort(axis=1)
lows, highs = np.row_stack(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be able to be vectorised, but I couldn't quickly figure that out. the following did not generate the same results as below:

stats.scoreatpercentile(reference_distribution, bounds, axis=1)

esda/significance.py Outdated Show resolved Hide resolved
esda/significance.py Outdated Show resolved Hide resolved
esda/significance.py Outdated Show resolved Hide resolved
@ljwolf
Copy link
Member

ljwolf commented Mar 5, 2024

I am still working on this, but I recall now why the implementation of an "alternative" argument was a bit trickier than I expected... because we allow for the user to "discard" the random replicates, rather than store them, we have to push the significance logic all the way down to the conditional randomization numba function. This may have significant performance implications, since we're currently only counting the number of larger random stats in each innermost loop.

It seems clear to me that

  1. if the test statistic is as large as k realizations,
  2. then there are always 2k simulations outside of the (k/n, (1-(k/n)) interval, plus the test statistic itself.
  3. So, the proper p-value for the two-sided test is (2*n_greater+1)/(n_samples + 1),
  4. which is off from two times our current p-value by 1/(n_samples+1).

If 1-4 are correct, this means we don't need to change any of the numba code. The correction can be calculated as 2*directed - (1/(n_samples+1)) after the numba calculation. Do I have this right @sjsrey @knaaptime @martinfleis @jGaboardi?

So, if we implement our current test for local stats without flipping (as greater), generate 1-p_sim (as lesser), and implement the above correction for the two-sided test (2*p_sim - (1/(n_samples+1))), none of the numba code needs to change.

Is that OK w/ other maintainers?

@martinfleis
Copy link
Member

This is too much stats for me to say anything useful.

@jGaboardi
Copy link
Member

This is too much stats for me to say anything useful.

Same for me.

Copy link
Member

@jGaboardi jGaboardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the topic is over my head, my approval is based on a general review.

@ljwolf
Copy link
Member

ljwolf commented Mar 6, 2024

One further wrinkle as well: some global Moran tests support directed testing a binary option already. Notably, if two_tailed=False, they pick the test direction based on whether the global I is positive or negative. It's also useful to note: this means we currently pick the smallest one-tailed p-value and, if the test is two-tailed, multiply this by two.

For us to roll-out the testing across all the classes, we need to consider if this option should be deprecated in favor of an explicit "alternative" option? Right now, there's no way to force a direction on these tests.

@sjsrey
Copy link
Member

sjsrey commented Mar 7, 2024

I am still working on this, but I recall now why the implementation of an "alternative" argument was a bit trickier than I expected... because we allow for the user to "discard" the random replicates, rather than store them, we have to push the significance logic all the way down to the conditional randomization numba function. This may have significant performance implications, since we're currently only counting the number of larger random stats in each innermost loop.

It seems clear to me that

  1. if the test statistic is as large as k realizations,
  2. then there are always 2k simulations outside of the (k/n, (1-(k/n)) interval, plus the test statistic itself.
  3. So, the proper p-value for the two-sided test is (2*n_greater+1)/(n_samples + 1),
  4. which is off from two times our current p-value by 1/(n_samples+1).

If 1-4 are correct, this means we don't need to change any of the numba code. The correction can be calculated as 2*directed - (1/(n_samples+1)) after the numba calculation. Do I have this right @sjsrey @knaaptime @martinfleis @jGaboardi?

So, if we implement our current test for local stats without flipping (as greater), generate 1-p_sim (as lesser), and implement the above correction for the two-sided test (2*p_sim - (1/(n_samples+1))), none of the numba code needs to change.

Is that OK w/ other maintainers?

I think this is OK.

One thing to check is if:

  1. So, the proper p-value for the two-sided test is (2*n_greater+1)/(n_samples + 1),
    Should be
    2(n_greater+1)/(n_samples+1)

@ljwolf
Copy link
Member

ljwolf commented Mar 7, 2024

One thing to check is if:

Sure, that is what I initially thought & what @JosiahParry suggested.

The reason why I'm thinking it's actually 2*p_sim - (1/(n_permutations + 1)) is because using 2*p_sim amounts to counting the test stat twice: 2*p_directed = 2 * (n_outside + 1)/(n+1) = (2*outside + 2)/(n+1). The difference will be vanishingly small as the number of permutations increases, but it's the principle...

Thinking another way, in the percentile-based version of the test, you compute the percentile p for the test statistic, count how many null statistics are outside of (p,1-p) and add one, since the test stat is always at least at its own percentile. This p-value is smaller than p_sim by 1/(n_samples-1), which is what would happen in the percentile test if you counted the test statistic twice.

the simulation code at the end of esda/simulation.py should illustrate?

@weikang9009
Copy link
Member

@ljwolf I was looking at the discussions in this PR and the other related issue. The correction for the two-sided test 2*p_sim - (1/(n_samples+1)) looks correct to me.

@JosiahParry
Copy link
Author

Thank you for the explanation @ljwolf. I think I'm almost there/onboard!

It's worth calling out explicitly this formula can result in a p-value > 1.0 which should also be handled e.g.

p_sim = 0.65
nsim = 999

(p_corrected = (2*p_sim - (1/(nsim + 1))))
#> [1] 1.299

if (p_corrected > 1) {
  1.0
} else {
  p_corrected
}
#> [1] 1

Additionally, would you mind elaborating why it is - (1/(nsim + 1)) as opposed to + (1/(nsim + 1))? To me, it makes more sense to penalize smaller numbers of simulations rather than larger number of simulations. For example subtracting the second term results in smaller p values for smaller numbers of simulations and larger ones for larger numbers of simulations.

calc_p_sim <- function(p_sim, nsim) {
  (p_corrected = (2*p_sim - (1/(nsim + 1))))

  if (p_corrected > 1) {
    1.0
  } else {
    p_corrected
  }

}

calc_p_sim(0.05, 49)
#> [1] 0.08
calc_p_sim(0.05, 99)
#> [1] 0.09
calc_p_sim(0.05, 999)
#> [1] 0.099

Comment on lines +28 to +29
the directed p-value is half of the two-sided p-value, and corresponds to running the
lesser and greater tests, then picking the smaller significance value. This is not advised.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will be untrue if the adjustment is added

Comment on lines +9 to +10
Pseudo-p values are calculated using the formula (M + 1) / (R + 1). Where R is the number of simulations
and M is the number of times that the simulated value was equal to, or more extreme than the observed test statistic.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: describe the adjustment here

Comment on lines +19 to +23
One of 'two-sided', 'lesser', or 'greater'. Indicates the alternative hypothesis.
- 'two-sided': the observed test-statistic is an extreme value of the reference distribution.
- 'lesser': the observed test-statistic is small relative to the reference distribution.
- 'greater': the observed test-statistic is large relative to the reference distribution.
- 'directed': the observed test statistic is either small or large reltaive to the reference distribution.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states that the valid arguments are 'two-sided', 'lesser', or 'greater' however there is also a directed argument documented as well—though it is unclear to me how directed and two-sided are different.

@weikang9009
Copy link
Member

It's worth calling out explicitly this formula can result in a p-value > 1.0 which should also be handled e.g.

The formula 2*p_sim - (1/(n_samples+1)) will not result in a p-value larger than 1 since p_sim is equivalent to the directed case which always picks the smaller side, meaning that p_sim will be always no larger than 0.5.

Additionally, would you mind elaborating why it is - (1/(nsim + 1)) as opposed to + (1/(nsim + 1))? To me, it makes more sense to penalize smaller numbers of simulations rather than larger number of simulations. For example subtracting the second term results in smaller p values for smaller numbers of simulations and larger ones for larger numbers of simulations.

This is based on the equation for the two-sided case p_two_sided = (2*n_greater+1)/(n_samples + 1) and the current implementation of p_sim: p_sim = (n_greater+1)/(n_samples + 1)

@ljwolf please correct me if I'm wrong

@ljwolf
Copy link
Member

ljwolf commented May 1, 2024

Yep, exactly right!

@JosiahParry, think about it in terms of the percentiles, maybe that will be clearer.

The correction term is to adjust for double counting the "observed" stat- if you just did 2*p_sim, that computes the number of "outside" simulated statistics, plus the observed stat twice (as if we see it at both percentile p and 1-p) However, we actually only see the observed stat once at percentile p---the two tailed boundary in the "other" tail is derived from 1-p.

Like, imagine the number of simulations is 100. Sort them from smallest to largest. If 20 simulations are as big as the test statistic, then the test statistic is at the 80th percentile. At the (100-80)th percentile, there are n*(1-.8) smaller observations.

So, we see 20 more extreme large observations, 1 test statistic, and 20 more extreme smaller observations as if the test statistic were in the other tail.

2p_sim(n+1) would count 20 more extreme large stats, 20 more extreme small stats, and two test statistics-one too many. So, we need to deflate that count by 1, which deflates the p value by 1/(n+1).

@knaaptime
Copy link
Member

for some reason this keeps reminding me of Downey's Euro Problem example, probably just because it reminds me "the statistician"'s default test is always two-sided
Screenshot 2024-05-02 at 1 14 35 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants