On the Harmonic-Mean of p-values

Tags: : Statistics, Machine Learning

The Harmonic-Mean p-value (HMP), as the name suggests, is the harmonic mean of p-values. It can replace other aggregations, such as Fisher’s combination and be used as a test-statistic for signal detection, a.k.a. global null testing. It is an elegant test statistic: it is easy to compute, and it’s null distribution can be easily derived for independent tests. It is thus a useful tool for signal-detection/meta-analysis.

Our interest in the HMP began with Daniel Wilson’s PNAS paper [1]. In there, and the accompanying Wikipedia entry, Wilson claims the following properties of the HMP:

  1. The HMP offers strong FWER control.
  2. The HMP is a more powerful test than Bonferroni and Simes’.
  3. The HMP is valid when p-values are statistically dependent.

Let’s parse them one at a time:

Regarding Strong FWER control: Strong FWER control means that the probability of any false positive is controlled, even in the presence of signal in some of the variables. This is not true for the HMP, because unless embedded in a closed testing procedure, HMP is a global test, and not a multiple test. Weak FWER control means that the probability of a false positive is under control, only if no variable carries signal. Wilson, however, does show that HMP offers more than weak FWER control. Because HMP is decreasing as more hypotheses are intersected, this means that if HMP rejects some conjunction hypothesis, then all the hypotheses in the closure of the conjunction will also be rejected. Wilson calls it strong control, but we think that “weak FWER control in selected subsets” may be more accurate.

Here is a formal definition, followed by an example. Denote a set of null-hypotheses, the number of false-nulls (i.e., signals, effects, associations,…), in , and the number of false rejections of some inference algorithm, such as HMP. Weak FWER means that if then Strong FWER means that Weak FWER in the Selected means that if where denotes a rejected hypothesis.

Here is an example in genetics. Based on SNP-wise p-values, a researcher declares genes A and B as significant. Weak FWER means the inference is valid only if there is no effects/associations anywhere in A, nor B, nor any other SNP outside of . Weak FWER in the selected means the inference is valid if there is no effects/associations anywhere in A,B. Effects/associations that are outside are allowed. Strong FWER means that inference on SNPs, within is valid, even in the presence of true associated SNPs within .

Readers familiar with neuroimaging will recognize that weak-FWER is exactly the type of guarantees provided when inferring on clusters using random-field theory. Indeed, the assumed null in cluster inference, a.k.a. topological inference, is that “there is no effect anywhere in the cluster”. In [4] we called it the “Spatial Specificity Paradox”: the larger the cluster, the harder it is to pin-point the origin of brain activation.

The neuroimaging example is also useful to demonstrate another property of weak-FWER-in-the-selected: signal outside selected clusters does not invalidate inference. For this reason, some may prefer calling these guarantees “weak-within–strong-between…”.

Regarding HMP vs. Bonferroni: We do not see how a global test like HMP can be compared to Bonferroni corrected multiple tests. Regarding HMP vs. Simes: The comparison in [1] is an unfair one. When all elementary hypotheses are tested, neither test dominates the other.

Regarding validity under dependence: Wilson argues, both in [1] and in his Wikipedia edit, that HMP is valid under arbitrary statistical dependencies between p-values. While there may be dependencies that do not invalidate the HMP false-positive rates, a simple simulation under compound-symmetry ravels that the HMP is not valid under dependence. Does theory support Wilson’s claim? Yes and no. On the one hand, the distribution of harmonic means does tend to be robust to dependence. This is because the harmonic mean is driven by the smaller entries; in the Gaussian case, where dependence=correlations, the smaller entries roughly behave like white noise. On the other hand, it was not too hard to design an example where false positive control is invalidated, as we show in [2].

All of the above does not mean the HMP is useless. Far from it. Goeman and Solari [3] show how to estimate the signal’s prevalence, with strong FWER control, by embedding a global test in a closed testing procedure. In Rosenblatt et al. [4] discussed in a previous blog post, we showed how we used this idea to estimate the amount of brain activation in selected regions. Ebrahimpoor et al. [5] use the same rationale to infer on the proportion of associated SNPs within genes. While the type of correlations in brain scans and genomes may (possibly) invalidate error guarantees, this embedding is still of value in many other applications.

We thus conclude that HMP and closed testing have great potential, but it has to be handled with more care than implied in [1].


[1] Daniel J. Wilson “The harmonic mean p-value for combining dependent tests” Proceedings of the National Academy of Sciences Jan 2019, 116 (4) 1195-1200;

[2] Jelle J. Goeman, Jonathan D. Rosenblatt, Thomas E. Nichols “The harmonic mean p-value: Strong versus weak control, and the assumption of independence” Proceedings of the National Academy of Sciences Oct 2019, 201909339;

[3] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.

[4] Rosenblatt, Jonathan D., et al. “All-Resolutions inference for brain imaging.” Neuroimage 181 (2018): 786-796.

[5] Ebrahimpoor, Mitra, et al. “Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods.” Briefings in bioinformatics (2019).

Written on October 28, 2019