Jekyll2021-04-13T07:35:55+00:00http://johnros.github.io/feed.xmlJonathan D. RosenblattStats, R, and possibly beach volley.Why model measurements with complex numbers?2021-04-01T00:00:00+00:002021-04-01T00:00:00+00:00http://johnros.github.io/complex<p>I recently grew a mild obsession to understand complex numbers.
It all started while working with <a href="https://in.bgu.ac.il/en/engn/ece/pages/staff/RoutenbergTirza.aspx">Tirza Routenberg</a> and Neta Zimerman on the analysis of seismic array data.
The array processing community will usually model measurements as complex numbers.
This approach is so natural in the signal processing community, that the canonical reference for array processing, [vanTrees2002], never even stops to explain why?
Being the statistician that I am, it puzzled me: the measurements are the instantanous compression of soil, why would you want to represent that with a complex number?</p>
<p>The following is my current understanding for the reasons that the signal processing, and array processing communities, will model measurements as complex numbers.
The tl;dr version is that:
(a) The merit of complex numbers is due to their representation via complex exponentials. Shifting a complex valued sine-wave in time, is merely multiplying its complex exponential representation by some (complex) constant.
(b) Any real-valued function/signal, may be mapped to its baseband representation, a.k.a., its complex envelope, without loss of information.
(c) These mappings/representations are useful since followup processing will typically include linear systems (convolutions), deconvolutions, Fourier transforms, etc, which are easier, both computationally and analytically, when operating with complex exponentials.</p>
<p>Now with details.</p>
<h1 id="useful-for-computations">Useful for Computations</h1>
<p>The usual argument one receives when asking a physicist or an electrical engineer “why complex” is:<br />
(a) Some measurements are complex.<br />
(b) Super useful for handling waves.<br />
(c) Super useful for linear systems.<br />
Let’s parse them one by one.</p>
<h2 id="some-measurements-are-complex">Some measurements are complex</h2>
<p>Some measurements, by their nature, respect the arithmetic of complex numbers.
This is the case when measuring current and voltage.
This is not the case in acoustics/seismology, where measurements represent the compression of air/soil.</p>
<h2 id="super-useful-for-waves">Super useful for waves</h2>
<p>This is an important point which may be initially unclear to someone, like myself, who never really understood the difference between a wave, and any other function of time.
A <a href="https://en.wikipedia.org/wiki/Wave">wave</a> is a function of time and space, but it is not an arbitrary function.
It represents a disturbance that propagates in time and space, so adjacent values are interconnected.
It is not only smooth, but also has to satisfy the <a href="https://en.wikipedia.org/wiki/Wave_equation">wave equation</a>.
Without going in the details of partial differential equations I will just say that a sine wave satisfies the wave equation, and thus any solution, i.e., any wave, is usually recovered by presenting the solution as scaled and shifted sine waves.
Sine waves and their shifts are best represented with complex numbers, as I will soon demonstrate.
For a full explanation I recommend [Smith2002].</p>
<h2 id="super-useful-for-convolutions">Super useful for convolutions</h2>
<p>Say you are analyzing the effect of a <a href="https://en.wikipedia.org/wiki/Linear_time-invariant_system">linear time invariant system</a> (LTI), aka a convolution.
It is a well known fact that <strong>Fourier diagonalizes the convolution</strong>.
Put differently, the spectrum of the output of an LTI, is a point-wise multiplication with some other function.
We will show this later in this post.</p>
<h1 id="sensor-arrays">Sensor Arrays</h1>
<p>As the name suggests, the field of <a href="https://en.wikipedia.org/wiki/Sensor_array">sensor arrays</a> deals with measurements from, well, sensor arrays.
It turns out that when analyzing data from an array of sensors, complex numbers soon arise.
Why is this?
Consider the <strong>real-valued</strong> measurement, \(f_k(t)\) of sensor \(k\) at time \(t\).
Sensor \(k\) and \(k'\) measure the <strong>same function</strong> at different locations.
Because this is the same function, measurements differ in their temporal lag: \(f_{k'}(t)=f_k(t-\tau_k)\).
Now enter a crucial fact about sine waveforms.
Say \(f(t)\) is a sine wave in the complex plane: \(f(t)=\cos(t) + i \sin(t)\) where \(i:=\sqrt{-1}\).
In <a href="https://en.wikipedia.org/wiki/Euler%27s_formula">complex exponential</a> notation this is \(f(t)=e^{i t}\).
Now presenting the time shift in complex exponential notation: \(f(t-\tau)=\sqrt{\cos^2(t)+\sin^2(t)} e^{i (t+\tau)}=e^{it} e^{i\tau}=f(t)e^{i\tau}\).
This is why we say that <strong>a shift in time is a multiplication in frequency</strong>.<br />
For some intuition, imagine that \(f(t)\) is the helix around a screw.
To shift time, i.e. evaluate \(f(t)\) at \(t-\tau\) one can either look at position \(t-\tau\), or keep looking at position \(t\), but advance the screw a distance of \(\tau\).
The “advancing of the screw” is the effect of the complex multiplication.</p>
<p>Since our sensor array measures time-shifted versions of the same signal, \(\{f(t-\tau_k)\}_k\), it would be nice if \(f(t)\) could be decomposed as a linear combination of complex exponentials.
But this is what the real-to-complex Fourier transform does!
Now since the first thing that will be done to array measurements is Fourier transforming them, why not start with directly there?
This is known, at least in [vanTrees2002], as the <strong>frequency domain snapshot model</strong>.</p>
<h1 id="signal-processing">Signal Processing</h1>
<p>The previous discussion implies that for array signal processing, and given some assumptions that we skipped, one should adopt the frequency domain snapshot model.
But if you ever practiced signal processing, you may know that the <strong>time domain snapshot model</strong> is no less popular, not only for arrays, and will often use complex numbers.
So again, why model measurements as complex numbers? In particular, when no arrays are involved?</p>
<p>One reason has to do with the fact that the term <strong>signal processing</strong> includes both <a href="https://en.wikipedia.org/wiki/Data_transmission">digital communication</a>, and data analysis.</p>
<p>In digital communication, it is quite common that one needs to transmit a message over a analogue channel (e.g. radio).
Because the channel is analogue, it essentially transmits waves.
The message to be transmitted has to be encoded by shifting and scaling wave functions.
This practice is known as <a href="https://en.wikipedia.org/wiki/Modulation">modulation</a>, and it is best done using complex exponentials because it, again, involves shifting and scaling wave functions.
Is this also the case for data analysis?</p>
<p>When signal processing for data analysis, there is no transmission, only reception.
One may argue that in data analyses we are “decripting nature’s messages”, but this romantic view has its limitation: we do not know the encoding mechanism used by nature, and the task is not decoding.</p>
<p>So why model measurements as complex numbers?
My answer to this is the <strong>complex envelope</strong>.
In my view, the matter is best described in [Schreier2010], and the argument is essentially, that there is nothing to lose.
The complex envelope is also known under the more informative name of <strong>equivalent baseband signal</strong>.
It is essentially a representation of the real-value signal, using a minimal spectrum.
Minimal in the sense that negative frequencies are canceled, and the remaining are shifted to some origin.
The price to pay for this “spectral compactness”, is that the signal is no longer real valued.
One can always convert from the complex envelope to the real-valued signal, and vice-versa.</p>
<h2 id="the-baseband-noise">The baseband noise</h2>
<p>I may have convinced you, and myself, that the complex envelope loses nothing, and may facilitate further processing which is easier with complex numbers.
This is true, but an important detail to mind when adopting this <strong>time domain complex envelope snapshot model</strong>, is the noise.
In the real-valued time domain, we usually use a white, Gaussian noise process, to model noise.
But what is the complex envelope of a Gaussian white noise process?
Is it Gaussian?
It is white?
The answer may be found in [vanTrees2002] circa Eq(5.79), or more rigorously in [Viswanathan2006].
The answer is approximately affirmative, meaning that one may use a white (proper) Gaussian process as the complex envelope of the real-valued noise process.</p>
<h2 id="complex-sinusoids-diagonalize-the-convolution">Complex Sinusoids Diagonalize the Convolution</h2>
<p>Complex sinusoids diagonalize the convolution.
Put differently: complex sinusoids are eigenfunctions of any linear time invariant system (LTI).
We said this earlier informally, but now we can be more formal about it.</p>
<p>Consider the operation of an LTI, \(\mathcal{H}\) on a complex sinusoid, \(f(t)=e^{i \omega t}\).
Denote \(g:=\mathcal{H}\{f\}\), \(f^*(t)=f(t-\tau)\), \(g^*(t)=g(t-\tau)\).
By definition
\(g^*=\mathcal{H}\{f^*\}=\mathcal{H}\{f(t-\tau)\}=\mathcal{H}\{e^{-i\omega\tau}f(t)\}\).
By linearity of \(\mathcal{H}\): \(\mathcal{H}\{e^{-i\omega\tau}f(t)\}=e^{-i\omega\tau}\mathcal{H}\{f(t)\}=e^{-i\omega\tau}g\).
Now because of time-invariance \(e^{-i\omega\tau}g=g^*\), which is satisfied if \(g\) is a complex sinusoid.</p>
<h1 id="conclusions">Conclusions</h1>
<ol>
<li>
<p>For array processing, where time-shifting is key, a frequency-domain-snapshot-model is a natural approach.</p>
</li>
<li>
<p>For general analysis of real-valued signal, complex modeling may be less obvious but brings benefits.
Thinking of the complex-envelope of a real-valued signal is harmless, provided you take care of the right noise model.</p>
</li>
</ol>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I am thankful for the fruitful conversations on the matter with <a href="http://www.ee.bgu.ac.il/~tirzar/">Tirza Routnerberg</a>, <a href="https://ece.illinois.edu/about/directory/faculty/jontalle">Jont Allen</a>, <a href="https://roy.lederman.name/">Roy Lederman</a>, and <a href="https://profiles.ucsd.edu/armin.schwartzman">Armin Schwartzman</a>.</p>
<h1 id="references">References</h1>
<p>[Smith2002] Steven Smith, Digital Signal Processing: A Practical Guide for Engineers and Scientists, 1st edition (Amsterdam; Boston: Newnes, 2002).</p>
<p>[Schreier2010] Peter J. Schreier and Louis L. Scharf, Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals (Cambridge: Cambridge University Press, 2010).</p>
<p>[vanTrees2002] Harry L. Van Trees, Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory, 1st edition (New York: Wiley-Interscience, 2002).</p>
<p>[Viswanathan2006] R. Viswanathan, “On the Autocorrelation of Complex Envelope of White Noise,” IEEE Transactions on Information Theory 52, no. 9 (September 2006): 4298–99.</p>I recently grew a mild obsession to understand complex numbers. It all started while working with Tirza Routenberg and Neta Zimerman on the analysis of seismic array data. The array processing community will usually model measurements as complex numbers. This approach is so natural in the signal processing community, that the canonical reference for array processing, [vanTrees2002], never even stops to explain why? Being the statistician that I am, it puzzled me: the measurements are the instantanous compression of soil, why would you want to represent that with a complex number?Domain Adaptation for Environmental Monitoring2020-08-16T00:00:00+00:002020-08-16T00:00:00+00:00http://johnros.github.io/importance<p>Environmental monitoring from satellite imagery essentially means that instead of directly measuring pollution (for instance), you predict it from satellite imagery.
When an epidemiologist controls for ambient temperature, you can be pretty sure that such an indirect measurement of temperature is involved.</p>
<p>Predicting pollution is typically addressed as a supervised learning problem: use the pollution measured in ground stations as labels, and predict it wherever ground stations are not available.
But what if some pollution monitoring stations are far away from the residences of the subjects involved in the epidemiological study? Would we not want to down-weight those stations in the learning?</p>
<p>The idea of weighted learning is not a new one.
In the Machine Learning literature, it has been recently popularized in the context of “Domain Adaptation”, and in particular “Covariate Shift”: where the distributions of the covariates in the train set differ from those of the test set.
This is exactly the case in environmental monitoring.</p>
<p>In our latest contribution [1], we call upon recent idea from the domain adaptation literature to estimate the quality of predicted temperatures in France.
We show that naive performance estimates are biased, even if cross-validated.
We then plug our performance estimators in the Empirical Risk Minimization framework, in order to learn better predictors.
En passant- we discuss the matter of h-blocking, and other data splitting schemes designed for unbiased performance estimation (briefly: don’t).</p>
<p>[1] R. Sarafian, I. Kloog, E. Sarafian, I. Hough and J. D. Rosenblatt, “A Domain Adaptation Approach for Performance Estimation of Spatial Predictions,” in IEEE Transactions on Geoscience and Remote Sensing, doi: 10.1109/TGRS.2020.3012575.</p>Environmental monitoring from satellite imagery essentially means that instead of directly measuring pollution (for instance), you predict it from satellite imagery. When an epidemiologist controls for ambient temperature, you can be pretty sure that such an indirect measurement of temperature is involved.MultiNav: Navigating Multivariate Data2020-04-28T00:00:00+00:002020-04-28T00:00:00+00:00http://johnros.github.io/multinav<p>In statistical process control (SPC), a.k.a. anomaly-detection, one compares some statistic to its “in control” distribution.
Many statistical and BI software suits (e.g. Tableau, PowerBI) can do SPC.
Almost all of these, however, focus on univariate processes, which are easier to visualize and discuss than multivariate processes.
The purpose of our <em>Multinav</em> package is to ingest multivariate data streams, compute multivariate summaries, and allow us to query the data interactively when an anomaly is detected.</p>
<p>In the next example we look into an agricultural IoT use-case, and try to detect sensors with anomalous readings.
The full details are available in <a href="https://efratvil.github.io/MultiNav/Documentation/">the official documentation</a>.</p>
<p>The data consists of hourly measurements, during 5 days, of the girth of 100 avocado trees.
For each such tree, an anomaly score is computed using four variation of Hotelling’s T2 statistic.
Multinav thus returns the following dashboard:</p>
<iframe src="https://efratvil.github.io/demos/MultiNav/MultiNav_simple_demo.html" width="800" height="600"></iframe>
<p>Crucially for our purposes, this dashboard is interactive.
One may hover over a particular score, in order to inspect the raw evolution of the measurement in time, and compare it to the group via a functional boxplot.</p>
<p>There is a lot more that MultiNav can do.
The details can be found in the inline-help, and <a href="https://efratvil.github.io/MultiNav/Documentation/">the official documentation</a>.
For the purpose of this blog, we merely summarize the reasons we find Multinav very useful:</p>
<ol>
<li>Interactive visualizations that link anomaly score with raw data.</li>
<li>Functional boxplot for querying the source of the anomalies.</li>
<li>Out-of-the-box multivariate scoring functions, with the possibility of extension by the user.</li>
<li>The interactive component are embeddable in Shiny apps, for a full blown interactive dashboard.</li>
</ol>In statistical process control (SPC), a.k.a. anomaly-detection, one compares some statistic to its “in control” distribution. Many statistical and BI software suits (e.g. Tableau, PowerBI) can do SPC. Almost all of these, however, focus on univariate processes, which are easier to visualize and discuss than multivariate processes. The purpose of our Multinav package is to ingest multivariate data streams, compute multivariate summaries, and allow us to query the data interactively when an anomaly is detected.On the Harmonic-Mean of p-values2019-10-28T00:00:00+00:002019-10-28T00:00:00+00:00http://johnros.github.io/harmonic<p>The <em>Harmonic-Mean p-value</em> (HMP), as the name suggests, is the harmonic mean of p-values.
It can replace other aggregations, such as <a href="https://en.wikipedia.org/wiki/Fisher%27s_method">Fisher’s combination</a>
and be used as a test-statistic for <em>signal detection</em>, a.k.a. <em>global null testing</em>.
It is an elegant test statistic: it is easy to compute, and it’s null distribution can be easily derived for independent tests.
It is thus a useful tool for signal-detection/meta-analysis.</p>
<p>Our interest in the HMP began with Daniel Wilson’s PNAS paper [1].
In there, and the accompanying <a href="https://en.wikipedia.org/w/index.php?title=Harmonic_mean_p-value&oldid=921890236">Wikipedia entry</a>, Wilson claims the following properties of the HMP:</p>
<ol>
<li>The HMP offers strong FWER control.</li>
<li>The HMP is a more powerful test than Bonferroni and Simes’.</li>
<li>The HMP is valid when p-values are statistically dependent.</li>
</ol>
<p>Let’s parse them one at a time:</p>
<p>Regarding <strong>Strong FWER control</strong>:
<em>Strong FWER control</em> means that the probability of any false positive is controlled, even in the presence of signal in some of the variables.
This is not true for the HMP, because unless embedded in a closed testing procedure, HMP is a global test, and not a multiple test.
<em>Weak FWER control</em> means that the probability of a false positive is under control, only if no variable carries signal.
Wilson, however, does show that HMP offers more than weak FWER control.
Because HMP is decreasing as more hypotheses are intersected, this means that if HMP rejects some conjunction hypothesis, then all the hypotheses in the closure of the conjunction will also be rejected.
Wilson calls it <em>strong control</em>, but we think that <em>“weak FWER control in selected subsets”</em> may be more accurate.</p>
<p>Here is a formal definition, followed by an example.
Denote \(B\) a set of \(m\) null-hypotheses, \(m_1(B)\) the number of false-nulls (i.e., signals, effects, associations,…), in \(B\), and \(V\) the number of false rejections of some inference algorithm, such as HMP.
<em>Weak FWER</em> means that if \(m_1(B)=0\) then \(P(V(B)>0)\leq \alpha.\)
<em>Strong FWER</em> means that \(\forall m_1(B), P(V(B)>0)\leq \alpha.\)
<em>Weak FWER in the Selected</em> means that if \(P(\exists R_i: m_1(R_i)=0)\leq \alpha,\) where \(R_i\) denotes a rejected hypothesis.</p>
<p>Here is an example in genetics.
Based on SNP-wise p-values, a researcher declares genes A and B as significant.
<em>Weak FWER</em> means the inference is valid only if there is no effects/associations anywhere in A, nor B, nor any other SNP outside of \(A \cup B\).
<em>Weak FWER in the selected</em> means the inference is valid if there is no effects/associations anywhere in A,B. Effects/associations that are outside \(A \cup B\) are allowed.
<em>Strong FWER</em> means that inference <strong>on SNPs</strong>, within \(A \cup B\) is valid, even in the presence of true associated SNPs within \(A \cup B\).</p>
<p>Readers familiar with neuroimaging will recognize that weak-FWER is exactly the type of guarantees provided when inferring on clusters using random-field theory.
Indeed, the assumed null in <em>cluster inference</em>, a.k.a. <em>topological inference</em>, is that “there is no effect anywhere in the cluster”.
In [4] we called it the “Spatial Specificity Paradox”: the larger the cluster, the harder it is to pin-point the origin of brain activation.</p>
<p>The neuroimaging example is also useful to demonstrate another property of weak-FWER-in-the-selected: signal outside selected clusters does not invalidate inference.
For this reason, some may prefer calling these guarantees “weak-within–strong-between…”.</p>
<p>Regarding <strong>HMP vs. Bonferroni</strong>: We do not see how a global test like HMP can be compared to Bonferroni corrected multiple tests.
Regarding HMP vs. Simes: The comparison in [1] is an unfair one. When all elementary hypotheses are tested, neither test dominates the other.</p>
<p>Regarding validity under <strong>dependence</strong>:
Wilson argues, both in [1] and in his <a href="https://en.wikipedia.org/w/index.php?title=Extensions_of_Fisher%27s_method&oldid=901235957">Wikipedia edit</a>, that HMP is valid under arbitrary statistical dependencies between p-values.
While there may be dependencies that do not invalidate the HMP false-positive rates, a simple simulation under compound-symmetry ravels that the HMP is not valid under dependence.
Does theory support Wilson’s claim?
Yes and no.
On the one hand, the distribution of harmonic means does tend to be robust to dependence.
This is because the harmonic mean is driven by the smaller entries; in the Gaussian case, where dependence=correlations, the smaller entries roughly behave like white noise.
On the other hand, it was not too hard to design an example where false positive control is invalidated, as we show in [2].</p>
<p>All of the above does not mean the HMP is useless.
Far from it.
Goeman and Solari [3] show how to estimate the signal’s prevalence, with strong FWER control, by embedding a global test in a closed testing procedure.
In Rosenblatt et al. [4] discussed in a <a href="http://www.john-ros.com/cherry-brain/">previous blog post</a>, we showed how we used this idea to estimate the amount of brain activation in selected regions.
Ebrahimpoor et al. [5] use the same rationale to infer on the proportion of associated SNPs within genes.
While the type of correlations in brain scans and genomes may (possibly) invalidate error guarantees, this embedding is still of value in many other applications.</p>
<p>We thus conclude that HMP and closed testing have great potential, but it has to be handled with more care than implied in [1].</p>
<hr />
<p>[1] Daniel J. Wilson
“The harmonic mean p-value for combining dependent tests”
Proceedings of the National Academy of Sciences Jan 2019, 116 (4) 1195-1200;</p>
<p>[2] Jelle J. Goeman, Jonathan D. Rosenblatt, Thomas E. Nichols
“The harmonic mean p-value: Strong versus weak control, and the assumption of independence”
Proceedings of the National Academy of Sciences Oct 2019, 201909339;</p>
<p>[3] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.</p>
<p>[4] Rosenblatt, Jonathan D., et al. “All-Resolutions inference for brain imaging.” Neuroimage 181 (2018): 786-796.</p>
<p>[5] Ebrahimpoor, Mitra, et al. “Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods.” Briefings in bioinformatics (2019).</p>The Harmonic-Mean p-value (HMP), as the name suggests, is the harmonic mean of p-values. It can replace other aggregations, such as Fisher’s combination and be used as a test-statistic for signal detection, a.k.a. global null testing. It is an elegant test statistic: it is easy to compute, and it’s null distribution can be easily derived for independent tests. It is thus a useful tool for signal-detection/meta-analysis.Better-Than-Chance Classification for Signal Detection2019-10-15T00:00:00+00:002019-10-15T00:00:00+00:00http://johnros.github.io/better-than-chance<p>In 2012 my friend <a href="https://scholar.google.co.il/citations?user=6TFj2D8AAAAJ&hl=en">Roee Gilron</a> told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level.
“That can’t be right” I said. “So much power is left on the table!”
“Can you show it to me?” Roee replied.
Well, I did. And 7 years later, our results have been published by <a href="https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxz035/5587128?searchresult=1">Biostatistics</a>.</p>
<p>Roee’s question led to a mass of simulations, which led to new questions, which led to new simulations.
This question also attracted the interest of my other colleagues: <a href="https://en-social-sciences.tau.ac.il/profile/rmukamel">Roy Mukamel</a>, <a href="https://www.universiteitleiden.nl/en/staffmembers/jelle-goeman">Jelle Goeman</a>, and <a href="https://en.stat.huji.ac.il/people/yuval-binyamini">Yuval Benjamini</a>.</p>
<p>The core of the work is the comparison of power, of two main approaches:
(1) Detecting signal using a supervised-classifier, as described above.
(2) Detecting signal using multivariate hypothesis testing, such as Hotelling’s \(T^2\) test.
We call the former an <em>accuracy test</em>, and the latter a <em>two-group</em>.
We studied the <em>high-dimension-small-sample setup</em>, where the dimension of each measurement, is comparable to the number of measurements.
This setup is consistent with applications in brain-imaging and genetics.</p>
<p>Here is a VERY short summary of our conclusions.</p>
<ul>
<li>Accuracy tests are underpowerd compared to two-group tests.</li>
<li>In high-dimension covariance regularization is crucial. The statistical literature has many two-group tests designed for high-dimension.</li>
<li>The optimal regularization for testing, and for prediction are different.</li>
<li>The interplay between the direction of the signal and the principal components of the noise has a considerable effect on power.</li>
<li>Two-group tests do not require cross-validation. They are thus considerably faster to compute.</li>
<li>If insisting on accuracy-tests instead of two-group tests, then resampling with-replacement has more power than without-replacement. In particular, the <em>leave-one-out Bootstrap</em> is better than <em>cross-validation</em>.</li>
</ul>
<p>The intuiton for our main findings is the following:</p>
<ol>
<li>Estimating accuracies adds a discretization stage which reduces power and is needless for testing.</li>
<li>In in high-dim, there is barely enough data to estimate the covariances in the original space, let alone in augmented feature spaces. Kernel tricks, and deep-nets may work fine in low-dim, but are hopeless in high-dim.</li>
</ol>
<p>Given these findings, the tremendous popularity of accuracy tests is quite puzzling.
We dare conjecture that it is partially due to the growing popularity of machine-learning, and the reversal of the inference cascade:
Researchers fit a classifier, and then check if there is any difference between populations?
Were researchers to start by testing for any difference between populations, and only then fit a classifier, then a two-group test would be natural starting point.</p>
<p>The full details can be found in [1].</p>
<p>[1] Jonathan D Rosenblatt, Yuval Benjamini, Roee Gilron, Roy Mukamel, Jelle J Goeman, Better-than-chance classification for signal detection, Biostatistics,
https://doi.org/10.1093/biostatistics/kxz035</p>In 2012 my friend Roee Gilron told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level. “That can’t be right” I said. “So much power is left on the table!” “Can you show it to me?” Roee replied. Well, I did. And 7 years later, our results have been published by Biostatistics.Web-Technologies for Interactive Multivariate Monitoring System2019-05-15T00:00:00+00:002019-05-15T00:00:00+00:00http://johnros.github.io/dendrometers<p>I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1].
The described system will soon be avaialbe as an R pacakge.</p>
<p>Think of rain as Irrigation 1.0; sprinklers as Irrigation 2.0; <a href="https://en.wikipedia.org/wiki/Drip_irrigation">drippers</a> as 3.0.
In Irrigation 4.0, farmers irrigate upon demand.
To assess a plants “demand” for water, several technologies are used.
All of these, require the assurance of the data quality that enters irrigation algorithms.
The scale of modern fields is such that data-quality needs to be assured automatically, not manually.</p>
<p>Any modern age BI system will allow data screening using If-Then rules.
These rules may filter “technical” anomalies, but will be unable to capture “statistical anomalies” such as a sick plant, or over-reactive sensor.</p>
<p>To aid agriculturers to assure data quality for demand-based irrigation, we teamed up with <a href="https://www.phytech.com/">Pytech</a>, a manufacturer of <a href="https://en.wikipedia.org/wiki/Dendrometry">dendrometers</a> to develop a system for the detection of anomalous sensors.</p>
<p><img src="../images/dendrometers.jpg" alt="Dendromers (top) with an example of their raw readings of the plant's girth (bottom)." /></p>
<p>The building blocks of our system:</p>
<ol>
<li>Measure the plants’ health via a network of dendrometers.</li>
<li>Faulty sensors detected using various anomaly detection algorithms, borrowing ideas from the robust-multivariate statistics, and social-network analysis.</li>
<li>Once an anomaly has been detected, an interrogation of the sensor is made possible in any browser, using web-technologies; <a href="https://d3js.org/">D3.JS</a> in particular.</li>
</ol>
<p><img src="../images/dendrometers-dashboard.jpg" alt="An view of various anomaly scoring algorithms (bottom), with an interactive functional boxplot, to view the raw readings of a selected sensor (top)." /></p>
<hr />
<p>[1] Vilenski, Efrat, Peter Bak, and Jonathan D. Rosenblatt. “Multivariate anomaly detection for ensuring data quality of dendrometer sensor networks.” Computers and Electronics in Agriculture 162 (2019): 412-421.</p>I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1]. The described system will soon be avaialbe as an R pacakge.Gaussian Markov Random Fields versus Linear Mixed Models for Spatial-Data2019-03-04T00:00:00+00:002019-03-04T00:00:00+00:00http://johnros.github.io/pm25<p>It was <a href="http://in.bgu.ac.il/en/humsos/geog/Pages/staff/kloog.aspx">Prof. Itai Kloog</a> that introduced us to the problem of pollution assessment.
In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations.
In statistical parlance: we predict pollution using image quality (<a href="https://earthobservatory.nasa.gov/global-maps/MODAL2_M_AER_OD">AOD</a>) as a predictor.</p>
<p>How to fit a predictive model?
In particular, given the spatial correlation in the model’s errors.</p>
<p>First approach: <strong>A Gaussian Random Field</strong>.
In particular, with an assumed <a href="https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function">Matern Covariance Function</a>.
This is a standard approach in spatial statistics.
Predictions that account for covariance in errors are known by statisticians as <a href="https://en.wikipedia.org/wiki/Best_linear_unbiased_prediction">BLUPs</a>, by applied mathematicians as Linear Projection Operators, by Geostatisticians as <a href="https://en.wikipedia.org/wiki/Kriging">Kriging</a>, and by Numerical Analysts as <a href="https://en.wikipedia.org/wiki/Radial_basis_function">Radial Basis Functions Interpolator</a>.
The desirable property of BLUPs, are that predictions vary smoothly in space, as one would expect.
A downside of this technique, is that the (inverse) covariance matrix required for predictions is in the order of the number of ground monitoring stations.
When calibrating the predictor at a continental scale, with thousands of stations, fitting such a predictor may take days.</p>
<p>Second approach: a <strong>Linear Mixed Model</strong> (LMM).
This is the currently dominant approach in pollution prediction.
The idea is to fit region-wise random effects, thus capturing spatial correlations in prediction errors.
The upside: the implied error correlation is sparse.
This allows to use algorithms tailored for sparse matrices [1], making fitting and predicting much faster.
The downside: prediction surfaces have a “slab” structure, instead of varying smoothly in space.
This is not an “aesthetics” argument, but rather, a very practical one:
Slab surfaces allow very different predictions for spatially adjacent stations. This is an undesirable property of the modeling approach.</p>
<p><img src="../images/sp_re.jpg" alt="The "slab" prediction surface of the LMM (bottom) vs. the smooth prediction surface of the GMRF (top)" /></p>
<p>Can the benefits of smooth predictions be married with the fast computations with sparse matrices?
The answer is affirmative, via <a href="https://en.wikipedia.org/wiki/Markov_random_field#Gaussian">Gaussian Markov Random Fields</a>.
The fundamental observation is that Gaussian Markov fields have sparse precision matrices, so they are easy to compute with.
The Integrated Nested Laplace Approximations (INLA) of [2] does just that: it approximates a Matern Gaussian field, with a Markov Gaussian field.
This framework is accompanied with an excellent R implementation: <a href="http://www.r-inla.org/">R-INLA</a>.</p>
<p>Using the INLA approximation, we were able to fit a (Markov) Gaussian random field to our data, and return smooth pollution prediction surfaces within the hour.
The covariance implied by the INLA approximation of the Matern field is presented in the following figure.
Notice the sparsity, which facilitates computations.</p>
<p><img src="../images/prec.jpg" alt="Error covariance implied by LMM vs. GMRF" /></p>
<p>When verifying the statistical errors in our prediction surfaces, we find they are not only computationally feasible, but also more accurate than the dominant LMM approach.
The cross-validated accuracy, as a function of extrapolation distance, is given in the following figure: <img src="../images/h_res.jpg" alt="Test error as function of extrapolation distance (h). GMRF dominates LMM for all distances." /></p>
<p>The full details can be found in the paper [3].</p>
<hr />
<p>[1] Davis, Timothy A. Direct methods for sparse linear systems. Vol. 2. Siam, 2006.</p>
<p>[2] Rue, Havard, Sara Martino, and Nicolas Chopin. “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.” Journal of the royal statistical society: Series b (statistical methodology) 71.2 (2009): 319-392.</p>
<p>[3] Sarafian, Ron, et al. “Gaussian Markov Random Fields versus Linear Mixed Models for satellite-based PM2. 5 assessment: Evidence from the Northeastern USA.” Atmospheric Environment (2019).</p>It was Prof. Itai Kloog that introduced us to the problem of pollution assessment. In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations. In statistical parlance: we predict pollution using image quality (AOD) as a predictor.The Spatial Specificity Paradox in brain imaging, remedied with valid, infinitely-circular, inference2018-07-27T00:00:00+00:002018-07-27T00:00:00+00:00http://johnros.github.io/cherry-brain<p>The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees.
This introduces a <em>spatial specificity paradox</em>. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.</p>
<p>To deal with this paradox, we propose to estimate the amount of active voxels in the cluster. If this proportion is large, then there is no real paradox. If this proportion is small, we may want to “drill down” to sub-clusters. This introduces a circularity problem, a.k.a. selective inference: we are not making inference on arbitrary voxels, but rather, voxels that belong to statistically significant clusters.</p>
<p>In the spirit of the FDR, we call the proportion of active voxels in a cluster the <em>True Discovery Proportion</em> (TDP). We use recent results from the multiple testing literature, namely the <em>All Resolution Inference</em> (ARI) framework of Goeman and other [1,2], to estimate the TDP in selected clusters.</p>
<p>The ARI framework has many benefits for our purpose:</p>
<ol>
<li>It takes voxel-wise p-values and returns TDP lower bounds.</li>
<li>The algorithm is very fast, as implemented in the <a href="https://cran.r-project.org/package=hommel">hommel</a> R package.</li>
<li>For brain imaging, we wrote a wrapper package, <a href="https://cran.r-project.org/package=ARIbrain">ARIbrain</a>.</li>
<li>The TDP bounds using ARI come with statistical error guarantees. Namely, with probability \(1-\alpha\) (over repeated experiments), no cluster will have an over estimated TDP.</li>
<li>The above guarantee applies no matter how clusters have been selected.</li>
</ol>
<p>The last item has quite surprising implications.
It means that ARI provides statistical guarantees if selecting clusters and estimating TDP from the same data, and particularly if clusters are selected using random-field-theory significance tests.
It also means that one may create sub-clusters within significant clusters, and estimate TDP again, without losing error guarantees(!).
It also means that one may select clusters using the TDP itself, and if unsatisfied with results, re-select clusters using which ever criterion, ad infinitude.</p>
<p>Here is an example of PTD bounds in sub-clusters, within the originally selected clusters: <img src="../images/gonogo_perc4bis.png" alt="here" />.</p>
<p>How does this re-selection not invalidate error guarantees?
Put differently, how does ARI deal with this <strong>infinite circularity</strong>?
The fundamental idea is similar to <a href="https://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method">Scheffe’s method</a> in post-hoc inference.
The idea is to provide statistical guarantees on TDP to <strong>all possible cluster selections</strong>.
This means that any cluster a practitioner may create, has already been accounted for by the ARI algorithm.</p>
<p>Providing valid lower TDP bounds is clearly not the only task at hand.
Indeed, bounding all TDP’s at \(0\) satisfied the desired error guarantees, for all possible cluster selection.
The real matter is power: are the TDP bounds provided by ARI tight, given the massive number of implied clusters being considered?
The answer is that the TDP bounds are indeed tight, at least in large clusters where the spatial specificity paradox is indeed a concern.</p>
<p>Two fundamental ingredients allow ARI to provide informative TDP bounds, even after this infinitely circular inference.
The first ingredient, is that we do not consider all possible brain maps, but rather, we assume the the brain map is <strong>smooth enough</strong>.
This smoothness is implied by assuming that the brain map satisfies the <em>Simes Inequality</em>, which excludes extremely oscillatory brain maps, which would require more conservative bounds.
The Simes inequality is implied by the <em>Positive Regression Dependence on Subsets</em> condition, which is frequently used in brain imaging, since it is required for <a href="https://en.wikipedia.org/wiki/False_discovery_rate">FDR control</a> using the <a href="https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure">Benjamini-Hochberg</a> algorithm.</p>
<p>The second ingredient, is that the TDP bounds are provided by inverting a <a href="https://en.wikipedia.org/wiki/Closed_testing_procedure">closed testing procedure</a>, which is a powerful algorithm for multiple testing correction.</p>
<p>The compounding of a closed testing procedure in a smooth-enough random field, implies that the true TDP cannot be too far from the observed TDP, so that it may be bound while being both informative, and statistically valid.</p>
<p>The full details can be found in our recent contribution, now accepted to Neuroimage [3].</p>
<hr />
<p>[1] Goeman, Jelle, et al. “Simultaneous Control of All False Discovery Proportions in Large-Scale Multiple Hypothesis Testing.” arXiv preprint arXiv:1611.06739 (2016).</p>
<p>[2] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.</p>
<p>[3] Rosenblatt, J. D., Finos, L., Weeda, W. D., Solari, A., & Goeman, J. J. (2018). All-Resolutions Inference for brain imaging. NeuroImage. https://doi.org/10.1016/j.neuroimage.2018.07.060</p>The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees. This introduces a spatial specificity paradox. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.Ranting on MVPA2017-09-03T00:00:00+00:002017-09-03T00:00:00+00:00http://johnros.github.io/mvpa-rant<p>The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time.
Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s \(T^2\)) are preferable.
Why are multivariate tests preferable than accuracy tests?</p>
<ol>
<li>They are more powerful.</li>
<li>They are easier to interpret.</li>
<li>They are easier to implement.</li>
<li>Because they are not cross validated then:
<ol>
<li>They are computationally faster.</li>
<li>They do not suffer biases in the cross validation scheme.</li>
</ol>
</li>
</ol>
<p>I read and referee papers where authors go to great lengths to interpret their “funky” results.
To them I say:
Your cross validation scheme is biased and your test statistic is leaving power on the table!
Please consult a statistician and replace your MVPA with a multivariate test.
For a more “scientific explanation” read [1] and [2].</p>
<p>If you justify the use of the prediction accuracy because it is also an effect-size, then please acknowledge that <em>effect size</em> is a different problem than <em>localization</em> and read the multivariate effect size literature (e.g. [3]).</p>
<p>When would I really want to use the prediction accuracy as a test statistic?
When doing actual decoding and not localization, such as brain-computer interfaces.</p>
<hr />
<p>[1] Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. “Better-Than-Chance Classification for Signal Detection.” arXiv preprint arXiv:1608.08873 (2016).</p>
<p>[2] Gilron, Roee, et al. “What’s in a pattern? Examining the type of signal multivariate analysis uncovers at the group level.” NeuroImage 146 (2017): 113-120.</p>
<p>[3] Olejnik, Stephen, and James Algina. “Measures of effect size for comparative studies: Applications, interpretations, and limitations.” Contemporary educational psychology 25.3 (2000): 241-286.</p>The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time. Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s \(T^2\)) are preferable. Why are multivariate tests preferable than accuracy tests?A surprising result on the power of the t-test2017-08-30T00:00:00+00:002017-08-30T00:00:00+00:00http://johnros.github.io/wilcoxon-power<p>In our recent contribution [1], just published in <a href="http://amstat.tandfonline.com/doi/full/10.1080/00031305.2017.1360795">The American Statistician</a> we revisit the power analysis of the t-test.</p>
<p>The fundamental observation is that the t-test has been proposed, and studied, as a detector of <em>shift alternatives</em>.
By shift alternatives, a statistician means that if two populations differ, then they differ by their mean.
Put differently, it is assumed the factor of interest has the effect of <em>shifting</em> a distribution.
For many phenomena, however, we would not expect an effect of shift-type.
Consider a clinical trial: if we expect a drug to affect only part of the population, we are no longer looking for a shift alternative, but rather, a <em>mixture alternative</em>.</p>
<p>We show, that for mixture alternative, much of the folklore on the t-test no longer holds (nor should it).
We show that Wilcoxon’s signed-rank test may be more powerful than a t-test under a Gaussian null.
This is bacause Wilcoxon’s signed-rank test may capture the assymetry in the mixture, before the t-test captures the changed mean.</p>
<p>This has interesting implications.
A practitioner may be willing to pay with some power, and opt for Wilcoxon’s test because they will not assume Gaussianity.
Our results show that it is possible that this practitioner did not lose any power, but in fact, has gained some.
Not because the null is non-Gaussian, but rather, because the alternative is of mixture-type.</p>
<p>After so much as been said on the t-test, I am rather proud that we can still inovate on the matter.</p>
<p>[1] Rosenblatt, Jonathan D., and Yoav Benjamini. “On Mixture Alternatives and Wilcoxon’s Signed-Rank Test.” The American Statistician, August 1, 2017. doi:10.1080/00031305.2017.1360795.</p>In our recent contribution [1], just published in The American Statistician we revisit the power analysis of the t-test.