Jonathan D. Rosenblatt

Benchmarking BLAS&LAPACK implementation for use with R, Python and Julia

2021-09-30T00:00:00+00:00

Guest post by Guy Barel.

Disclaimer: This project spanned a long period, and we were not as tidy as we should have been. This means that we are unable to provide the code used to produce the plots. If you decide to reproduce these results, and find similar/contradicting findings, please share with us.

Linear algebra libraries contain essential methods for statistical software. Machine learning and AI use numeric linear algebra operations on data sets represented as matrices and vectors, making numerical linear algebra a core component in the field. In recent years, the HPC community has attracted considerable research attention on obtaining high performances in matrix computations. In addition, the growth of database dimensions brings an urgent need for analyzing the data sets in a reasonable time. Therefore, knowing what the optimal libraries are in computing speed is crucial for any user interested in the field.

The following is a benchmark I conducted during my 4-th year engineering project at BGU with the guidance of Dr. Jonathan Rosenblatt. The benchmark evaluates the performances of some matrix operations, in terms of execution time, for a wide range of linear algebra libraries - Basic Linear Algebra Subprograms (BLAS), Linear Algebra Package (LAPACK), and high-performance libraries (OpenBLAS, ATLAS, and IntelMKL). Moreover, the libraries evaluation is combined with three programming languages, R, Python, and Julia, offering extended analysis compared to related publications.

In order to reach optimized performance time, the combination of two main parameters are examined, \((i)\) the selection of a linear algebra library and \((ii)\) the preferred programming language.

For benchmarking purposes, it is necessary to replace the underlying BLAS implementation of a programming language, where the replacement process has been proven particularly challenging. For this purpose, I use a novel framework called FlexiBLAS to exchange the BLAS implementation at run-time via an environment variable. This approach uses a wrapper library called FlexiBLAS, which neither requires relinkage nor recompilation. FlexiBLAS concept is a lightweight wrapper library around all BLAS and LAPACK subroutines using a plugin framework on top of the POSIX features for load, unload, and search for functions or symbols inside elf-files at runtime. The major advantage of this approach is the relatively simple process of exchanging the BLAS library; besides, numerical experiments show no significant overhead introduced by this approach [MartinK¨ohler2013].

For benchmarking linear algebra libraries, hardware and software aspects significantly matter for the overall results. The use of different CPUs, programming languages, or operating systems will generate distinct performance profiles. This benchmark were carried out on a 64-bit machine featuring four 16-core Intel Xeon E7-4850v4 processors at 2.1GHz 115W, 1M L1, 4M L2, and 40M 20-way set associative shared cache, a total of 64 core/ 128 threads. All tests will be performed on Linux OS Ubuntu 18.04, Using R, Julia, and Python programming languages.

Benchmark

Each benchmark experiment consists of 10 runs for a matrix of size \(2^7 , 2^9 , 2^{11} , 2^{12}\) , and \(2^{13}\), rows/columns, using the chosen benchmark— Matrix-Multiplication, Singular Value Decomposition(SVD), Cholesky decomposition and a matrix sizes of \(2^{15} , 2^{17} , 2^{19} , 2^{20}\) , and \(2^{21}\) for Dot-Product —over all the optimized BLAS implementations mentioned. For each test, I generated a random matrix for representing a matrix object. Then, I ran each operation ten times, recorded the elapsed running time, and computed an average time. Below are the results of the tests.

Results

I choose to present the results in two different forms- \((1)\) Programming Languages oriented and \((2)\) Libraries oriented. This way, we can better understand the influences of the programming languages and the libraries, individually, on the computation time. Each of these plots displays results for all the available BLAS implementations, with the matrix dimension on the horizontal axis and the elapsed time in seconds on the vertical axis.

Programming Languages oriented results

Dot-Product results

*Figure 1 Dot-product results*

Matrix multiplication results

*Figure 2 Matrix multiplication results*

SVD results

*Figure 3 SVD results*

Cholskey results

*Figure 4 Cholskey results*

Libraries oriented results

*Figure 5 BLAS&LAPACK results*

*Figure 6 MKL results*

*Figure 7 ATLAS results*

*Figure 8 OpenBLAS results*

Conclusion

Unsurprisingly, the standard BLAS and LAPACK are dominated by all other alternatives. Atlas is improving on the standard BLAS and LAPACK but is dominated by the libraries MKL and OpenBLAS that have relatively similar performance. Python and Julia are essentially indistinguishable and ahead of R for most of the tests. Surprisingly, R obtains much better results under OpenBLAS for most numerical tests and performs better computation time than Python and Julia.

This benchmark shows that greater performances can be achieved by wisely choosing the programming languages and the linear algebra libraries, significantly improving performances for linear algebra computation.

References

[MartinK¨ohler2013] Martin K¨ohler, J. S. (2013). FlexiBLAS - A flexible BLAS library with runtime exchangeable backends.

Why model measurements with complex numbers?

2021-04-01T00:00:00+00:00

I recently grew a mild obsession to understand complex numbers. It all started while working with Tirza Routenberg and Neta Zimerman on the analysis of seismic array data. The array processing community will usually model measurements as complex numbers. This approach is so natural in the signal processing community, that the canonical reference for array processing, [vanTrees2002], never even stops to explain why? Being the statistician that I am, it puzzled me: the measurements are the instantanous compression of soil, why would you want to represent that with a complex number?

The following is my current understanding for the reasons that the signal processing, and array processing communities, will model measurements as complex numbers. The tl;dr version is that: (a) The merit of complex numbers is due to their representation via complex exponentials. Shifting a complex valued sine-wave in time, is merely multiplying its complex exponential representation by some (complex) constant. (b) Any real-valued function/signal, may be mapped to its baseband representation, a.k.a., its complex envelope, without loss of information. (c) These mappings/representations are useful since followup processing will typically include linear systems (convolutions), deconvolutions, Fourier transforms, etc, which are easier, both computationally and analytically, when operating with complex exponentials.

Now with details.

Useful for Computations

The usual argument one receives when asking a physicist or an electrical engineer “why complex” is:
(a) Some measurements are complex.
(b) Super useful for handling waves.
(c) Super useful for linear systems.
Let’s parse them one by one.

Some measurements are complex

Some measurements, by their nature, respect the arithmetic of complex numbers. This is the case when measuring current and voltage. This is not the case in acoustics/seismology, where measurements represent the compression of air/soil.

Super useful for waves

This is an important point which may be initially unclear to someone, like myself, who never really understood the difference between a wave, and any other function of time. A wave is a function of time and space, but it is not an arbitrary function. It represents a disturbance that propagates in time and space, so adjacent values are interconnected. It is not only smooth, but also has to satisfy the wave equation. Without going in the details of partial differential equations I will just say that a sine wave satisfies the wave equation, and thus any solution, i.e., any wave, is usually recovered by presenting the solution as scaled and shifted sine waves. Sine waves and their shifts are best represented with complex numbers, as I will soon demonstrate. For a full explanation I recommend [Smith2002].

Super useful for convolutions

Say you are analyzing the effect of a linear time invariant system (LTI), aka a convolution. It is a well known fact that Fourier diagonalizes the convolution. Put differently, the spectrum of the output of an LTI, is a point-wise multiplication with some other function. We will show this later in this post.

Sensor Arrays

As the name suggests, the field of sensor arrays deals with measurements from, well, sensor arrays. It turns out that when analyzing data from an array of sensors, complex numbers soon arise. Why is this? Consider the real-valued measurement, \(f_k(t)\) of sensor \(k\) at time \(t\). Sensor \(k\) and \(k'\) measure the same function at different locations. Because this is the same function, measurements differ in their temporal lag: \(f_{k'}(t)=f_k(t-\tau_k)\). Now enter a crucial fact about sine waveforms. Say \(f(t)\) is a sine wave in the complex plane: \(f(t)=\cos(t) + i \sin(t)\) where \(i:=\sqrt{-1}\). In complex exponential notation this is \(f(t)=e^{i t}\). Now presenting the time shift in complex exponential notation: \(f(t-\tau)=\sqrt{\cos^2(t)+\sin^2(t)} e^{i (t+\tau)}=e^{it} e^{i\tau}=f(t)e^{i\tau}\). This is why we say that a shift in time is a multiplication in frequency.
For some intuition, imagine that \(f(t)\) is the helix around a screw. To shift time, i.e. evaluate \(f(t)\) at \(t-\tau\) one can either look at position \(t-\tau\), or keep looking at position \(t\), but advance the screw a distance of \(\tau\). The “advancing of the screw” is the effect of the complex multiplication.

Since our sensor array measures time-shifted versions of the same signal, \(\{f(t-\tau_k)\}_k\), it would be nice if \(f(t)\) could be decomposed as a linear combination of complex exponentials. But this is what the real-to-complex Fourier transform does! Now since the first thing that will be done to array measurements is Fourier transforming them, why not start with directly there? This is known, at least in [vanTrees2002], as the frequency domain snapshot model.

Signal Processing

The previous discussion implies that for array signal processing, and given some assumptions that we skipped, one should adopt the frequency domain snapshot model. But if you ever practiced signal processing, you may know that the time domain snapshot model is no less popular, not only for arrays, and will often use complex numbers. So again, why model measurements as complex numbers? In particular, when no arrays are involved?

One reason has to do with the fact that the term signal processing includes both digital communication, and data analysis.

In digital communication, it is quite common that one needs to transmit a message over a analogue channel (e.g. radio). Because the channel is analogue, it essentially transmits waves. The message to be transmitted has to be encoded by shifting and scaling wave functions. This practice is known as modulation, and it is best done using complex exponentials because it, again, involves shifting and scaling wave functions. Is this also the case for data analysis?

When signal processing for data analysis, there is no transmission, only reception. One may argue that in data analyses we are “decripting nature’s messages”, but this romantic view has its limitation: we do not know the encoding mechanism used by nature, and the task is not decoding.

So why model measurements as complex numbers? My answer to this is the complex envelope. In my view, the matter is best described in [Schreier2010], and the argument is essentially, that there is nothing to lose. The complex envelope is also known under the more informative name of equivalent baseband signal. It is essentially a representation of the real-value signal, using a minimal spectrum. Minimal in the sense that negative frequencies are canceled, and the remaining are shifted to some origin. The price to pay for this “spectral compactness”, is that the signal is no longer real valued. One can always convert from the complex envelope to the real-valued signal, and vice-versa.

The baseband noise

I may have convinced you, and myself, that the complex envelope loses nothing, and may facilitate further processing which is easier with complex numbers. This is true, but an important detail to mind when adopting this time domain complex envelope snapshot model, is the noise. In the real-valued time domain, we usually use a white, Gaussian noise process, to model noise. But what is the complex envelope of a Gaussian white noise process? Is it Gaussian? It is white? The answer may be found in [vanTrees2002] circa Eq(5.79), or more rigorously in [Viswanathan2006]. The answer is approximately affirmative, meaning that one may use a white (proper) Gaussian process as the complex envelope of the real-valued noise process.

Complex Sinusoids Diagonalize the Convolution

Complex sinusoids diagonalize the convolution. Put differently: complex sinusoids are eigenfunctions of any linear time invariant system (LTI). We said this earlier informally, but now we can be more formal about it.

Consider the operation of an LTI, \(\mathcal{H}\) on a complex sinusoid, \(f(t)=e^{i \omega t}\). Denote \(g:=\mathcal{H}\{f\}\), \(f^*(t)=f(t-\tau)\), \(g^*(t)=g(t-\tau)\). By definition \(g^*=\mathcal{H}\{f^*\}=\mathcal{H}\{f(t-\tau)\}=\mathcal{H}\{e^{-i\omega\tau}f(t)\}\). By linearity of \(\mathcal{H}\): \(\mathcal{H}\{e^{-i\omega\tau}f(t)\}=e^{-i\omega\tau}\mathcal{H}\{f(t)\}=e^{-i\omega\tau}g\). Now because of time-invariance \(e^{-i\omega\tau}g=g^*\), which is satisfied if \(g\) is a complex sinusoid.

Conclusions

For array processing, where time-shifting is key, a frequency-domain-snapshot-model is a natural approach.
For general analysis of real-valued signal, complex modeling may be less obvious but brings benefits. Thinking of the complex-envelope of a real-valued signal is harmless, provided you take care of the right noise model.

Acknowledgements

I am thankful for the fruitful conversations on the matter with Tirza Routnerberg, Jont Allen, Roy Lederman, and Armin Schwartzman.

References

[Smith2002] Steven Smith, Digital Signal Processing: A Practical Guide for Engineers and Scientists, 1st edition (Amsterdam; Boston: Newnes, 2002).

[Schreier2010] Peter J. Schreier and Louis L. Scharf, Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals (Cambridge: Cambridge University Press, 2010).

[vanTrees2002] Harry L. Van Trees, Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory, 1st edition (New York: Wiley-Interscience, 2002).

[Viswanathan2006] R. Viswanathan, “On the Autocorrelation of Complex Envelope of White Noise,” IEEE Transactions on Information Theory 52, no. 9 (September 2006): 4298–99.

Domain Adaptation for Environmental Monitoring

2020-08-16T00:00:00+00:00

Environmental monitoring from satellite imagery essentially means that instead of directly measuring pollution (for instance), you predict it from satellite imagery. When an epidemiologist controls for ambient temperature, you can be pretty sure that such an indirect measurement of temperature is involved.

Predicting pollution is typically addressed as a supervised learning problem: use the pollution measured in ground stations as labels, and predict it wherever ground stations are not available. But what if some pollution monitoring stations are far away from the residences of the subjects involved in the epidemiological study? Would we not want to down-weight those stations in the learning?

The idea of weighted learning is not a new one. In the Machine Learning literature, it has been recently popularized in the context of “Domain Adaptation”, and in particular “Covariate Shift”: where the distributions of the covariates in the train set differ from those of the test set. This is exactly the case in environmental monitoring.

In our latest contribution [1], we call upon recent idea from the domain adaptation literature to estimate the quality of predicted temperatures in France. We show that naive performance estimates are biased, even if cross-validated. We then plug our performance estimators in the Empirical Risk Minimization framework, in order to learn better predictors. En passant- we discuss the matter of h-blocking, and other data splitting schemes designed for unbiased performance estimation (briefly: don’t).

[1] R. Sarafian, I. Kloog, E. Sarafian, I. Hough and J. D. Rosenblatt, “A Domain Adaptation Approach for Performance Estimation of Spatial Predictions,” in IEEE Transactions on Geoscience and Remote Sensing, doi: 10.1109/TGRS.2020.3012575.

MultiNav: Navigating Multivariate Data

2020-04-28T00:00:00+00:00

In statistical process control (SPC), a.k.a. anomaly-detection, one compares some statistic to its “in control” distribution. Many statistical and BI software suits (e.g. Tableau, PowerBI) can do SPC. Almost all of these, however, focus on univariate processes, which are easier to visualize and discuss than multivariate processes. The purpose of our Multinav package is to ingest multivariate data streams, compute multivariate summaries, and allow us to query the data interactively when an anomaly is detected.

In the next example we look into an agricultural IoT use-case, and try to detect sensors with anomalous readings. The full details are available in the official documentation.

The data consists of hourly measurements, during 5 days, of the girth of 100 avocado trees. For each such tree, an anomaly score is computed using four variation of Hotelling’s T2 statistic. Multinav thus returns the following dashboard:

Crucially for our purposes, this dashboard is interactive. One may hover over a particular score, in order to inspect the raw evolution of the measurement in time, and compare it to the group via a functional boxplot.

There is a lot more that MultiNav can do. The details can be found in the inline-help, and the official documentation. For the purpose of this blog, we merely summarize the reasons we find Multinav very useful:

Interactive visualizations that link anomaly score with raw data.
Functional boxplot for querying the source of the anomalies.
Out-of-the-box multivariate scoring functions, with the possibility of extension by the user.
The interactive component are embeddable in Shiny apps, for a full blown interactive dashboard.

On the Harmonic-Mean of p-values

2019-10-28T00:00:00+00:00

The Harmonic-Mean p-value (HMP), as the name suggests, is the harmonic mean of p-values. It can replace other aggregations, such as Fisher’s combination and be used as a test-statistic for signal detection, a.k.a. global null testing. It is an elegant test statistic: it is easy to compute, and it’s null distribution can be easily derived for independent tests. It is thus a useful tool for signal-detection/meta-analysis.

Our interest in the HMP began with Daniel Wilson’s PNAS paper [1]. In there, and the accompanying Wikipedia entry, Wilson claims the following properties of the HMP:

The HMP offers strong FWER control.
The HMP is a more powerful test than Bonferroni and Simes’.
The HMP is valid when p-values are statistically dependent.

Let’s parse them one at a time:

Regarding Strong FWER control: Strong FWER control means that the probability of any false positive is controlled, even in the presence of signal in some of the variables. This is not true for the HMP, because unless embedded in a closed testing procedure, HMP is a global test, and not a multiple test. Weak FWER control means that the probability of a false positive is under control, only if no variable carries signal. Wilson, however, does show that HMP offers more than weak FWER control. Because HMP is decreasing as more hypotheses are intersected, this means that if HMP rejects some conjunction hypothesis, then all the hypotheses in the closure of the conjunction will also be rejected. Wilson calls it strong control, but we think that “weak FWER control in selected subsets” may be more accurate.

Here is a formal definition, followed by an example. Denote \(B\) a set of \(m\) null-hypotheses, \(m_1(B)\) the number of false-nulls (i.e., signals, effects, associations,…), in \(B\), and \(V\) the number of false rejections of some inference algorithm, such as HMP. Weak FWER means that if \(m_1(B)=0\) then \(P(V(B)>0)\leq \alpha.\) Strong FWER means that \(\forall m_1(B), P(V(B)>0)\leq \alpha.\) Weak FWER in the Selected means that if \(P(\exists R_i: m_1(R_i)=0)\leq \alpha,\) where \(R_i\) denotes a rejected hypothesis.

Here is an example in genetics. Based on SNP-wise p-values, a researcher declares genes A and B as significant. Weak FWER means the inference is valid only if there is no effects/associations anywhere in A, nor B, nor any other SNP outside of \(A \cup B\). Weak FWER in the selected means the inference is valid if there is no effects/associations anywhere in A,B. Effects/associations that are outside \(A \cup B\) are allowed. Strong FWER means that inference on SNPs, within \(A \cup B\) is valid, even in the presence of true associated SNPs within \(A \cup B\).

Readers familiar with neuroimaging will recognize that weak-FWER is exactly the type of guarantees provided when inferring on clusters using random-field theory. Indeed, the assumed null in cluster inference, a.k.a. topological inference, is that “there is no effect anywhere in the cluster”. In [4] we called it the “Spatial Specificity Paradox”: the larger the cluster, the harder it is to pin-point the origin of brain activation.

The neuroimaging example is also useful to demonstrate another property of weak-FWER-in-the-selected: signal outside selected clusters does not invalidate inference. For this reason, some may prefer calling these guarantees “weak-within–strong-between…”.

Regarding HMP vs. Bonferroni: We do not see how a global test like HMP can be compared to Bonferroni corrected multiple tests. Regarding HMP vs. Simes: The comparison in [1] is an unfair one. When all elementary hypotheses are tested, neither test dominates the other.

Regarding validity under dependence: Wilson argues, both in [1] and in his Wikipedia edit, that HMP is valid under arbitrary statistical dependencies between p-values. While there may be dependencies that do not invalidate the HMP false-positive rates, a simple simulation under compound-symmetry ravels that the HMP is not valid under dependence. Does theory support Wilson’s claim? Yes and no. On the one hand, the distribution of harmonic means does tend to be robust to dependence. This is because the harmonic mean is driven by the smaller entries; in the Gaussian case, where dependence=correlations, the smaller entries roughly behave like white noise. On the other hand, it was not too hard to design an example where false positive control is invalidated, as we show in [2].

All of the above does not mean the HMP is useless. Far from it. Goeman and Solari [3] show how to estimate the signal’s prevalence, with strong FWER control, by embedding a global test in a closed testing procedure. In Rosenblatt et al. [4] discussed in a previous blog post, we showed how we used this idea to estimate the amount of brain activation in selected regions. Ebrahimpoor et al. [5] use the same rationale to infer on the proportion of associated SNPs within genes. While the type of correlations in brain scans and genomes may (possibly) invalidate error guarantees, this embedding is still of value in many other applications.

We thus conclude that HMP and closed testing have great potential, but it has to be handled with more care than implied in [1].

[1] Daniel J. Wilson “The harmonic mean p-value for combining dependent tests” Proceedings of the National Academy of Sciences Jan 2019, 116 (4) 1195-1200;

[2] Jelle J. Goeman, Jonathan D. Rosenblatt, Thomas E. Nichols “The harmonic mean p-value: Strong versus weak control, and the assumption of independence” Proceedings of the National Academy of Sciences Oct 2019, 201909339;

[3] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.

[4] Rosenblatt, Jonathan D., et al. “All-Resolutions inference for brain imaging.” Neuroimage 181 (2018): 786-796.

[5] Ebrahimpoor, Mitra, et al. “Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods.” Briefings in bioinformatics (2019).

Better-Than-Chance Classification for Signal Detection

2019-10-15T00:00:00+00:00

In 2012 my friend Roee Gilron told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level. “That can’t be right” I said. “So much power is left on the table!” “Can you show it to me?” Roee replied. Well, I did. And 7 years later, our results have been published by Biostatistics.

Roee’s question led to a mass of simulations, which led to new questions, which led to new simulations. This question also attracted the interest of my other colleagues: Roy Mukamel, Jelle Goeman, and Yuval Benjamini.

The core of the work is the comparison of power, of two main approaches: (1) Detecting signal using a supervised-classifier, as described above. (2) Detecting signal using multivariate hypothesis testing, such as Hotelling’s \(T^2\) test. We call the former an accuracy test, and the latter a two-group. We studied the high-dimension-small-sample setup, where the dimension of each measurement, is comparable to the number of measurements. This setup is consistent with applications in brain-imaging and genetics.

Here is a VERY short summary of our conclusions.

Accuracy tests are underpowerd compared to two-group tests.
In high-dimension covariance regularization is crucial. The statistical literature has many two-group tests designed for high-dimension.
The optimal regularization for testing, and for prediction are different.
The interplay between the direction of the signal and the principal components of the noise has a considerable effect on power.
Two-group tests do not require cross-validation. They are thus considerably faster to compute.
If insisting on accuracy-tests instead of two-group tests, then resampling with-replacement has more power than without-replacement. In particular, the leave-one-out Bootstrap is better than cross-validation.

The intuiton for our main findings is the following:

Estimating accuracies adds a discretization stage which reduces power and is needless for testing.
In in high-dim, there is barely enough data to estimate the covariances in the original space, let alone in augmented feature spaces. Kernel tricks, and deep-nets may work fine in low-dim, but are hopeless in high-dim.

Given these findings, the tremendous popularity of accuracy tests is quite puzzling. We dare conjecture that it is partially due to the growing popularity of machine-learning, and the reversal of the inference cascade: Researchers fit a classifier, and then check if there is any difference between populations? Were researchers to start by testing for any difference between populations, and only then fit a classifier, then a two-group test would be natural starting point.

The full details can be found in [1].

[1] Jonathan D Rosenblatt, Yuval Benjamini, Roee Gilron, Roy Mukamel, Jelle J Goeman, Better-than-chance classification for signal detection, Biostatistics, https://doi.org/10.1093/biostatistics/kxz035

Web-Technologies for Interactive Multivariate Monitoring System

2019-05-15T00:00:00+00:00

I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1]. The described system will soon be avaialbe as an R pacakge.

Think of rain as Irrigation 1.0; sprinklers as Irrigation 2.0; drippers as 3.0. In Irrigation 4.0, farmers irrigate upon demand. To assess a plants “demand” for water, several technologies are used. All of these, require the assurance of the data quality that enters irrigation algorithms. The scale of modern fields is such that data-quality needs to be assured automatically, not manually.

Any modern age BI system will allow data screening using If-Then rules. These rules may filter “technical” anomalies, but will be unable to capture “statistical anomalies” such as a sick plant, or over-reactive sensor.

To aid agriculturers to assure data quality for demand-based irrigation, we teamed up with Pytech, a manufacturer of dendrometers to develop a system for the detection of anomalous sensors.

The building blocks of our system:

Measure the plants’ health via a network of dendrometers.
Faulty sensors detected using various anomaly detection algorithms, borrowing ideas from the robust-multivariate statistics, and social-network analysis.
Once an anomaly has been detected, an interrogation of the sensor is made possible in any browser, using web-technologies; D3.JS in particular.

[1] Vilenski, Efrat, Peter Bak, and Jonathan D. Rosenblatt. “Multivariate anomaly detection for ensuring data quality of dendrometer sensor networks.” Computers and Electronics in Agriculture 162 (2019): 412-421.

Gaussian Markov Random Fields versus Linear Mixed Models for Spatial-Data

2019-03-04T00:00:00+00:00

It was Prof. Itai Kloog that introduced us to the problem of pollution assessment. In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations. In statistical parlance: we predict pollution using image quality (AOD) as a predictor.

How to fit a predictive model? In particular, given the spatial correlation in the model’s errors.

First approach: A Gaussian Random Field. In particular, with an assumed Matern Covariance Function. This is a standard approach in spatial statistics. Predictions that account for covariance in errors are known by statisticians as BLUPs, by applied mathematicians as Linear Projection Operators, by Geostatisticians as Kriging, and by Numerical Analysts as Radial Basis Functions Interpolator. The desirable property of BLUPs, are that predictions vary smoothly in space, as one would expect. A downside of this technique, is that the (inverse) covariance matrix required for predictions is in the order of the number of ground monitoring stations. When calibrating the predictor at a continental scale, with thousands of stations, fitting such a predictor may take days.

Second approach: a Linear Mixed Model (LMM). This is the currently dominant approach in pollution prediction. The idea is to fit region-wise random effects, thus capturing spatial correlations in prediction errors. The upside: the implied error correlation is sparse. This allows to use algorithms tailored for sparse matrices [1], making fitting and predicting much faster. The downside: prediction surfaces have a “slab” structure, instead of varying smoothly in space. This is not an “aesthetics” argument, but rather, a very practical one: Slab surfaces allow very different predictions for spatially adjacent stations. This is an undesirable property of the modeling approach.

Can the benefits of smooth predictions be married with the fast computations with sparse matrices? The answer is affirmative, via Gaussian Markov Random Fields. The fundamental observation is that Gaussian Markov fields have sparse precision matrices, so they are easy to compute with. The Integrated Nested Laplace Approximations (INLA) of [2] does just that: it approximates a Matern Gaussian field, with a Markov Gaussian field. This framework is accompanied with an excellent R implementation: R-INLA.

Using the INLA approximation, we were able to fit a (Markov) Gaussian random field to our data, and return smooth pollution prediction surfaces within the hour. The covariance implied by the INLA approximation of the Matern field is presented in the following figure. Notice the sparsity, which facilitates computations.

When verifying the statistical errors in our prediction surfaces, we find they are not only computationally feasible, but also more accurate than the dominant LMM approach. The cross-validated accuracy, as a function of extrapolation distance, is given in the following figure:

The full details can be found in the paper [3].

[1] Davis, Timothy A. Direct methods for sparse linear systems. Vol. 2. Siam, 2006.

[2] Rue, Havard, Sara Martino, and Nicolas Chopin. “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.” Journal of the royal statistical society: Series b (statistical methodology) 71.2 (2009): 319-392.

[3] Sarafian, Ron, et al. “Gaussian Markov Random Fields versus Linear Mixed Models for satellite-based PM2. 5 assessment: Evidence from the Northeastern USA.” Atmospheric Environment (2019).

The Spatial Specificity Paradox in brain imaging, remedied with valid, infinitely-circular, inference

2018-07-27T00:00:00+00:00

The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees. This introduces a spatial specificity paradox. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.

To deal with this paradox, we propose to estimate the amount of active voxels in the cluster. If this proportion is large, then there is no real paradox. If this proportion is small, we may want to “drill down” to sub-clusters. This introduces a circularity problem, a.k.a. selective inference: we are not making inference on arbitrary voxels, but rather, voxels that belong to statistically significant clusters.

In the spirit of the FDR, we call the proportion of active voxels in a cluster the True Discovery Proportion (TDP). We use recent results from the multiple testing literature, namely the All Resolution Inference (ARI) framework of Goeman and other [1,2], to estimate the TDP in selected clusters.

The ARI framework has many benefits for our purpose:

It takes voxel-wise p-values and returns TDP lower bounds.
The algorithm is very fast, as implemented in the hommel R package.
For brain imaging, we wrote a wrapper package, ARIbrain.
The TDP bounds using ARI come with statistical error guarantees. Namely, with probability \(1-\alpha\) (over repeated experiments), no cluster will have an over estimated TDP.
The above guarantee applies no matter how clusters have been selected.

The last item has quite surprising implications. It means that ARI provides statistical guarantees if selecting clusters and estimating TDP from the same data, and particularly if clusters are selected using random-field-theory significance tests. It also means that one may create sub-clusters within significant clusters, and estimate TDP again, without losing error guarantees(!). It also means that one may select clusters using the TDP itself, and if unsatisfied with results, re-select clusters using which ever criterion, ad infinitude.

Here is an example of PTD bounds in sub-clusters, within the originally selected clusters: .

How does this re-selection not invalidate error guarantees? Put differently, how does ARI deal with this infinite circularity? The fundamental idea is similar to Scheffe’s method in post-hoc inference. The idea is to provide statistical guarantees on TDP to all possible cluster selections. This means that any cluster a practitioner may create, has already been accounted for by the ARI algorithm.

Providing valid lower TDP bounds is clearly not the only task at hand. Indeed, bounding all TDP’s at \(0\) satisfied the desired error guarantees, for all possible cluster selection. The real matter is power: are the TDP bounds provided by ARI tight, given the massive number of implied clusters being considered? The answer is that the TDP bounds are indeed tight, at least in large clusters where the spatial specificity paradox is indeed a concern.

Two fundamental ingredients allow ARI to provide informative TDP bounds, even after this infinitely circular inference. The first ingredient, is that we do not consider all possible brain maps, but rather, we assume the the brain map is smooth enough. This smoothness is implied by assuming that the brain map satisfies the Simes Inequality, which excludes extremely oscillatory brain maps, which would require more conservative bounds. The Simes inequality is implied by the Positive Regression Dependence on Subsets condition, which is frequently used in brain imaging, since it is required for FDR control using the Benjamini-Hochberg algorithm.

The second ingredient, is that the TDP bounds are provided by inverting a closed testing procedure, which is a powerful algorithm for multiple testing correction.

The compounding of a closed testing procedure in a smooth-enough random field, implies that the true TDP cannot be too far from the observed TDP, so that it may be bound while being both informative, and statistically valid.

The full details can be found in our recent contribution, now accepted to Neuroimage [3].

[1] Goeman, Jelle, et al. “Simultaneous Control of All False Discovery Proportions in Large-Scale Multiple Hypothesis Testing.” arXiv preprint arXiv:1611.06739 (2016).

[2] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.

[3] Rosenblatt, J. D., Finos, L., Weeda, W. D., Solari, A., & Goeman, J. J. (2018). All-Resolutions Inference for brain imaging. NeuroImage. https://doi.org/10.1016/j.neuroimage.2018.07.060

Ranting on MVPA

2017-09-03T00:00:00+00:00

The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time. Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s \(T^2\)) are preferable. Why are multivariate tests preferable than accuracy tests?

They are more powerful.
They are easier to interpret.
They are easier to implement.
Because they are not cross validated then:
1. They are computationally faster.
2. They do not suffer biases in the cross validation scheme.

I read and referee papers where authors go to great lengths to interpret their “funky” results. To them I say: Your cross validation scheme is biased and your test statistic is leaving power on the table! Please consult a statistician and replace your MVPA with a multivariate test. For a more “scientific explanation” read [1] and [2].

If you justify the use of the prediction accuracy because it is also an effect-size, then please acknowledge that effect size is a different problem than localization and read the multivariate effect size literature (e.g. [3]).

When would I really want to use the prediction accuracy as a test statistic? When doing actual decoding and not localization, such as brain-computer interfaces.

[1] Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. “Better-Than-Chance Classification for Signal Detection.” arXiv preprint arXiv:1608.08873 (2016).

[2] Gilron, Roee, et al. “What’s in a pattern? Examining the type of signal multivariate analysis uncovers at the group level.” NeuroImage 146 (2017): 113-120.

[3] Olejnik, Stephen, and James Algina. “Measures of effect size for comparative studies: Applications, interpretations, and limitations.” Contemporary educational psychology 25.3 (2000): 241-286.