# Benchmarking BLAS&LAPACK implementation for use with R, Python and Julia

Guest post by Guy Barel.

# Why model measurements with complex numbers?

I recently grew a mild obsession to understand complex numbers. It all started while working with Tirza Routenberg and Neta Zimerman on the analysis of seismic array data. The array processing community will usually model measurements as complex numbers. This approach is so natural in the signal processing community, that the canonical reference for array processing, [vanTrees2002], never even stops to explain why? Being the statistician that I am, it puzzled me: the measurements are the instantanous compression of soil, why would you want to represent that with a complex number?

# Domain Adaptation for Environmental Monitoring

Environmental monitoring from satellite imagery essentially means that instead of directly measuring pollution (for instance), you predict it from satellite imagery. When an epidemiologist controls for ambient temperature, you can be pretty sure that such an indirect measurement of temperature is involved.

# MultiNav: Navigating Multivariate Data

In statistical process control (SPC), a.k.a. anomaly-detection, one compares some statistic to its “in control” distribution. Many statistical and BI software suits (e.g. Tableau, PowerBI) can do SPC. Almost all of these, however, focus on univariate processes, which are easier to visualize and discuss than multivariate processes. The purpose of our Multinav package is to ingest multivariate data streams, compute multivariate summaries, and allow us to query the data interactively when an anomaly is detected.

# On the Harmonic-Mean of p-values

The Harmonic-Mean p-value (HMP), as the name suggests, is the harmonic mean of p-values. It can replace other aggregations, such as Fisher’s combination and be used as a test-statistic for signal detection, a.k.a. global null testing. It is an elegant test statistic: it is easy to compute, and it’s null distribution can be easily derived for independent tests. It is thus a useful tool for signal-detection/meta-analysis.

# Better-Than-Chance Classification for Signal Detection

In 2012 my friend Roee Gilron told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level. “That can’t be right” I said. “So much power is left on the table!” “Can you show it to me?” Roee replied. Well, I did. And 7 years later, our results have been published by Biostatistics.

# Web-Technologies for Interactive Multivariate Monitoring System

I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1]. The described system will soon be avaialbe as an R pacakge.

# Gaussian Markov Random Fields versus Linear Mixed Models for Spatial-Data

It was Prof. Itai Kloog that introduced us to the problem of pollution assessment. In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations. In statistical parlance: we predict pollution using image quality (AOD) as a predictor.

# The Spatial Specificity Paradox in brain imaging, remedied with valid, infinitely-circular, inference

The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees. This introduces a spatial specificity paradox. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.

# Ranting on MVPA

The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time. Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s $$T^2$$) are preferable. Why are multivariate tests preferable than accuracy tests?

# A surprising result on the power of the t-test

In our recent contribution [1], just published in The American Statistician we revisit the power analysis of the t-test.

# Sampling as an Epidemic Process

Respondent driven sampling (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral. It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc. In our latest contribution, just published in Biometrics, Yakir Berchenko, Simon Frost and myself, take a look at RDS and cast the sampling as a stochastic epidemic. This view allows us to analyze RDS using the likelihood framework, which was previously impossible. In particular, this allows us to debias population prevalence estimates, and estimate the population size! The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.

# Intro to dimensionality reduction

Gave a guest lecture on dimensionality reduction at Amir Geva’s “Clustering and Unsupervised Computer Learning” graduate course. I tried to give a quick overview of major dimensionality reduction algorithms. In particular, I like to present algorithms via the problem they are aimed to solve, and not via how they solve it.

# What is a pattern? MVPA cast as a hypothesis test

In our recent contribution [1], just published in Neuroimage we cast the popular Multi-Voxel Pattern Analysis framework (MVPA) in terms of hypothesis testing. We do so because MVPA is typically used for signal localization, i.e., the detection of “information encoding” regions.

# Almost-embarrassingly-parallel algorithms for machine learning

Most machine learning algorithms are optimization problems. If they are not, they can often be cast as such. Optimization problems are notoriously hard to distribute. That is why machine learning from distributed BigData databases is so challenging.

# Interactive Plotting with R

Efrat is a MSc. student in my group. She works on integrating advanced Multivariate Process Control capabilities in interactive dashboards. During her work she aquired an impressive expertise in interactive plotting with R, and D3JS.

# Quality Engineering Class Notes

Now that I am a member of the Industrial Engineering Dept. at Ben Gurion University, I am naturally looking into statistical aspects of Industrial Engineering. In particular process control. This being the case, I started teaching Quality Engineering. While preparing the course, I read the classical introductory literature and I felt it failed to convey the beauty of the field, by focusing on too many little details. I thus went ahead and wrote my own book, which can be found online.

# Disambiguating Bayesian Statistics

The term “Bayesian Statistics” is mentioned in any introductory course to statistics and appears in countless papers and in books, in many contexts and with many meanings. Since it carries different meaning to different authors, I will try to suggest several different interpretations I have encountered.

# ICML 2015

I have attended this week the ICML2015 conference in Lille France. Here are some impressions…

# Analyzing your data in the Amazon Cloud with R

If you want use R and: your data does not fit in your hard disk, or you want to do some ad-hoc distributed computations, or you need 256 GB of RAM for fitting your model, or you want your data to be accesible from anywhere in the world, or you heard about “AWS” and want to know how it may help your statistical needs… Then it is time to remind you of an old post of mine explaining how to setup an environment for data analysis with R in the AWS cloud.