Jekyll2019-06-09T10:09:03+00:00http://johnros.github.io/feed.xmlJonathan D. RosenblattStats, R, and possibly beach volley.Web-Technologies for Interactive Multivariate Monitoring System2019-05-15T00:00:00+00:002019-05-15T00:00:00+00:00http://johnros.github.io/dendrometers<p>I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1].
The described system will soon be avaialbe as an R pacakge.</p>
<p>Think of rain as Irrigation 1.0; sprinklers as Irrigation 2.0; <a href="https://en.wikipedia.org/wiki/Drip_irrigation">drippers</a> as 3.0.
In Irrigation 4.0, farmers irrigate upon demand.
To assess a plants “demand” for water, several technologies are used.
All of these, require the assurance of the data quality that enters irrigation algorithms.
The scale of modern fields is such that data-quality needs to be assured automatically, not manually.</p>
<p>Any modern age BI system will allow data screening using If-Then rules.
These rules may filter “technical” anomalies, but will be unable to capture “statistical anomalies” such as a sick plant, or over-reactive sensor.</p>
<p>To aid agriculturers to assure data quality for demand-based irrigation, we teamed up with <a href="https://www.phytech.com/">Pytech</a>, a manufacturer of <a href="https://en.wikipedia.org/wiki/Dendrometry">dendrometers</a> to develop a system for the detection of anomalous sensors.</p>
<p><img src="../images/dendrometers.jpg" alt="Dendromers (top) with an example of their raw readings of the plant's girth (bottom)." /></p>
<p>The building blocks of our system:</p>
<ol>
<li>Measure the plants’ health via a network of dendrometers.</li>
<li>Faulty sensors detected using various anomaly detection algorithms, borrowing ideas from the robust-multivariate statistics, and social-network analysis.</li>
<li>Once an anomaly has been detected, an interrogation of the sensor is made possible in any browser, using web-technologies; <a href="https://d3js.org/">D3.JS</a> in particular.</li>
</ol>
<p><img src="../images/dendrometers-dashboard.jpg" alt="An view of various anomaly scoring algorithms (bottom), with an interactive functional boxplot, to view the raw readings of a selected sensor (top)." /></p>
<hr />
<p>[1] Vilenski, Efrat, Peter Bak, and Jonathan D. Rosenblatt. “Multivariate anomaly detection for ensuring data quality of dendrometer sensor networks.” Computers and Electronics in Agriculture 162 (2019): 412-421.</p>I am glad to announce our latest contribution, just publised in Computers and Electronics in Agriculture [1]. The described system will soon be avaialbe as an R pacakge.Gaussian Markov Random Fields versus Linear Mixed Models for Spatial-Data2019-03-04T00:00:00+00:002019-03-04T00:00:00+00:00http://johnros.github.io/pm25<p>It was <a href="http://in.bgu.ac.il/en/humsos/geog/Pages/staff/kloog.aspx">Prof. Itai Kloog</a> that introduced us to the problem of pollution assessment.
In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations.
In statistical parlance: we predict pollution using image quality (<a href="https://earthobservatory.nasa.gov/global-maps/MODAL2_M_AER_OD">AOD</a>) as a predictor.</p>
<p>How to fit a predictive model?
In particular, given the spatial correlation in the model’s errors.</p>
<p>First approach: <strong>A Gaussian Random Field</strong>.
In particular, with an assumed <a href="https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function">Matern Covariance Function</a>.
This is a standard approach in spatial statistics.
Predictions that account for covariance in errors are known by statisticians as <a href="https://en.wikipedia.org/wiki/Best_linear_unbiased_prediction">BLUPs</a>, by applied mathematicians as Linear Projection Operators, by Geostatisticians as <a href="https://en.wikipedia.org/wiki/Kriging">Kriging</a>, and by Numerical Analysts as <a href="https://en.wikipedia.org/wiki/Radial_basis_function">Radial Basis Functions Interpolator</a>.
The desirable property of BLUPs, are that predictions vary smoothly in space, as one would expect.
A downside of this technique, is that the (inverse) covariance matrix required for predictions is in the order of the number of ground monitoring stations.
When calibrating the predictor at a continental scale, with thousands of stations, fitting such a predictor may take days.</p>
<p>Second approach: a <strong>Linear Mixed Model</strong> (LMM).
This is the currently dominant approach in pollution prediction.
The idea is to fit region-wise random effects, thus capturing spatial correlations in prediction errors.
The upside: the implied error correlation is sparse.
This allows to use algorithms tailored for sparse matrices [1], making fitting and predicting much faster.
The downside: prediction surfaces have a “slab” structure, instead of varying smoothly in space.
This is not an “aesthetics” argument, but rather, a very practical one:
Slab surfaces allow very different predictions for spatially adjacent stations. This is an undesirable property of the modeling approach.</p>
<p><img src="../images/sp_re.jpg" alt="The "slab" prediction surface of the LMM (bottom) vs. the smooth prediction surface of the GMRF (top)" /></p>
<p>Can the benefits of smooth predictions be married with the fast computations with sparse matrices?
The answer is affirmative, via <a href="https://en.wikipedia.org/wiki/Markov_random_field#Gaussian">Gaussian Markov Random Fields</a>.
The fundamental observation is that Gaussian Markov fields have sparse precision matrices, so they are easy to compute with.
The Integrated Nested Laplace Approximations (INLA) of [2] does just that: it approximates a Matern Gaussian field, with a Markov Gaussian field.
This framework is accompanied with an excellent R implementation: <a href="http://www.r-inla.org/">R-INLA</a>.</p>
<p>Using the INLA approximation, we were able to fit a (Markov) Gaussian random field to our data, and return smooth pollution prediction surfaces within the hour.
The covariance implied by the INLA approximation of the Matern field is presented in the following figure.
Notice the sparsity, which facilitates computations.</p>
<p><img src="../images/prec.jpg" alt="Error covariance implied by LMM vs. GMRF" /></p>
<p>When verifying the statistical errors in our prediction surfaces, we find they are not only computationally feasible, but also more accurate than the dominant LMM approach.
The cross-validated accuracy, as a function of extrapolation distance, is given in the following figure: <img src="../images/h_res.jpg" alt="Test error as function of extrapolation distance (h). GMRF dominates LMM for all distances." /></p>
<p>The full details can be found in the paper [3].</p>
<hr />
<p>[1] Davis, Timothy A. Direct methods for sparse linear systems. Vol. 2. Siam, 2006.</p>
<p>[2] Rue, Havard, Sara Martino, and Nicolas Chopin. “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.” Journal of the royal statistical society: Series b (statistical methodology) 71.2 (2009): 319-392.</p>
<p>[3] Sarafian, Ron, et al. “Gaussian Markov Random Fields versus Linear Mixed Models for satellite-based PM2. 5 assessment: Evidence from the Northeastern USA.” Atmospheric Environment (2019).</p>It was Prof. Itai Kloog that introduced us to the problem of pollution assessment. In short: pollution is assessed from satellite imagery, calibrated from ground monitoring stations. In statistical parlance: we predict pollution using image quality (AOD) as a predictor.The Spatial Specificity Paradox in brain imaging, remedied with valid, infinitely-circular, inference2018-07-27T00:00:00+00:002018-07-27T00:00:00+00:00http://johnros.github.io/cherry-brain<p>The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees.
This introduces a <em>spatial specificity paradox</em>. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.</p>
<p>To deal with this paradox, we propose to estimate the amount of active voxels in the cluster. If this proportion is large, then there is no real paradox. If this proportion is small, we may want to “drill down” to sub-clusters. This introduces a circularity problem, a.k.a. selective inference: we are not making inference on arbitrary voxels, but rather, voxels that belong to statistically significant clusters.</p>
<p>In the spirit of the FDR, we call the proportion of active voxels in a cluster the <em>True Discovery Proportion</em> (TDP). We use recent results from the multiple testing literature, namely the <em>All Resolution Inference</em> (ARI) framework of Goeman and other [1,2], to estimate the TDP in selected clusters.</p>
<p>The ARI framework has many benefits for our purpose:</p>
<ol>
<li>It takes voxel-wise p-values and returns TDP lower bounds.</li>
<li>The algorithm is very fast, as implemented in the <a href="https://cran.r-project.org/package=hommel">hommel</a> R package.</li>
<li>For brain imaging, we wrote a wrapper package, <a href="https://cran.r-project.org/package=ARIbrain">ARIbrain</a>.</li>
<li>The TDP bounds using ARI come with statistical error guarantees. Namely, with probability <script type="math/tex">1-\alpha</script> (over repeated experiments), no cluster will have an over estimated TDP.</li>
<li>The above guarantee applies no matter how clusters have been selected.</li>
</ol>
<p>The last item has quite surprising implications.
It means that ARI provides statistical guarantees if selecting clusters and estimating TDP from the same data, and particularly if clusters are selected using random-field-theory significance tests.
It also means that one may create sub-clusters within significant clusters, and estimate TDP again, without losing error guarantees(!).
It also means that one may select clusters using the TDP itself, and if unsatisfied with results, re-select clusters using which ever criterion, ad infinitude.</p>
<p>Here is an example of PTD bounds in sub-clusters, within the originally selected clusters: <img src="../images/gonogo_perc4bis.png" alt="here" />.</p>
<p>How does this re-selection not invalidate error guarantees?
Put differently, how does ARI deal with this <strong>infinite circularity</strong>?
The fundamental idea is similar to <a href="https://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method">Scheffe’s method</a> in post-hoc inference.
The idea is to provide statistical guarantees on TDP to <strong>all possible cluster selections</strong>.
This means that any cluster a practitioner may create, has already been accounted for by the ARI algorithm.</p>
<p>Providing valid lower TDP bounds is clearly not the only task at hand.
Indeed, bounding all TDP’s at <script type="math/tex">0</script> satisfied the desired error guarantees, for all possible cluster selection.
The real matter is power: are the TDP bounds provided by ARI tight, given the massive number of implied clusters being considered?
The answer is that the TDP bounds are indeed tight, at least in large clusters where the spatial specificity paradox is indeed a concern.</p>
<p>Two fundamental ingredients allow ARI to provide informative TDP bounds, even after this infinitely circular inference.
The first ingredient, is that we do not consider all possible brain maps, but rather, we assume the the brain map is <strong>smooth enough</strong>.
This smoothness is implied by assuming that the brain map satisfies the <em>Simes Inequality</em>, which excludes extremely oscillatory brain maps, which would require more conservative bounds.
The Simes inequality is implied by the <em>Positive Regression Dependence on Subsets</em> condition, which is frequently used in brain imaging, since it is required for <a href="https://en.wikipedia.org/wiki/False_discovery_rate">FDR control</a> using the <a href="https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure">Benjamini-Hochberg</a> algorithm.</p>
<p>The second ingredient, is that the TDP bounds are provided by inverting a <a href="https://en.wikipedia.org/wiki/Closed_testing_procedure">closed testing procedure</a>, which is a powerful algorithm for multiple testing correction.</p>
<p>The compounding of a closed testing procedure in a smooth-enough random field, implies that the true TDP cannot be too far from the observed TDP, so that it may be bound while being both informative, and statistically valid.</p>
<p>The full details can be found in our recent contribution, now accepted to Neuroimage [3].</p>
<hr />
<p>[1] Goeman, Jelle, et al. “Simultaneous Control of All False Discovery Proportions in Large-Scale Multiple Hypothesis Testing.” arXiv preprint arXiv:1611.06739 (2016).</p>
<p>[2] Goeman, Jelle J., and Aldo Solari. “Multiple testing for exploratory research.” Statistical Science 26.4 (2011): 584-597.</p>
<p>[3] Rosenblatt, J. D., Finos, L., Weeda, W. D., Solari, A., & Goeman, J. J. (2018). All-Resolutions Inference for brain imaging. NeuroImage. https://doi.org/10.1016/j.neuroimage.2018.07.060</p>The most prevalent mode of inference in brain imaging is inference on supra-threshold clusters, with random field theory providing error guarantees. This introduces a spatial specificity paradox. The larger the detected cluster, the less we know on the exact location of the activation. This is because the null hypothesis tested is “no activation in the whole cluster” so the alternative is “at least on voxel is active in the cluster”. This observation is now new, but merely an implication of the random field assumption.Ranting on MVPA2017-09-03T00:00:00+00:002017-09-03T00:00:00+00:00http://johnros.github.io/mvpa-rant<p>The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time.
Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s <script type="math/tex">T^2</script>) are preferable.
Why are multivariate tests preferable than accuracy tests?</p>
<ol>
<li>They are more powerful.</li>
<li>They are easier to interpret.</li>
<li>They are easier to implement.</li>
<li>Because they are not cross validated then:
<ol>
<li>They are computationally faster.</li>
<li>They do not suffer biases in the cross validation scheme.</li>
</ol>
</li>
</ol>
<p>I read and referee papers where authors go to great lengths to interpret their “funky” results.
To them I say:
Your cross validation scheme is biased and your test statistic is leaving power on the table!
Please consult a statistician and replace your MVPA with a multivariate test.
For a more “scientific explanation” read [1] and [2].</p>
<p>If you justify the use of the prediction accuracy because it is also an effect-size, then please acknowledge that <em>effect size</em> is a different problem than <em>localization</em> and read the multivariate effect size literature (e.g. [3]).</p>
<p>When would I really want to use the prediction accuracy as a test statistic?
When doing actual decoding and not localization, such as brain-computer interfaces.</p>
<hr />
<p>[1] Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. “Better-Than-Chance Classification for Signal Detection.” arXiv preprint arXiv:1608.08873 (2016).</p>
<p>[2] Gilron, Roee, et al. “What’s in a pattern? Examining the type of signal multivariate analysis uncovers at the group level.” NeuroImage 146 (2017): 113-120.</p>
<p>[3] Olejnik, Stephen, and James Algina. “Measures of effect size for comparative studies: Applications, interpretations, and limitations.” Contemporary educational psychology 25.3 (2000): 241-286.</p>The use of MVPA for signal detection/localization in neuroimaging has troubled me for a long time. Somehow the community refuses to acknowledge that for the purpose of localization, multivariate tests (e.g. Hotelling’s ) are preferable. Why are multivariate tests preferable than accuracy tests?A surprising result on the power of the t-test2017-08-30T00:00:00+00:002017-08-30T00:00:00+00:00http://johnros.github.io/wilcoxon-power<p>In our recent contribution [1], just published in <a href="http://amstat.tandfonline.com/doi/full/10.1080/00031305.2017.1360795">The American Statistician</a> we revisit the power analysis of the t-test.</p>
<p>The fundamental observation is that the t-test has been proposed, and studied, as a detector of <em>shift alternatives</em>.
By shift alternatives, a statistician means that if two populations differ, then they differ by their mean.
Put differently, it is assumed the factor of interest has the effect of <em>shifting</em> a distribution.
For many phenomena, however, we would not expect an effect of shift-type.
Consider a clinical trial: if we expect a drug to affect only part of the population, we are no longer looking for a shift alternative, but rather, a <em>mixture alternative</em>.</p>
<p>We show, that for mixture alternative, much of the folklore on the t-test no longer holds (nor should it).
We show that Wilcoxon’s signed-rank test may be more powerful than a t-test under a Gaussian null.
This is bacause Wilcoxon’s signed-rank test may capture the assymetry in the mixture, before the t-test captures the changed mean.</p>
<p>This has interesting implications.
A practitioner may be willing to pay with some power, and opt for Wilcoxon’s test because they will not assume Gaussianity.
Our results show that it is possible that this practitioner did not lose any power, but in fact, has gained some.
Not because the null is non-Gaussian, but rather, because the alternative is of mixture-type.</p>
<p>After so much as been said on the t-test, I am rather proud that we can still inovate on the matter.</p>
<p>[1] Rosenblatt, Jonathan D., and Yoav Benjamini. “On Mixture Alternatives and Wilcoxon’s Signed-Rank Test.” The American Statistician, August 1, 2017. doi:10.1080/00031305.2017.1360795.</p>In our recent contribution [1], just published in The American Statistician we revisit the power analysis of the t-test.Sampling as an Epidemic Process2017-02-22T00:00:00+00:002017-02-22T00:00:00+00:00http://johnros.github.io/rds<p><em>Respondent driven sampling</em> (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral.
It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc.
In our latest contribution, just published in <a href="http://onlinelibrary.wiley.com/doi/10.1111/biom.12678/abstract">Biometrics</a>, <a href="https://scholar.google.co.il/citations?user=U3ykKLQAAAAJ&hl=en">Yakir Berchenko</a>, <a href="http://www.infectiousdisease.cam.ac.uk/directory/sdf22@cam.ac.uk">Simon Frost</a> and myself, take a look at RDS and cast the sampling as a <strong>stochastic epidemic</strong>.
This view allows us to analyze RDS using the likelihood framework, which was previously impossible.
In particular, this allows us to debias population prevalence estimates, and estimate the population size!
The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.</p>
<p>I particularly like this project, because it is a real end-to-end statistical challenge with nice theory, computational considerations, and a deliverable R package:</p>
<ul>
<li>
<p>A widely applicable problem:
sampling in hidden populations is both very important, and a real challenge to classical sampling techniques.
RDS is also a potential tool to analyze “Facebook-samples”, which are becoming more prevalent.</p>
</li>
<li>
<p>The theory:
viewing the sampling as a stochastic epidemic, an idea due to Yakir, allows to link the sampling literature to the vast corpus of knowledge on epidemics, software reliability, and counting processes.</p>
</li>
<li>
<p>A computational challenge:
The likelihood function implied by the stochastic epidemic is essentially, a <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">stochastic differential equation</a>.
The counting processes literature allowed us to state the likelihood directly, observe it is separable, and solve the maximum-likelihood problem efficiently.</p>
</li>
<li>
<p>An R package:
Our RDS estimator, with the numerical “tricks” above, is implemented in the <strong>chords</strong> package, available from <a href="https://CRAN.R-project.org/package=chords">CRAN</a>.</p>
</li>
</ul>Respondent driven sampling (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral. It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc. In our latest contribution, just published in Biometrics, Yakir Berchenko, Simon Frost and myself, take a look at RDS and cast the sampling as a stochastic epidemic. This view allows us to analyze RDS using the likelihood framework, which was previously impossible. In particular, this allows us to debias population prevalence estimates, and estimate the population size! The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.Intro to dimensionality reduction2017-01-02T00:00:00+00:002017-01-02T00:00:00+00:00http://johnros.github.io/intro-to-dim-reduce<p>Gave a guest lecture on dimensionality reduction at <a href="http://www.ee.bgu.ac.il/~geva/">Amir Geva’s</a> “Clustering and Unsupervised Computer Learning” graduate course.
I tried to give a quick overview of major dimensionality reduction algorithms.
In particular, I like to present algorithms via the problem they are <strong>aimed</strong> to solve, and not via <strong>how</strong> they solve it.</p>
<p><a href="https://github.com/johnros/dim_reduce/blob/master/dim_reduce.pdf">Class notes may be found here</a>.</p>Gave a guest lecture on dimensionality reduction at Amir Geva’s “Clustering and Unsupervised Computer Learning” graduate course. I tried to give a quick overview of major dimensionality reduction algorithms. In particular, I like to present algorithms via the problem they are aimed to solve, and not via how they solve it.What is a pattern? MVPA cast as a hypothesis test2016-11-30T00:00:00+00:002016-11-30T00:00:00+00:00http://johnros.github.io/what-is-a-pattern<p>In our recent contribution [1], just published in <a href="http://www.sciencedirect.com/science/article/pii/S1053811916306401">Neuroimage</a> we cast the popular Multi-Voxel Pattern Analysis framework (MVPA) in terms of hypothesis testing.
We do so because MVPA is typically used for signal localization, i.e., the detection of “information encoding” regions.</p>
<p>Our major conclusion is that <strong>group MVPA tests a qualitatively different hypothesis than that tested in univariate analysis</strong>.
We show that in regions detected with MVPA subjects may have actually responded very differently.<br />
In particular, an “information encoding” region may be one where some subjects show an <strong>increase</strong> in blood oxygenation (BOLD), while others a <strong>decrease</strong>.</p>
<p>This is a surprising result since it means that the shift from the analysis of one voxel at-a-time to several-voxels at a time, also entailed a re-definition of “what is an activation?”.
In particular, the MVPA definition of activation is such that it is much harder to interpret biologically.</p>
<p>Clearly, the choice of the null and alternative, i.e., the definition of signal, is case dependent, and should be left to the neuroscientist’s best judgement.
It is our hope, that our observation will facilitate such an informed choice.</p>
<p>En passant, we observe that recurring patterns between subjects imply that activation patterns are <strong>asymmetrically distributed</strong> about the null.
Following this observation, we call upon the statistical literature to offer several measures of multivariate symmetry.
These allow the researcher to quantify the degree of multivariate “agreement” between subjects, instead of committing a-priori to a particular notion of “agreement” to be tested.</p>
<p>[1] Gilron, Roee, Jonathan Rosenblatt, Oluwasanmi Koyejo, Russell A. Poldrack, and Roy Mukamel. “What’s in a Pattern? Examining the Type of Signal Multivariate Analysis Uncovers at the Group Level.” NeuroImage 146 (February 1, 2017): 113–20.</p>In our recent contribution [1], just published in Neuroimage we cast the popular Multi-Voxel Pattern Analysis framework (MVPA) in terms of hypothesis testing. We do so because MVPA is typically used for signal localization, i.e., the detection of “information encoding” regions.Almost-embarrassingly-parallel algorithms for machine learning2016-06-12T00:00:00+00:002016-06-12T00:00:00+00:00http://johnros.github.io/parallelized-learning<p>Most machine learning algorithms are optimization problems.
If they are not, they can often be cast as such.
Optimization problems are notoriously hard to distribute.
That is why machine learning from distributed BigData databases is so challenging.</p>
<p>If data is distributed along observations (and not variables), one simple algorithm is to learn your favorite model using the data on each machine, and then aggregate over machines.
If your favorite model is in a finite-dimensional parametric class, you can even aggregate by simple averaging over machines.</p>
<p>This averaging approach is known as <em>command and conquer</em>, <em>one-shot averaging</em>, and <em>embarasingly parallel learning</em>, among others.
It is attractive because of its low communication requirements and simplicity to implement.
Indeed, it can be implemented over any distributed abstraction layer such as Spark, Hadoop, Condor, SGE, and more.
It can also be implemented on top of most popular distributed databases such as Amazon-Redshift and HP-Vertica.
It also covers a wide range of learning algorithms such Ordinary Least Squares, Generalized Linear Models, and Linear SVM.</p>
<p>In our latest contribution, just <a href="http://imaiai.oxfordjournals.org/content/early/2016/06/09/imaiai.iaw013.abstract?keytype=ref&ijkey=TbndI5rIDAxDEzz">published in Information and Inference, a Journal of the IMA</a>, we perform a statistical analysis of the error of such an algorithm and compare it with a non-distributed (centralized) solution.</p>
<p>Our findings can be summarized as follows:
When there are many more observations, per machine, than parameters to estimate, there is no (first order) accuracy loss in distributing the data.
When the number of observations is not much greater than the number of parameters, then there is indeed an accuracy loss. This loss is greater for non-linear models, than linear.</p>
<p>If it unclear why accuracy is lost when averaging, think of linear regression.
The (squared) risk minimizer is <script type="math/tex">\beta^*=\Sigma^{-1} \alpha</script>, where <script type="math/tex">\Sigma= E[x x']</script> and <script type="math/tex">\alpha=E[x y]</script>.
The empirical risk minimizer, <script type="math/tex">\hat{\beta}=(X'X)^{-1} X'y</script>, is merely its empirical equivalent.
If rows of the <script type="math/tex">X</script> matrix are distributed over machines, which do not communicate, then instead of the full <script type="math/tex">(X'X)^{-1}</script> we can only compute machine-wise estimates.
It turns out, that even in this simple linear regression problem, aggregating the various machine wise <script type="math/tex">\hat{\beta}</script>, e.g., by averaging, is less accurate than computing <script type="math/tex">\hat{\beta}</script> with the whole data.</p>
<p>The statistical analysis of the split-and-average algorithm has several implications:
It informs the practitioner which algorithms can be safely computed in parallel, and which need more attention.
Put differently- no learning algorithm is truely <strong>embarassignly-parallel</strong>, but some are <strong>almost-embarasingly-parallel</strong>.</p>
<p>Equipped with guarantees on the learning error, one can apply our results to compute the required number of machines that achieves a given error.
Since increasing the number of machine increases the error, but decreases the learning speed, our results can also be seen as a <strong>learning accuracy-complexity curve</strong>.
Finally, the error decomposition for split-and-average algorithms also implies a Gaussian limit. Our results can thus be used also for inference and model selection.</p>
<p>To prove our results we mostly used <a href="https://en.wikipedia.org/wiki/Lucien_Le_Cam">Lucien Le-Cam</a>, and <a href="https://en.wikipedia.org/wiki/Peter_J._Huber">Peter Huber’s</a> classical asymptotic statistics.
We take particular pride in the use of classical statistical theory to solve cutting edge learning algorithms for BigData.</p>Most machine learning algorithms are optimization problems. If they are not, they can often be cast as such. Optimization problems are notoriously hard to distribute. That is why machine learning from distributed BigData databases is so challenging.Multivariate difference between male and female brain2016-02-18T00:00:00+00:002016-02-18T00:00:00+00:00http://johnros.github.io/genders-and-brains<p>In their recent, high-impact, <a href="http://www.pnas.org/content/112/50/15468.abstract">PNAS publications</a>, a Tel Aviv University research group led by <a href="http://people.socsci.tau.ac.il/mu/daphnajoel/">Prof. Daphna Joel</a> claims that no difference exists between male and female brain.
This was a very high profile study as can be seen by the mentions in
<a href="https://www.newscientist.com/article/dn28582-scans-prove-theres-no-such-thing-as-a-male-or-female-brain/">The New Scientists</a>,
<a href="https://www.theguardian.com/science/2015/dec/01/brain-sex-many-ways-to-be-male-and-female">TheGuardian</a>,
<a href="http://medicalxpress.com/news/2015-11-male-female-brain-valid-distinction.html">MedicalPress</a>,
<a href="http://www.israelscienceinfo.com/en/medecine/femmes-et-sciences-pour-luniversite-de-tel-aviv-les-cerveaux-feminins-et-masculins-sont-un-patchwork-de-caracteristiques/">IsraelScienceInfo</a>,
<a href="http://www.dailymail.co.uk/sciencetech/article-3340123/Male-vs-female-brain-Not-valid-distinction-study-says.html">DailyMail</a>,
<a href="http://www.jpost.com/Business-and-Innovation/Health-and-Science/TAU-neuroscientists-Brains-are-not-gendered-435882">TheJerusalemPost</a>,
<a href="http://www.cbc.ca/news/technology/brain-sex-differences-1.3344954">CBCNews</a>, and many more.</p>
<p>This publication contradicts much of the corpus of knowledge on brains and gender, and thus took the scientific community by surprise. How can this be?</p>
<p>In short and as put by Carl Sagan:
“<strong>Absence of evidence is not evidence of absence</strong>”.</p>
<p>Indeed, by performing many univariate analyses, the authors show that males and females do not show any particular pattern in the brains’ structure, as least as recorded by MRI scans.
It is, however, quite possible for two multivariate data sets to be nicely separated, but not so in any of the “raw” univariate measurements.
The following figure is a toy example of a dataset which cannot be separated by any single (raw) variable, but certainly can when considering two variables simultaneously.</p>
<p><img src="../images/overlap.png" alt="Multivariate seperability" /></p>
<p>I suspect this is what happened in the case of “Sex Beyond the Genitalia”. When I <a href="http://www.pnas.org/content/early/2016/03/15/1523961113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9">reanalyzed the same data</a> the <strong>multivariate</strong> brain structures of males and females was different enough, so that the gender could be inferred from the MRI data alone, with <script type="math/tex">~ 80\%</script> accuracy(!).</p>
<p>It also seems I was not the only one troubled by Joel et al.’s findings. Here is
<a href="http://www.pnas.org/content/early/2016/03/15/1525534113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9">Del Giudice et al.’s</a> comment,
<a href="http://www.pnas.org/content/early/2016/03/15/1523888113.full">Chekroud et al.’s</a>,
<a href="http://www.pnas.org/content/early/2016/03/07/1524418113.extract">Marek Glazerman’s</a>,
<a href="https://www.psychologytoday.com/blog/sexual-personalities/201512/statistical-abracadabra-making-sex-differences-disappear">David Schmidt’s</a>.</p>
<p>In <a href="http://www.pnas.org/content/early/2016/03/15/1600792113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9#ref-8">Joel’s reply to the critics</a> they no longer insist that
“<em>human brains do not belong to one of two distinct categories: male brain/female brain</em>”, but rather soften their claims:
“<em>it is unclear what the biological meaning of the new space is and in what sense brains that seem close in this space are more similar than brains that seem distant</em>”.</p>
<p>I agree. For the purposes of <strong>intepreting</strong> the dimensions in which male and female differ, some feature selection can be introduced.
I will leave that for future neuroimaing research.</p>
<p><strong>Edit</strong>(19.3.2016):
Here is the code that generated the above figure:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">mvtnorm</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1e3</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">999</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rmvnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="o">*</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="o">=</span><span class="n">plogis</span><span class="p">(</span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">as.factor</span><span class="w">
</span><span class="n">xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="o">=</span><span class="n">X</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">x</span><span class="m">.2</span><span class="o">=</span><span class="n">X</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">empty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">plot.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.grid.major</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.grid.minor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span><span class="c1"># scatterplot of x and y variables</span><span class="w">
</span><span class="n">scatter</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="m">.2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w">
</span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">legend.justification</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="c1"># marginal density of x - plot on top</span><span class="w">
</span><span class="n">plot_top</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w">
</span><span class="c1"># marginal density of y - plot on the right</span><span class="w">
</span><span class="n">plot_right</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w">
</span><span class="n">grid.arrange</span><span class="p">(</span><span class="n">plot_top</span><span class="p">,</span><span class="w"> </span><span class="n">empty</span><span class="p">,</span><span class="w"> </span><span class="n">scatter</span><span class="p">,</span><span class="w"> </span><span class="n">plot_right</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">widths</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w">
</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">heights</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>In their recent, high-impact, PNAS publications, a Tel Aviv University research group led by Prof. Daphna Joel claims that no difference exists between male and female brain. This was a very high profile study as can be seen by the mentions in The New Scientists, TheGuardian, MedicalPress, IsraelScienceInfo, DailyMail, TheJerusalemPost, CBCNews, and many more.