Jekyll2017-08-07T17:07:18+00:00http://johnros.github.io/johnros.github.comStats, R, and possibly beach volley.A surprising result on the power of the t-test2017-08-30T00:00:00+00:002017-08-30T00:00:00+00:00http://johnros.github.io/wilcoxon-power<p>In our recent contribution [1], just published in <a href="http://amstat.tandfonline.com/doi/full/10.1080/00031305.2017.1360795">The American Statistician</a> we revisit the power analysis of the t-test.</p>
<p>The fundamental observation is that the t-test has been proposed, and studied, as a detector of <em>shift alternatives</em>.
By shift alternatives, a statistician means that if two populations differ, then they differ by their mean.
Put differently, it is assumed the factor of interest has the effect of <em>shifting</em> a distribution.
For many phenomena, however, we would not expect an effect of shift-type.
Consider a clinical trial: if we expect a drug to affect only part of the population, we are no longer looking for a shift alternative, but rather, a <em>mixture alternative</em>.</p>
<p>We show, that for mixture alternative, much of the folklore on the t-test no longer holds (nor should it).
We show that Wilcoxon’s signed-rank test may be more powerful than a t-test under a Gaussian null.
This is bacause Wilcoxon’s signed-rank test may capture the assymetry in the mixture, before the t-test captures the changed mean.</p>
<p>This has interesting implications.
A practitioner may be willing to pay with some power, and opt for Wilcoxon’s test because they will not assume Gaussianity.
Our results show that it is possible that this practitioner did not lose any power, but in fact, has gained some.
Not because the null is non-Gaussian, but rather, because the alternative is of mixture-type.</p>
<p>After so much as been said on the t-test, I am rather proud that we can still inovate on the matter.</p>
<p>[1] Rosenblatt, Jonathan D., and Yoav Benjamini. “On Mixture Alternatives and Wilcoxon’s Signed-Rank Test.” The American Statistician, August 1, 2017. doi:10.1080/00031305.2017.1360795.</p>In our recent contribution [1], just published in The American Statistician we revisit the power analysis of the t-test.Sampling as an Epidemic Process2017-02-22T00:00:00+00:002017-02-22T00:00:00+00:00http://johnros.github.io/rds<p><em>Respondent driven sampling</em> (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral.
It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc.
In our latest contribution, just published in <a href="http://onlinelibrary.wiley.com/doi/10.1111/biom.12678/abstract">Biometrics</a>, <a href="https://scholar.google.co.il/citations?user=U3ykKLQAAAAJ&hl=en">Yakir Berchenko</a>, <a href="http://www.infectiousdisease.cam.ac.uk/directory/sdf22@cam.ac.uk">Simon Frost</a> and myself, take a look at RDS and cast the sampling as a <strong>stochastic epidemic</strong>.
This view allows us to analyze RDS using the likelihood framework, which was previously impossible.
In particular, this allows us to debias population prevalence estimates, and estimate the population size!
The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.</p>
<p>I particularly like this project, because it is a real end-to-end statistical challenge with nice theory, computational considerations, and a deliverable R package:</p>
<ul>
<li>
<p>A widely applicable problem:
sampling in hidden populations is both very important, and a real challenge to classical sampling techniques.
RDS is also a potential tool to analyze “Facebook-samples”, which are becoming more prevalent.</p>
</li>
<li>
<p>The theory:
viewing the sampling as a stochastic epidemic, an idea due to Yakir, allows to link the sampling literature to the vast corpus of knowledge on epidemics, software reliability, and counting processes.</p>
</li>
<li>
<p>A computational challenge:
The likelihood function implied by the stochastic epidemic is essentially, a <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">stochastic differential equation</a>.
The counting processes literature allowed us to state the likelihood directly, observe it is separable, and solve the maximum-likelihood problem efficiently.</p>
</li>
<li>
<p>An R package:
Our RDS estimator, with the numerical “tricks” above, is implemented in the <strong>chords</strong> package, available from <a href="https://CRAN.R-project.org/package=chords">CRAN</a>.</p>
</li>
</ul>Respondent driven sampling (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral. It is especially useful when sampling stigmatized groups, such as injection drug users, sex workers, and men who have sex with men, etc. In our latest contribution, just published in Biometrics, Yakir Berchenko, Simon Frost and myself, take a look at RDS and cast the sampling as a stochastic epidemic. This view allows us to analyze RDS using the likelihood framework, which was previously impossible. In particular, this allows us to debias population prevalence estimates, and estimate the population size! The likelihood framework also allows us to add Bayesian regularization, debias risk estimates a-la AIC, or cross-validation, which were previously impossible, without the sampling distribution.Intro to dimensionality reduction2017-01-02T00:00:00+00:002017-01-02T00:00:00+00:00http://johnros.github.io/intro-to-dim-reduce<p>Gave a guest lecture on dimensionality reduction at <a href="http://www.ee.bgu.ac.il/~geva/">Amir Geva’s</a> “Clustering and Unsupervised Computer Learning” graduate course.
I tried to give a quick overview of major dimensionality reduction algorithms.
In particular, I like to present algorithms via the problem they are <strong>aimed</strong> to solve, and not via <strong>how</strong> they solve it.</p>
<p><a href="https://github.com/johnros/dim_reduce/blob/master/dim_reduce.pdf">Class notes may be found here</a>.</p>Gave a guest lecture on dimensionality reduction at Amir Geva’s “Clustering and Unsupervised Computer Learning” graduate course. I tried to give a quick overview of major dimensionality reduction algorithms. In particular, I like to present algorithms via the problem they are aimed to solve, and not via how they solve it.What is a pattern? MVPA cast as a hypothesis test2016-11-30T00:00:00+00:002016-11-30T00:00:00+00:00http://johnros.github.io/what-is-a-pattern<p>In our recent contribution [1], just published in <a href="http://www.sciencedirect.com/science/article/pii/S1053811916306401">Neuroimage</a> we cast the popular Multi-Voxel Pattern Analysis framework (MVPA) in terms of hypothesis testing.
We do so because MVPA is typically used for signal localization, i.e., the detection of “information encoding” regions.</p>
<p>Our major conclusion is that <strong>group MVPA tests a qualitatively different hypothesis than that tested in univariate analysis</strong>.
We show that in regions detected with MVPA subjects may have actually responded very differently.<br />
In particular, an “information encoding” region may be one where some subjects show an <strong>increase</strong> in blood oxygenation (BOLD), while others a <strong>decrease</strong>.</p>
<p>This is a surprising result since it means that the shift from the analysis of one voxel at-a-time to several-voxels at a time, also entailed a re-definition of “what is an activation?”.
In particular, the MVPA definition of activation is such that it is much harder to interpret biologically.</p>
<p>Clearly, the choice of the null and alternative, i.e., the definition of signal, is case dependent, and should be left to the neuroscientist’s best judgement.
It is our hope, that our observation will facilitate such an informed choice.</p>
<p>En passant, we observe that recurring patterns between subjects imply that activation patterns are <strong>asymmetrically distributed</strong> about the null.
Following this observation, we call upon the statistical literature to offer several measures of multivariate symmetry.
These allow the researcher to quantify the degree of multivariate “agreement” between subjects, instead of committing a-priori to a particular notion of “agreement” to be tested.</p>
<p>[1] Gilron, Roee, Jonathan Rosenblatt, Oluwasanmi Koyejo, Russell A. Poldrack, and Roy Mukamel. “What’s in a Pattern? Examining the Type of Signal Multivariate Analysis Uncovers at the Group Level.” NeuroImage 146 (February 1, 2017): 113–20.</p>In our recent contribution [1], just published in Neuroimage we cast the popular Multi-Voxel Pattern Analysis framework (MVPA) in terms of hypothesis testing. We do so because MVPA is typically used for signal localization, i.e., the detection of “information encoding” regions.Almost-embarrassingly-parallel algorithms for machine learning2016-06-12T00:00:00+00:002016-06-12T00:00:00+00:00http://johnros.github.io/parallelized-learning<p>Most machine learning algorithms are optimization problems.
If they are not, they can often be cast as such.
Optimization problems are notoriously hard to distribute.
That is why machine learning from distributed BigData databases is so challenging.</p>
<p>If data is distributed along observations (and not variables), one simple algorithm is to learn your favorite model using the data on each machine, and then aggregate over machines.
If your favorite model is in a finite-dimensional parametric class, you can even aggregate by simple averaging over machines.</p>
<p>This averaging approach is known as <em>command and conquer</em>, <em>one-shot averaging</em>, and <em>embarasingly parallel learning</em>, among others.
It is attractive because of its low communication requirements and simplicity to implement.
Indeed, it can be implemented over any distributed abstraction layer such as Spark, Hadoop, Condor, SGE, and more.
It can also be implemented on top of most popular distributed databases such as Amazon-Redshift and HP-Vertica.
It also covers a wide range of learning algorithms such Ordinary Least Squares, Generalized Linear Models, and Linear SVM.</p>
<p>In our latest contribution, just <a href="http://imaiai.oxfordjournals.org/content/early/2016/06/09/imaiai.iaw013.abstract?keytype=ref&ijkey=TbndI5rIDAxDEzz">published in Information and Inference, a Journal of the IMA</a>, we perform a statistical analysis of the error of such an algorithm and compare it with a non-distributed (centralized) solution.</p>
<p>Our findings can be summarized as follows:
When there are many more observations, per machine, than parameters to estimate, there is no (first order) accuracy loss in distributing the data.
When the number of observations is not much greater than the number of parameters, then there is indeed an accuracy loss. This loss is greater for non-linear models, than linear.</p>
<p>If it unclear why accuracy is lost when averaging, think of linear regression.
The (squared) risk minimizer is <script type="math/tex">\beta^*=\Sigma^{-1} \alpha</script>, where <script type="math/tex">\Sigma= E[x x']</script> and <script type="math/tex">\alpha=E[x y]</script>.
The empirical risk minimizer, <script type="math/tex">\hat{\beta}=(X'X)^{-1} X'y</script>, is merely its empirical equivalent.
If rows of the <script type="math/tex">X</script> matrix are distributed over machines, which do not communicate, then instead of the full <script type="math/tex">(X'X)^{-1}</script> we can only compute machine-wise estimates.
It turns out, that even in this simple linear regression problem, aggregating the various machine wise <script type="math/tex">\hat{\beta}</script>, e.g., by averaging, is less accurate than computing <script type="math/tex">\hat{\beta}</script> with the whole data.</p>
<p>The statistical analysis of the split-and-average algorithm has several implications:
It informs the practitioner which algorithms can be safely computed in parallel, and which need more attention.
Put differently- no learning algorithm is truely <strong>embarassignly-parallel</strong>, but some are <strong>almost-embarasingly-parallel</strong>.</p>
<p>Equipped with guarantees on the learning error, one can apply our results to compute the required number of machines that achieves a given error.
Since increasing the number of machine increases the error, but decreases the learning speed, our results can also be seen as a <strong>learning accuracy-complexity curve</strong>.
Finally, the error decomposition for split-and-average algorithms also implies a Gaussian limit. Our results can thus be used also for inference and model selection.</p>
<p>To prove our results we mostly used <a href="https://en.wikipedia.org/wiki/Lucien_Le_Cam">Lucien Le-Cam</a>, and <a href="https://en.wikipedia.org/wiki/Peter_J._Huber">Peter Huber’s</a> classical asymptotic statistics.
We take particular pride in the use of classical statistical theory to solve cutting edge learning algorithms for BigData.</p>Most machine learning algorithms are optimization problems. If they are not, they can often be cast as such. Optimization problems are notoriously hard to distribute. That is why machine learning from distributed BigData databases is so challenging.Multivariate difference between male and female brain2016-02-18T00:00:00+00:002016-02-18T00:00:00+00:00http://johnros.github.io/genders-and-brains<p>In their recent, high-impact, <a href="http://www.pnas.org/content/112/50/15468.abstract">PNAS publications</a>, a Tel Aviv University research group led by <a href="http://people.socsci.tau.ac.il/mu/daphnajoel/">Prof. Daphna Joel</a> claims that no difference exists between male and female brain.
This was a very high profile study as can be seen by the mentions in
<a href="https://www.newscientist.com/article/dn28582-scans-prove-theres-no-such-thing-as-a-male-or-female-brain/">The New Scientists</a>,
<a href="https://www.theguardian.com/science/2015/dec/01/brain-sex-many-ways-to-be-male-and-female">TheGuardian</a>,
<a href="http://medicalxpress.com/news/2015-11-male-female-brain-valid-distinction.html">MedicalPress</a>,
<a href="http://www.israelscienceinfo.com/en/medecine/femmes-et-sciences-pour-luniversite-de-tel-aviv-les-cerveaux-feminins-et-masculins-sont-un-patchwork-de-caracteristiques/">IsraelScienceInfo</a>,
<a href="http://www.dailymail.co.uk/sciencetech/article-3340123/Male-vs-female-brain-Not-valid-distinction-study-says.html">DailyMail</a>,
<a href="http://www.jpost.com/Business-and-Innovation/Health-and-Science/TAU-neuroscientists-Brains-are-not-gendered-435882">TheJerusalemPost</a>,
<a href="http://www.cbc.ca/news/technology/brain-sex-differences-1.3344954">CBCNews</a>, and many more.</p>
<p>This publication contradicts much of the corpus of knowledge on brains and gender, and thus took the scientific community by surprise. How can this be?</p>
<p>In short and as put by Carl Sagan:
“<strong>Absence of evidence is not evidence of absence</strong>”.</p>
<p>Indeed, by performing many univariate analyses, the authors show that males and females do not show any particular pattern in the brains’ structure, as least as recorded by MRI scans.
It is, however, quite possible for two multivariate data sets to be nicely separated, but not so in any of the “raw” univariate measurements.
The following figure is a toy example of a dataset which cannot be separated by any single (raw) variable, but certainly can when considering two variables simultaneously.</p>
<p><img src="../images/overlap.png" alt="Multivariate seperability" /></p>
<p>I suspect this is what happened in the case of “Sex Beyond the Genitalia”. When I <a href="http://www.pnas.org/content/early/2016/03/15/1523961113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9">reanalyzed the same data</a> the <strong>multivariate</strong> brain structures of males and females was different enough, so that the gender could be inferred from the MRI data alone, with <script type="math/tex">~ 80\%</script> accuracy(!).</p>
<p>It also seems I was not the only one troubled by Joel et al.’s findings. Here is
<a href="http://www.pnas.org/content/early/2016/03/15/1525534113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9">Del Giudice et al.’s</a> comment,
<a href="http://www.pnas.org/content/early/2016/03/15/1523888113.full">Chekroud et al.’s</a>,
<a href="http://www.pnas.org/content/early/2016/03/07/1524418113.extract">Marek Glazerman’s</a>,
<a href="https://www.psychologytoday.com/blog/sexual-personalities/201512/statistical-abracadabra-making-sex-differences-disappear">David Schmidt’s</a>.</p>
<p>In <a href="http://www.pnas.org/content/early/2016/03/15/1600792113.full?sid=71a90a9a-ec35-45a3-a11a-63d0fc116fa9#ref-8">Joel’s reply to the critics</a> they no longer insist that
“<em>human brains do not belong to one of two distinct categories: male brain/female brain</em>”, but rather soften their claims:
“<em>it is unclear what the biological meaning of the new space is and in what sense brains that seem close in this space are more similar than brains that seem distant</em>”.</p>
<p>I agree. For the purposes of <strong>intepreting</strong> the dimensions in which male and female differ, some feature selection can be introduced.
I will leave that for future neuroimaing research.</p>
<p><strong>Edit</strong>(19.3.2016):
Here is the code that generated the above figure:</p>
<div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">mvtnorm</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1e3</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">999</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rmvnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="o">*</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="o">=</span><span class="n">plogis</span><span class="p">(</span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">as.factor</span><span class="w">
</span><span class="n">xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="o">=</span><span class="n">X</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">x</span><span class="m">.2</span><span class="o">=</span><span class="n">X</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">empty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">plot.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.grid.major</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.grid.minor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span><span class="c1"># scatterplot of x and y variables
</span><span class="n">scatter</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="m">.2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w">
</span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">legend.justification</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="c1"># marginal density of x - plot on top
</span><span class="n">plot_top</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w">
</span><span class="c1"># marginal density of y - plot on the right
</span><span class="n">plot_right</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">xy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"purple"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w">
</span><span class="n">grid.arrange</span><span class="p">(</span><span class="n">plot_top</span><span class="p">,</span><span class="w"> </span><span class="n">empty</span><span class="p">,</span><span class="w"> </span><span class="n">scatter</span><span class="p">,</span><span class="w"> </span><span class="n">plot_right</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">widths</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w">
</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">heights</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span></code></pre>
</div>In their recent, high-impact, PNAS publications, a Tel Aviv University research group led by Prof. Daphna Joel claims that no difference exists between male and female brain. This was a very high profile study as can be seen by the mentions in The New Scientists, TheGuardian, MedicalPress, IsraelScienceInfo, DailyMail, TheJerusalemPost, CBCNews, and many more.Interactive Plotting with R2016-01-05T00:00:00+00:002016-01-05T00:00:00+00:00http://johnros.github.io/interactive-plot-r<p><a href="https://www.linkedin.com/in/efratvilenski">Efrat</a> is a MSc. student in my group.
She works on integrating advanced Multivariate Process Control capabilities in interactive dashboards.
During her work she aquired an impressive expertise in interactive plotting with R, and <a href="http://d3js.org/">D3JS</a>.</p>
<p>Yesterday, 2016-3-01, she gave a workshop on the topic for the <a href="http://www.statistics.org.il/">The Israeli Statistical Association</a>.
About 60 participants attended the Google Campus at Tel Aviv to hear about R, plotting, and JavaScript.</p>
<p>Her slides, code and links, can be found <a href="http://efratvil.github.io/R-Israel-Jan-2015/links.html">here</a>.</p>Efrat is a MSc. student in my group. She works on integrating advanced Multivariate Process Control capabilities in interactive dashboards. During her work she aquired an impressive expertise in interactive plotting with R, and D3JS.Quality Engineering Class Notes2015-11-20T00:00:00+00:002015-11-20T00:00:00+00:00http://johnros.github.io/quality-engineering<p>Now that I am a member of the <a href="http://in.bgu.ac.il/engn/iem/Pages/default.aspx">Industrial Engineering Dept.</a> at Ben Gurion University, I am naturally looking into statistical aspects of Industrial Engineering. In particular process control.
This being the case, I started teaching Quality Engineering.
While preparing the course, I read the classical introductory literature and I felt it failed to convey the beauty of the field, by focusing on too many little details.
I thus went ahead and wrote my own book, which can be found <a href="https://github.com/johnros/qualityEngineering/blob/master/Class_notes/notes.pdf">online</a>.</p>
<p>How does it differ from existing literature:</p>
<ul>
<li>Being an introductory text it has a very wide scope of topics. The focus is on the underlying ideas and terminology, and details are given in the references.</li>
<li>Topics covered: History of quality engineering, exploratory data analysis, process control charts, design of experiments, acceptance sampling, and reliability.</li>
<li>The design of experiments and reliability chapters have much wider scopes then typically found in quality engineering textbooks.</li>
<li>I read many books and papers while researching the literature, and I tried to bring the most recent and clear references to each topic.</li>
</ul>
<p>I hope readers will find my notes useful.
Being experimental, they may still contain mistakes. I would be very thankful to whoever decides to inform me of any mistakes found.</p>Now that I am a member of the Industrial Engineering Dept. at Ben Gurion University, I am naturally looking into statistical aspects of Industrial Engineering. In particular process control. This being the case, I started teaching Quality Engineering. While preparing the course, I read the classical introductory literature and I felt it failed to convey the beauty of the field, by focusing on too many little details. I thus went ahead and wrote my own book, which can be found online.Disambiguating Bayesian Statistics2015-09-02T00:00:00+00:002015-09-02T00:00:00+00:00http://johnros.github.io/disambiguating-bayesian-statistics<p>The term “Bayesian Statistics” is mentioned in any introductory course to statistics and appears in countless papers and in books, in many contexts and with many meanings.
Since it carries different meaning to different authors, I will try to suggest several different interpretations I have encountered.</p>
<p>First, I will classify several possible generative models according to the <script type="math/tex">4</script> following attributes:</p>
<ol>
<li>Is there a probability on the parameter space (a “prior”)?</li>
<li>Is the prior subjective (epistemic, beliefs) or objective (physical)?</li>
<li>Is the prior parametric or non-parametric?</li>
<li>Is the prior simple or composite?</li>
</ol>
<p>We consider these attributes on some common modelling approaches:</p>
<ul>
<li>Neyman-Pearson frequentist inference has <em>no prior</em> on the parameter space.
Example: MLE for the mean of a normal population.</li>
<li>A pure/subjective/DeFinetti Bayesian has a <em>simple, subjective</em> prior. It may be parametric or not.
<strong>Example</strong>: <a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">MAP</a> estimate of the mean of a normal population.</li>
<li>A Semi-Bayesian (a.k.a. pseudo-Bayesian, semi-Empirical-Bayesian) is essentially a frequentist with a <em>composite, parametric, objective</em> prior.
Sometimes referred to as Empirical-Bayesian [1] albeit not in it’s original historical sense [2].
<strong>Example</strong>: Type-II Maximum Likelihood [2] or Restricted-Maximum-Likelihood estimation of the variance components in a mixed effects model.</li>
<li>Empirical Bayesian in it’s original historical sense has an <em>objective, non-parametric prior</em>, specified up to it’s two first moments [2].
It differs from the (original) Semi-Bayesian, in that the prior is non-parametric.
<strong>Example</strong>: Sample coverage estimation- “What is the probability that the next sample will be of an unseen species?” [3].</li>
<li>A <em>parametric, composite, subjective</em> prior is something interesting.
It is hard to interpret since it implies that “my beliefs are sharp, but I don’t know what they are (yet?)”.
It is typically encountered as a mathematical regularization device.
<strong>Example</strong>: Ridge regression; the Gaussian prior on the coefficients can hardly be interpreted as a limiting frequency, thus it is subjective. The variance of the prior is unspecified, usually estimated using cross-validation, thus a composite prior.</li>
</ul>
<h1 id="critique">Critique</h1>
<p>Q: Epistemic probability? Is there really such a thing as epistemic probability? Is it not just the limiting frequency of events in our accumulative experience?</p>
<p>A: If it were epistemic and universal (objective), then the distinction might be only a matter for philosophers. That fact that it is personal (subjective), is of real practical implications.</p>
<hr />
<p>Q: Objective probabilities?!? Except for non-parametric, I am always assuming the distribution of the population. Isn’t this subjective? Am I not subjectively assuming the differentiability of the CDF for density estimation? We are thus, all subjective Bayesians. Yes, even Fisher!</p>
<p>A: I would say one cannot approach data completely assumption free, so saying we are all subjective Bayesians, might be true, but non-informative. I feel different “philosophies”, are useful to convey what kind of argument are we making with the data. Namely, the answer to the 4 above questions. Is my answer true, but non informative as well? :-)</p>
<hr />
<p>Q: If we acknowledge that “Empirical Bayes” has caught a new meaning since it’s initial introduction, why bother with history? Why not use the composite epistemic meaning only?</p>
<p>A: Because it seems there no agreed upon “new meaning”. Some will use it for composite-epistemic priors, but some will use it for physical ones. The smallest common denominator, is the fact that the prior is composite and not simple (whether it be parametric or not).</p>
<h1 id="references">References</h1>
<p>[1] E.L. Lehmann and George Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998).<br />
[2] I. J. Good, “Introduction to Robbins (1955) An empirical Bayes approach to statistics,” Breakthroughs in Statistics: Foundations and basic theory (1992): 379.<br />
[3] Bradley Efron and Ronald Thisted, “Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?,” Biometrika 63, no. 3 (December 1976): 435-447.<br />
[4] Fienberg, Stephen E. “When Did Bayesian Inference Become ‘Bayesian’?” Bayesian Analysis 1, no. 1 (March 2006): 1-40. doi:10.1214/06-BA101.</p>The term “Bayesian Statistics” is mentioned in any introductory course to statistics and appears in countless papers and in books, in many contexts and with many meanings. Since it carries different meaning to different authors, I will try to suggest several different interpretations I have encountered.ICML 20152015-07-07T00:00:00+00:002015-07-07T00:00:00+00:00http://johnros.github.io/icml-2015<p>I have attended this week the <a href="http://icml.cc/2015/">ICML2015</a> conference in Lille France.
Here are some impressions…</p>
<p><script type="math/tex">1,600</script> attendants.
About <script type="math/tex">270</script> presentations out of about <script type="math/tex">1,037</script> submitted.
A massive, very well organized event.
Needless to say that deep learning was the dominant topic by a wide margin, but since it is not quite my field of interest, it is not emphasized in the following highlights.</p>
<p>If there is a particular talk or slide deck you want, and I did not link to, feel free to drop me a note.</p>
<h3 id="tutorials">Tutorials</h3>
<p>I attended the NLP tutorial by <a href="http://cs.stanford.edu/~pliang/">Percy Liang</a>,
the Structured Prediction tutorial by <a href="http://www.umiacs.umd.edu/~hal/">Hal Daume III</a> and <a href="http://hunch.net/~jl/">John Langford</a>,
and the Convex Optimization tutorial by <a href="http://www.cs.ubc.ca/~schmidtm/">Mark Schmidt</a> and <a href="http://www.maths.ed.ac.uk/~richtarik/">Peter Richtarik</a> .
All slide are available <a href="http://icml.cc/2015/?page_id=97">here</a>.</p>
<p>The NLP tutorial was very high level, and great for someone like me who lacks the basic terminology.</p>
<p>The Structured Prediction tutorial had a varying level.
Hal presented some basic approaches, and then focused on his own work on an efficient algorithm for structured prediction which does sequential labeling.
My own insight is that Hal’s algorithm can be seen as an optimization scheme for outcomes that have a tree-like graphical model (thus the sequential approach).
I may be very wrong on this one, as I am still struggling to define the borders between Structured Learning and MANOVA.
John gave an exposition of the <em>L2S</em> (learning to search) algorithm, which seemed very impressive, but I was clearly missing some background to fully understand the talk.</p>
<p>The convex optimization talks were brilliant!
A nice unified review from historical approaches (dating back to <a href="http://gallica.bnf.fr/ark:/12148/bpt6k2982c/f540.image.langEN">Cauchy</a>) to cutting edge approaches.
I think that Mark Schmidt’s <a href="http://icml.cc/2015/tutorials/2015_ICML_ConvexOptimization_I.pdf">slides</a> will definitely serve me to learn and to teach convex optimization in the future.</p>
<h3 id="keynote--leon-bottou">Keynote- Leon Bottou</h3>
<p>Leon gave an <a href="http://icml.cc/2015/invited/LeonBottouICML2015.pdf">inspiring talk</a> about the next big challenges in ML.
He pointed out two:</p>
<ul>
<li>Incorporating ML modules in <strong>big</strong> software systems requires a level of stability currently unavailable. There seems to be a lot to learn from other industries on abstraction and modularity of their building blocks. ML modules are currently “weak contracts” whereas large complicated systems require modules with “strong contracts”.</li>
<li>Even with all the cross validation, train-test, and other risk estimators, ML is still biased towards particular data sets. There is a lot to learn from other scientific disciplines on the design of data collection, and more generally, on “enriching our experimental repertoire, redefining our scientific processes, and still maintain our progress speed”.</li>
</ul>
<h3 id="faster-cover-trees">Faster Cover Trees</h3>
<p><a href="https://izbicki.me/">Mike Izbicki</a> presented his <a href="https://github.com/mikeizbicki/HLearn">Faster Cover Trees</a> data structure, which is a data structure for fast queries (possibly also insertions) on pairwise distances.
Clearly, very useful for neighborhood type algorithms such as KNN.
Interestingly, he came across his algorithm while trying to write a Haskal implementation of the classical <a href="https://en.wikipedia.org/wiki/Cover_tree">cover tree</a>.</p>
<h3 id="optimal-and-adaptive-algorithms-for-online-boosting">Optimal and Adaptive Algorithms for Online Boosting</h3>
<p>If you want an online variant of gradient boosting, read <a href="http://arxiv.org/abs/1502.02651">this paper</a>.
Beautifully presented by <a href="http://www.cs.princeton.edu/~haipengl/">Haipeng Luo</a>, and clearly deserving an ICML best paper award.</p>
<h3 id="low-rank-approximation-using-error-correcting-coding-matrices">Low Rank Approximation using Error Correcting Coding Matrices</h3>
<p>A nice presentation by Shashanka Ubaru about <a href="http://www.ece.umn.edu/users/arya/papers/LowRankICML.pdf">projections</a> using error correcting matrices that preserve geometry. The motivation is the reduction of the computation cost of performing random projections a-la <a href="https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma">Johnson-Lindenstrauss</a>.</p>
<h3 id="bayesian-and-empirical-bayesian-forests">Bayesian and Empirical Bayesian Forests</h3>
<p>A <a href="http://arxiv.org/abs/1502.02312">cool idea</a> presented by <a href="http://faculty.chicagobooth.edu/matt.taddy/">Taddy Matthew</a>.
There is no need to re-learn the root of each tree at every computing node when <a href="">bagging</a> trees in parallel.
This can lead to great speedups since after each split, only small subsets of the data are considered for splitting. The deeper the common roots of the bagged forest, the larger the computational gain.
As a side comment, Taddy also mentioned that for very large data sets (in the scale of eBay), there is not benefit in the variable subsampling the distinguishes between bagging and random forests.</p>
<h3 id="is-feature-selection-secure-against-training-data-poisoning">Is Feature Selection Secure against Training Data Poisoning?</h3>
<p><a href="https://pralab.diee.unica.it/en/BattistaBiggio">Battista Biggio</a> presented an interesting framework to analyze vulnerability of ML software modules to cyber attacks.
My take home message: data poisoning attacks are simply an adversarial introduction of outliers.
Like any other outliers, one can use robust methods as a protection.</p>
<h3 id="privacy-for-free-posterior-sampling-and-stochastic-gradient-monte-carlo">Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo</h3>
<p>A great <a href="http://arxiv.org/abs/1502.07645">talk</a> presented by <a href="http://www.cs.cmu.edu/~yuxiangw/">Yu-Xiang Wang</a>.
The take home message:
Sampling from the posterior distribution is both a good estimator, and preserves differential privacy.</p>
<h3 id="rademacher-observations-private-data-and-boosting">Rademacher Observations, Private Data, and Boosting</h3>
<p><a href="http://giorgiopatrini.org/">Giorgio Patrini</a> presented the idea of <a href="http://arxiv.org/abs/1502.02322">Rademacher Observations</a> (“rados”).
At first I was not sure if rados are actually <a href="http://people.csail.mit.edu/dannyf/intro1.ppt">coresets</a>, or <a href="https://en.wikipedia.org/wiki/Random_projection">random projections</a>, which have already been proposed in the context of privacy preservation.
I then understood it is none the above.
“Rados” are actually <em>sufficient statistics</em> computed on <em>subsets</em> of data.
As such, they are statistical aggregates that allow estimation, while preserving privacy.</p>
<h3 id="the-composition-theorem-for-differential-privacy">The Composition Theorem for Differential Privacy</h3>
<p><a href="http://web.engr.illinois.edu/~swoh/">Sewoong Oh</a> presented a calculus for computing the privacy vulnerability when performing multiple deferentially private queries.
Bears strong similarity with the accumulation of error in multiple testing.</p>
<h3 id="keynote-social-phenomena-in-global-networks">Keynote: Social Phenomena in Global Networks</h3>
<p>In his keynote address, <a href="http://www.cs.cornell.edu/home/kleinber/">Jon Kleinberg</a> gave a clear and interesting exposition of the history of network analysis:</p>
<ul>
<li>Relation between local and global observed structure.</li>
<li>Relation between observed phenomena and generative models.</li>
</ul>
<p>I hope the slides will be made available.</p>
<h3 id="hashtags-clicks-and-likes-supervision-for-content-based-posts">Hashtags, Clicks and Likes: Supervision for Content-based Posts</h3>
<p>The <a href="https://sites.google.com/site/extremeclassification/">Extreme Classification: Learning with a Very Large Number of Labels workshop</a> dealt with the issue of learning a large set of labels.</p>
<p><a href="https://scholar.google.com/citations?user=lMkTx0EAAAAJ">Jason Weston’s</a>,from Facebook, presented the challenge in classifying the content of Facebook posts.
User hashtags serve as (weak) labels, and the problem is to provide the <script type="math/tex">k</script> best associated tags.
The ERM problem is then defined with respect to an ingenious loss he previously presented, termed <a href="http://www.australianscience.com.au/research/google/37180.pdf">WSABIE</a>.
His algorithm consists of learning an embedding of words into some vector space.
Words are then aggregated to embed posts.
Hashtags are also embedded in the same space, so that finding the hashtags associated with a particular post is reduced to computing dot products.
Interestingly, he noted that learning the word embeddings proved more accurate than out-of-the-box embeddings like <a href="https://code.google.com/p/word2vec/">word2vec</a>.
I permit myself to think of his approach as an efficient implementation of <a href="https://en.wikipedia.org/wiki/Canonical_correlation">canonical correlation analysis</a> with respect to the Wsbie ranking loss.</p>
<h3 id="the-frank-wolfe-algorithm-recent-results-and-applications-to-high-dimensional-similarity-learning-and-distributed-optimization">The Frank-Wolfe Algorithm: Recent Results and Applications to High-Dimensional Similarity Learning and Distributed Optimization</h3>
<p>I partially attended the <a href="https://sites.google.com/site/gretaproject/greed-is-great-icml">greedy optimization algorithms workshop</a> and particularly liked <a href="http://perso.telecom-paristech.fr/~abellet/">Aurelien Bellet’s</a> presentation on the <a href="https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm">Frank-Wolfe</a> algorithm for similarity learning.
The fundamental idea is that once the similarity learning problem has been posed as an ERM problem over a space of matrices, it can be efficiently solved using a first order optimizer which advances in the direction of the largest coordinate in the gradient. This is called the Sparse Frank-Wolfe algorithm.
It should also be noted that the matrices are not constrained to correspond to inner products (thus symmetric and PSD) since the similarity metrics being learnt may not be proper dot products.</p>
<h3 id="julias-approach-to-open-source-machine-learning">Julia’s Approach to Open Source Machine Learning</h3>
<p><a href="http://www.johnmyleswhite.com/">John Myles White</a> presented the <a href="https://en.wikipedia.org/wiki/Julia_(programming_language)">Julia language</a> at the <a href="https://mloss.org/workshop/icml15/">Machine Learning Open Source Software 2015</a> workshop.
I will not go in detail. If you are unfamiliar with the language, you should check it out.
It also got me thinking- when are we going to see machine-learning driven <a href="https://en.wikipedia.org/wiki/Type_inference">type-inference</a> in compilers?
In reply to my question, John said the JavaScript compiler already has some probabilistic rules.
Can ML provide classifications that are reliable enough to be used for type-inference within a compiler?
This is obviously related to <a href="#keynote-leon-bottou">Leon Bottou’s talk</a> about “weak” and “hard” contracts.</p>
<h3 id="resource-efficuent-learning">Resource Efficuent Learning</h3>
<p><a href="https://sites.google.com/site/icml2015budgetedml/">This workshop</a> fit exactly within my research interests.
I have learned from it a lot.
A couple of speakers presented loss function which account for the learning cost within the optimization. I admit I have trouble with this approach because of the units…
It is my feeling that prediction error and estimation cost cannot be compounded in one loss function.</p>
<p>In a different line of work, <a href="http://web.eecs.umich.edu/~jabernet/">Jake Abernethy</a> presented a very nice sampling scheme where a researcher publishes his willingness to pay for each data point, depending on its value.
Subjects can they decide whether they sell their data or not at the researcher’s bidding price.
Perhaps not surprisingly, a researcher’s willingness to pay depends on the gradient of the empirical risk at the requested data point.
This clearly assumes that the data can be validated, so that cheating is not an issue.
I wonder what would happen when cheating is possible?
Maybe a game-theoretic analysis?
Maybe a bit-coin decentralized verification mechanism?</p>
<p><a href="http://research.microsoft.com/en-us/um/people/oferd/">Ofer Dekel</a> from Microsoft present a formal framework for simultaneous pruning of random forests.
I found the theory beautiful with non-intuitive results.
The idea is to prune simultaneously a <em>whole forest</em>, and not tree-wise.
Ofer casts the problem as an optimization problem, and indeed achieves astonishing complexity reductions, without compromising statistical accuracy.</p>
<p><a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Yoshua Bengio</a> presented a review of a mass of research aimed at learning and servicing with deep networks, with a “small footprint”.
Many useful references were presented. You should follow his website as he posts his talks regularly.
I was particularly impressed by several idea:</p>
<ul>
<li>Conditional Computations is the idea that a different model is fit at different locations of the feature space. While powerful and attractive, it turns out the the logic of CPUs and GPUs is such that is optimized for “predictable” computations, and not the random access type required when conditioning on feature values.</li>
<li>Low Precision training is the idea that we do not really require floating point precision for machine learning.
Several authors have successfully learned models with <script type="math/tex">16</script> and <script type="math/tex">8</script> bit precision.
Some have allowed a varying precision for different network layers, known as Dynamic Fixed Point.
Bengio is not taking this to the limit and learning using <script type="math/tex">0,1</script> weights. Besides the small memory footprint, this has the advantage that multiplication operations are no longer required, thus improving performance.</li>
</ul>
<p><a href="http://www0.cs.ucl.ac.uk/staff/c.archambeau/">Cedric Archambeau</a> from Amazon presented the challenges in <a href="https://aws.amazon.com/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/">Amazon Machine Learning</a>.
The end goal is to let the user choose (a) time to train, (b) predictive performance, (c) time to predict. Given these, the number of machines, algorithm, and other hardware and algorithmic consideration can be abstracted away for better servicing, and better optimization of the AWS infrastructure.</p>
<p>The last talk I attended, was given by <a href="http://research.microsoft.com/en-us/um/people/manik/">Manik Varma</a> turned out to be one of the best given, most interesting, and inspiring I heard in a while.
Manik <a href="http://machinelearning.wustl.edu/mlpapers/papers/icml2013_jose13?ref=driverlayer.com/web">presented</a> an efficiently learnable hypothesis class that unifies trees, DNNs, kernel SVMs.
I did not find the ICML slides, but luckily, I found a <a href="http://techtalks.tv/talks/local-deep-kernel-learning-for-efficient-non-linear-svm-prediction/58381/">past recording</a> of the talk.
I have also learned that this is the classifier used in Windows 8 to detect malware at runtime!</p>I have attended this week the ICML2015 conference in Lille France. Here are some impressions…