ICML 2015

I have attended this week the ICML2015 conference in Lille France. Here are some impressions…

\(1,600\) attendants. About \(270\) presentations out of about \(1,037\) submitted. A massive, very well organized event. Needless to say that deep learning was the dominant topic by a wide margin, but since it is not quite my field of interest, it is not emphasized in the following highlights.

If there is a particular talk or slide deck you want, and I did not link to, feel free to drop me a note.

Tutorials

I attended the NLP tutorial by Percy Liang, the Structured Prediction tutorial by Hal Daume III and John Langford, and the Convex Optimization tutorial by Mark Schmidt and Peter Richtarik . All slide are available here.

The NLP tutorial was very high level, and great for someone like me who lacks the basic terminology.

The Structured Prediction tutorial had a varying level. Hal presented some basic approaches, and then focused on his own work on an efficient algorithm for structured prediction which does sequential labeling. My own insight is that Hal’s algorithm can be seen as an optimization scheme for outcomes that have a tree-like graphical model (thus the sequential approach). I may be very wrong on this one, as I am still struggling to define the borders between Structured Learning and MANOVA. John gave an exposition of the L2S (learning to search) algorithm, which seemed very impressive, but I was clearly missing some background to fully understand the talk.

The convex optimization talks were brilliant! A nice unified review from historical approaches (dating back to Cauchy) to cutting edge approaches. I think that Mark Schmidt’s slides will definitely serve me to learn and to teach convex optimization in the future.

Keynote- Leon Bottou

Leon gave an inspiring talk about the next big challenges in ML. He pointed out two:

Incorporating ML modules in big software systems requires a level of stability currently unavailable. There seems to be a lot to learn from other industries on abstraction and modularity of their building blocks. ML modules are currently “weak contracts” whereas large complicated systems require modules with “strong contracts”.
Even with all the cross validation, train-test, and other risk estimators, ML is still biased towards particular data sets. There is a lot to learn from other scientific disciplines on the design of data collection, and more generally, on “enriching our experimental repertoire, redefining our scientific processes, and still maintain our progress speed”.

Faster Cover Trees

Mike Izbicki presented his Faster Cover Trees data structure, which is a data structure for fast queries (possibly also insertions) on pairwise distances. Clearly, very useful for neighborhood type algorithms such as KNN. Interestingly, he came across his algorithm while trying to write a Haskal implementation of the classical cover tree.

Optimal and Adaptive Algorithms for Online Boosting

If you want an online variant of gradient boosting, read this paper. Beautifully presented by Haipeng Luo, and clearly deserving an ICML best paper award.

Low Rank Approximation using Error Correcting Coding Matrices

A nice presentation by Shashanka Ubaru about projections using error correcting matrices that preserve geometry. The motivation is the reduction of the computation cost of performing random projections a-la Johnson-Lindenstrauss.

Bayesian and Empirical Bayesian Forests

A cool idea presented by Taddy Matthew. There is no need to re-learn the root of each tree at every computing node when bagging trees in parallel. This can lead to great speedups since after each split, only small subsets of the data are considered for splitting. The deeper the common roots of the bagged forest, the larger the computational gain. As a side comment, Taddy also mentioned that for very large data sets (in the scale of eBay), there is not benefit in the variable subsampling the distinguishes between bagging and random forests.

Is Feature Selection Secure against Training Data Poisoning?

Battista Biggio presented an interesting framework to analyze vulnerability of ML software modules to cyber attacks. My take home message: data poisoning attacks are simply an adversarial introduction of outliers. Like any other outliers, one can use robust methods as a protection.

Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo

A great talk presented by Yu-Xiang Wang. The take home message: Sampling from the posterior distribution is both a good estimator, and preserves differential privacy.

Rademacher Observations, Private Data, and Boosting

Giorgio Patrini presented the idea of Rademacher Observations (“rados”). At first I was not sure if rados are actually coresets, or random projections, which have already been proposed in the context of privacy preservation. I then understood it is none the above. “Rados” are actually sufficient statistics computed on subsets of data. As such, they are statistical aggregates that allow estimation, while preserving privacy.

The Composition Theorem for Differential Privacy

Sewoong Oh presented a calculus for computing the privacy vulnerability when performing multiple deferentially private queries. Bears strong similarity with the accumulation of error in multiple testing.

In his keynote address, Jon Kleinberg gave a clear and interesting exposition of the history of network analysis:

Relation between local and global observed structure.
Relation between observed phenomena and generative models.

I hope the slides will be made available.

Hashtags, Clicks and Likes: Supervision for Content-based Posts

The Extreme Classification: Learning with a Very Large Number of Labels workshop dealt with the issue of learning a large set of labels.

Jason Weston’s,from Facebook, presented the challenge in classifying the content of Facebook posts. User hashtags serve as (weak) labels, and the problem is to provide the \(k\) best associated tags. The ERM problem is then defined with respect to an ingenious loss he previously presented, termed WSABIE. His algorithm consists of learning an embedding of words into some vector space. Words are then aggregated to embed posts. Hashtags are also embedded in the same space, so that finding the hashtags associated with a particular post is reduced to computing dot products. Interestingly, he noted that learning the word embeddings proved more accurate than out-of-the-box embeddings like word2vec. I permit myself to think of his approach as an efficient implementation of canonical correlation analysis with respect to the Wsbie ranking loss.

The Frank-Wolfe Algorithm: Recent Results and Applications to High-Dimensional Similarity Learning and Distributed Optimization

I partially attended the greedy optimization algorithms workshop and particularly liked Aurelien Bellet’s presentation on the Frank-Wolfe algorithm for similarity learning. The fundamental idea is that once the similarity learning problem has been posed as an ERM problem over a space of matrices, it can be efficiently solved using a first order optimizer which advances in the direction of the largest coordinate in the gradient. This is called the Sparse Frank-Wolfe algorithm. It should also be noted that the matrices are not constrained to correspond to inner products (thus symmetric and PSD) since the similarity metrics being learnt may not be proper dot products.

Julia’s Approach to Open Source Machine Learning

John Myles White presented the Julia language at the Machine Learning Open Source Software 2015 workshop. I will not go in detail. If you are unfamiliar with the language, you should check it out. It also got me thinking- when are we going to see machine-learning driven type-inference in compilers? In reply to my question, John said the JavaScript compiler already has some probabilistic rules. Can ML provide classifications that are reliable enough to be used for type-inference within a compiler? This is obviously related to Leon Bottou’s talk about “weak” and “hard” contracts.

Resource Efficuent Learning

This workshop fit exactly within my research interests. I have learned from it a lot. A couple of speakers presented loss function which account for the learning cost within the optimization. I admit I have trouble with this approach because of the units… It is my feeling that prediction error and estimation cost cannot be compounded in one loss function.

In a different line of work, Jake Abernethy presented a very nice sampling scheme where a researcher publishes his willingness to pay for each data point, depending on its value. Subjects can they decide whether they sell their data or not at the researcher’s bidding price. Perhaps not surprisingly, a researcher’s willingness to pay depends on the gradient of the empirical risk at the requested data point. This clearly assumes that the data can be validated, so that cheating is not an issue. I wonder what would happen when cheating is possible? Maybe a game-theoretic analysis? Maybe a bit-coin decentralized verification mechanism?

Ofer Dekel from Microsoft present a formal framework for simultaneous pruning of random forests. I found the theory beautiful with non-intuitive results. The idea is to prune simultaneously a whole forest, and not tree-wise. Ofer casts the problem as an optimization problem, and indeed achieves astonishing complexity reductions, without compromising statistical accuracy.

Yoshua Bengio presented a review of a mass of research aimed at learning and servicing with deep networks, with a “small footprint”. Many useful references were presented. You should follow his website as he posts his talks regularly. I was particularly impressed by several idea:

Conditional Computations is the idea that a different model is fit at different locations of the feature space. While powerful and attractive, it turns out the the logic of CPUs and GPUs is such that is optimized for “predictable” computations, and not the random access type required when conditioning on feature values.
Low Precision training is the idea that we do not really require floating point precision for machine learning. Several authors have successfully learned models with \(16\) and \(8\) bit precision. Some have allowed a varying precision for different network layers, known as Dynamic Fixed Point. Bengio is not taking this to the limit and learning using \(0,1\) weights. Besides the small memory footprint, this has the advantage that multiplication operations are no longer required, thus improving performance.

Cedric Archambeau from Amazon presented the challenges in Amazon Machine Learning. The end goal is to let the user choose (a) time to train, (b) predictive performance, (c) time to predict. Given these, the number of machines, algorithm, and other hardware and algorithmic consideration can be abstracted away for better servicing, and better optimization of the AWS infrastructure.

The last talk I attended, was given by Manik Varma turned out to be one of the best given, most interesting, and inspiring I heard in a while. Manik presented an efficiently learnable hypothesis class that unifies trees, DNNs, kernel SVMs. I did not find the ICML slides, but luckily, I found a past recording of the talk. I have also learned that this is the classifier used in Windows 8 to detect malware at runtime!

Written on July 7, 2015