The Building Blocks of Machine Learning

Tags: : Machine Learning

One can easily be confused by the sea of methods and terms in machine learning. I find the endless terminology confusing and counter productive. One might have a perfect understanding of a method “A”, but is unaware that the new state of the art algorithm, “B++”, is merely a small twist to his familiar “A”. I have spent hours trying to disambiguate terms just to realize that a simple idea was occluded by terminology.

In this post, I try to collect the fundamental ideas underlying machine-learning. Most algorithms out there can be shown to be some compounding of these ideas:

\[\newcommand{\loss}{l} % A loss function \newcommand{\risk}{R} % The risk function \newcommand{\riskn}{\mathbb{R}} % The empirical risk \newcommand{\argmin}[2]{\mathop{argmin} _{#1}\set{#2}} % The argmin operator \newcommand{\set}[1]{\left\{ #1 \right\}} % A set \newcommand{\rv}[1]{\mathbf{#1}} % A random variable \newcommand{\x}{\rv x} % The random variable x \newcommand{\y}{\rv y} % The random variable x \newcommand{\X}{\rv X} % The random variable x \newcommand{\Y}{\rv Y} % The random variable y \newcommand{\X}{\rv z} % The random variable x \newcommand{\Y}{\rv Z} % The random variable y \newcommand{\estim}[1]{\widehat{#1}} % An estimator\]

Risk Minimization

This is possibly the most fundamental concept in machine learning. The idea stems from (statistical) decision theory, and consists of defining what is the loss incurred by making an error, \(\loss(Z;\theta)\). Once defined, one would naturally seek to make predictions, \(\theta^*\), that are accurate, i.e., minimize the average loss known as the risk: \(\risk(\theta):= \int \loss(Z;\theta) dZ\), and \(\theta^*:=\argmin{\theta}{\risk(\theta)}\).

As the whole population is typically inaccessible, we cannot compute the risk. Instead, we have access to a sample, and so we instead minimize the empirical risk \(\riskn(\theta):= \frac 1n \sum \loss(Z_i;\theta)\), and \(\estim{\theta^*}:= \argmin{\theta}{\riskn(\theta)}\).

The vast majority of supervised and unsupervised learning algorithms are merely empirical risk minimizers (ERM). Some examples include Ordinary Least Squares, Maximum Likelihood estimation, PCA.

Typical examples of algorithms that cannot be cast as pure ERM problems are non-parametric methods like k-nearest neighbour and kernel methods. This is because optimization in an infinite dimension is typically impossible (with exceptions, see the kernel trick).

Inductive Bias

Inductive bias is the idea that without constraining the class of prediction function (“hypothesis” in the learning literature), there is non point in learning. Indeed, our predictions and approximation will be overfitted to the sample, and perform poorly on new data.

Inductive bias can be introduced in several ways:

By restricting the functional form of your predictors (the “hypothesis class”).
By preferring simple solutions, i.e., adding regularization.

Ridge regression demonstrates these two forms of inductive bias: we constrain the predictor to be linear in the features, and also add \(l_2\) regularization.

Risk Estimation

Having defined a loss function, we obviously seek for predictors and estimates that minimize the risk. We may think that choosing the model with the smallest empirical risk, \(\riskn{\theta)}\), may be a good way to compare and choose models. This, however, is a poor strategy that will certainly lead to overfitting. This is because \(\riskn(\estim{\theta^*})\) is a biased estimate of \(\risk(\estim{\theta^*})\). The “richer” the hypothesis class, the larger the bias, leading us to prefer more complicated models.

To choose the best model, we would like some unbiased estimate of \(\risk(\estim{\theta^*})\). Notable methods that aim at estimating the risk of the selected model, a.k.a. the generalization error, or test error are:

Train, validate, test samples.
Cross Validation.
Jackkinifing.
Mallow’s Cp, AIC, BIC, MDL, SRM, DIC, HQC, and others.

Dimensionality Reduction

The fact that a scientists/machine collected a set of \(p\) variables, does clearly not mean that we need them all, or that we need them in the original scale. Indeed, variables may include redundant information. It is thus of great interest, both for interpretability and for computational speedups, to remove the redundancy in the data by reducing its dimension.

Some examples include:

Variable selection such as stepwise-regression, and LASSO.
Linear-space embeddings such as Principal Component Analysis, Factor Analysis, Independent Component Analysis, Multidimensional Scaling, Partial Least Squares, and Canonical Correlation Analysis
Non-linear-space embeddings such as Local MDS, Locally Linear Embeddings, and Self Organizing Maps.

Basis Augmentation

Basis augmentation consists of generating new variables from non linear combinations of existing ones. This is motivated by the observation that relation between variables may be linear in some non-linear transformation of these. Examples include:

The Kernel Trick

As previously stated, it is typically impossible so solve some risk minimization problem over an infinite dimensional function space. Exceptions arise, however, if using a particular type of regularization. Indeed, it was Grace Whaba’s observation, when studying smoothing splines, that with the right regularization, the solution to a risk minimization problem in an infinite dimensional function space, is contained within a finite dimensional simple space (that of cubic splines). This observation was later generalized to what is currently known as the Kernel Trick.

Informally speaking, the kernel trick states that by choosing an appropriate regularization, we can constrain the solution of an infinite dimensional ERM problem to a simple functional subspace, known as a Reproducing Kernel Hilbert Space. The reason that RKHS spaces appear in this context is that functions in RKHS can be represented as linear combinations of distances between points in the space, which is a far- from-trivial property of a function.

Generative Models

A data scientist trained as a Statistician will first think of a sampling distributions, a.k.a., a generative model. This may be an overkill for the simple purpose of discriminative analysis, dimensionality reduction and clustering. If, however, a generative model can be assumed, then it immediately lends itself to learning using likelihood principals.

Assuming the generative model has a latent variable, allows the design of algorithms that pool information from different samples in a manner that no algorithm designer could have though of. Examples include:

Finite Mixtures, Hiddem Markov Models, Hierarchical Dirichlet Processes, Stochastic blockmodels used for classification and clustering.
Factor Analysis, Independent Component Analysis, Canonical Correlations Analysis, used for dimensionality reduction.
Latent Dirichlet Allocation, and Probabilistic Latent Semantic Indexing used for topic modeling.

Written on June 12, 2015