The Building Blocks of Machine Learning

Tags: : Machine Learning

One can easily be confused by the sea of methods and terms in macine learning. I find the endless terminology confusing and counter productive. One might have a perfect understanding of a method “A”, but is unaware that the new state of the art algorithm, “B++”, is merely a small twist to his familiar “A”. I have spent hours trying to disambiguate terms just to realize that a simple idea was ocluded by terminology.

In this post, I try to collect the fundemantal ideas underlying machine-learning. Most algorithms out there can be shown to be some compounding of these ideas:

Risk Minimization

\[\newcommand{\loss}{l} % A loss function \newcommand{\risk}{R} % The risk function \newcommand{\riskn}{\mathbb{R}} % The empirical risk \newcommand{\argmin}[2]{\mathop{argmin} _{#1}\set{#2}} % The argmin operator \newcommand{\set}[1]{\left\{ #1 \right\}} % A set \newcommand{\rv}[1]{\mathbf{#1}} % A random variable \newcommand{\x}{\rv x} % The random variable x \newcommand{\y}{\rv y} % The random variable x \newcommand{\X}{\rv X} % The random variable x \newcommand{\Y}{\rv Y} % The random variable y \newcommand{\X}{\rv z} % The random variable x \newcommand{\Y}{\rv Z} % The random variable y \newcommand{\estim}[1]{\widehat{#1}} % An estimator\]

This is possibly the most fundamental concept in machine learning. The idea stems from (statistical) desicision theory, and consists of defining what is the loss incurred by making an error, $\loss(Z;\theta)$. Once defined, one would naturally seek to make predictions, $\theta^$, that are accurate, i.e., minimize the average loss known as the risk: $\risk(\theta):= \int \loss(Z;\theta) dZ$, and $\theta^:=\argmin{\theta}{\risk(\theta)}$.

As the whole population is typically inaccesible, we cannot compute the risk. Instead, we have access to a sample, and so we instead minimize the empirical risk $\riskn(\theta):= \frac 1n \sum \loss(Z_i;\theta)$, and $\estim{\theta^*}:= \argmin{\theta}{\riskn(\theta)}$.

The vast majority of supervised and unsupervised learning algorithms are merely empirical risk minimizers (ERM). Some examples include Ordinary Least Squares, Maximum Likelihood estimation, PCA.

Typical examples of algorithms that cannot be cast as pure ERM problems are non-parametric methods like k-nearest neighbour and kernel methods. This is because optimization in an infinite dimension is typically impossible (with exceptions, see the kernel trick).

Inductive Bias

Inductive bias is the idea that without constraining the class of prediction function (“hypothesis” in the learning literature), there is non point in learning. Indeed, our predictions and approximation will be overfitted to the sample, and perform poorly on new data.

Inductive bias can be introduced in seveal ways:

By restricting the functional form of your predictors (the “hypothesis class”).
By preferring simple solutions, i.e., adding regularization.

Ridge regression demonstrates these two forms of inductive bias: we constrain the predictor to be linear in the features, and also add $l_2$ regularization.

Risk Estimation

Having defind a loss function, we obviously seek for predictors and estimates that minimize the risk. We may think that choosing the model with the smallest empirical risk, $\riskn{\theta)}$, may be a good way to compare and choose models. This, however, is a poor strategy that will certainly lead to overfitting. This is because $\riskn(\estim{\theta^})$ is a biased estimate of $\risk(\estim{\theta^})$. The “richer” the hypothesis class, the larger the bias, leading us to prefer more complicated models.

To choose the best model, we would like some unbiased esimate of $\risk(\estim{\theta^*})$. Notable methods that aim at estimating the risk of the selected model, a.k.a. the generalization error, or test error are:

Train, validate, test samples.
Cross Validation.
Jackkinifing.
Mallow’s Cp, AIC, BIC, MDL, SRM, DIC, HQC, and others.

Dimensionality Reduction

The fact that a scientists/machine collected a set of $p$ variables, does clearly not mean that we need them all, or that we need them in the original scale. Indeed, variables may include redundant information. It is thus of great interest, both for interpretability and for computational speedups, to remove the redundancy in the data by reducing its dimension.

Some examples include:

Variable selection such as stepwise-regression, and LASSO.
Linear-space embeddings such as Principal Component Analysis, Factor Analysis, Independent Component Analysis, Multidimensional Scaling, Partial Least Squares, and Canonical Correlation Analysis
Non-linear-space embeddings such as Local MDS, Locally Linear Embeddings, and Self Organizing Maps.

Basis Augmentation

Basis augmentation consists of generating new variables from non linear combinations of existing ones. This is motivated by the observation that relation between variables may be linear in some non-linear transformation of these. Examples include:

The Kernel Trick

As previously stated, it is typically impossible so solve some risk miminization problem over an infinte dimensional function space. Exceptions arise, however, if using a particular type of regularization. Indeed, it was Grace Whaba’s observation, when studying smoothing splines, that with the right regularization, the solution to a risk minimization problem in an infinite dimensional function space, is contained within a finite dimensional simple space (that of cubic splines). This observation was later generalized to what is currently known as the Kernel Trick.

Informally speaking, the kernel trick states that by choosing an appropriate regularization, we can constrain the solution of an infinite dimensional ERM problem to a simple functional subspace, known as a Reproducing Kernel Hilbert Space. The reason that RKHS spaces appear in this context is that functions in RKHS can be represented as linear combinations of distances between points in the space, which is a far- from-trivial property of a function.

Generative Models

Written on