Chapter 9 Multivariate Data Analysis

The term “multivariate data analysis” is so broad and so overloaded, that we start by clarifying what is discussed and what is not discussed in this chapter. Broadly speaking, we will discuss statistical inference, and leave more “exploratory flavored” matters like clustering, and visualization, to the Unsupervised Learning Chapter 11.

We start with an example.

Example 9.1 Consider the problem of a patient monitored in the intensive care unit. At every minute the monitor takes \(p\) physiological measurements: blood pressure, body temperature, etc. The total number of minutes in our data is \(n\), so that in total, we have \(n \times p\) measurements, arranged in a matrix. We also know the typical \(p\)-vector of typical measurements for this patient when healthy, denoted \(\mu_0\).

Formally, let \(y\) be single (random) measurement of a \(p\)-variate random vector. Denote \(\mu:=E[y]\). Here is the set of problems we will discuss, in order of their statistical difficulty.

Signal Detection: a.k.a. multivariate test, or global test, or omnibus test. Where we test whether \(\mu\) differs than some \(\mu_0\).
Signal Counting: a.k.a. prevalence estimation, or \(\pi_0\) estimation. Where we count the number of entries in \(\mu\) that differ from \(\mu_0\).
Signal Identification: a.k.a. selection, or multiple testing. Where we infer which of the entries in \(\mu\) differ from \(\mu_0\). In the ANOVA literature, this is known as a post-hoc analysis, which follows an omnibus test.
Estimation: Estimating the magnitudes of entries in \(\mu\), and their departure from \(\mu_0\). If estimation follows a signal detection or signal identification stage, this is known as selective estimation.

Example 9.2 Consider the problem of detecting regions of cognitive function in the brain using fMRI. Each measurement is the activation level at each location in a brain’s region. If the region has a cognitive function, the mean activation differs than \(\mu_0=0\) when the region is evoked.

Example 9.3 Consider the problem of detecting cancer encoding regions in the genome. Each measurement is the vector of the genetic configuration of an individual. A cancer encoding region will have a different (multivariate) distribution between sick and healthy. In particular, \(\mu\) of sick will differ from \(\mu\) of healthy.

Example 9.4 Consider the problem of the simplest multiple regression. The estimated coefficient, \(\hat \beta\) are a random vector. Regression theory tells us that its covariance is \(Var[\hat \beta|X]=(X'X)^{-1}\sigma^2\), and null mean of \(\beta\). We thus see that inference on the vector of regression coefficients, is nothing more than a multivaraite inference problem.

9.1 Signal Detection

Signal detection deals with the detection of the departure of \(\mu\) from some \(\mu_0\), and especially, \(\mu_0=0\). This problem can be thought of as the multivariate counterpart of the univariate hypothesis t-test.

9.1.1 Hotelling’s T2 Test

The most fundamental approach to signal detection is a mere generalization of the t-test, known as Hotelling’s \(T^2\) test.

Recall the univariate t-statistic of a data vector \(x\) of length \(n\): \[\begin{align} t^2(x):= \frac{(\bar{x}-\mu_0)^2}{Var[\bar{x}]}= (\bar{x}-\mu_0)Var[\bar{x}]^{-1}(\bar{x}-\mu_0), \tag{9.1} \end{align}\] where \(Var[\bar{x}]=S^2(x)/n\), and \(S^2(x)\) is the unbiased variance estimator \(S^2(x):=(n-1)^{-1}\sum (x_i-\bar x)^2\).

Generalizing Eq(9.1) to the multivariate case: \(\mu_0\) is a \(p\)-vector, \(\bar x\) is a \(p\)-vector, and \(Var[\bar x]\) is a \(p \times p\) matrix of the covariance between the \(p\) coordinated of \(\bar x\). When operating with vectors, the squaring becomes a quadratic form, and the division becomes a matrix inverse. We thus have \[\begin{align} T^2(x):= (\bar{x}-\mu_0)' Var[\bar{x}]^{-1} (\bar{x}-\mu_0), \tag{9.2} \end{align}\] which is the definition of Hotelling’s \(T^2\) one-sample test statistic. We typically denote the covariance between coordinates in \(x\) with \(\hat \Sigma(x)\), so that \(\widehat \Sigma_{k,l}:=\widehat {Cov}[x_k,x_l]=(n-1)^{-1} \sum (x_{k,i}-\bar x_k)(x_{l,i}-\bar x_l)\). Using the \(\Sigma\) notation, Eq.(9.2) becomes \[\begin{align} T^2(x):= n (\bar{x}-\mu_0)' \hat \Sigma(x)^{-1} (\bar{x}-\mu_0), \end{align}\] which is the standard notation of Hotelling’s test statistic.

For inference, we need the null distribution of Hotelling’s test statistic. For this we introduce some vocabulary¹⁷:

Low Dimension: We call a problem low dimensional if \(n \gg p\), i.e. \(p/n \approx 0\). This means there are many observations per estimated parameter.
High Dimension: We call a problem high dimensional if \(p/n \to c\), where \(c\in (0,1)\). This means there are more observations than parameters, but not many.
Very High Dimension: We call a problem very high dimensional if \(p/n \to c\), where \(1<c<\infty\). This means there are less observations than parameters.

Hotelling’s \(T^2\) test can only be used in the low dimensional regime. For some intuition on this statement, think of taking \(n=20\) measurements of \(p=100\) physiological variables. We seemingly have \(20\) observations, but there are \(100\) unknown quantities in \(\mu\). Say you decide that \(\mu\) differs from \(\mu_0\) based on the coordinate with maximal difference between your data and \(\mu_0\). Do you know how much variability to expect of this maximum? Try comparing your intuition with a quick simulation. Did the variabilty of the maximum surprise you? Hotelling’s \(T^2\) is not the same as the maxiumum, but the same intuition applies. This criticism is formalized in Bai and Saranadasa (1996). In modern applications, Hotelling’s \(T^2\) is rarely recommended. Luckily, many modern alternatives are available. See J. Rosenblatt, Gilron, and Mukamel (2016) for a review.

9.1.2 Various Types of Signal to Detect

In the previous, we assumed that the signal is a departure of \(\mu\) from some \(\mu_0\). For vactor-valued data \(y\), that is distributed \(\mathcal F\), we may define “signal” as any departure from some \(\mathcal F_0\). This is the multivaraite counterpart of goodness-of-fit (GOF) tests.

Even when restricting “signal” to departures of \(\mu\) from \(\mu_0\), “signal” may come in various forms:

Dense Signal: when the departure is in a large number of coordinates of \(\mu\).
Sparse Signal: when the departure is in a small number of coordinates of \(\mu\).

Process control in a manufactoring plant, for instance, is consistent with a dense signal: if a manufacturing process has failed, we expect a change in many measurements (i.e. coordinates of \(\mu\)). Detection of activation in brain imaging is consistent with a dense signal: if a region encodes cognitive function, we expect a change in many brain locations (i.e. coordinates of \(\mu\).) Detection of disease encodig regions in the genome is consistent with a sparse signal: if susceptibility of disease is genetic, only a small subset of locations in the genome will encode it.

Hotelling’s \(T^2\) statistic is best for dense signal. The next test, is a simple (and forgotten) test best with sparse signal.

9.1.3 Simes’ Test

Hotelling’s \(T^2\) statistic has currently two limitations: It is designed for dense signals, and it requires estimating the covariance, which is a very difficult problem.

An algorithm, that is sensitive to sparse signal and allows statistically valid detection under a wide range of covariances (even if we don’t know the covariance) is known as Simes’ Test. The statistic is defined vie the following algorithm:

Compute \(p\) variable-wise p-values: \(p_1,\dots,p_j\).
Denote \(p_{(1)},\dots,p_{(j)}\) the sorted p-values.
Simes’ statistic is \(p_{Simes}:=min_j\{p_{(j)} \times p/j\}\).
Reject the “no signal” null hypothesis at significance \(\alpha\) if \(p_{Simes}<\alpha\).

9.1.4 Signal Detection with R

We start with simulating some data with no signal. We will convince ourselves that Hotelling’s and Simes’ tests detect nothing, when nothing is present. We will then generate new data, after injecting some signal, i.e., making \(\mu\) depart from \(\mu_0=0\). We then convince ourselves, that both Hotelling’s and Simes’ tests, are indeed capable of detecting signal, when present.

Generating null data:

library(mvtnorm)
n <- 100 # observations
p <- 18 # parameter dimension
mu <- rep(0,p) # no signal: mu=0
x <- rmvnorm(n = n, mean = mu)
dim(x)

## [1] 100  18

lattice::levelplot(x) # just looking at white noise.

Now making our own Hotelling one-sample \(T^2\) test using Eq.((9.2)).

hotellingOneSample <- function(x, mu0=rep(0,ncol(x))){
  n <- nrow(x)
  p <- ncol(x)
  stopifnot(n > 5* p)
  bar.x <- colMeans(x)
  Sigma <- var(x)
  Sigma.inv <- solve(Sigma)
  T2 <- n * (bar.x-mu0) %*% Sigma.inv %*% (bar.x-mu0)
  p.value <- pchisq(q = T2, df = p, lower.tail = FALSE)
  return(list(statistic=T2, pvalue=p.value))
}
hotellingOneSample(x)

## $statistic
##          [,1]
## [1,] 17.94698
## 
## $pvalue
##           [,1]
## [1,] 0.4591507

Things to note:

stopifnot(n > 5 * p) is a little verification to check that the problem is indeed low dimensional. Otherwise, the \(\chi^2\) approximation cannot be trusted.
solve returns a matrix inverse.
%*% is the matrix product operator (see also crossprod()).
A function may return only a single object, so we wrap the statistic and its p-value in a list object.

Just for verification, we compare our home made Hotelling’s test, to the implementation in the rrcov package. The statistic is clearly OK, but our \(\chi^2\) approximation of the distribution leaves room to desire. Personally, I would never trust a Hotelling test if \(n\) is not much greater than \(p\), in which case I would use a high-dimensional adaptation (see Bibliography).

rrcov::T2.test(x)

## 
##  One-sample Hotelling test
## 
## data:  x
## T2 = 17.94698, F = 0.82584, df1 = 18, df2 = 82, p-value = 0.6653
## alternative hypothesis: true mean vector is not equal to (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)' 
## 
## sample estimates:
##                       [,1]       [,2]      [,3]       [,4]        [,5]
## mean x-vector -0.002990837 0.06118945 0.2153931 -0.0487018 -0.06665223
##                    [,6]       [,7]       [,8]        [,9]      [,10]
## mean x-vector 0.1297716 0.04790746 0.08328193 -0.08511783 0.02627395
##                     [,11]     [,12]     [,13]      [,14]      [,15]
## mean x-vector -0.09120245 0.1603455 0.1298308 0.08406651 -0.0559366
##                    [,16]       [,17]      [,18]
## mean x-vector 0.03307495 -0.04411712 -0.1185427

Let’s do the same with Simes’:

Simes <- function(x){
  p.vals <- apply(x, 2, function(z) t.test(z)$p.value) # Compute variable-wise pvalues
  p <- ncol(x)
  p.Simes <- p * min(sort(p.vals)/seq_along(p.vals)) # Compute the Simes statistic
  return(c(pvalue=p.Simes))
}
Simes(x)

##    pvalue 
## 0.4937222

And now we verify that both tests can indeed detect signal when present. Are p-values small enough to reject the “no signal” null hypothesis?

mu <- rep(x = 10/p,times=p) # inject signal
x <- rmvnorm(n = n, mean = mu)
hotellingOneSample(x)

## $statistic
##          [,1]
## [1,] 804.2087
## 
## $pvalue
##               [,1]
## [1,] 4.038321e-159

Simes(x)

##      pvalue 
## 2.76579e-10

… yes. All p-values are very small, so that all statistics can detect the non-null distribution.

9.2 Signal Counting

There are many ways to approach the signal counting problem. For the purposes of this book, however, we will not discuss them directly, and solve the signal counting problem as a signal identification problem: if we know where \(\mu\) departs from \(\mu_0\), we only need to count coordinates to solve the signal counting problem.

Remark. In the sparsity or multiple-testing literature, what we call “signal counting” is known as “adapting to sparsit”, or “adaptivity”.

9.3 Signal Identification

The problem of signal identification is also known as selective testing, or more commonly as multiple testing.

In the ANOVA literature, an identification stage will typically follow a detection stage. These are known as the omnibus F test, and post-hoc tests, respectively. In the multiple testing literature there will typically be no preliminary detection stage. It is typically assumed that signal is present, and the only question is “where?”

The first question when approaching a multiple testing problem is “what is an error”? Is an error declaring a coordinate in \(\mu\) to be different than \(\mu_0\) when it is actually not? Is an error an overly high proportion of falsely identified coordinates? The former is known as the family wise error rate (FWER), and the latter as the false discovery rate (FDR).

Remark. These types of errors have many names in many communities. See the Wikipedia entry on ROC for a table of the (endless) possible error measures.

9.3.1 Signal Identification in R

One (of many) ways to do signal identification involves the stats::p.adjust function. The function takes as inputs a \(p\)-vector of the variable-wise p-values. Why do we start with variable-wise p-values, and not the full data set?

Because we want to make inference variable-wise, so it is natural to start with variable-wise statistics.
Because we want to avoid dealing with covariances if possible. Computing variable-wise p-values does not require estimating covariances.
So that the identification problem is decoupled from the variable-wise inference problem, and may be applied much more generally than in the setup we presented.

We start be generating some high-dimensional multivariate data and computing the coordinate-wise (i.e. hypothesis-wise) p-value.

library(mvtnorm)
n <- 1e1
p <- 1e2
mu <- rep(0,p)
x <- rmvnorm(n = n, mean = mu)
dim(x)

## [1]  10 100

lattice::levelplot(x)

We now compute the pvalues of each coordinate. We use a coordinate-wise t-test. Why a t-test? Because for the purpose of demonstration we want a simple test. In reality, you may use any test that returns valid p-values.

t.pval <- function(y) t.test(y)$p.value
p.values <- apply(X = x, MARGIN = 2, FUN = t.pval) 
plot(p.values, type='h')

Things to note:

t.pval is a function that merely returns the p-value of a t.test.
We used the apply function to apply the same function to each column of x.
MARGIN=2 tells apply to compute over columns and not rows.
The output, p.values, is a vector of 100 p-values.

We are now ready to do the identification, i.e., find which coordinate of \(\mu\) is different than \(\mu_0=0\). The workflow for identification has the same structure, regardless of the desired error guarantees:

Compute an adjusted p-value.
Compare the adjusted p-value to the desired error level.

If we want \(FWER \leq 0.05\), meaning that we allow a \(5\%\) probability of making any mistake, we will use the method="holm" argument of p.adjust.

alpha <- 0.05
p.values.holm <- p.adjust(p.values, method = 'holm' )
which(p.values.holm < alpha)

## integer(0)

If we want \(FDR \leq 0.05\), meaning that we allow the proportion of false discoveries to be no larger than \(5\%\), we use the method="BH" argument of p.adjust.

alpha <- 0.05
p.values.BH <- p.adjust(p.values, method = 'BH' )
which(p.values.BH < alpha)

## integer(0)

We now inject some strong signal in \(\mu\) just to see that the process works. We will artificially inject signal in the first 10 coordinates.

mu[1:10] <- 2 # inject signal in first 10 variables
x <- rmvnorm(n = n, mean = mu) # generate data
p.values <- apply(X = x, MARGIN = 2, FUN = t.pval) 
p.values.BH <- p.adjust(p.values, method = 'BH' )
which(p.values.BH < alpha)

##  [1]  1  2  3  4  5  6  7  8  9 10

Indeed- we are now able to detect that the first coordinates carry signal, because their respective coordinate-wise null hypotheses have been rejected.

9.4 Signal Estimation (*)

The estimation of the elements of \(\mu\) is a seemingly straightforward task. This is not the case, however, if we estimate only the elements that were selected because they were significant (or any other data-dependent criterion). Clearly, estimating only significant entries will introduce a bias in the estimation. In the statistical literature, this is known as selection bias. Selection bias also occurs when you perform inference on regression coefficients after some model selection, say, with a lasso, or a forward search¹⁸.

Selective inference is a complicated and active research topic so we will not offer any off-the-shelf solution to the matter. The curious reader is invited to read Rosenblatt and Benjamini (2014), Javanmard and Montanari (2014), or Will Fithian’s PhD thesis (Fithian 2015) for more on the topic.

9.5 Bibliographic Notes

For a general introduction to multivariate data analysis see Anderson-Cook (2004). For an R oriented introduction, see Everitt and Hothorn (2011). For more on the difficulties with high dimensional problems, see Bai and Saranadasa (1996). For some cutting edge solutions for testing in high-dimension, see J. Rosenblatt, Gilron, and Mukamel (2016) and references therein. Simes’ test is not very well known. It is introduced in Simes (1986), and proven to control the type I error of detection under a PRDS type of dependence in Benjamini and Yekutieli (2001). For more on multiple testing, and signal identification, see Efron (2012). For more on the choice of your error rate see Rosenblatt (2013). For an excellent review on graphical models see Kalisch and Bühlmann (2014). Everything you need on graphical models, Bayesian belief networks, and structure learning in R, is collected in the Task View.

9.6 Practice Yourself

Generate multivariate data with:
```
set.seed(3)
mu<-rexp(50,6)
multi<-  rmvnorm(n = 100, mean = mu) 
```
1. Use Hotelling’s test to determine if \(\mu\) equals \(\mu_0=0\). Can you detect the signal?
2. Perform t.test on each variable and extract the p-value. Try to identify visually the variables which depart from \(\mu_0\).
3. Use p.adjust to identify in which variables there are any departures from \(\mu_0=0\). Allow 5% probability of making any false identification.
4. Use p.adjust to identify in which variables there are any departures from \(\mu_0=0\). Allow a 5% proportion of errors within identifications.
Generate multivariate data from two groups: rmvnorm(n = 100, mean = rep(0,10)) for the first, and rmvnorm(n = 100, mean = rep(0.1,10)) for the second.
1. Do we agree the groups differ?
2. Implement the two-group Hotelling test described in Wikipedia: (https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Two-sample_statistic).
3. Verify that you are able to detect that the groups differ.
4. Perform a two-group t-test on each coordinate. On which coordinates can you detect signal while controlling the FWER? On which while controlling the FDR? Use p.adjust.
Return to the previous problem, but set n=9. Verify that you cannot compute your Hotelling statistic.

References

Anderson-Cook, Christine M. 2004. “An Introduction to Multivariate Statistical Analysis.” Journal of the American Statistical Association 99 (467). American Statistical Association: 907–9.

Bai, Zhidong, and Hewa Saranadasa. 1996. “Effect of High Dimension: By an Example of a Two Sample Problem.” Statistica Sinica. JSTOR, 311–29.

Benjamini, Yoav, and Daniel Yekutieli. 2001. “The Control of the False Discovery Rate in Multiple Testing Under Dependency.” Annals of Statistics. JSTOR, 1165–88.

Efron, Bradley. 2012. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Vol. 1. Cambridge University Press.

Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with R. Springer Science & Business Media.

Fithian, William. 2015. “Topics in Adaptive Inference.” PhD thesis, STANFORD UNIVERSITY.

Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research 15 (1): 2869–2909.

Kalisch, Markus, and Peter Bühlmann. 2014. “Causal Structure Learning and Inference: A Selective Review.” Quality Technology & Quantitative Management 11 (1). Taylor & Francis: 3–21.

Rosenblatt, Jonathan. 2013. “A Practitioner’s Guide to Multiple Testing Error Rates.” arXiv Preprint arXiv:1304.4920.

Rosenblatt, Jonathan D, and Yoav Benjamini. 2014. “Selective Correlations; Not Voodoo.” NeuroImage 103. Elsevier: 401–10.

Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. 2016. “Better-Than-Chance Classification for Signal Detection.” arXiv Preprint arXiv:1608.08873.

Simes, R John. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3). Oxford University Press: 751–54.

This vocabulary is not standard in the literature, so when you read a text, you will need to verify yourself what the author means.↩
You might find this shocking, but it does mean that you cannot trust the summary table of a model that was selected from a multitude of models.↩