# Chapter 14 Sparse Representations

Analyzing “bigdata” in R is a challenge because the workspace is memory resident, i.e., all your objects are stored in RAM. As a rule of thumb, fitting models requires about 5 times the size of the data. This means that if you have 1 GB of data, you might need about 5 GB to fit a linear models. We will discuss how to compute *out of RAM* in the Memory Efficiency Chapter 15. In this chapter, we discuss efficient representations of your data, so that it takes less memory. The fundamental idea, is that if your data is *sparse*, i.e., there are many zero entries in your data, then a naive `data.frame`

or `matrix`

will consume memory for all these zeroes. If, however, you have many recurring zeroes, it is more efficient to save only the non-zero entries.

When we say *data*, we actually mean the `model.matrix`

. The `model.matrix`

is a matrix that R grows, converting all your factors to numeric variables that can be computed with. *Dummy coding* of your factors, for instance, is something that is done in your `model.matrix`

. If you have a factor with many levels, you can imagine that after dummy coding it, many zeroes will be present.

The **Matrix** package replaces the `matrix`

class, with several sparse representations of matrix objects.

When using sparse representation, and the **Matrix** package, you will need an implementation of your favorite model fitting algorithm (e.g. `lm`

) that is adapted to these sparse representations; otherwise, R will cast the sparse matrix into a regular (non-sparse) matrix, and you will have saved nothing in RAM.

*Remark.*If you are familiar with MATLAB you should know that one of the great capabilities of MATLAB, is the excellent treatment of sparse matrices with the

`sparse`

function.
Before we go into details, here is a simple example. We will create a factor of letters with the `letters`

function. Clearly, this factor can take only \(26\) values. This means that \(25/26\) of the `model.matrix`

will be zeroes after dummy coding. We will compare the memory footprint of the naive `model.matrix`

with the sparse representation of the same matrix.

```
library(magrittr)
reps <- 1e6 # number of samples
y<-rnorm(reps)
x<- letters %>%
sample(reps, replace=TRUE) %>%
factor
```

The object `x`

is a factor of letters:

`head(x)`

```
## [1] n x z f a i
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
```

We dummy code `x`

with the `model.matrix`

function.

```
X.1 <- model.matrix(~x-1)
head(X.1)
```

```
## xa xb xc xd xe xf xg xh xi xj xk xl xm xn xo xp xq xr xs xt xu xv xw xx
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## xy xz
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 0
## 6 0 0
```

We call **MatrixModels** for an implementation of `model.matrix`

that supports sparse representations.

```
suppressPackageStartupMessages(library(MatrixModels))
X.2<- as(x,"sparseMatrix") %>% t # Makes sparse dummy model.matrix
head(X.2)
```

`## 6 x 26 sparse Matrix of class "dgCMatrix"`

`## [[ suppressing 26 column names 'a', 'b', 'c' ... ]]`

```
##
## [1,] . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . . . . . . 1 . .
## [3,] . . . . . . . . . . . . . . . . . . . . . . . . . 1
## [4,] . . . . . 1 . . . . . . . . . . . . . . . . . . . .
## [5,] 1 . . . . . . . . . . . . . . . . . . . . . . . . .
## [6,] . . . . . . . . 1 . . . . . . . . . . . . . . . . .
```

Notice that the matrices have the same dimensions:

`dim(X.1)`

`## [1] 1000000 26`

`dim(X.2)`

`## [1] 1000000 26`

The memory footprint of the matrices, given by the `pryr::object_size`

function, are very very different.

`pryr::object_size(X.1)`

`## 272 MB`

`pryr::object_size(X.2)`

`## 12 MB`

Things to note:

- The sparse representation takes a whole lot less memory than the non sparse.
- The
`as(,"sparseMatrix")`

function grows the dummy variable representation of the factor`x`

. - The
**pryr**package provides many facilities for inspecting the memory footprint of your objects and code.

With a sparse representation, we not only saved on RAM, but also on the computing time of fitting a model. Here is the timing of a non sparse representation:

`system.time(lm.1 <- lm(y ~ X.1)) `

```
## user system elapsed
## 2.149 0.414 2.561
```

Well actually, `lm`

is a wrapper for the `lm.fit`

function. If we override all the overhead of `lm`

, and call `lm.fit`

directly, we gain some time:

`system.time(lm.1 <- lm.fit(y=y, x=X.1))`

```
## user system elapsed
## 0.702 0.067 0.769
```

We now do the same with the sparse representation:

`system.time(lm.2 <- MatrixModels:::lm.fit.sparse(X.2,y))`

```
## user system elapsed
## 0.123 0.004 0.126
```

It is only left to verify that the returned coefficients are the same:

`all.equal(lm.2, unname(lm.1$coefficients), tolerance = 1e-12)`

`## [1] TRUE`

You can also visualize the non zero entries, i.e., the sparsity structure.

`image(X.2[1:26,1:26])`

## 14.1 Sparse Matrix Representations

We first distinguish between the two main goals of the efficient representation: (i) efficient writing, i.e., modification; (ii) efficient reading, i.e., access. For our purposes, we will typically want efficient reading, since the `model.matrix`

will not change while a model is being fitted.

Representations designed for writing include the *dictionary of keys*, *list of lists*, and a *coordinate list*. Representations designed for efficient reading include the *compressed sparse row* and *compressed sparse column*.

### 14.1.1 Coordinate List Representation

A *coordinate list representation*, also known as *COO*, or *triplet represantation* is simply a list of the non zero entries. Each element in the list is a triplet of the row, column, and value, of each non-zero entry in the matrix. For instance the matrix \[ \begin{bmatrix}
0 & a_2 & 0 \\
0 & 0 & b_3
\end{bmatrix} \] will be \[ \begin{bmatrix}
1 & 2 & a_2 \\
2 & 3 & b_3
\end{bmatrix}. \]

### 14.1.2 Compressed Row Oriented Representation

*Compressed row oriented representation*, also known as *compressed sparse row*, or *CSR*. *CSR* is similar to *COO* with a compressed row vector. Instead of holding the row of each non-zero entry, the row vector holds the locations in the colum vector where a row is increased. See the next illustration.

### 14.1.3 Compressed Column Oriented Representation

A *compressed column oriented representation*, also known as *compressed sparse column*, or *CSC*. In *CSC* the column vector is compressed. Unlike *CSR* where the row vector is compressed. The nature of statistical applications is such, that CSC representation is typically the most economical, justifying its popularity.

### 14.1.4 Sparse Algorithms

We will go into the details of some algorithms in the Numerical Linear Algebra Chapter 17. For our current purposes two things need to be emphasized:

Working with sparse representations requires using a function that is aware of the representation you are using.

A mathematician may write \(Ax=b \Rightarrow x=A^{-1}b\). This is a predicate of \(x\),i.e., a property that \(x\) satisfies, which helps with its analysis. A computer, however, would

**never**compute \(A^{-1}\) in order to find \(x\), but rather use one of many endlessly many numerical algorithms. A computer will typically “search” various \(x\)’s until it finds the one that fulfils the predicate.

## 14.2 Sparse Matrices and Sparse Models in R

### 14.2.1 The Matrix Package

The **Matrix** package provides facilities to deal with real (stored as double precision), logical and so-called “pattern” (binary) dense and sparse matrices. There are provisions to provide integer and complex (stored as double precision complex) matrices.

The sparse matrix classes include:

`TsparseMatrix`

: a virtual class of the various sparse matrices in triplet representation.`CsparseMatrix`

: a virtual class of the various sparse matrices in CSC representation.`RsparseMatrix`

: a virtual class of the various sparse matrices in CSR representation.

For matrices of real numbers, stored in *double precision*, the **Matrix** package provides the following (non virtual) classes:

`dgTMatrix`

: a**general**sparse matrix of**doubles**, in**triplet**representation.`dgCMatrix`

: a**general**sparse matrix of**doubles**, in**CSC**representation.`dsCMatrix`

: a**symmetric**sparse matrix of**doubles**, in**CSC**representation.`dtCMatrix`

: a**triangular**sparse matrix of**doubles**, in**CSC**representation.

Why bother with distinguishing between the different shapes of the matrix? Because the more structure is assumed on a matrix, the more our (statistical) algorithms can be optimized. For our purposes `dgCMatrix`

will be the most useful.

### 14.2.2 The glmnet Package

As previously stated, an efficient storage of the `model.matrix`

is half of the story. We now need implementations of our favorite statistical algorithms that make use of this representation. At the time of writing, a very useful package that does that is the **glmnet** package, which allows to fit linear models, generalized linear models, with ridge, lasso, and elastic net regularization. The **glmnet** package allows all of this, using the sparse matrices of the **Matrix** package.

The following example is taken from John Myles White’s blog, and compares the runtime of fitting an OLS model, using `glmnet`

with both sparse and dense matrix representations.

```
suppressPackageStartupMessages(library('glmnet'))
set.seed(1)
performance <- data.frame()
for (sim in 1:10){
n <- 10000
p <- 500
nzc <- trunc(p / 10)
x <- matrix(rnorm(n * p), n, p) #make a dense matrix
iz <- sample(1:(n * p),
size = n * p * 0.85,
replace = FALSE)
x[iz] <- 0 # sparsify by injecting zeroes
sx <- Matrix(x, sparse = TRUE) # save as a sparse object
beta <- rnorm(nzc)
fx <- x[, seq(nzc)] %*% beta
eps <- rnorm(n)
y <- fx + eps # make data
# Now to the actual model fitting:
sparse.times <- system.time(fit1 <- glmnet(sx, y)) # sparse glmnet
full.times <- system.time(fit2 <- glmnet(x, y)) # dense glmnet
sparse.size <- as.numeric(object.size(sx))
full.size <- as.numeric(object.size(x))
performance <- rbind(performance, data.frame(Format = 'Sparse',
UserTime = sparse.times[1],
SystemTime = sparse.times[2],
ElapsedTime = sparse.times[3],
Size = sparse.size))
performance <- rbind(performance, data.frame(Format = 'Full',
UserTime = full.times[1],
SystemTime = full.times[2],
ElapsedTime = full.times[3],
Size = full.size))
}
```

Things to note:

- The simulation calls
`glmnet`

twice. Once with the non-sparse object`x`

, and once with its sparse version`sx`

. - The degree of sparsity of
`sx`

is \(85\%\). We know this because we “injected” zeroes in \(0.85\) of the locations of`x`

. - Because
`y`

is continuous`glmnet`

will fit a simple OLS model. We will see later how to use it to fit GLMs and use lasso, ridge, and elastic-net regularization.

We now inspect the computing time, and the memory footprint, only to discover that sparse representations make a BIG difference.

```
suppressPackageStartupMessages(library('ggplot2'))
ggplot(performance, aes(x = Format, y = ElapsedTime, fill = Format)) +
stat_summary(fun.data = 'mean_cl_boot', geom = 'bar') +
stat_summary(fun.data = 'mean_cl_boot', geom = 'errorbar') +
ylab('Elapsed Time in Seconds')
```

```
ggplot(performance, aes(x = Format, y = Size / 1000000, fill = Format)) +
stat_summary(fun.data = 'mean_cl_boot', geom = 'bar') +
stat_summary(fun.data = 'mean_cl_boot', geom = 'errorbar') +
ylab('Matrix Size in MB')
```

How do we perform other types of regression with the **glmnet**? We just need to use the `family`

and `alpha`

arguments of `glmnet::glmnet`

. The `family`

argument governs the type of GLM to fit: logistic, Poisson, probit, or other types of GLM. The `alpha`

argument controls the type of regularization. Set to `alpha=0`

for ridge, `alpha=1`

for lasso, and any value in between for elastic-net regularization.

### 14.2.3 The MatrixModels Package

The MatrixModels package is designed to fit various models (linear, non-linear, generalized) using sparse matries. The function `MatrixModels::glm4`

can easily replace `stats::glm`

for all your needs. Unlike **glmnet**, the **MatrixModels** package will not offer you model regularization.

### 14.2.4 The SparseM Package

Basic linear algebra with sparse matrices.

## 14.3 Beyond Sparsity

When you think of it, sparse matrix representations is nothing but a combo of lossless compression, with accompanying matrix algorithms. Can this combo be leveraged when matrices are not sparse? At the time of writing, I am unaware of R objects that explot this idea, but it is generally possible. See Elgohary et al. (2017) for details.

## 14.4 Bibliographic Notes

The best place to start reading on sparse representations and algorithms is the vignettes of the **Matrix** package. Gilbert, Moler, and Schreiber (1992) is also a great read for some general background. For the theory on solving sparse linear systems see Davis (2006). For general numerical linear algebra see Gentle (2012).

## 14.5 Practice Yourself

What is the CSC representation of the following matrix: \[\begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 6 \\ 1 & 0 & 1 \end{bmatrix}\]

Write a function that takes two matrices in CSC and returns their matrix product.

### References

Davis, Timothy A. 2006. *Direct Methods for Sparse Linear Systems*. SIAM.

Elgohary, Ahmed, Matthias Boehm, Peter J Haas, Frederick R Reiss, and Berthold Reinwald. 2017. “Scaling Machine Learning via Compressed Linear Algebra.” *ACM SIGMOD Record* 46 (1). ACM: 42–49.

Gentle, James E. 2012. *Numerical Linear Algebra for Applications in Statistics*. Springer Science & Business Media.

Gilbert, John R, Cleve Moler, and Robert Schreiber. 1992. “Sparse Matrices in Matlab: Design and Implementation.” *SIAM Journal on Matrix Analysis and Applications* 13 (1). SIAM: 333–56.

Shah, Viral, and John R Gilbert. 2004. “Sparse Matrices in Matlab* P: Design and Implementation.” In *International Conference on High-Performance Computing*, 144–55. Springer.