Chapter 15 Parallel Computing

You would think that because you have an expensive multicore computer your computations will speed up. Well, no. At least not if you don’t make sure they do. By default, no matter how many cores you have, the operating system will allocate each R session to a single core.

For starters, we need to distinguish between two types of parallelism:

  1. Explicit parallelism: where the user handles the parallelisation.
  2. Implicit parallelism: where the parallelisation is abstracted away from the user.

Clearly, implicit parallelism is more desirable, but the state of mathematical computing is such that no sufficiently general implicit parallelism framework exists. The R Consortium is currently financing a major project for a A “Unified Framework For Distributed Computing in R” so we can expect things to change soon. In the meanwhile, most of the parallel implementations are explicit.

15.1 Implicit Parallelism

We will not elaborate on implicit parallelism except mentioning the following:

  • You can enjoy parallel linear algebra by replacing the linear algebra libraries with BLAS and LAPACK as described here.
  • You should read the “Parallel computing: Implicit parallelism” section in the excellent High Performance Computing task view, for the latest developments in implicit parallelism.

15.2 Explicit Parallelism

R provides many frameworks for explicit parallelism. Because the parallelism is initiated by the user, we first need to decide when to parallelize? As a rule of thumb, you want to parallelise when you encounter a CPU bottleneck, and not a memory bottleneck. Memory bottlenecks are released with sparsity (Chapter 13), or efficient memory usage (Chapter 14).

Several ways to diagnose your bottleneck include:

  • Keep your Windows Task Manager, or Linux top open, and look for the CPU load, and RAM loads.
  • The computation takes a long time, and when you stop it pressing ESC, R is immediately responsive. If it is not immediately responsive, you have a memory bottleneck.
  • Profile your code. See Hadley’s guide.

For reasons detailed in Kane et al. (2013), we will present the foreach parallelisation package (Analytics and Weston 2015). It will allow us to:

  1. Decouple between our parallel algorithm and the parallelisation mechanism: we write parallelisable code once, and can then switch the underlying parallelisation mechanism.

  2. Combine with the big.matrix object from Chapter 14 for shared memory parallisation: all the machines may see the same data, so that we don’t need to export objects from machine to machine.

What do we mean by “switch the underlying parallesation mechanism”? It means there are several packages that will handle communication between machines. Some are very general and will work on any cluster. Some are more specific and will work only on a single multicore machine (not a cluster) with a particular operating system. These mechanisms include multicore, snow, parallel, and Rmpi. The compatibility between these mechanisms and foreach is provided by another set of packages: doMC , doMPI, doRedis, doParallel, and doSNOW.

Remark. I personally prefer the multicore mechanism, with the doMC adapter for foreach. I will not use this combo, however, because multicore will not work on Windows machines. I will thus use the more general snow and doParallel combo. If you do happen to run on Linux, or Unix, you will want to replace all doParallel functionality with doMC.

Let’s start with a simple example, taken from “Getting Started with doParallel and foreach”.

library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
result <- foreach(i=1:3) %dopar% sqrt(i)
class(result)
## [1] "list"
result
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051

Things to note:

  • makeCluster creates an object with the information our cluster. On a single machine it is very simple. On a cluster of machines, you will need to specify the i.p. addresses or other identifiers of the machines.
  • registerDoParallel is used to inform the foreach package of the presence of our cluster.
  • The foreach function handles the looping. In particular note the %dopar operator that ensures that looping is in parallel. %dopar% can be replaced by %do% if you want serial looping (like the for loop), for instance, for debugging.
  • The output of the various machines is collected by foreach to a list object.
  • In this simple example, no data is shared between machines so we are not putting the shared memory capabilities to the test.
  • We can check how many workers were involved using the getDoParWorkers() function.
  • We can check the parallelisation mechanism used with the getDoParName() function.

Here is a more involved example. We now try to make Bootstrap inference on the coefficients of a logistic regression. Bootstrapping means that in each iteration, we resample the data, and refit the model.

x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 1e4
ptime <- system.time({
 r <- foreach(icount(trials), .combine=cbind) %dopar% {
 ind <- sample(100, 100, replace=TRUE)
 result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
 coefficients(result1)
 }
 })[3]
ptime
## elapsed 
##  24.235

Things to note:

  • As usual, we use the foreach function with the %dopar% operator to loop in parallel.
  • The icounts function generates a counter.
  • The .combine=cbind argument tells the foreach function how to combine the output of different machines, so that the returned object is not the default list.

How long would that have taken in a simple (serial) loop? We only need to replace %dopar% with %do% to test.

stime <- system.time({
 r <- foreach(icount(trials), .combine=cbind) %do% {
 ind <- sample(100, 100, replace=TRUE)
 result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
 coefficients(result1)
 }
 })[3]
stime
## elapsed 
##  45.089

Yes. Parallelising is clearly faster.

Let’s see how we can combine the power of bigmemory and foreach by creating a file mapped big.matrix object, which is shared by all machines. The following example is taken from Kane et al. (2013), and uses the big.matrix object we created in Chapter 14.

library(bigmemory)
x <- attach.big.matrix("airline.desc")

library(foreach)
library(doSNOW)
cl <- makeSOCKcluster(rep("localhost", 4)) # make a cluster of 4 machines
registerDoSNOW(cl) # register machines for foreach()

Get a “description” of the big.matrix object that will be used to call it from each machine.

xdesc <- describe(x) 

Split the data along values of BENE_AGE_CAT_CD.

G <- split(1:nrow(x), x[, "BENE_AGE_CAT_CD"]) 

Define a function that computes quantiles of CAR_LINE_ICD9_DGNS_CD.

GetDepQuantiles <- function(rows, data) {
 quantile(data[rows, "CAR_LINE_ICD9_DGNS_CD"], probs = c(0.5, 0.9, 0.99),
 na.rm = TRUE)
}

We are all set up to loop, in parallel, and compute quantiles of CAR_LINE_ICD9_DGNS_CD for each value of BENE_AGE_CAT_CD.

qs <- foreach(g = G, .combine = rbind) %dopar% {
 require("bigmemory")
 x <- attach.big.matrix(xdesc)
 GetDepQuantiles(rows = g, data = x)
}
qs
##          50% 90% 99%
## result.1 558 793 996
## result.2 518 789 996
## result.3 514 789 996
## result.4 511 789 996
## result.5 511 790 996
## result.6 518 796 995

15.2.1 Caution: Implicit with Explicit Parallelism

TODO

15.3 Bibliographic Notes

For a brief and excellent explanation on parallel computing in R see Schmidberger et al. (2009). For a full review see Chapple et al. (2016). For an up-to-date list of packages supporting parallel programming see the High Performance Computing R task view.

15.4 Practice Yourself

References

Kane, Michael J, John Emerson, Stephen Weston, and others. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19.

Analytics, Revolution, and Steve Weston. 2015. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

Schmidberger, Markus, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, and Ulrich Mansmann. 2009. “State of the Art in Parallel Computing with R.” Journal of Statistical Software 47 (1).

Chapple, Simon R, Eilidh Troup, Thorsten Forster, and Terence Sloan. 2016. Mastering Parallel Programming with R. Packt Publishing Ltd.