Chapter 11 Plotting

Whether you are doing EDA, or preparing your results for publication, you need plots. R has many plotting mechanisms, allowing the user a tremendous amount of flexibility, while abstracting away a lot of the tedious details. To be concrete, many of the plots in R are simply impossible to produce with Excel, SPSS, or SAS, and would take a tremendous amount of work to produce with Python, Java and lower level programming languages.

In this text, we will focus on two plotting packages. The basic graphics package, distributed with the base R distribution, and the ggplot2 package.

Before going into the details of the plotting packages, we start with some philosophy. The graphics package originates from the mainframe days. Computers had no graphical interface, and the output of the plot was immediately sent to a printer. Once a plot has been produced with the graphics package, just like a printed output, it cannot be queried nor changed, except for further additions.

The philosophy of R is that everyting is an object. The graphics package does not adhere to this philosophy, and indeed it was soon augmented with the grid package (R Core Team 2016), that treats plots as objects. grid is a low level graphics interface, and users may be more familiar with the lattice package built upon it (Sarkar 2008).

lattice is very powerful, but soon enough, it was overtaken in popularity by the ggplot2 package (Wickham 2009). ggplot2 was the PhD project of Hadley Wickham, a name to remember… Two fundamental ideas underlay ggplot2: (i) everything is an object, and (ii), plots can be described by a simple grammar, i.e., a language to describe the building blocks of the plot. The grammar in ggplot2 are is the one stated by L. Wilkinson (2006). The objects and grammar of ggplot2 have later evolved to allow more complicated plotting and in particular, interactive plotting.

Interactive plotting is a very important feature for EDA, and reporting. The major leap in interactive plotting was made possible by the advancement of web technologies, such as JavaScript and D3.JS. Why is this? Because an interactive plot, or report, can be seen as a web-site. Building upon the capabilities of JavaScript and your web browser to provide the interactivity, greatly facilitates the development of such plots, as the programmer can rely on the web-browsers capabilities for interactivity.

11.1 The graphics System

The R code from the Basics Chapter 3 is a demonstration of the graphics package and plotting system. We make a quick review of the basics.

11.1.1 Using Existing Plotting Functions

11.1.1.1 Scatter Plot

A simple scatter plot.

attach(trees)
plot(Girth ~ Height)

Various types of plots.

par.old <- par(no.readonly = TRUE)
par(mfrow=c(2,3))
plot(Girth, type='h', main="type='h'") 
plot(Girth, type='o', main="type='o'") 
plot(Girth, type='l', main="type='l'")
plot(Girth, type='s', main="type='s'")
plot(Girth, type='b', main="type='b'")
plot(Girth, type='p', main="type='p'")

par(par.old)

Things to note:

  • The par command controls the plotting parameters. mfrow=c(2,3) is used to produce a matrix of plots with 2 rows and 3 columns.
  • The par.old object saves the original plotting setting. It is restored after plotting using par(par.old).
  • The type argument controls the type of plot.
  • The main argument controls the title.
  • See ?plot and ?par for more options.

Control the plotting characters with the pch argument, and size with the cex argument.

plot(Girth, pch='+', cex=3)

Control the line’s type with lty argument, and width with lwd.

par(mfrow=c(2,3))
plot(Girth, type='l', lty=1, lwd=2)
plot(Girth, type='l', lty=2, lwd=2)
plot(Girth, type='l', lty=3, lwd=2)
plot(Girth, type='l', lty=4, lwd=2)
plot(Girth, type='l', lty=5, lwd=2)
plot(Girth, type='l', lty=6, lwd=2)

Add line by slope and intercept with abline.

plot(Girth)
abline(v=14, col='red') # vertical line at 14.
abline(h=9, lty=4,lwd=4, col='pink') # horizontal line at 9.
abline(a = 0, b=1) # linear line with intercept a=0, and slope b=1.

plot(Girth)
points(x=1:30, y=rep(12,30), cex=0.5, col='darkblue')
lines(x=rep(c(5,10), 7), y=7:20, lty=2 )
lines(x=rep(c(5,10), 7)+2, y=7:20, lty=2 )
lines(x=rep(c(5,10), 7)+4, y=7:20, lty=2 , col='darkgreen')
lines(x=rep(c(5,10), 7)+6, y=7:20, lty=4 , col='brown', lwd=4)

Things to note:

  • points adds points on an existing plot.
  • lines adds lines on an existing plot.
  • col controls the color of the element. It takes names or numbers as argument.
  • cex controls the scale of the element. Defaults to cex=1.

Add other elements.

plot(Girth)
segments(x0=rep(c(5,10), 7), y0=7:20, x1=rep(c(5,10), 7)+2, y1=(7:20)+2 ) # line segments
arrows(x0=13,y0=16,x1=16,y1=17) # arrows
rect(xleft=10, ybottom=12,  xright=12, ytop=16) # rectangle
polygon(x=c(10,11,12,11.5,10.5), y=c(9,9.5,10,10.5,9.8), col='grey') # polygon
title(main='This plot makes no sense', sub='Or does it?') 
mtext('Printing in the margins', side=2) # math text
mtext(expression(alpha==log(f[i])), side=4)

Things to note:

  • The following functions add the elements they are named after: segments, arrows, rect, polygon, title.
  • mtext adds mathematical text, which needs to be wrapped in expression(). For more information for mathematical annotation see ?plotmath.

Add a legend.

plot(Girth, pch='G',ylim=c(8,77), xlab='Tree number', ylab='', type='b', col='blue')
points(Volume, pch='V', type='b', col='red')
legend(x=2, y=70, legend=c('Girth', 'Volume'), pch=c('G','V'), col=c('blue','red'), bg='grey')

Adjusting Axes with xlim and ylim.

plot(Girth, xlim=c(0,15), ylim=c(8,12))

Use layout for complicated plot layouts.

A<-matrix(c(1,1,2,3,4,4,5,6), byrow=TRUE, ncol=2)
layout(A,heights=c(1/14,6/14,1/14,6/14))

oma.saved <- par("oma")
par(oma = rep.int(0, 4))
par(oma = oma.saved)
o.par <- par(mar = rep.int(0, 4))
for (i in seq_len(6)) {
    plot.new()
    box()
    text(0.5, 0.5, paste('Box no.',i), cex=3)
}

Always detach.

detach(trees)

11.1.2 Exporting a Plot

The pipeline for exporting graphics is similar to the export of data. Instead of the write.table or save functions, we will use the pdf, tiff, png, functions. Depending on the type of desired output.

Check and set the working directory.

getwd()
setwd("/tmp/")

Export tiff.

tiff(filename='graphicExample.tiff')
plot(rnorm(100))
dev.off()

Things to note:

  • The tiff function tells R to open a .tiff file, and write the output of a plot.
  • Only a single (the last) plot is saved.
  • dev.off to close the tiff device, and return the plotting to the R console (or RStudio).

If you want to produce several plots, you can use a counter in the file’s name. The counter uses the printf format string.

tiff(filename='graphicExample%d.tiff') #Creates a sequence of files
plot(rnorm(100))
boxplot(rnorm(100))
hist(rnorm(100))
dev.off()

To see the list of all open devices use dev.list(). To close all device, (not only the last one), use graphics.off().

See ?pdf and ?jpeg for more info.

11.1.3 Fancy graphics Examples

11.1.3.1 Line Graph

x = 1995:2005
y = c(81.1, 83.1, 84.3, 85.2, 85.4, 86.5, 88.3, 88.6, 90.8, 91.1, 91.3)
plot.new()
plot.window(xlim = range(x), ylim = range(y))
abline(h = -4:4, v = -4:4, col = "lightgrey")
lines(x, y, lwd = 2)
title(main = "A Line Graph Example",
        xlab = "Time",
        ylab = "Quality of R Graphics")
axis(1)
axis(2)
box()

Things to note:

  • plot.new creates a new, empty, plotting device.
  • plot.window determines the limits of the plotting region.
  • axis adds the axes, and box the framing box.
  • The rest of the elements, you already know.

11.1.3.2 Rosette

n = 17
theta = seq(0, 2 * pi, length = n + 1)[1:n]
x = sin(theta)
y = cos(theta)
v1 = rep(1:n, n)
v2 = rep(1:n, rep(n, n))
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1), asp = 1)
segments(x[v1], y[v1], x[v2], y[v2])
box()

11.1.3.3 Arrows

plot.new()
plot.window(xlim = c(0, 1), ylim = c(0, 1))
arrows(.05, .075, .45, .9, code = 1)
arrows(.55, .9, .95, .075, code = 2)
arrows(.1, 0, .9, 0, code = 3)
text(.5, 1, "A", cex = 1.5)
text(0, 0, "B", cex = 1.5)
text(1, 0, "C", cex = 1.5)

11.1.3.4 Arrows as error bars

x = 1:10
y = runif(10) + rep(c(5, 6.5), c(5, 5))
yl = y - 0.25 - runif(10)/3
yu = y + 0.25 + runif(10)/3
plot.new()
plot.window(xlim = c(0.5, 10.5), ylim = range(yl, yu))
arrows(x, yl, x, yu, code = 3, angle = 90, length = .125)
points(x, y, pch = 19, cex = 1.5)
axis(1, at = 1:10, labels = LETTERS[1:10])
axis(2, las = 1)
box()

11.1.3.5 Histogram

A histogram is nothing but a bunch of rectangle elements.

plot.new()
plot.window(xlim = c(0, 5), ylim = c(0, 10))
rect(0:4, 0, 1:5, c(7, 8, 4, 3), col = "lightblue")
axis(1)
axis(2, las = 1)

11.1.3.5.1 Spiral Squares
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1), asp = 1)
x = c(-1, 1, 1, -1)
y = c( 1, 1, -1, -1)
polygon(x, y, col = "cornsilk")
vertex1 = c(1, 2, 3, 4)
vertex2 = c(2, 3, 4, 1)
for(i in 1:50) {
    x = 0.9 * x[vertex1] + 0.1 * x[vertex2]
    y = 0.9 * y[vertex1] + 0.1 * y[vertex2]
    polygon(x, y, col = "cornsilk")
}

11.1.3.6 Circles

Circles are just dense polygons.

R = 1
xc = 0
yc = 0
n = 72
t = seq(0, 2 * pi, length = n)[1:(n-1)]
x = xc + R * cos(t)
y = yc + R * sin(t)
plot.new()
plot.window(xlim = range(x), ylim = range(y), asp = 1)
polygon(x, y, col = "lightblue", border = "navyblue")

11.1.3.7 Spiral

k = 5
n = k * 72
theta = seq(0, k * 2 * pi, length = n)
R = .98^(1:n - 1)
x = R * cos(theta)
y = R * sin(theta)
plot.new()
plot.window(xlim = range(x), ylim = range(y), asp = 1)
lines(x, y)

11.2 The ggplot2 System

The philosophy of ggplot2 is very different from the graphics device. Recall, in ggplot2, a plot is a object. It can be queried, it can be changed, and among other things, it can be plotted.

ggplot2 provides a convenience function for many plots: qplot. We take a non-typical approach by ignoring qplot, and presenting the fundamental building blocks. Once the building blocks have been understood, mastering qplot will be easy.

The following is taken from UCLA’s idre.

A ggplot2 object will have the following elements:

  • Data the data frame holding the data to be plotted.
  • Aes defines the mapping between variables to their visualization.
  • Geoms are the objects/shapes you add as layers to your graph.
  • Stats are statistical transformations when you are not plotting the raw data, such as the mean or confidence intervals.
  • Faceting splits the data into subsets to create multiple variations of the same graph (paneling).

The nlme::Milk dataset has the protein level of various cows, at various times, with various diets.

library(nlme)
data(Milk)
head(Milk)
## Grouped Data: protein ~ Time | Cow
##   protein Time Cow   Diet
## 1    3.63    1 B01 barley
## 2    3.57    2 B01 barley
## 3    3.47    3 B01 barley
## 4    3.65    4 B01 barley
## 5    3.89    5 B01 barley
## 6    3.73    6 B01 barley
library(ggplot2)
ggplot(data = Milk, aes(x=Time, y=protein)) +
  geom_point()

Things to note:

  • The ggplot function is the constructor of the ggplot2 object. If the object is not assigned, it is plotted.
  • The aes argument tells R that the Time variable in the Milk data is the x axis, and protein is y.
  • The geom_point defines the Geom, i.e., it tells R to plot the points as they are (and not lines, histograms, etc.).
  • The ggplot2 object is build by compounding its various elements separated by the + operator.
  • All the variables that we will need are assumed to be in the Milk data frame. This means that (a) the data needs to be a data frame (not a matrix for instance), and (b) we will not be able to use variables that are not in the Milk data frame.

Let’s add some color.

ggplot(data = Milk, aes(x=Time, y=protein)) +
  geom_point(aes(color=Diet))

The color argument tells R to use the variable Diet as the coloring. A legend is added by default. If we wanted a fixed color, and not a variable dependent color, color would have been put outside the aes function.

ggplot(data = Milk, aes(x=Time, y=protein)) +
  geom_point(color="green")

Let’s save the ggplot2 object so we can reuse it. Notice it is not plotted.

p <- ggplot(data = Milk, aes(x=Time, y=protein)) +
  geom_point()

We can change^{In the Object-Oriented Programming lingo, this is known as mutating} existing plots using the + operator. Here, we add a smoothing line to the plot p.

p + geom_smooth(method = 'gam')

Things to note:

  • The smoothing line is a layer added with the geom_smooth() function.
  • Lacking arguments of its own, the new layer will inherit the aes of the original object, x and y variables in particular.

To split the plot along some variable, we use faceting, done with the facet_wrap function.

p + facet_wrap(~Diet)

Instead of faceting, we can add a layer of the mean of each Diet subgroup, connected by lines.

p + stat_summary(aes(color=Diet), fun.y="mean", geom="line")

Things to note:

  • stat_summary adds a statistical summary.
  • The summary is applied along Diet subgroups, because of the color=Diet aesthetic, which has already split the data.
  • The summary to be applied is the mean, because of fun.y="mean".
  • The group means are connected by lines, because of the geom="line" argument.

What layers can be added using the geoms family of functions?

  • geom_bar: bars with bases on the x-axis.
  • geom_boxplot: boxes-and-whiskers.
  • geom_errorbar: T-shaped error bars.
  • geom_histogram: histogram.
  • geom_line: lines.
  • geom_point: points (scatterplot).
  • geom_ribbon: bands spanning y-values across a range of x-values.
  • geom_smooth: smoothed conditional means (e.g. loess smooth).

To demonstrate the layers added with the geoms_* functions, we start with a histogram.

pro <- ggplot(Milk, aes(x=protein))
pro + geom_histogram(bins=30)

A bar plot.

ggplot(Milk, aes(x=Diet)) +
  geom_bar()

A scatter plot.

tp <- ggplot(Milk, aes(x=Time, y=protein))
tp + geom_point()

A smooth regression plot, reusing the tp object.

tp + geom_smooth(method='gam')

And now, a simple line plot, reusing the tp object, and connecting lines along Cow.

tp + geom_line(aes(group=Cow))

The line plot is completely incomprehensible. Better look at boxplots along time (even if omitting the Cow information).

tp + geom_boxplot(aes(group=Time))

We can do some statistics for each subgroup. The following will compute the mean and standard errors of protein at each time point.

ggplot(Milk, aes(x=Time, y=protein)) +
  stat_summary(fun.data = 'mean_se')

Some popular statistical summaries, have gained their own functions:

  • mean_cl_boot: mean and bootstrapped confidence interval (default 95%).
  • mean_cl_normal: mean and Gaussian (t-distribution based) confidence interval (default 95%).
  • mean_dsl: mean plus or minus standard deviation times some constant (default constant=2).
  • median_hilow: median and outer quantiles (default outer quantiles = 0.025 and 0.975).

For less popular statistical summaries, we may specify the statistical function in stat_summary. The median is a first example.

ggplot(Milk, aes(x=Time, y=protein)) +
  stat_summary(fun.y="median", geom="point")

We can also define our own statistical summaries.

medianlog <- function(y) {median(log(y))}
ggplot(Milk, aes(x=Time, y=protein)) +
  stat_summary(fun.y="medianlog", geom="line")

Faceting allows to split the plotting along some variable. face_wrap tells R to compute the number of columns and rows of plots automatically.

ggplot(Milk, aes(x=protein, color=Diet)) +
  geom_density() +
  facet_wrap(~Time)

facet_grid forces the plot to appear allow rows or columns, using the ~ syntax.

ggplot(Milk, aes(x=Time, y=protein)) +
  geom_point() +
  facet_grid(Diet~.) # `.~Diet` to split along columns and not rows.

To control the looks of the plot, ggplot2 uses themes.

ggplot(Milk, aes(x=Time, y=protein)) +
  geom_point() +
  theme(panel.background=element_rect(fill="lightblue"))

ggplot(Milk, aes(x=Time, y=protein)) +
  geom_point() +
  theme(panel.background=element_blank(),
        axis.title.x=element_blank())

Saving plots can be done using ggplot2::ggsave, or with pdf like the graphics plots:

pdf(file = 'myplot.pdf')
print(tp) # You will need an explicit print command!
dev.off()

Finally, what every user of ggplot2 constantly uses, is the (excellent!) online documentation at http://docs.ggplot2.org.

11.2.1 Extensions of the ggplot2 System

Because ggplot2 plots are R objects, they can be used for computations and altered. Many authors, have thus extended the basic ggplot2 functionality. A list of ggplot2 extensions is curated by Daniel Emaasit at http://www.ggplot2-exts.org. The RStudio team has its own list of recommended packages at RStartHere.

11.3 Interactive Graphics

As already mentioned, the recent and dramatic advancement in interactive visualization was made possible by the advances in web technologies, and the D3.JS JavaScript library in particular. This is because it allows developers to rely on existing libraries designed for web browsing instead of re-implementing interactive visualizations. These libraries are more visually pleasing, and computationally efficient, than anything they could have developed themselves.

The htmlwidgets package does not provide visualization, but rather, it facilitates the creation of new interactive visualizations. This is because it handles all the technical details that are required to use R output within JavaScript visualization libraries.

For a list of interactive visualization tools that rely on htmlwidgets see the RStartsHere page. In the following sections, we discuss a selected subset.

11.3.1 Plotly

You can create nice interactive graphs using plotly::plot_ly:

library(plotly)
set.seed(100)
d <- diamonds[sample(nrow(diamonds), 1000), ]
plot_ly(data = d, x = ~carat, y = ~price, color = ~carat, size = ~carat, text = ~paste("Clarity: ", clarity))

More conveniently, any ggplot2 graph can be made interactive using plotly::ggplotly:

p <- ggplot(data = d, aes(x = carat, y = price)) +
  geom_smooth(aes(colour = cut, fill = cut), method = 'loess') + 
  facet_wrap(~ cut) # make ggplot
ggplotly(p) # from ggplot to plotly

How about exporting plotly objects? Well, a plotly object is nothing more than a little web site: an HTML file. When showing a plotly figure, RStudio merely servers you as a web browser. You could, alternatively, export this HTML file to send your colleagues as an email attachment, or embed it in a web site. To export these, use the plotly::export or the htmlwidgets::saveWidget functions.

For more on plotly see https://plot.ly/r/.

11.4 Bibliographic Notes

For the graphics package, see R Core Team (2016). For ggplot2 see Wickham (2009). For the theory underlying ggplot2, i.e. the Grammer of Graphics, see L. Wilkinson (2006). A video by one of my heroes, Brian Caffo, discussing graphics vs. ggplot2.

11.5 Practice Yourself

  1. Go to the Fancy Graphics Section 11.1.3. Try parsing the commands in your head.

  2. Recall the medianlog example and replace the medianlog function with a harmonic mean.

    medianlog <- function(y) {median(log(y))}
    ggplot(Milk, aes(x=Time, y=protein)) +
      stat_summary(fun.y="medianlog", geom="line")
    ```
  3. Write a function that creates a boxplot from scratch. See how I built a line graph in Section 11.1.3.

  4. Export my plotly example using the RStudio interface and send it to yourself by email.

ggplot2:

  1. Read about the “oats” dataset using ? MASS::oats.
    1. Inspect, visually, the dependency of the yield (Y) in the Varieties (V) and the Nitrogen treatment (N).
    2. Compute the mean and the standard error of the yield for every value of Varieties and Nitrogen treatment.
    3. Change the axis labels to be informative with labs function and give a title to the plot with ggtitle function.
  2. Read about the “mtcars” data set using ? mtcars.
    1. Inspect, visually, the dependency of the Fuel consumption (mpg) in the weight (wt)
    2. Inspect, visually, the assumption that the Fuel consumption also depends on the number of cylinders.
    3. Is there an interaction between the number of cylinders to the weight (i.e. the slope of the regression line is different between the number of cylinders)? Use geom_smooth.

References

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with R. New York: Springer. http://lmdvr.r-forge.r-project.org.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.