Chapter 21 The Hadleyverse

The Hadleyverse, short for “Hadley Wickham’s universe”, is a set of packages that make it easier to handle data. If you are developing packages, you should be careful since using these packages may create many dependencies and compatibility issues. If you are analyzing data, and the portability of your functions to other users, machines, and operating systems is not of a concern, you will LOVE these packages. The term Hadleyverse refers to all of Hadley’s packages, but here, we mention only a useful subset, which can be collectively installed via the tidyverse package:

ggplot2 for data visualization. See the Plotting Chapter 12.
dplyr for data manipulation.
tidyr for data tidying.
readr for data import.
stringr for character strings.
anytime for time data.

21.1 readr

The readr package (Wickham, Hester, and Francois 2016) replaces base functions for importing and exporting data such as read.table. It is faster, with a cleaner syntax.

We will not go into the details and refer the reader to the official documentation here and the R for data sciecne book.

21.2 dplyr

When you think of data frame operations, think dplyr (Wickham and Francois 2016). Notable utilities in the package include:

select() Select columns from a data frame.
filter() Filter rows according to some condition(s).
arrange() Sort / Re-order rows in a data frame.
mutate() Create new columns or transform existing ones.
group_by() Group a data frame by some factor(s) usually in conjunction to summary.
summarize() Summarize some values from the data frame or across groups.
inner_join(x,y,by="col")return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
left_join(x,y,by="col") return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
right_join(x,y,by="col") return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
anti_join(x,y,by="col") return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’.

The following example involve data.frame objects, but dplyr can handle other classes. In particular data.tables from the data.table package (Dowle and Srinivasan 2017), which is designed for very large data sets.

dplyr can work with data stored in a database. In which case, it will convert your command to the appropriate SQL syntax, and issue it to the database. This has the advantage that (a) you do not need to know the specific SQL implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier. For more on this, see the databses vignette.

The following examples are taken from Kevin Markham. The nycflights13::flights has delay data for US flights.

library(nycflights13)
flights

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, ind <int>

The data is of class tbl_df which is an extension of the data.frame class, designed for large data sets. Notice that the printing of flights is short, even without calling the head function. This is a feature of the tbl_df class ( print(data.frame) would try to load all the data, thus take a long time).

class(flights) # a tbl_df is an extension of the data.frame class

## [1] "tbl_df"     "tbl"        "data.frame"

Let’s filter the observations from the first day of the first month. Notice how much better (i.e. readable) is the dplyr syntax, with piping, compared to the basic syntax.

flights[flights$month == 1 & flights$day == 1, ] # old style

library(dplyr) 
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.

More filtering.

filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.

arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending

select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, c(tail_num = "tailnum")) # rename column

# add a new computed colume
mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60) 

# you can refer to columns you just created! (gain)
mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# keep only new variables, not all data frame.
transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# simple statistics
summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE)
  )
# random subsample 
sample_n(flights, 10) 
sample_frac(flights, 0.01)

We now perform operations on subgroups. we group observations along the plane’s tail number (tailnum), and compute the count, average distance traveled, and average delay. We group with group_by, and compute subgroup statistics with summarise.

by_tailnum <- group_by(flights, tailnum)

delay <- summarise(by_tailnum,
  count = length(),
  avg.dist = mean(distance, na.rm = TRUE),
  avg.delay = mean(arr_delay, na.rm = TRUE))

delay

We can group along several variables, with a hierarchy. We then collapse the hierarchy one by one.

daily <- group_by(flights, year, month, day)
per_day   <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year  <- summarise(per_month, flights = sum(flights))

Things to note:

Every call to summarise collapses one level in the hierarchy of grouping. The output of group_by recalls the hierarchy of aggregation, and collapses along this hierarchy.

We can use dplyr for two table operations, i.e., joins. For this, we join the flight data, with the airplane data in airplanes.

library(dplyr) 
airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

# select the subset of interesting flight data. 
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) 

# join on left table with automatic matching.
flights2 %>% left_join(airlines)

## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier name               
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>              
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines I…
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines I…
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines …
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways    
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines In…
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines I…
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways    
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airline…
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways    
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines …
## # … with 336,766 more rows

flights2 %>% left_join(weather)

## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

# join with named matching
flights2 %>% left_join(planes, by = "tailnum")

## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa"))

## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor…  30.0 -95.3
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor…  30.0 -95.3
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA  
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic…  42.0 -87.9
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort…  26.1 -80.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash…  38.9 -77.5
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla…  28.4 -81.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic…  42.0 -87.9
## # … with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>

Types of join with SQL equivalent.

# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))

## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     2
## 2     2     1

(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))

## # A tibble: 2 x 3
##       x     a b    
##   <dbl> <dbl> <chr>
## 1     1    10 a    
## 2     3    10 a

# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a

## # A tibble: 1 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a

# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>

# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a

# Return all rows. 
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a

## # A tibble: 3 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a

# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x")  # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)

## # A tibble: 1 x 2
##       x     y
##   <dbl> <int>
## 1     1     2

21.3 tidyr

21.4 reshape2

21.5 stringr

21.6 anytime

21.7 Biblipgraphic Notes

21.8 Practice Yourself

References

Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.