Chapter 23 The Hadleyverse

The Hadleyverse, short for “Hadley Wickham’s universe”, is a set of packages that make it easier to handle data. If you are developing packages, you should be careful since using these packages may create many dependencies and compatibility issues. If you are analyzing data, and the portability of your functions to other users, machines, and operating systems is not of a concern, you will LOVE these packages. The term Hadleyverse refers to all of Hadley’s packages, but here, we mention only a useful subset, which can be collectively installed via the tidyverse package:

  • ggplot2 for data visualization. See the Plotting Chapter 11.
  • dplyr for data manipulation.
  • tidyr for data tidying.
  • readr for data import.
  • stringr for character strings.
  • anytime for time data.

23.1 readr

The readr package (Wickham, Hester, and Francois 2016) replaces base functions for importing and exporting data such as read.table. It is faster, with a cleaner syntax.

We will not go into the details and refer the reader to the official documentation here and the R for data sciecne book.

23.2 dplyr

When you think of data frame operations, think dplyr (Wickham and Francois 2016). Notable utilities in the package include:

  • select() Select columns from a data frame.
  • filter() Filter rows according to some condition(s).
  • arrange() Sort / Re-order rows in a data frame.
  • mutate() Create new columns or transform existing ones.
  • group_by() Group a data frame by some factor(s) usually in conjunction to summary.
  • summarize() Summarize some values from the data frame or across groups.
  • inner_join(x,y,by="col")return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
  • left_join(x,y,by="col") return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
  • right_join(x,y,by="col") return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
  • anti_join(x,y,by="col") return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’.

The following example involve data.frame objects, but dplyr can handle other classes. In particular data.tables from the data.table package (Dowle and Srinivasan 2017), which is designed for very large data sets.

dplyr can work with data stored in a database. In which case, it will convert your command to the appropriate SQL syntax, and issue it to the database. This has the advantage that (a) you do not need to know the specific SQL implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier. For more on this, see the databses vignette.

The following examples are taken from Kevin Markham. The nycflights13::flights has delay data for US flights.

library(nycflights13)
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The data is of class tbl_df which is an extension of the data.frame class, designed for large data sets. Notice that the printing of flights is short, even without calling the head function. This is a feature of the tbl_df class ( print(data.frame) would try to load all the data, thus take a long time).

class(flights) # a tbl_df is an extension of the data.frame class
## [1] "tbl_df"     "tbl"        "data.frame"

Let’s filter the observations from the first day of the first month. Notice how much better (i.e. readable) is the dplyr syntax, with piping, compared to the basic syntax.

flights[flights$month == 1 & flights$day == 1, ] # old style

library(dplyr) 
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.

More filtering.

filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.

arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending

select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, tail_num = tailnum) # rename column

# add a new computed colume
mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60) 

# you can refer to columns you just created! (gain)
mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# keep only new variables, not all data frame.
transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# simple statistics
summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE)
  )
# random subsample 
sample_n(flights, 10) 
sample_frac(flights, 0.01) 

We now perform operations on subgroups. we group observations along the plane’s tail number (tailnum), and compute the count, average distance traveled, and average delay. We group with group_by, and compute subgroup statistics with summarise.

by_tailnum <- group_by(flights, tailnum)

delay <- summarise(by_tailnum,
  count = n(),
  avg.dist = mean(distance, na.rm = TRUE),
  avg.delay = mean(arr_delay, na.rm = TRUE))

delay
## # A tibble: 4,044 x 4
##    tailnum count avg.dist  avg.delay
##      <chr> <int>    <dbl>      <dbl>
##  1  D942DN     4 854.5000 31.5000000
##  2  N0EGMQ   371 676.1887  9.9829545
##  3  N10156   153 757.9477 12.7172414
##  4  N102UW    48 535.8750  2.9375000
##  5  N103US    46 535.1957 -6.9347826
##  6  N104UW    47 535.2553  1.8043478
##  7  N10575   289 519.7024 20.6914498
##  8  N105UW    45 524.8444 -0.2666667
##  9  N107US    41 528.7073 -5.7317073
## 10  N108UW    60 534.5000 -1.2500000
## # ... with 4,034 more rows

We can group along several variables, with a hierarchy. We then collapse the hierarchy one by one.

daily <- group_by(flights, year, month, day)
per_day   <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year  <- summarise(per_month, flights = sum(flights))

Things to note:

  • Every call to summarise collapses one level in the hierarchy of grouping. The output of group_by recalls the hierarchy of aggregation, and collapses along this hierarchy.

We can use dplyr for two table operations, i.e., joins. For this, we join the flight data, with the airplane data in airplanes.

library(dplyr) 
airlines  
## # A tibble: 16 x 2
##    carrier                        name
##      <chr>                       <chr>
##  1      9E           Endeavor Air Inc.
##  2      AA      American Airlines Inc.
##  3      AS        Alaska Airlines Inc.
##  4      B6             JetBlue Airways
##  5      DL        Delta Air Lines Inc.
##  6      EV    ExpressJet Airlines Inc.
##  7      F9      Frontier Airlines Inc.
##  8      FL AirTran Airways Corporation
##  9      HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.
# select the subset of interesting flight data. 
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) 

# join on left table with automatic matching.
flights2 %>% left_join(airlines) 
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
##     year month   day  hour origin  dest tailnum carrier
##    <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>
##  1  2013     1     1     5    EWR   IAH  N14228      UA
##  2  2013     1     1     5    LGA   IAH  N24211      UA
##  3  2013     1     1     5    JFK   MIA  N619AA      AA
##  4  2013     1     1     5    JFK   BQN  N804JB      B6
##  5  2013     1     1     6    LGA   ATL  N668DN      DL
##  6  2013     1     1     5    EWR   ORD  N39463      UA
##  7  2013     1     1     6    EWR   FLL  N516JB      B6
##  8  2013     1     1     6    LGA   IAD  N829AS      EV
##  9  2013     1     1     6    JFK   MCO  N593JB      B6
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA
## # ... with 336,766 more rows, and 1 more variables: name <chr>
flights2 %>% left_join(weather) 
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin  dest tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5    EWR   IAH  N14228      UA    NA    NA    NA
##  2  2013     1     1     5    LGA   IAH  N24211      UA    NA    NA    NA
##  3  2013     1     1     5    JFK   MIA  N619AA      AA    NA    NA    NA
##  4  2013     1     1     5    JFK   BQN  N804JB      B6    NA    NA    NA
##  5  2013     1     1     6    LGA   ATL  N668DN      DL 39.92 26.06 57.33
##  6  2013     1     1     5    EWR   ORD  N39463      UA    NA    NA    NA
##  7  2013     1     1     6    EWR   FLL  N516JB      B6 39.02 26.06 59.37
##  8  2013     1     1     6    LGA   IAD  N829AS      EV 39.92 26.06 57.33
##  9  2013     1     1     6    JFK   MCO  N593JB      B6 39.02 26.06 59.37
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA 39.92 26.06 57.33
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
# join with named matching
flights2 %>% left_join(planes, by = "tailnum") 
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin  dest tailnum carrier year.y
##     <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>  <int>
##  1   2013     1     1     5    EWR   IAH  N14228      UA   1999
##  2   2013     1     1     5    LGA   IAH  N24211      UA   1998
##  3   2013     1     1     5    JFK   MIA  N619AA      AA   1990
##  4   2013     1     1     5    JFK   BQN  N804JB      B6   2012
##  5   2013     1     1     6    LGA   ATL  N668DN      DL   1991
##  6   2013     1     1     5    EWR   ORD  N39463      UA   2012
##  7   2013     1     1     6    EWR   FLL  N516JB      B6   2000
##  8   2013     1     1     6    LGA   IAD  N829AS      EV   1998
##  9   2013     1     1     6    JFK   MCO  N593JB      B6   2004
## 10   2013     1     1     6    LGA   ORD  N3ALAA      AA     NA
## # ... with 336,766 more rows, and 7 more variables: type <chr>,
## #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>
# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa")) 
## # A tibble: 336,776 x 15
##     year month   day  hour origin  dest tailnum carrier
##    <int> <int> <int> <dbl>  <chr> <chr>   <chr>   <chr>
##  1  2013     1     1     5    EWR   IAH  N14228      UA
##  2  2013     1     1     5    LGA   IAH  N24211      UA
##  3  2013     1     1     5    JFK   MIA  N619AA      AA
##  4  2013     1     1     5    JFK   BQN  N804JB      B6
##  5  2013     1     1     6    LGA   ATL  N668DN      DL
##  6  2013     1     1     5    EWR   ORD  N39463      UA
##  7  2013     1     1     6    EWR   FLL  N516JB      B6
##  8  2013     1     1     6    LGA   IAD  N829AS      EV
##  9  2013     1     1     6    JFK   MCO  N593JB      B6
## 10  2013     1     1     6    LGA   ORD  N3ALAA      AA
## # ... with 336,766 more rows, and 7 more variables: name <chr>, lat <dbl>,
## #   lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>

Types of join with SQL equivalent.

# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     2
## 2     2     1
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
##       x     a     b
##   <dbl> <dbl> <chr>
## 1     1    10     a
## 2     3    10     a
# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 1 x 4
##       x     y     a     b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10     a
# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a     b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10     a
## 2     2     1    NA  <NA>
# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a     b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10     a
## 2     3    NA    10     a
# Return all rows. 
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 3 x 4
##       x     y     a     b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10     a
## 2     2     1    NA  <NA>
## 3     3    NA    10     a
# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x")  # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <int>
## 1     1     2

23.3 tidyr

23.4 reshape2

23.5 stringr

23.6 anytime

23.7 Biblipgraphic Notes

23.8 Practice Yourself

Analytics, Revolution, and Steve Weston. 2015. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

Anderson-Cook, Christine M. 2004. “An Introduction to Multivariate Statistical Analysis.” Journal of the American Statistical Association 99 (467). American Statistical Association: 907–9.

Arlot, Sylvain, Alain Celisse, and others. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4. The author, under a Creative Commons Attribution License: 40–79.

Bai, Zhidong, and Hewa Saranadasa. 1996. “Effect of High Dimension: By an Example of a Two Sample Problem.” Statistica Sinica. JSTOR, 311–29.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.

Benjamini, Yoav, and Daniel Yekutieli. 2001. “The Control of the False Discovery Rate in Multiple Testing Under Dependency.” Annals of Statistics. JSTOR, 1165–88.

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Chapple, Simon R, Eilidh Troup, Thorsten Forster, and Terence Sloan. 2016. Mastering Parallel Programming with R. Packt Publishing Ltd.

Conway, Drew, and John White. 2012. Machine Learning for Hackers. “ O’Reilly Media, Inc.”

Davis, Timothy A. 2006. Direct Methods for Sparse Linear Systems. SIAM.

Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Efron, Bradley. 2012. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Vol. 1. Cambridge University Press.

Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with R. Springer Science & Business Media.

Fithian, William. 2015. “Topics in Adaptive Inference.” PhD thesis, STANFORD UNIVERSITY.

Foster, Dean P, and Robert A Stine. 2004. “Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy.” Journal of the American Statistical Association 99 (466). Taylor & Francis: 303–13.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics Springer, Berlin.

Gentle, James E. 2012. Numerical Linear Algebra for Applications in Statistics. Springer Science & Business Media.

Gilbert, John R, Cleve Moler, and Robert Schreiber. 1992. “Sparse Matrices in Matlab: Design and Implementation.” SIAM Journal on Matrix Analysis and Applications 13 (1). SIAM: 333–56.

Golub, Gene H, and Charles F Van Loan. 2012. Matrix Computations. Vol. 3. JHU Press.

Graham, RL. 1988. “Isometric Embeddings of Graphs.” Selected Topics in Graph Theory 3. Academic Press San Diego, CA: 133–50.

Greene, William H. 2003. Econometric Analysis. Pearson Education India.

Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology 24 (6). Warwick & York: 417.

Izenman, Alan Julian. 2008. “Modern Multivariate Statistical Techniques.” Regression, Classification and Manifold Learning. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 6. Springer.

Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research 15 (1): 2869–2909.

Kalisch, Markus, and Peter Bühlmann. 2014. “Causal Structure Learning and Inference: A Selective Review.” Quality Technology & Quantitative Management 11 (1). Taylor & Francis: 3–21.

Kane, Michael J, John Emerson, Stephen Weston, and others. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19.

Lantz, Brett. 2013. Machine Learning with R. Packt Publishing Ltd.

Leisch, Friedrich. 2002. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis.” In Compstat, 575–80. Springer.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

McCullagh, Peter. 1984. “Generalized Linear Models.” European Journal of Operational Research 16 (3). Elsevier: 285–92.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. MIT press.

Nadler, Boaz. 2008. “Finite Sample Approximation Results for Principal Component Analysis: A Matrix Perturbation Approach.” The Annals of Statistics. JSTOR, 2791–2817.

Pearson, Karl. 1901. “LIII. on Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11). Taylor & Francis: 559–72.

Pinero, Jose, and Douglas Bates. 2000. “Mixed-Effects Models in S and S-Plus (Statistics and Computing).” Springer, New York.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rabinowicz, Assaf, and Saharon Rosset. 2018. “Assessing Prediction Error at Interpolation and Extrapolation Points.” arXiv Preprint arXiv:1802.00996.

Ripley, Brian D. 2007. Pattern Recognition and Neural Networks. Cambridge university press.

Robinson, George K. 1991. “That Blup Is a Good Thing: The Estimation of Random Effects.” Statistical Science. JSTOR, 15–32.

Rosenblatt, Jonathan. 2013. “A Practitioner’s Guide to Multiple Testing Error Rates.” arXiv Preprint arXiv:1304.4920.

Rosenblatt, Jonathan D, and Yoav Benjamini. 2014. “Selective Correlations; Not Voodoo.” NeuroImage 103. Elsevier: 401–10.

Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. 2016. “Better-Than-Chance Classification for Signal Detection.” arXiv Preprint arXiv:1608.08873.

Rosset, Saharon, and Ryan J Tibshirani. 2018. “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation.” Journal of the American Statistical Association, no. just-accepted. Taylor & Francis.

Sammut, Claude, and Geoffrey I Webb. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with R. New York: Springer. http://lmdvr.r-forge.r-project.org.

Schmidberger, Markus, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, and Ulrich Mansmann. 2009. “State of the Art in Parallel Computing with R.” Journal of Statistical Software 47 (1).

Searle, Shayle R, George Casella, and Charles E McCulloch. 2009. Variance Components. Vol. 391. John Wiley & Sons.

Shah, Viral, and John R Gilbert. 2004. “Sparse Matrices in Matlab* P: Design and Implementation.” In International Conference on High-Performance Computing, 144–55. Springer.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.

Shawe-Taylor, John, and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge university press.

Simes, R John. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3). Oxford University Press: 751–54.

Small, Christopher G. 1990. “A Survey of Multidimensional Medians.” International Statistical Review/Revue Internationale de Statistique. JSTOR, 263–77.

Tukey, John W. 1977. Exploratory Data Analysis. Reading, Mass.

Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer science & business media.

Venables, William N, and Brian D Ripley. 2013. Modern Applied Statistics with S-Plus. Springer Science & Business Media.

Venables, William N, David M Smith, R Development Core Team, and others. 2004. “An Introduction to R.” Network Theory Limited.

Wang, Chun, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan. 2015. “Statistical Methods and Computing for Big Data.” arXiv Preprint arXiv:1502.07989.

Weihs, Claus, Olaf Mersmann, and Uwe Ligges. 2013. Foundations of Statistical Algorithms: With References to R Packages. CRC Press.

Weiss, Robert E. 2005. Modeling Longitudinal Data. Springer Science & Business Media.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. Advanced R. CRC Press.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Wilcox, Rand R. 2011. Introduction to Robust Estimation and Hypothesis Testing. Academic Press.

Wilkinson, GN, and CE Rogers. 1973. “Symbolic Description of Factorial Models for Analysis of Variance.” Applied Statistics. JSTOR, 392–99.

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.

———. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. CRC Press.

References

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.