library(tidyverse)
r-square intro
a few basic building blocks
SQUARE consultants
square.research.vub.be
Compiled on R 4.4.1
What-Why-Who
This site aims to introduce researchers to the tidyverse
ecosystem in R.
Our target audience is primarily the research community of the VUB / UZ Brussel, particularly those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback: square@vub.be
Advanced R course: the tidyverse
- Advanced: experience required to keep up
- an r primer is given as context and can give you a quick refresher
- Not advanced: it is still about simple stuff
- data manipulation
- data visualization
- no statistics
- You want advanced ?
- should not be necessary for you
- Wickham, H. (2019). Advanced R, Second Edition. CRC Press.
- chapters
- functional programming
- object oriented programming
- meta programming (expressions, quasiquotation, evaluation, …)
First Tidyverse Steps
- Data manipulation and visualization
- simple but generally usable
- important part of most analyses
- often neglected in statistics courses
- Our focus: tidyverse
- a set of R packages (~ functions)
- bridges the gap between raw data and modeling
tidyverse: Why it exists
- R; a flexible open source statistical programming tool (2000)
- flexible: a lot is possible, in different ways
- open source: many contributors writing code their own way
- users have to adapt to each package / function
- Commit to shared rules (not reduce R flexibility)
- contract with user → consistent input (tidy data)
- contract with developer → consistent specification
- predictable function names
- intuitive/sensible arguments and defaults
- contract with developer → consistent output
- predictable (constancy of data type by default)
- reusable
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
as.is = !stringsAsFactors, tryLogical = TRUE,
row.names, col.names, na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read_table(
file,col_names = TRUE,
col_types = NULL,
locale = default_locale(),
na = "NA",
skip = 0,
n_max = Inf,
guess_max = min(n_max, 1000),
progress = show_progress(),
comment = "",
show_col_types = should_show_types(),
skip_empty_rows = TRUE
)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
class(mtcars)
[1] "data.frame"
str(mtcars[,c('mpg','cyl')])
'data.frame': 32 obs. of 2 variables:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
str(mtcars[,c('mpg')])
num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# str(select(mtcars,'mpg'))
str(mtcars |> select('mpg'))
'data.frame': 32 obs. of 1 variable:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
str(mtcars |> pull('mpg'))
num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
tidyverse
= first successful attempt to make R more consistent- earlier attempts failed
- tidyverse well thought through
- tidyverse makes sense for most
- tidyverse supported and promoted by Rstudio
- Ecosystem emerges, following the tidyverse rules
tibble
for data representationtidyr
for tidying datadplyr
for manipulating data framesggplot
for visualizing datastringr
for dealing with textsreadr
for reading in dataforcats
for dealing with factorspurrr
for functional programming (advanced)- …
Find convenient cheat sheets here or directly in RStudio (Help → Cheat Sheets).
Set up tidyverse packages
- Install the
tidyverse
package (at least once)
install.packages('tidyverse')
- Load the
tidyverse
package (once per R session)- the individual packages that are loaded by default are listed
- conflicts are listed
library(tidyverse)
- Conflicts result from identical function names
- resolve conflicts
- explicit referencing of package with
::
- e.g.,
stat::filter( )
ordplyr::filter( )
- e.g.,
- creating new default
- e.g., select <- dplyr::select
- explicit referencing of package with
- resolve conflicts
- Conflicts can be checked for tidyverse
tidyverse_conflicts( )
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverse
ecosystem includes
broom, conflicted, cli, dbplyr, dplyr, dtplyr, forcats, ggplot2, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, ragg, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.
tidyverse_packages( )
tidy data (input) - where is started
- Hadley Wickham’s
ggplot
(now works at RStudio)- consistent input, easier to write visualization functions
- enforce use of ‘tidy’ data
- Tidy data
- observations in focus assigned a row, each
- columns to add properties to these observations (cell values)
- tabular, possibly disentangled into multiple tables
tibble
- data.frame 2.0
- do less, complain more
tidy output - it’s extension
- Max Kuhn’s
caret
(now works at RStudio) turned into thebroom
package- homogenize statistical output
- output to potentially serve as input
- Tidy output
- one-row model information
glance
- multiple row statistical summary
tidy
- model based extended data
augment
- one-row model information
- Output also turned into a tibble
tibbles: the tidyverse data type
The
tibble
package offers the tidyverse data type, atibble
A
tibble
is adata frame
, not necessarily the other way aroundA
data frame
is R’s data type for analysis- a list of equally sized vectors
- numeric vector (either double, integer, or complex)
- factor (ordered, not ordered)
- boolean vector
- character
- a list of equally sized vectors
A
tibble
enhances adata frame
- for convenience and consistency
- no row-names, must be part of data
- different default behavior
- printing, naming, …
- less forgiving
- example: print
Create
tibble
withtibble( )
ortribble( )
function- notice:
class( )
shows bothdata.frame
andtbl_df
- notice: no row names, all info made explicit as data
- compare with dataframe
- notice:
<- tibble(
mytibble colA = c("a","b","c"),
colB = c(1:3)
)<- tribble(
(mytibble ~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
))
# A tibble: 3 × 2
colA colB
<chr> <dbl>
1 a 1
2 b 2
3 c 3
class(mytibble)
[1] "tbl_df" "tbl" "data.frame"
<- data.frame(colA=c('a','b','c'),colB=1:3)
mydf class(mydf)
[1] "data.frame"
- No need to think much about
tibbles
- a
tibble
is adata frame
- tidyverse functions automatically enhance
data frames
totibbles
- a
pipes: a convenient way of chaining functions
- The
magrittr
package offers thepipe
function%>%
or|>
- pushes left hand side as first argument into right hand side
- eg., object %>% function
- borrowed from functional programming
tidyverse
always has as first argument it’s input- function(input, …)
- pipes are convenient to chain functions
- eg., object %>% function %>% function’ %>% function’’ …
- Pipes read from left to right
- most base R use reads inside-out
- compare
mtcars %>% pull(mpg) %>% mean()
mean(mtcars$mpg)
- especially of interest with multiple steps, serves readability
- example: root sum of squares for two sets of 10, sampled from standard normal
<- rnorm(10); x2 <- rnorm(10) x1
sqrt(sum((x1-x2)^2))
[1] 6.000135
-x2)^2 %>% sum( ) %>% sqrt( ) (x1
[1] 6.000135
Example: tidyverse
- Create factors for all variables with fewer than 4 distinct values
- for data.frame
mtcars
- change the elements (
mutate
)- for all variables (
across
)- where variable
.
< 4 distinct values - to factor
- where variable
- for all variables (
- and show the structure (
glimpse
)
- for data.frame
%>%
mtcars mutate(
across(
where(~n_distinct(.)<4),
%>%
as.factor)) select(1:4) %>% glimpse
Rows: 32
Columns: 4
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…