# base R
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)r-square intro
a few basic building blocks
SQUARE consultants
square.research.vub.be
Compiled on R 4.5.2
ImportantWhat-Why-Who
This page aims to introduce researchers to the tidyverse ecosystem in R.
Our target audience is primarily the research community of the VUB / UZ Brussel, particularly those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback: square@vub.be
Advanced R course: the tidyverse
- Advanced: experience required to keep up
- an r primer is given as context and can give you a quick refresher
- Not advanced: it is still about simple stuff
- data manipulation
- data visualization
- no statistics
- You want advanced ?
- should not be necessary for you
- Wickham, H. (2019). Advanced R, Second Edition. CRC Press.
- chapters
- functional programming
- object oriented programming
- meta programming (expressions, quasiquotation, evaluation, …)
R
- R is free, open source, with a large community
- R is a programming tool:
- aims at data manipulation, visualization, analysis
- all are best done with coding
- efficiently and correctly process data and statistics
- maintain structure and transparency, to support reproducibility
- R & Python, and AI
- R works similar to and together with Python
- R editor is also free: RStudio - Positron
- coding is made -a lot- easier with AI
tidyverse: Tidy what ?!
- Tidyverse contains R packages that share the tidy philosophy
- R packages contain sets of related functions
- functions are self-contained blocks of code to turn input into output
tidyverse: Why it exists
- R; a flexible open source statistical programming tool (2000)
- flexible: a lot is possible, in different ways
- open source: many contributors writing code their own way
- users have to adapt to each package / function
- Commit to shared rules (not reduce R flexibility)
- contract with user → consistent input (tidy data)
- contract with developer → consistent specification
- predictable function names
- intuitive/sensible arguments and defaults
- contract with developer → consistent output
- predictable (constancy of data type by default)
- reusable
- a few examples
# tidyverse
read_table(
file,
col_names = TRUE,
col_types = NULL,
locale = default_locale(),
na = "NA",
skip = 0,
n_max = Inf,
guess_max = min(n_max, 1000),
progress = show_progress(),
comment = "",
show_col_types = should_show_types(),
skip_empty_rows = TRUE
)# an examplary data frame
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
class(mtcars)[1] "data.frame"
# base R
str(mtcars[,c('mpg','cyl')])'data.frame': 32 obs. of 2 variables:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
str(mtcars[,c('mpg')]) num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# tidyverse
str(mtcars |> select('mpg'))'data.frame': 32 obs. of 1 variable:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
str(mtcars |> pull('mpg')) num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
tidyverse= first successful attempt to make R more consistent- earlier attempts failed
- tidyverse well thought through
- tidyverse makes sense for most
- tidyverse supported and promoted by Posit (Rstudio)
- Ecosystem emerges, following the tidyverse rules
tibblefor data representationtidyrfor tidying datadplyrfor manipulating data framesggplotfor visualizing datastringrfor dealing with textsreadrfor reading in dataforcatsfor dealing with factorspurrrfor functional programming (advanced)- …
Find convenient cheat sheets here or directly in RStudio (Help → Cheat Sheets).
tidyverse: visualization and manipulation
- Data manipulation (dplyr, tidyr) and visualization (ggplot2)
- simple but generally usable
- important part of most analyses
- often neglected in statistics courses
- Bridges the gap between raw data and modeling
Set up tidyverse packages
- Install the
tidyversepackage (at least once)
install.packages('tidyverse')- Load the
tidyversepackage (once per R session)- the individual packages that are loaded by default are listed
- conflicts are listed
library(tidyverse)- Conflicts result from identical function names
- resolve conflicts
- explicit referencing of package with
::- e.g.,
stat::filter( )ordplyr::filter( )
- e.g.,
- creating new default
- e.g., select <- dplyr::select
- explicit referencing of package with
- resolve conflicts
- Conflicts can be checked for tidyverse
tidyverse_conflicts( )── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverseecosystem includes
broom, conflicted, cli, dbplyr, dplyr, dtplyr, forcats, ggplot2, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, ragg, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.
tidyverse_packages( )tidy data (input) - where it started
- Hadley Wickham’s
ggplot(now works at Posit (RStudio))- consistent input, easier to write visualization functions
- enforce use of ‘tidy’ data
- Tidy data
- research unit in focus assigned to a row, each
- properties of research units spread over columns
- cells linking a research unit (row) to a property (column)
- research unit specific tables, to be linked by key variables
Patient as research unit in a cross-over design, not ideal.
# A tibble: 6 × 7
patient_id bmi score_base score_trt score_ctrl period_trt period_ctrl
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 p01 23.8 82 86 84 first second
2 p02 27.1 75 74 73 second first
3 p03 21.5 88 91 90 first second
4 p04 30.2 69 71 70 second first
5 p05 24.7 91 94 93 first second
6 p06 28 73 76 75 second first
Observation as research unit in a cross-over design, much better.
# A tibble: 18 × 6
obs_id patient_id bmi period treatment score
<int> <chr> <dbl> <chr> <chr> <dbl>
1 1 p01 23.8 t_0 base 82
2 2 p01 23.8 t_1 trt 86
3 3 p01 23.8 t_2 ctrl 84
4 4 p02 27.1 t_0 base 75
5 5 p02 27.1 t_1 ctrl 73
6 6 p02 27.1 t_2 trt 74
7 7 p03 21.5 t_0 base 88
8 8 p03 21.5 t_1 trt 91
9 9 p03 21.5 t_2 ctrl 90
10 10 p04 30.2 t_0 base 69
11 11 p04 30.2 t_1 ctrl 70
12 12 p04 30.2 t_2 trt 71
13 13 p05 24.7 t_0 base 91
14 14 p05 24.7 t_1 trt 94
15 15 p05 24.7 t_2 ctrl 93
16 16 p06 28 t_0 base 73
17 17 p06 28 t_1 ctrl 75
18 18 p06 28 t_2 trt 76
Maybe combine two, but disentangled:
patient
# A tibble: 6 × 3
patient_id bmi score_base
<chr> <dbl> <dbl>
1 p01 23.8 82
2 p02 27.1 75
3 p03 21.5 88
4 p04 30.2 69
5 p05 24.7 91
6 p06 28 73
observations
# A tibble: 12 × 5
obs_id patient_id period treatment score
<int> <chr> <chr> <chr> <dbl>
1 2 p01 t_1 trt 86
2 3 p01 t_2 ctrl 84
3 5 p02 t_1 ctrl 73
4 6 p02 t_2 trt 74
5 8 p03 t_1 trt 91
6 9 p03 t_2 ctrl 90
7 11 p04 t_1 ctrl 70
8 12 p04 t_2 trt 71
9 14 p05 t_1 trt 94
10 15 p05 t_2 ctrl 93
11 17 p06 t_1 ctrl 75
12 18 p06 t_2 trt 76
tidy output - it’s extension
- Max Kuhn’s
caret(now works at Posit (RStudio)) turned into thebroompackage- homogenize statistical output
- output to potentially serve as input
- Tidy output
- one-row model information
glance - multiple row statistical summary
tidy - model based extended data
augment
- one-row model information
Regression analysis can be summarized
Call:
lm(formula = score ~ treatment + bmi, data = dta)
Residuals:
Min 1Q Median 3Q Max
-4.7868 -2.4515 -0.9560 0.4421 9.0415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.192 13.178 11.321 1.26e-06 ***
treatmenttrt 1.167 2.891 0.404 0.695944
bmi -2.641 0.503 -5.251 0.000527 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.007 on 9 degrees of freedom
Multiple R-squared: 0.755, Adjusted R-squared: 0.7005
F-statistic: 13.87 on 2 and 9 DF, p-value: 0.001784
A tidy type of summary
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 149. 13.2 11.3 0.00000126
2 treatmenttrt 1.17 2.89 0.404 0.696
3 bmi -2.64 0.503 -5.25 0.000527
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.755 0.701 5.01 13.9 0.00178 2 -34.6 77.3 79.2
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tibbles: the tidyverse data type
The
tibblepackage offers the tidyverse data type, atibbleA
tibbleis adata frame, not necessarily the other way aroundA
data frameis R’s data type for analysis- a list of equally sized vectors
- numeric vector (either double, integer, or complex)
- factor (ordered, not ordered)
- boolean vector
- character
- a list of equally sized vectors
A
tibbleenhances adata frame- for convenience and consistency
- no row-names, must be part of data
- different default behavior
- printing, naming, …
- less forgiving
- example: print
Create
tibblewithtibble( )ortribble( )function- notice:
class( )shows bothdata.frameandtbl_df
- notice: no row names, all info made explicit as data
- compare with dataframe
- notice:
mytibble <- tibble(
colA = c("a","b","c"),
colB = c(1:3)
)
(mytibble <- tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
))# A tibble: 3 × 2
colA colB
<chr> <dbl>
1 a 1
2 b 2
3 c 3
class(mytibble)[1] "tbl_df" "tbl" "data.frame"
mydf <- data.frame(colA=c('a','b','c'),colB=1:3)
class(mydf)[1] "data.frame"
- No need to think much about
tibbles- a
tibbleis adata frame - tidyverse functions automatically enhance
data framestotibbles - does less, complains more
- a
pipes: a convenient way of chaining functions
- The
magrittrpackage offers thepipefunction%>%or|>- pushes left hand side as first argument into right hand side
- eg., object %>% function ~ function(object, …)
- borrowed from functional programming
tidyversealways has as first argument it’s input- function(input, …)
- pipes are convenient to chain functions
- eg., object %>% function %>% function’ %>% function’’ …
- Pipes read from left to right
- most base R use reads inside-out
- compare
mtcars %>% pull(mpg) %>% mean()mean(mtcars$mpg)
- especially of interest with multiple steps, serves readability
- example: root sum of square differences for two sets of 10, sampled from standard normal
x1 <- rnorm(10); x2 <- rnorm(10)sqrt(sum((x1-x2)^2))[1] 3.652338
(x1-x2)^2 %>% sum( ) %>% sqrt( )[1] 3.652338
Example: tidyverse / dplyr
- Create factors for all variables with fewer than 4 distinct values
- for data.frame
mtcars - change the elements (
mutate)- for all variables (
across)- where variable
.< 4 distinct values - to factor
- where variable
- for all variables (
- and show the structure (
glimpse)
- for data.frame
mtcars %>%
mutate(
across(
where(~n_distinct(.)<4),
as.factor)) %>%
glimpse()Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…