r-square intro

a few basic building blocks
Author

Wilfried Cools & Lara Stas

Published

December 7, 2025

SQUARE consultants
square.research.vub.be

Compiled on R 4.5.2

ImportantWhat-Why-Who

This page aims to introduce researchers to the tidyverse ecosystem in R.

Our target audience is primarily the research community of the VUB / UZ Brussel, particularly those who have some basic experience in R and want to know more.

We invite you to help improve this document by sending us feedback: square@vub.be

Advanced R course: the tidyverse

  • Advanced: experience required to keep up
    • an r primer is given as context and can give you a quick refresher
  • Not advanced: it is still about simple stuff
    • data manipulation
    • data visualization
    • no statistics
  • You want advanced ?
    • should not be necessary for you
    • Wickham, H. (2019). Advanced R, Second Edition. CRC Press.
    • chapters
      • functional programming
      • object oriented programming
      • meta programming (expressions, quasiquotation, evaluation, …)

R

  • R is free, open source, with a large community
  • R is a programming tool:
    • aims at data manipulation, visualization, analysis
    • all are best done with coding
      • efficiently and correctly process data and statistics
      • maintain structure and transparency, to support reproducibility
  • R & Python, and AI
    • R works similar to and together with Python
    • R editor is also free: RStudio - Positron
    • coding is made -a lot- easier with AI

tidyverse: Tidy what ?!

  • Tidyverse contains R packages that share the tidy philosophy
  • R packages contain sets of related functions
  • functions are self-contained blocks of code to turn input into output

tidyverse: Why it exists

  • R; a flexible open source statistical programming tool (2000)
    • flexible: a lot is possible, in different ways
    • open source: many contributors writing code their own way
    • users have to adapt to each package / function
  • Commit to shared rules (not reduce R flexibility)
    • contract with user → consistent input (tidy data)
    • contract with developer → consistent specification
      • predictable function names
      • intuitive/sensible arguments and defaults
    • contract with developer → consistent output
      • predictable (constancy of data type by default)
      • reusable
- a few examples
    
# base R
read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = FALSE,
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
# tidyverse
read_table(
  file,
  col_names = TRUE,
  col_types = NULL,
  locale = default_locale(),
  na = "NA",
  skip = 0,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  progress = show_progress(),
  comment = "",
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE
)
# an examplary data frame
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
class(mtcars)
[1] "data.frame"
# base R
str(mtcars[,c('mpg','cyl')])
'data.frame':   32 obs. of  2 variables:
 $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl: num  6 6 4 6 8 6 8 4 4 6 ...
str(mtcars[,c('mpg')])
 num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# tidyverse
str(mtcars |> select('mpg'))
'data.frame':   32 obs. of  1 variable:
 $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
str(mtcars |> pull('mpg'))
 num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  • tidyverse = first successful attempt to make R more consistent
    • earlier attempts failed
    • tidyverse well thought through
    • tidyverse makes sense for most
    • tidyverse supported and promoted by Posit (Rstudio)
  • Ecosystem emerges, following the tidyverse rules
    • tibble for data representation
    • tidyr for tidying data
    • dplyr for manipulating data frames
    • ggplot for visualizing data
    • stringr for dealing with texts
    • readr for reading in data
    • forcats for dealing with factors
    • purrr for functional programming (advanced)

Find convenient cheat sheets here or directly in RStudio (Help → Cheat Sheets).

tidyverse: visualization and manipulation

  • Data manipulation (dplyr, tidyr) and visualization (ggplot2)
    • simple but generally usable
    • important part of most analyses
    • often neglected in statistics courses
  • Bridges the gap between raw data and modeling

Set up tidyverse packages

  • Install the tidyverse package (at least once)
install.packages('tidyverse')
  • Load the tidyverse package (once per R session)
    • the individual packages that are loaded by default are listed
    • conflicts are listed
library(tidyverse)
  • Conflicts result from identical function names
    • resolve conflicts
      • explicit referencing of package with ::
        • e.g., stat::filter( ) or dplyr::filter( )
      • creating new default
        • e.g., select <- dplyr::select
  • Conflicts can be checked for tidyverse
tidyverse_conflicts( )
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • tidyverse ecosystem includes
    broom, conflicted, cli, dbplyr, dplyr, dtplyr, forcats, ggplot2, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, ragg, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.
tidyverse_packages( )

tidy data (input) - where it started

  • Hadley Wickham’s ggplot (now works at Posit (RStudio))
    • consistent input, easier to write visualization functions
    • enforce use of ‘tidy’ data
  • Tidy data
    • research unit in focus assigned to a row, each
    • properties of research units spread over columns
    • cells linking a research unit (row) to a property (column)
    • research unit specific tables, to be linked by key variables
Patient as research unit in a cross-over design, not ideal.
# A tibble: 6 × 7
  patient_id   bmi score_base score_trt score_ctrl period_trt period_ctrl
  <chr>      <dbl>      <dbl>     <dbl>      <dbl> <chr>      <chr>      
1 p01         23.8         82        86         84 first      second     
2 p02         27.1         75        74         73 second     first      
3 p03         21.5         88        91         90 first      second     
4 p04         30.2         69        71         70 second     first      
5 p05         24.7         91        94         93 first      second     
6 p06         28           73        76         75 second     first      
Observation as research unit in a cross-over design, much better.
# A tibble: 18 × 6
   obs_id patient_id   bmi period treatment score
    <int> <chr>      <dbl> <chr>  <chr>     <dbl>
 1      1 p01         23.8 t_0    base         82
 2      2 p01         23.8 t_1    trt          86
 3      3 p01         23.8 t_2    ctrl         84
 4      4 p02         27.1 t_0    base         75
 5      5 p02         27.1 t_1    ctrl         73
 6      6 p02         27.1 t_2    trt          74
 7      7 p03         21.5 t_0    base         88
 8      8 p03         21.5 t_1    trt          91
 9      9 p03         21.5 t_2    ctrl         90
10     10 p04         30.2 t_0    base         69
11     11 p04         30.2 t_1    ctrl         70
12     12 p04         30.2 t_2    trt          71
13     13 p05         24.7 t_0    base         91
14     14 p05         24.7 t_1    trt          94
15     15 p05         24.7 t_2    ctrl         93
16     16 p06         28   t_0    base         73
17     17 p06         28   t_1    ctrl         75
18     18 p06         28   t_2    trt          76
Maybe combine two, but disentangled: 
patient
# A tibble: 6 × 3
  patient_id   bmi score_base
  <chr>      <dbl>      <dbl>
1 p01         23.8         82
2 p02         27.1         75
3 p03         21.5         88
4 p04         30.2         69
5 p05         24.7         91
6 p06         28           73

observations
# A tibble: 12 × 5
   obs_id patient_id period treatment score
    <int> <chr>      <chr>  <chr>     <dbl>
 1      2 p01        t_1    trt          86
 2      3 p01        t_2    ctrl         84
 3      5 p02        t_1    ctrl         73
 4      6 p02        t_2    trt          74
 5      8 p03        t_1    trt          91
 6      9 p03        t_2    ctrl         90
 7     11 p04        t_1    ctrl         70
 8     12 p04        t_2    trt          71
 9     14 p05        t_1    trt          94
10     15 p05        t_2    ctrl         93
11     17 p06        t_1    ctrl         75
12     18 p06        t_2    trt          76

tidy output - it’s extension

  • Max Kuhn’s caret (now works at Posit (RStudio)) turned into the broom package
    • homogenize statistical output
    • output to potentially serve as input
  • Tidy output
    • one-row model information glance
    • multiple row statistical summary tidy
    • model based extended data augment
Regression analysis can be summarized

Call:
lm(formula = score ~ treatment + bmi, data = dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7868 -2.4515 -0.9560  0.4421  9.0415 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   149.192     13.178  11.321 1.26e-06 ***
treatmenttrt    1.167      2.891   0.404 0.695944    
bmi            -2.641      0.503  -5.251 0.000527 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.007 on 9 degrees of freedom
Multiple R-squared:  0.755, Adjusted R-squared:  0.7005 
F-statistic: 13.87 on 2 and 9 DF,  p-value: 0.001784
A tidy type of summary
# A tibble: 3 × 5
  term         estimate std.error statistic    p.value
  <chr>           <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    149.      13.2      11.3   0.00000126
2 treatmenttrt     1.17     2.89      0.404 0.696     
3 bmi             -2.64     0.503    -5.25  0.000527  
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.755         0.701  5.01      13.9 0.00178     2  -34.6  77.3  79.2
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tibbles: the tidyverse data type

  • The tibble package offers the tidyverse data type, a tibble

  • A tibble is a data frame, not necessarily the other way around

  • A data frame is R’s data type for analysis

    • a list of equally sized vectors
      • numeric vector (either double, integer, or complex)
      • factor (ordered, not ordered)
      • boolean vector
      • character
  • A tibble enhances a data frame

    • for convenience and consistency
    • no row-names, must be part of data
    • different default behavior
      • printing, naming, …
      • less forgiving
    • example: print
  • Create tibble with tibble( ) or tribble( ) function

    • notice: class( ) shows both data.frame and tbl_df
    • notice: no row names, all info made explicit as data
    • compare with dataframe
mytibble <- tibble(
  colA = c("a","b","c"),
  colB = c(1:3)
)
(mytibble <- tribble(
  ~colA, ~colB,
  "a",   1,
  "b",   2,
  "c",   3
))
# A tibble: 3 × 2
  colA   colB
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
class(mytibble)
[1] "tbl_df"     "tbl"        "data.frame"
mydf <- data.frame(colA=c('a','b','c'),colB=1:3)
class(mydf)
[1] "data.frame"
  • No need to think much about tibbles
    • a tibble is a data frame
    • tidyverse functions automatically enhance data frames to tibbles
    • does less, complains more

pipes: a convenient way of chaining functions

  • The magrittr package offers the pipe function
    • %>% or |>
    • pushes left hand side as first argument into right hand side
      • eg., object %>% function ~ function(object, …)
    • borrowed from functional programming
  • tidyverse always has as first argument it’s input
    • function(input, …)
    • pipes are convenient to chain functions
      • eg., object %>% function %>% function’ %>% function’’ …
  • Pipes read from left to right
    • most base R use reads inside-out
    • compare
      • mtcars %>% pull(mpg) %>% mean()
      • mean(mtcars$mpg)
    • especially of interest with multiple steps, serves readability
    • example: root sum of square differences for two sets of 10, sampled from standard normal
x1 <- rnorm(10); x2 <- rnorm(10)
sqrt(sum((x1-x2)^2))
[1] 3.652338
(x1-x2)^2 %>% sum( ) %>% sqrt( )
[1] 3.652338

Example: tidyverse / dplyr

  • Create factors for all variables with fewer than 4 distinct values
    • for data.frame mtcars
    • change the elements (mutate)
      • for all variables (across)
        • where variable . < 4 distinct values
        • to factor
    • and show the structure (glimpse)
mtcars %>% 
    mutate(
        across(
            where(~n_distinct(.)<4),
            as.factor)) %>% 
    glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…