r-square intro

a few basic building blocks

Author

Wilfried Cools & Lara Stas

Published

November 22, 2024

SQUARE consultants

square.research.vub.be

Compiled on R 4.4.1

What-Why-Who

This site aims to introduce researchers to the tidyverse ecosystem in R.

Our target audience is primarily the research community of the VUB / UZ Brussel, particularly those who have some basic experience in R and want to know more.

We invite you to help improve this document by sending us feedback: square@vub.be

Advanced R course: the tidyverse

Advanced: experience required to keep up
- an r primer is given as context and can give you a quick refresher
Not advanced: it is still about simple stuff
- data manipulation
- data visualization
- no statistics
You want advanced ?
- should not be necessary for you
- Wickham, H. (2019). Advanced R, Second Edition. CRC Press.
- chapters
  - functional programming
  - object oriented programming
  - meta programming (expressions, quasiquotation, evaluation, …)

First Tidyverse Steps

Data manipulation and visualization
- simple but generally usable
- important part of most analyses
- often neglected in statistics courses
Our focus: tidyverse
- a set of R packages (~ functions)
- bridges the gap between raw data and modeling

tidyverse: Why it exists

R; a flexible open source statistical programming tool (2000)
- flexible: a lot is possible, in different ways
- open source: many contributors writing code their own way
- users have to adapt to each package / function
Commit to shared rules (not reduce R flexibility)
- contract with user → consistent input (tidy data)
- contract with developer → consistent specification
  - predictable function names
  - intuitive/sensible arguments and defaults
- contract with developer → consistent output
  - predictable (constancy of data type by default)
  - reusable

library(tidyverse)

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = FALSE,
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read_table(
  file,
  col_names = TRUE,
  col_types = NULL,
  locale = default_locale(),
  na = "NA",
  skip = 0,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  progress = show_progress(),
  comment = "",
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE
)

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

class(mtcars)

[1] "data.frame"

str(mtcars[,c('mpg','cyl')])

'data.frame':   32 obs. of  2 variables:
 $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl: num  6 6 4 6 8 6 8 4 4 6 ...

str(mtcars[,c('mpg')])

 num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...

# str(select(mtcars,'mpg'))
str(mtcars |> select('mpg'))

'data.frame':   32 obs. of  1 variable:
 $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...

str(mtcars |> pull('mpg'))

 num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...

tidyverse = first successful attempt to make R more consistent
- earlier attempts failed
- tidyverse well thought through
- tidyverse makes sense for most
- tidyverse supported and promoted by Rstudio
Ecosystem emerges, following the tidyverse rules
- tibble for data representation
- tidyr for tidying data
- dplyr for manipulating data frames
- ggplot for visualizing data
- stringr for dealing with texts
- readr for reading in data
- forcats for dealing with factors
- purrr for functional programming (advanced)
- …

Find convenient cheat sheets here or directly in RStudio (Help → Cheat Sheets).

Set up tidyverse packages

Install the tidyverse package (at least once)

install.packages('tidyverse')

Load the tidyverse package (once per R session)
- the individual packages that are loaded by default are listed
- conflicts are listed

library(tidyverse)

Conflicts result from identical function names
- resolve conflicts
  - explicit referencing of package with ::
    - e.g., stat::filter( ) or dplyr::filter( )
  - creating new default
    - e.g., select <- dplyr::select
Conflicts can be checked for tidyverse

tidyverse_conflicts( )

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidyverse ecosystem includes
broom, conflicted, cli, dbplyr, dplyr, dtplyr, forcats, ggplot2, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, ragg, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.

tidyverse_packages( )

tidy data (input) - where is started

Hadley Wickham’s ggplot (now works at RStudio)
- consistent input, easier to write visualization functions
- enforce use of ‘tidy’ data
Tidy data
- observations in focus assigned a row, each
- columns to add properties to these observations (cell values)
- tabular, possibly disentangled into multiple tables
tibble
- data.frame 2.0
- do less, complain more

tidy output - it’s extension

Max Kuhn’s caret (now works at RStudio) turned into the broom package
- homogenize statistical output
- output to potentially serve as input
Tidy output
- one-row model information glance
- multiple row statistical summary tidy
- model based extended data augment
Output also turned into a tibble

tibbles: the tidyverse data type

The tibble package offers the tidyverse data type, a tibble
A tibble is a data frame, not necessarily the other way around
A data frame is R’s data type for analysis
- a list of equally sized vectors
  - numeric vector (either double, integer, or complex)
  - factor (ordered, not ordered)
  - boolean vector
  - character
A tibble enhances a data frame
- for convenience and consistency
- no row-names, must be part of data
- different default behavior
  - printing, naming, …
  - less forgiving
- example: print
Create tibble with tibble( ) or tribble( ) function
- notice: class( ) shows both data.frame and tbl_df
- notice: no row names, all info made explicit as data
- compare with dataframe

mytibble <- tibble(
  colA = c("a","b","c"),
  colB = c(1:3)
)
(mytibble <- tribble(
  ~colA, ~colB,
  "a",   1,
  "b",   2,
  "c",   3
))

# A tibble: 3 × 2
  colA   colB
  <chr> <dbl>
1 a         1
2 b         2
3 c         3

class(mytibble)

[1] "tbl_df"     "tbl"        "data.frame"

mydf <- data.frame(colA=c('a','b','c'),colB=1:3)
class(mydf)

[1] "data.frame"

No need to think much about tibbles
- a tibble is a data frame
- tidyverse functions automatically enhance data frames to tibbles

pipes: a convenient way of chaining functions

The magrittr package offers the pipe function
- %>% or |>
- pushes left hand side as first argument into right hand side
  - eg., object %>% function
- borrowed from functional programming
tidyverse always has as first argument it’s input
- function(input, …)
- pipes are convenient to chain functions
  - eg., object %>% function %>% function’ %>% function’’ …
Pipes read from left to right
- most base R use reads inside-out
- compare
  - mtcars %>% pull(mpg) %>% mean()
  - mean(mtcars$mpg)
- especially of interest with multiple steps, serves readability
- example: root sum of squares for two sets of 10, sampled from standard normal

x1 <- rnorm(10); x2 <- rnorm(10)

sqrt(sum((x1-x2)^2))

[1] 6.000135

(x1-x2)^2 %>% sum( ) %>% sqrt( )

[1] 6.000135

Example: tidyverse

Create factors for all variables with fewer than 4 distinct values
- for data.frame mtcars
- change the elements (mutate)
  - for all variables (across)
    - where variable . < 4 distinct values
    - to factor
- and show the structure (glimpse)

mtcars %>% 
    mutate(
        across(
            where(~n_distinct(.)<4),
            as.factor)) %>% 
    select(1:4) %>% glimpse

Rows: 32
Columns: 4
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…