`library(tidyverse)`

# r-square intro

Compiled on R 4.3.1

This site aims to introduce researchers to the `tidyverse`

ecosystem in R.

Our target audience is primarily the research community of the VUB / UZ Brussel, particularly those who have some basic experience in R and want to know more.

We invite you to help improve this document by sending us feedback: square@vub.be

## First Tidyverse Steps

- Data manipulation and visualisation
- simple but very important
- bridges the gap between raw data and modeling
- important part of most analyses
- often neglected in statistics courses

- Our focus: tidyverse
- a set of R packages (~ functions)
- in between raw data and modeling

## tidyverse: Why it exists

- R; a flexible open source statistical programming tool (2000)
- open source: many contributors writing code their own way
- users have to adapt to each package / function

- Commit to shared rules (not reduce R flexibility)
- consistency in terms of input and output
- contract with user → tidy data
- contract with developer
- consistency & intuitive/sensible defaults
- constancy of data type by default

- consistency in function names and (order of) arguments

- consistency in terms of input and output

`tidyverse`

= first successful attempt to make R more consistent- earlier attempts failed
- tidyverse well thought through
- tidyverse makes sense for most
- tidyverse supported and promoted by Rstudio

- Ecosystem emerges, following the tidyverse rules
`tibble`

for data representation`tidyr`

for tidying data`dplyr`

for manipulating data frames`ggplot`

for visualizing data`stringr`

for dealing with texts`readr`

for reading in data`forcats`

for dealing with factors`purrr`

for functional programming (advanced)- …

Find convenient cheat sheets *here or directly in RStudio (Help → Cheat Sheets)*.

## Set up tidyverse packages

The R-primer page (see menu: context) can maybe serve as a basic introduction on the use of R in general.

Install the

`tidyverse`

package (at least once)

`install.packages('tidyverse')`

- Load the
`tidyverse`

package (once per R session)- the individual packages that are loaded by default are listed
- conflicts are listed

`library(tidyverse)`

- Conflicts result from identical function names
- resolve conflicts
- explicit referencing of package with
`::`

- e.g.,
`stat::filter( )`

or`dplyr::filter( )`

- e.g.,
- creating new default
- e.g., select <- dplyr::select

- explicit referencing of package with

- resolve conflicts
- Conflicts can be checked for tidyverse

`tidyverse_conflicts( )`

```
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

`tidyverse`

ecosystem includes

broom, conflicted, cli, dbplyr, dplyr, dtplyr, forcats, ggplot2, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, purrr, ragg, readr, readxl, reprex, rlang, rstudioapi, rvest, stringr, tibble, tidyr, xml2, tidyverse.

`tidyverse_packages( )`

## tidy data (input)

- Hadley Wickham’s
`ggplot`

(now works at RStudio)- consistent input, easier to write visualization functions
- enforce use of ‘tidy’ data

- Tidy data
- observations in focus assigned a row, each
- columns to add properties to these observations (cell values)
- tabular, possibly disentangled into multiple tables

`tibble`

- data.frame 2.0
- do less, complain more

## tidy output

- Max Kuhn’s
`caret`

(now works at RStudio) turned into the`broom`

package- homogenize statistical output
- output to potentially serve as input

- Tidy output
- one-row model information
`glance`

- multiple row statistical summary
`tidy`

- model based extended data
`augment`

- one-row model information
- Output also turned into a tibble

## tibbles: the tidyverse data type

The

`tibble`

package offers the tidyverse data type, a`tibble`

A

`tibble`

is a`data frame`

, not necessarily the other way aroundA

`data frame`

is R’s data type for analysis- a list of equally sized vectors
- numeric vector (either double, integer, or complex)
- factor (ordered, not ordered)
- boolean vector
- character

- a list of equally sized vectors
A

`tibble`

enhances a`data frame`

- for convenience and consistency
- no row-names, must be part of data
- different default behavior
- printing, naming, …
- less forgiving

- example: print

Create

`tibble`

with`tibble( )`

or`tribble( )`

function- notice:
`class( )`

shows both`data.frame`

and`tbl_df`

- notice: no row names, all info made explicit as data
- compare with dataframe

- notice:

```
<- tibble(
mytibble colA = c("a","b","c"),
colB = c(1:3)
)<- tribble(
(mytibble ~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
))
```

```
# A tibble: 3 × 2
colA colB
<chr> <dbl>
1 a 1
2 b 2
3 c 3
```

`class(mytibble)`

`[1] "tbl_df" "tbl" "data.frame"`

```
<- data.frame(colA=c('a','b','c'),colB=1:3)
mydf class(mydf)
```

`[1] "data.frame"`

- No need to think much about
`tibbles`

- a
`tibble`

is a`data frame`

- tidyverse functions automatically enhance
`data frames`

to`tibbles`

- a

## pipes: a convenient way of chaining functions

- The
`magrittr`

package offers the`pipe`

function`%>%`

or`|>`

- pushes left hand side into right hand side
- eg., object %>% function

- borrowed from functional programming

`tidyverse`

always has as first argument it’s input- function(input, …)
- pipes are convenient to chain functions
- eg., object %>% function %>% function %>% function …

- Pipes read from left to right
- most base R use reads inside-out
- compare
`mtcars %>% pull(mpg) %>% mean()`

`mean(mtcars$mpg)`

- especially of interest with multiple steps, serves readability
- example: root sum of squares for two sets of 10, sampled from standard normal

`<- rnorm(10); x2 <- rnorm(10) x1 `

`sqrt(sum((x1-x2)^2))`

`[1] 4.570047`

`-x2)^2 %>% sum( ) %>% sqrt( ) (x1`

`[1] 4.570047`

## Example: tidyverse

- Create factors for all variables with fewer than 4 distinct values
- for data.frame
`mtcars`

- change the elements (
`mutate`

)- for all variables (
`across`

)- where variable
`.`

< 4 distinct values - to factor

- where variable

- for all variables (
- and show the structure (
`glimpse`

)

- for data.frame

```
%>%
mtcars mutate(
across(
where(~n_distinct(.)<4),
%>%
as.factor)) select(1:4) %>% glimpse
```

```
Rows: 32
Columns: 4
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
```