Data Wrangling with R’s tidyverse

dplyr, tidyr and friends
Author

Wilfried Cools & Lara Stas

Published

December 27, 2023

SQUARE consultants
square.research.vub.be

Compiled on R 4.3.1

What-Why-Who

This site aims to introduce researchers to data manipulation in R with the dplyr, tidyr, and stringr packages of the tidyverse ecosystem.

Our target audience is primarily the research community at VUB / UZ Brussel, those who have some basic experience in R and want to know more.

We invite you to help improve this document by sending us feedback: wilfried.cools@vub.be

Key Message
  • Data manipulation can prepare data and/or its summarizing statistics
    • for modeling purposes
    • for visualisation purposes
  • Data manipulation is inherent to data analysis, not just a precursor
    • no -fit’s all data representation-
      • note: raw data should not be altered and kept safe
    • flexible use of data manipulation
      • supports more informative and complete modeling
      • elicits better visualisation of data and statistics
  • Data manipulation is best done with coding
    • efficiently and correctly process data and statistics
    • maintain structure and transparency, to support reproducibility
  • Data manipulation is easier and more intuitive when maintaining tidy data.
    • tidy data: meaning appropriately mapped into structure
      • each row an observation as research unit,
      • each column a variable as property,
      • each cell a particular value, linking row to column
      • note: data can be split into multiple tables (relational data).
    • aim for tidy data registration (avoid tedious manipulations)
  • Workflow (Hadley Wickham):

R’s tidyverse packages: dplyr and tidyr

  • Current focus on dplyr and tidyr on manipulating and tidying data in the tidyverse eco-system (Hadley Wickham etal.)

  • Data manipulation can be done in base R, or other packages

  • dplyr and tidyr, the current defaults

    • inspired heavily on relational database logic
    • developed purposefully
      • largely consistent
      • well appreciated defaults
      • easy and intuitive to build (if you get it)
      • without loosing much flexibility
  • dplyr and tidyr, part of the tidyverse ecosystem includes:

    • ggplot2 for visualizing data and statistics [check Visualization]
    • stringr for dealing with texts
    • forcats for dealing with factors

Convenient cheat sheets at https://rstudio.com/resources/cheatsheets/.

Getting ahead of ourselves with dplyr

toy dataset

  • The infamous mtcars data are used.
    • observe it’s structure with str( ) and first 6 observations head( ) function.
    • note: available data with data( )
  • Have a tidyverse look at the data with glimpse( )
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
  • Have a tidyverse look at the data with slice_head( )
mtcars %>% slice_head(n=6)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

examplary data manipulation

  • Get the minimum value of the ratio of mpg over hp for each combination of am and cyl.
    • take the mtcars data,
    • select variables mpg, cyl, hp, am, and rename hp to hpow,
    • subset rows where hpow bigger than 3.5,
    • create new variable mpgr as the ratio mpg on hpow,
    • summarize mpgr as the minimum for every combination of cyl and am,
    • and reshape the result into a table with one row per cyl-value (4,6,8) and a column for each am value (0,1),
    • with column variable names renamed to am0 and am1.
mtcars %>%
    select(mpg, cyl, hpow=hp, am) %>%
    filter(hpow > 3.5) %>%
    mutate(mpgr = mpg/hpow) %>%
    group_by(cyl, am) %>%
    summarize(min=min(mpgr)) %>%
    pivot_wider(names_from=am,
        values_from=min) %>%
    select(cyl,am0=`0`,am1=`1`)
cyl am0 am1
4 0.2216495 0.1963303
6 0.1560976 0.1125714
8 0.1068571 NA

dplyr package, functions to manipulate data

  • dplyr reflects the apply function in base R
    • d is for data frames
  • Focus on manipulating data frames (tibbles):
    • subsetting, altering, summarizing, ordering, combining, reshaping
  • The main -verbs- (see example above)
    • filter( ) : conditional selection of cases
    • select( ) : conditional selection of variables, allows reordering and renaming
    • mutate( ) : creation of new variables based on existing variables
    • summarise( ) : reduce sets of values to single values
  • The verb to structure data (see example above)
    • group_by( ) : internal grouping, undo with ungroup( )
    • works preceding main verbs
  • The verbs to enhance control on scope (advanced)
    • across( ) : new way of scoping (instead of *_it, *_at, *_all)
      • works for selection in mutate( ) and summarize( )
  • Additional dplyr verbs:
    • arrange : ordering of cases
    • sample_n and sample_frac : random sampling
    • slice, transmute, rename, relocate, …
  • Verbs to extend data
    • bind_rows and bind_cols : append data of same structure
    • left_join, right_join, inner_join, full_join, semi_join and anti_join : join data using indicator variable(s)
  • Note: only the core of dplyr is discussed, much more is possible

group_by( )

  • Grouping prepares data for group specific operations

intro

  • Get a glimpse of the data as before,
    • number of rows and columns
      • in tidy data: observations and variables
    • number of groups, and grouping variables
      • 4 groups: 2 am x 2 vs
      • Note: grouping structure part of glimpse-output
  • The width is set for presentation purposes
tst <- mtcars %>% group_by(am,vs)
glimpse(tst,width=40)
Rows: 32
Columns: 11
Groups: am, vs [4]
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 1…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4…
$ disp <dbl> 160.0, 160.0, 108.0, 258.…
$ hp   <dbl> 110, 110, 93, 110, 175, 1…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.21…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.4…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2…
  • Actions on grouped data are grouped too,
    • e.g., a frequency table, count the number of observations (count( ))
    • grouped data result in grouped counts
tst %>% count()
am vs n
0 0 12
0 1 7
1 0 6
1 1 7
  • Remove grouping with ungroup( )
    • this is good practice to avoid unwanted effects !
tst <- tst %>% ungroup( )
tst %>% count()
n
32
  • Alternatively, overwrite the initial grouping
    • the last grouping is used by default
    • additional arguments, for example .add and .drop, can change that
      • a first groups by vs
      • a second groups by am and vs
mtcars %>% group_by(am) %>% group_by(vs)
mtcars %>% group_by(am) %>% group_by(vs, .add=TRUE)
  • Transformed variables can also be used for grouping
    • e.g., cutting the mpg in 3 groups with cut( ) then use count( )
      Notice the intervals that are created.
tst <- mtcars %>% group_by(mpg3 = cut(mpg, 3))
tst %>% count()
mpg3 n
<10.4,18.2] 14
<18.2,26.1] 13
<26.1,33.9] 5

exercises

  • Embedded within the next sections

filter( )

  • Filtering returns rows using matching conditions

intro

  • Get a subset of rows that includes only those rows with mpg above 30
mtcars %>% filter(mpg > 30)
mpg cyl disp hp drat wt qsec vs am gear carb
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
  • Include more than just one condition,
    • take only rows with mpg above 20 AND qsec below or equal to 18
    • note: consecutive filtering achieves the same.
      • & for and
      • for or

      • ! for not
mtcars %>% filter(mpg > 20 & qsec <= 18)
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
  • More complex conditions can be specified
    • take rows with mpg above 30 OR qsec below 20 AND am equal to 0
    • all the rules of logic apply, parentheses included
mtcars %>% filter(mpg > 30 | (qsec > 20 & am==0))
mpg cyl disp hp drat wt qsec vs am gear carb
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
  • Grouping also works here
    • get all distinct values of cyl per level of gear
    • note: this selects the first unique rows
mtcars %>% group_by(gear) %>% distinct(cyl)
gear cyl
4 6
4 4
3 6
3 8
3 4
5 4
5 8
5 6

exercises

  • The starwars dataset is already part of tidyverse, so you should have it available !
  • Have a glimpse at the data, what data types are included ?
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
  • Do note, different data types are included in the tibble (data frame)
    • chr for characters, int for integers, dbl for doubles, we miss the lgl for a boolean
    • notice that even a vector of type list can be included.
  • Filter the rows to subset the data and retain only characters with light skin and brown eye color
# A tibble: 7 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Biggs Da…    183    84 black      light      brown             24 male  mascu…
3 Cordé        157    NA brown      light      brown             NA fema… femin…
4 Dormé        165    NA brown      light      brown             NA fema… femin…
5 Raymus A…    188    79 brown      light      brown             NA male  mascu…
6 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
7 Padmé Am…    165    45 brown      light      brown             46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Arrange the data according the character’s height, largest on top ! (google it!!)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
 2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 3 Lama Su     229    88 none       grey       black           NA   male  mascu…
 4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 7 Taun We     213    NA none       grey       black           NA   fema… femin…
 8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
 9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Who is smallest (comes on top after arranging) ?
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Slice the data and keep only the 5th to 10th observation ! (?slice)
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Owen Lars    178   120 brown, gr… light      blue              52 male  mascu…
3 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
4 R5-D4         97    32 <NA>       white, red red               NA none  mascu…
5 Biggs Da…    183    84 black      light      brown             24 male  mascu…
6 Obi-Wan …    182    77 auburn, w… fair       blue-gray         57 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Slice the first 2 observations for each gender (group your data) !
    • what other functions are discussed at ?slice_head ?
# A tibble: 6 × 14
# Groups:   gender [3]
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
3 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
4 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
5 Ric Olié     183    NA brown      fair       blue              NA <NA>  <NA>  
6 Quarsh P…    183    NA black      dark       brown             62 <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Use slice_sample( ) to randomly select 5 observations !
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Ki-Adi-M…    198  82   white      pale       yellow            92 male  mascu…
2 Wicket S…     88  20   brown      brown      brown              8 male  mascu…
3 Sebulba      112  40   none       grey, red  orange            NA male  mascu…
4 Shaak Ti     178  57   none       red, blue… black             NA fema… femin…
5 Luminara…    170  56.2 black      yellow     blue              58 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Use slice_max( ) to select 3 observations with highest values on height !
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
2 Tarfful      234   136 brown      brown      blue              NA male  mascu…
3 Lama Su      229    88 none       grey       black             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • Get the top 3 (highest mass) for each species !
    • ignore characters with missing data for mass
    • note, -not missing- are those who are not ! missing is.na()
# A tibble: 40 × 14
# Groups:   species [32]
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 2 Dexter …    198   102 none       brown      yellow            NA male  mascu…
 3 Ki-Adi-…    198    82 white      pale       yellow            92 male  mascu…
 4 Zam Wes…    168    55 blonde     fair, gre… yellow            NA fema… femin…
 5 IG-88       200   140 none       metal      red               15 none  mascu…
 6 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 7 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 8 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 9 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
10 Wicket …     88    20 brown      brown      brown              8 male  mascu…
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

select( )

  • Extract columns (variables) by name (or position), rename and/or reorder them

intro

  • Select the variable mpg
    • notice that even with one column, the result remains a dataframe (not a vector), this is tidyverse policy !
  • An operation on a data with a certain type should result in data of the same type.
    • if you take one column from a matrix you have a one column matrix, not a vector.
    • if you take one column from a data frame, again, you end up with a one-column data frame, not a vector.
mtcars %>% select(mpg)