Contemporary R programming

a step into the tidyverse
Author

Wilfried Cools

Published

December 28, 2023

SQUARE consultant
square.research.vub.be

Compiled on R 4.3.1

Programming in R

Programming in R, quite similar to programming in other languages,
especially Python, Matlab, …

Learning how to program, quite similar too

  • program… a lot
  • keep solving your problems
  • re-write !!

Make the computer do the work for you

  • create your own algorithms → to process input to output
  • talk to the computer
    • split up a problem into small(er) steps
    • for each step, make everything explicit
  • gain automation
    • gain performance
      • tweak & rerun
      • weed away errors
    • gain reproducibility
  • avoid
    • copy-pasting in your code ? Again !!
    • typing in specific values ? Again !!
    • make one change, and thus many others ? Again !!

Get the most out of what you can do with the computer

  • readable code

    • by future you, by peers / reviewers
    • from well-documented to self-explanatory
  • easily extendable and general code

    • modularity / encapsulation
    • avoid hard coding
      → use variables for flexibility
  • efficient code (speed)

  • iterations (do x for every instance of y)

The Essence of R Programming

Functions and arguments

Define your own functions - to avoid repetition (reusable) - to increase readability - to reduce errors - to encapsulate code (scoping)

Use build in functions: whenever you can - base R + use packages

Arguments → conditional implementation ~ flexibility

Iterations and Functions

example: cumulative sum:

  • assume 4 numbers
my_sum <- c(10,20,30,40)
  • manually get cumulative sum
c(10,10+20,10+20+30,10+20+30+40)
  • existing function for the cumulative sum
cumsum(my_sum)
  • iterate to get the cumulative sum
out <- numeric()
for(it in 1:length(my_sum)){
  out <- c(out,sum(out[it-1],my_sum[it]))
  }
out
[1]  10  30  60 100
  • define a function: <- function( )
my_cumsum <- function(values){
  out <- numeric()
  for(it in 1:length(values)){ out <- c(out,sum(out[it-1],values[it])) }
  return(out) }
  • use that function
# call and reuse
my_cumsum(my_sum)
my_cumsum(c(5,4,3))
  • use that function multiple times
# for all at once
map(
    list(
        a=my_sum,
        b=c(5,4,3)
    ),
    my_cumsum)

Packages and Environments

  • Bring in functions defined in packages
    • locally install.packages('tidyverse')
    • in your workspace library(tidyverse)
  • Note: because tidyverse includes dplyr, the function select is understood.
mtcars %>% select(mpg,cyl) %>% slice(1:2)
              mpg cyl
Mazda RX4      21   6
Mazda RX4 Wag  21   6

Different package - same function name

library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:gtExtras':

    select
The following object is masked from 'package:dplyr':

    select
mtcars %>% select(mpg,cyl) %>% slice(1:2)
Error in select(., mpg, cyl) : unused arguments (mpg, cyl)
  • Functions are defined within environments
getAnywhere(select)
3 differing objects matching 'select' were found
in the following places
  package:MASS
  package:gtExtras
  package:dplyr
  namespace:dplyr
  namespace:MASS
  namespace:tidyselect
Use [] to view one of them

Environments and Namespaces

Packages can be made explicit with ::

environment(select)
<environment: namespace:MASS>
environment(dplyr::select)
<environment: namespace:dplyr>
mtcars %>% dplyr::select(mpg,cyl)

Different package - same function name

library(MASS)
mtcars %>% select(mpg,cyl) %>% slice(1:2)
Error in select(., mpg, cyl) : unused arguments (mpg, cyl)
  • Functions are defined within environments
getAnywhere(select)
3 differing objects matching 'select' were found
in the following places
  package:MASS
  package:gtExtras
  package:dplyr
  namespace:dplyr
  namespace:MASS
  namespace:tidyselect
Use [] to view one of them

A package can be made default

environment(select)
<environment: namespace:MASS>
environment(dplyr::select)
<environment: namespace:dplyr>
mtcars %>% dplyr::select(mpg,cyl)

Explicit: better but cumbersome

  • combine often used libraries on top
  • use :: for unique / rare use
  • overwrite function name to be sure
select <- dplyr::select
mtcars %>% select(mpg,cyl) %>% slice(1:2)
              mpg cyl
Mazda RX4      21   6
Mazda RX4 Wag  21   6

A package can be made default

environment(select)
<environment: namespace:dplyr>
environment(dplyr::select)
<environment: namespace:dplyr>
mtcars %>% dplyr::select(mpg,cyl)

Modularity and Flexibility

Solve big problems

  • by solving many small problems (chain)
  • by extending small problems (embed)

Chunks of code (eg., functions),
each with simple input and output

Make code run for the more general case

Link chunks of code automatically

  • output is input
  • using arguments to functions

Define once so changes are made once

  • DRY (Don’t Repeat Yourself)

Readability

Consistent naming of variables and functions

  • nouns for variable names
  • verbs for functions
  • fixed composition order
    eg., lm_dta_sub - glm_dtb_ext
  • combine what belongs together
    eg., 1st, 2nd and 3rd element

Short and meaningful naming

Functions instead of code

Isolate the core of the program

mtcars[mtcars$mpg > 21 & mtcars$hp < 60, c(1,4,6)]
mtcars[mtcars$mpg > 21 & mtcars$hp < 60, c('mpg','hp','wt')]
mtcars %>% filter(mpg>21,hp<60) %>% select(mpg,hp,wt)
             mpg hp    wt
Honda Civic 30.4 52 1.615
mtcars %>% 
  filter(mpg>21,hp<60) %>% 
  select(mpg,hp,wt)

R Programming Specifics

R as a tool

  • dedicated to statistics: but much more
  • open source (almost fully)
  • highly modular
  • uses vectorisation

Wickham, H. (2019). Advanced R, Second Edition. CRC Press.

R as a language

  • functional / kinda object oriented
  • use of lexical scoping
  • dynamically-typed
  • specific choices for memory use
    • copy on modify
    • modify in place when unique reference
    • modify in place for environments
    • lists store references, not values
  • build on C / Fortran
    • can be fast !! but often is not

Speed

It matters how you do things

  • Make use of vectorisation.
    c(1:5)^2 → 1, 4, 9, 16, 25
    really!! make use of it
  • Avoid creating objects in loops (immutable objects).
nr_iter <- 100000
# vectorisation
system.time(out <- (1:nr_iter)^2)
   user  system elapsed 
      0       0       0 
# pre-allocating memory
out2 <- as.numeric(NA,length=100000)
system.time(for(it in 1:nr_iter) out2[it] <- it^2)
   user  system elapsed 
   0.11    0.00    0.08 
# growing output
out3 <- numeric()
system.time(for(it in 1:nr_iter) out3 <- c(out3,it^2))
   user  system elapsed 
  16.66    2.77   22.20 
# using tidyverse
system.time(out4 <- map_dbl(1:nr_iter,~.x^2))
   user  system elapsed 
   0.22    0.01    0.27 

R Objects

R workspaces contain R objects

  • data structures
  • functions are objects too
  • new object types can be created

Objects differ in how they are used

  • inspect objects
  • extract information from objects

R for you - typically use of data frames

- data-frames are lists<br/>heterogeneous
- matrices are more efficient<br/>homogeneous
  • and vectors

    • (mostly) double for numeric
    • factor for categorical
      fixed length categories (numerical)

Quick perspective on objects

R Functions

R workspaces contain R functions

  • check with
    lsf.str()
    lsf.str("package::dplyr")
  • look at the function, eg. lm
  • look for information, eg. ?lm

Go to the help file

  • conditional on arguments (input)
  • give a return value (output)
  • with examples

Most functions in packages

  • load into workspace
    library or require
  • preferably use heavily used packages
  • maybe prioritise tidyverse