Hi there

Writing Efficient R Code

Yonghui Dong / 2018-02-01


My notes on Writing Efficient R code

1. Use an up-to-date version of R

2. The Art of Benchmarking

In order to make your code go faster, you need to know how long it takes to run. Benchmarking is used to time how long each solution takes, then you can select the fastest.

  1. Two steps:
  1. Two functions:

3. Fine Tuning: Efficient Base R

R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.

In R, memory allocation happens automatically. R allocates memory in RAM to store variables, and it is time consuming. Minimizing variable assignment could improve the speed

  1. Three important rules:

4. Diagnosing Problems: Code Profiling

Profiling helps you locate the bottlenecks in your code. The general idea is to run the code and record what is being currently excuted every few milliseconds.

It can be done using Profvis R-package, and this package has been integrated in Rstudio. You can highlight the code that you want to profile.

5. Turbo Charged Code: Parallel Programming

Some problems can be solved faster using multiple cores on your machine. By default, R only uses 1 core.

  1. How many cores does this machine have?

The parallel package has a function detectCores() that determines the number of cores in a machine.

  1. What sort of problems benefit from parallel computing?

Not every analysis can make use of multiple cores. Many statistical algorithms can only use a single core. If you can run your loop forward and backwards, there is a good chance that you can use multicore computing.

  1. The parallel package - parApply() & parSapply() & parLapply()

They are the parallel versions of apply(), sapply() & lapply()

6. The examples

m <- matrix(rnorm(100000), ncol = 10)
d <- as.data.frame(m)
#(1) time and compare a matrix and a dataframe (point 2 and point 3: rule 3)
library("microbenchmark")
microbenchmark(apply(m, 1, mean), 
               apply(d, 1, mean), 
               times = 10) # Run each function 10 times
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##  apply(m, 1, mean) 37.86911 43.05239 47.54156 45.05385 48.33339 72.60366
##  apply(d, 1, mean) 44.16448 48.30704 50.05870 49.54362 50.95960 60.98107
##  neval cld
##     10   a
##     10   a
#(2) Parallel computing (point 5)
# Five steps
library("parallel") #(1) Load package
copies_of_r<- detectCores() - 1 #(2) Specify the number of cores
cl <- makeCluster( copies_of_r) #(3) Create a cluster object
m_mean <- parApply(cl, m, 1, mean) #(4) Swap to parApply()
stopCluster(cl) #(5) stop