Why use Rcpp?

The power of Rcpp comes from the fact the C++ code has a fundamentally different modus operandi than R. That is, C++ is compiled whereas R is interpreted. Understanding this difference is the key to understanding why an R user might want to rewrite parts of their code in an compiled language—namely that it will most likely run much faster. Leveraging C++ code at workflow bottlenecks is a great way to speed things up!

Scripts versus compiled code

This talk is too high-level of a view to get into the gritty details, but here are the broad-brush definitions:

Scripting (Interpreted) Language

A language where the user feeds instructions to an interpreter at run-time. The interpreter translates the code into specific instructions for the processor in real-time. The user can interface with the interpreter interactively. These programs tend to be slower than compiled code.

Compiled Language

A programming language in which code must be transformed from human-readable source code to machine-readable object code in advance of running the program for the first time. These languages tend to be faster-running. However, interactive coding sessions are not usually possible.

Examples:

Scripting Compiled
R C/C++
Python Java
Perl FORTRAN
Matlab/Octave Julia
JavaScript Haskell

Note: These definitions are somewhat fluid. Examples are based on common usage of each language.

Quick sys-reqs note

R is a scripting language and C/C++ are compiled languages. This means that we can run any R code in an interpreter without any special preparation. However, any C++ code we make must be compiled first into machine-readable instructions specific to our chip architecture. To do this, we need a special tool called a compiler. Standard Windows and Mac installs don’t tend to have one handy—this is one reason why you download Windows binaries from CRAN by default.

If you are a Windows user, you will probably need Rtools to play along. Mac users need Xcode (I think). devtools will let us know if we have everything we need. From R:

# Check to make sure we have everything we need
devtools::has_devel()
## '/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
##   --quiet CMD SHLIB foo.c
## 
## [1] TRUE

If this returns FALSE, go download Rtools or Xcode for Win or Mac, respectively.

Compiling code (outside of R)

If you have a compiler, creating object code is easy (but can get complicated!). Here is an example of how we turn a C++ source code file into a machine-readable binary program. From the terminal:

# Compile C++ source file to binary program
g++ heaps.cpp -o heaps

Now we can call that program from the terminal:

# Get all possible permutations of the set {1,2,3} using Heap's algorithm
./heaps 1 2 3
##  1 2 3
##  2 1 3
##  3 1 2
##  1 3 2
##  2 3 1
##  3 2 1

NOTE: At no point did we need to invoke an interpreter, as we do with R, Python, etc.

Basic C++ syntax

C++ syntax is different from R in many ways. However, the Rcpp API helps a lot to make C++ code feel more R-like. This is not the venue to go over all the details of the C++ language, but we will do a classic ‘Hello World’ example so you can get the flavor.

The contents of our hello.cpp:

// hello.cpp: A Hello World program for C++
// The next line is a preprocessor directive, kind of like R's `library()`.
#include <cstdio>

/* This is a multi-line comment. */
int main (void) 
{
  printf("Hello world!\n");
  return 0;
}

And then from the terminal:

# Compile our program
g++ hello.cpp -o hello

# Call our program
hello
## Hello world!

Compiled code in your R scripts

C-R API

TL;DR: Don’t use it. The Rcpp API is a much easier way to write C/C++ code for R. The exception might be if you are maintaining old code that is already in this ecosystem.

The Rcpp API

Workflow

Prototyping

You can prototype C++ code very easily using Rcpp. There are two ways you should be aware of:

  • Rcpp::cppFunction()
    • Create functions in an interactive session.
    • The function takes a character string which is a C++ function definition.
    • Example below.
  • Rcpp::sourceCpp()
    • Much like base::source() but with C++ code.
    • Also can be called interactively, but expects a separate file with (almost) pure C++ source code.
    • Rstudio has some cool features here. (We will do a live demo together.)
Packaging

Once you have some solid functions, you might want to create a package to house and share them. The power of Rcpp really starts to shine when you start creating packages with it. After all, the idea of using compiled code in your R scripts is that it can help solve some speed bottlenecks. You are probably not going to optimize your code to be blazing-fast for a one-off problem. Therefore, this is code that you will probably want to recycle.

You can use Rcpp::Rcpp.package.skeleton() to create a new package skeleton with some Rcpp niceties built in. Meanwhile, you can use devtools::use_rcpp() to add those niceties to an existing package. RcppArmadillo has RcppArmadillo::RcppArmadillo.package.skeleton() as well if you are interested in using the Armadillo C++ linear algebra library.

A basic function

Rcpp::cppFunction('
void Hello_World (std::string my_input) 
{
  // This will write to the R console.
  Rcpp::Rcout << "Hello " << my_input << "!" << std::endl;
}
')

Hello_World("WMRUG attendees")
## Hello WMRUG attendees!

Another basic function

Rcpp::cppFunction('
Rcpp::NumericVector calc_hypotenuse (Rcpp::NumericVector a, Rcpp::NumericVector b) 
{
  // All variables must be "declared" before use
  Rcpp::NumericVector c;
  
  // Use `=` instead of `<-` for assignment
  c = sqrt(pow(a,2) + pow(b,2));

  // You must always explicitly return the value with the `return` keyword
  return c;
}
')

calc_hypotenuse(1:3, 1:3)
## [1] 1.414214 2.828427 4.242641

Data structures

You need to know the equivalent Rcpp C++ data structure to the R data structure that you are working with. Since all variables are declared in C++, you cannot rely on implied data types or coersion.

Syntactic sugar

As I mentioned before, Rcpp adds a lot of syntactic sugar to make C++ feel more like R. For example, Rcpp uses operator overloading to vectorize mathematical and logical operations. Many of your favorite functions are also made available through Rcpp’s C++ library.

Demonstrations

Demo 1: Fibonacci Numbers

Fibonacci’s Sequence:

\[ F_n = \begin{cases} 1 , & \text{if } n\leq 2\\ F_{n-1} + F_{n-2}, & \text{otherwise} \end{cases} \]

So the first 10 numbers of the sequence will be \(1,1,2,3,5,8,13,21,34,55\).

This is fun because it is recursive. Therefore a programmer may take a recursive, memoized, or iterative approach to solving. We can use this to compare R and C++ under different conditions.

First, let’s define a function to verify that our outputs are correct:

verify_results <- function(FUN) {
  FUN_vec <- Vectorize(FUN, "n")
  
  identical(
    as.integer(FUN_vec(1:10)),
    c(1L, 1L, 2L, 3L, 5L, 8L, 13L, 21L, 34L, 55L)
  )
}

Recursive

Now, let’s do a simple recursive function. First in R:

fibR_rec <- function(n) {
  if (n <= 2)
    return(1)
  
  sys.function()(n - 1) + sys.function()(n - 2)
}

Then in C++:

Rcpp::cppFunction("
int fibCpp_rec(const int n) {
  if (n <= 2)
    return 1;

  return fibCpp_rec(n - 1) + fibCpp_rec(n - 2);

}"
)

Now we confirm they are correct:

verify_results(fibR_rec) && verify_results(fibCpp_rec)
## [1] TRUE

Both are right. But how do they compare performance-wise?

microbenchmark::microbenchmark(
  fibR_rec(10),
  fibCpp_rec(10),
  fibR_rec(20),
  fibCpp_rec(20)
)
## Unit: nanoseconds
##            expr     min        lq       mean    median        uq      max
##    fibR_rec(10)   62158   64959.0   78211.32   66409.0   72212.5   908321
##  fibCpp_rec(10)     985    1280.5    4138.72    1736.5    7391.5    26179
##    fibR_rec(20) 8210515 8647271.0 9206183.60 9347091.0 9625065.5 10494225
##  fibCpp_rec(20)   12607   12907.0   15665.03   13360.5   19057.0    22614
##  neval
##    100
##    100
##    100
##    100

OK, so the C++ wins here. This is because there is a lot of overhead to all those recursive function calls in R. Less so in C++. You will find that for loops and recursive function calls are usually the bottlenecks in your R code. If you cannot vectorize them somehow, you might want to think about making a C++ function to speed things up.

Iterative

The iterative approach is probably a little smarter. Let’s try again in R:

fibR_it <- function(n) {
  
  a <- 0L
  b <- 1L
  
  for (i in seq_len(n)) {
    c <- a + b
    a <- b
    b <- c
  }
  
  a
}

In C++

Rcpp::cppFunction("
int fibCpp_it(const int n) 
{
int a = 0, b = 1, c = 0;

for (int i=0; i<n; i++)
{
  c = a + b;
  a = b;
  b = c;
}

return a;
}
")

And does the iterative approach provide correct answers?

verify_results(fibR_it) && verify_results(fibCpp_it)
## [1] TRUE

Now how do they compare as far as speed?

microbenchmark::microbenchmark(
  fibR_it(10),
  fibCpp_it(10),
  times = 1000
)
## Unit: nanoseconds
##           expr min  lq     mean median     uq   max neval
##    fibR_it(10) 883 951 1050.179  987.5 1046.5 30526  1000
##  fibCpp_it(10) 835 898 1003.798  950.0 1018.0  9885  1000

Hmm. Suddenly the C++ code is lagging. This is probably because assignment is highly optimized in R and there is some overhead to calling the C++ function. Let’s do it over a few more loops to see if that changes things:

microbenchmark::microbenchmark(
  fibR_it(30),
  fibCpp_it(30)
)
## Unit: nanoseconds
##           expr  min   lq    mean median   uq   max neval
##    fibR_it(30) 1924 2042 2207.11   2076 2152 11319   100
##  fibCpp_it(30)  844  936 1136.71   1019 1111  9148   100

Now we see the C++ code is catching up!

Demo 2: Linear Algebra

This second demo is a package that uses the Armadillo linear algebra library to create multivariate t distributed random matrices.

This was done as an exercise to explore performance gains for various tasks. It was also my first attempt at incorporating compiled code into an R package. Follow the link for the package vignette: mvrt.