The power of Rcpp comes from the fact the C++ code has a fundamentally different modus operandi than R. That is, C++
is compiled whereas R is interpreted. Understanding this difference is the key to understanding why an R user might want to rewrite parts of their code in an compiled language—namely that it will most likely run much faster. Leveraging C++
code at workflow bottlenecks is a great way to speed things up!
This talk is too high-level of a view to get into the gritty details, but here are the broad-brush definitions:
A language where the user feeds instructions to an interpreter at run-time. The interpreter translates the code into specific instructions for the processor in real-time. The user can interface with the interpreter interactively. These programs tend to be slower than compiled code.
A programming language in which code must be transformed from human-readable source code to machine-readable object code in advance of running the program for the first time. These languages tend to be faster-running. However, interactive coding sessions are not usually possible.
Examples:
Scripting | Compiled |
---|---|
R | C/C++ |
Python | Java |
Perl | FORTRAN |
Matlab/Octave | Julia |
JavaScript | Haskell |
Note: These definitions are somewhat fluid. Examples are based on common usage of each language.
R is a scripting language and C
/C++
are compiled languages. This means that we can run any R code in an interpreter without any special preparation. However, any C++
code we make must be compiled first into machine-readable instructions specific to our chip architecture. To do this, we need a special tool called a compiler. Standard Windows and Mac installs don’t tend to have one handy—this is one reason why you download Windows binaries from CRAN by default.
If you are a Windows user, you will probably need Rtools to play along. Mac users need Xcode (I think). devtools
will let us know if we have everything we need. From R:
# Check to make sure we have everything we need
devtools::has_devel()
## '/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore \
## --quiet CMD SHLIB foo.c
##
## [1] TRUE
If this returns FALSE
, go download Rtools or Xcode for Win or Mac, respectively.
If you have a compiler, creating object code is easy (but can get complicated!). Here is an example of how we turn a C++
source code file into a machine-readable binary program. From the terminal:
# Compile C++ source file to binary program
g++ heaps.cpp -o heaps
Now we can call that program from the terminal:
# Get all possible permutations of the set {1,2,3} using Heap's algorithm
./heaps 1 2 3
## 1 2 3
## 2 1 3
## 3 1 2
## 1 3 2
## 2 3 1
## 3 2 1
NOTE: At no point did we need to invoke an interpreter, as we do with R, Python, etc.
C++
syntaxC++
syntax is different from R in many ways. However, the Rcpp API helps a lot to make C++
code feel more R-like. This is not the venue to go over all the details of the C++
language, but we will do a classic ‘Hello World’ example so you can get the flavor.
The contents of our hello.cpp
:
// hello.cpp: A Hello World program for C++
// The next line is a preprocessor directive, kind of like R's `library()`.
#include <cstdio>
/* This is a multi-line comment. */
int main (void)
{
printf("Hello world!\n");
return 0;
}
And then from the terminal:
# Compile our program
g++ hello.cpp -o hello
# Call our program
hello
## Hello world!
TL;DR: Don’t use it. The Rcpp API is a much easier way to write C
/C++
code for R. The exception might be if you are maintaining old code that is already in this ecosystem.
You can prototype C++
code very easily using Rcpp. There are two ways you should be aware of:
Rcpp::cppFunction()
C++
function definition.Rcpp::sourceCpp()
base::source()
but with C++
code.C++
source code.Once you have some solid functions, you might want to create a package to house and share them. The power of Rcpp really starts to shine when you start creating packages with it. After all, the idea of using compiled code in your R scripts is that it can help solve some speed bottlenecks. You are probably not going to optimize your code to be blazing-fast for a one-off problem. Therefore, this is code that you will probably want to recycle.
You can use Rcpp::Rcpp.package.skeleton()
to create a new package skeleton with some Rcpp niceties built in. Meanwhile, you can use devtools::use_rcpp()
to add those niceties to an existing package. RcppArmadillo has RcppArmadillo::RcppArmadillo.package.skeleton()
as well if you are interested in using the Armadillo C++
linear algebra library.
Rcpp::cppFunction('
void Hello_World (std::string my_input)
{
// This will write to the R console.
Rcpp::Rcout << "Hello " << my_input << "!" << std::endl;
}
')
Hello_World("WMRUG attendees")
## Hello WMRUG attendees!
Rcpp::cppFunction('
Rcpp::NumericVector calc_hypotenuse (Rcpp::NumericVector a, Rcpp::NumericVector b)
{
// All variables must be "declared" before use
Rcpp::NumericVector c;
// Use `=` instead of `<-` for assignment
c = sqrt(pow(a,2) + pow(b,2));
// You must always explicitly return the value with the `return` keyword
return c;
}
')
calc_hypotenuse(1:3, 1:3)
## [1] 1.414214 2.828427 4.242641
You need to know the equivalent Rcpp C++
data structure to the R data structure that you are working with. Since all variables are declared in C++
, you cannot rely on implied data types or coersion.
As I mentioned before, Rcpp adds a lot of syntactic sugar to make C++
feel more like R. For example, Rcpp uses operator overloading to vectorize mathematical and logical operations. Many of your favorite functions are also made available through Rcpp’s C++
library.
Fibonacci’s Sequence:
\[ F_n = \begin{cases} 1 , & \text{if } n\leq 2\\ F_{n-1} + F_{n-2}, & \text{otherwise} \end{cases} \]
So the first 10 numbers of the sequence will be \(1,1,2,3,5,8,13,21,34,55\).
This is fun because it is recursive. Therefore a programmer may take a recursive, memoized, or iterative approach to solving. We can use this to compare R and C++
under different conditions.
First, let’s define a function to verify that our outputs are correct:
verify_results <- function(FUN) {
FUN_vec <- Vectorize(FUN, "n")
identical(
as.integer(FUN_vec(1:10)),
c(1L, 1L, 2L, 3L, 5L, 8L, 13L, 21L, 34L, 55L)
)
}
Now, let’s do a simple recursive function. First in R:
fibR_rec <- function(n) {
if (n <= 2)
return(1)
sys.function()(n - 1) + sys.function()(n - 2)
}
Then in C++
:
Rcpp::cppFunction("
int fibCpp_rec(const int n) {
if (n <= 2)
return 1;
return fibCpp_rec(n - 1) + fibCpp_rec(n - 2);
}"
)
Now we confirm they are correct:
verify_results(fibR_rec) && verify_results(fibCpp_rec)
## [1] TRUE
Both are right. But how do they compare performance-wise?
microbenchmark::microbenchmark(
fibR_rec(10),
fibCpp_rec(10),
fibR_rec(20),
fibCpp_rec(20)
)
## Unit: nanoseconds
## expr min lq mean median uq max
## fibR_rec(10) 62158 64959.0 78211.32 66409.0 72212.5 908321
## fibCpp_rec(10) 985 1280.5 4138.72 1736.5 7391.5 26179
## fibR_rec(20) 8210515 8647271.0 9206183.60 9347091.0 9625065.5 10494225
## fibCpp_rec(20) 12607 12907.0 15665.03 13360.5 19057.0 22614
## neval
## 100
## 100
## 100
## 100
OK, so the C++
wins here. This is because there is a lot of overhead to all those recursive function calls in R. Less so in C++
. You will find that for
loops and recursive function calls are usually the bottlenecks in your R code. If you cannot vectorize them somehow, you might want to think about making a C++
function to speed things up.
The iterative approach is probably a little smarter. Let’s try again in R:
fibR_it <- function(n) {
a <- 0L
b <- 1L
for (i in seq_len(n)) {
c <- a + b
a <- b
b <- c
}
a
}
In C++
…
Rcpp::cppFunction("
int fibCpp_it(const int n)
{
int a = 0, b = 1, c = 0;
for (int i=0; i<n; i++)
{
c = a + b;
a = b;
b = c;
}
return a;
}
")
And does the iterative approach provide correct answers?
verify_results(fibR_it) && verify_results(fibCpp_it)
## [1] TRUE
Now how do they compare as far as speed?
microbenchmark::microbenchmark(
fibR_it(10),
fibCpp_it(10),
times = 1000
)
## Unit: nanoseconds
## expr min lq mean median uq max neval
## fibR_it(10) 883 951 1050.179 987.5 1046.5 30526 1000
## fibCpp_it(10) 835 898 1003.798 950.0 1018.0 9885 1000
Hmm. Suddenly the C++
code is lagging. This is probably because assignment is highly optimized in R and there is some overhead to calling the C++
function. Let’s do it over a few more loops to see if that changes things:
microbenchmark::microbenchmark(
fibR_it(30),
fibCpp_it(30)
)
## Unit: nanoseconds
## expr min lq mean median uq max neval
## fibR_it(30) 1924 2042 2207.11 2076 2152 11319 100
## fibCpp_it(30) 844 936 1136.71 1019 1111 9148 100
Now we see the C++
code is catching up!
This second demo is a package that uses the Armadillo linear algebra library to create multivariate t distributed random matrices.
This was done as an exercise to explore performance gains for various tasks. It was also my first attempt at incorporating compiled code into an R package. Follow the link for the package vignette: mvrt
.