In this lecture, we review some popular computer languages commonly used in computational statistics. We want to understand why some languages are faster than others and what are the remedies for the slow languages.
1 Types of computer languages
Compiled languages (low-level languages): C/C++, Fortran, …
Directly compiled to machine code that is executed by CPU
Pros: fast, memory efficient
Cons: longer development time, hard to debug
Interpreted language (high-level languages): R, Matlab, Python, SAS IML, JavaScript, …
Typical execution of a high-level language such as R, Python, and Matlab.
To improve efficiency of interpreted languages such as R or Matlab, conventional wisdom is to avoid loops as much as possible. Aka, vectorize code > The only loop you are allowed to have is that for an iterative algorithm.
When looping is unavoidable, need to code in C, C++, or Fortran. This creates the notorious two language problem
Success stories: the popular glmnet package in R is coded in Fortran; tidyverse and data.table packages use a lot RCpp/C++.
High-level languages have made many efforts to bring themselves closer to the performance of low-level languages such as C, C++, or Fortran, with a variety levels of success.
Matlab has employed JIT (just-in-time compilation) technology since 2002.
Since R 3.4.0 (Apr 2017), the JIT bytecode compiler is enabled by default at its level 3.
Cython is a compiler system based on Python.
Modern languages such as Julia tries to solve the two language problem. That is to achieve efficiency without vectorizing code.
Julia execution.
3 Gibbs sampler example by Doug Bates
Doug Bates (member of R-Core, author of popular R packages Matrix, lme4, RcppEigen, etc)
As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.
The task is to create a Gibbs sampler for the density \[
f(x, y) = k x^2 \exp(- x y^2 - y^2 + 2y - 4x), \quad x > 0
\] using the conditional distributions \[
\begin{eqnarray*}
X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\
Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right).
\end{eqnarray*}
\]
R solution. The RCall.jl package allows us to execute R code without leaving the Julia environment. We first define an R function Rgibbs().
usingRCall# show R informationR"""sessionInfo()"""
RObject{VecSxp}
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.2.2
# define a function for Gibbs samplingR"""library(Matrix)Rgibbs <- function(N, thin) { mat <- matrix(0, nrow=N, ncol=2) x <- y <- 0 for (i in 1:N) { for (j in 1:thin) { x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1))) } mat[i,] <- c(x, y) } mat}"""
RObject{ClosSxp}
function (N, thin)
{
mat <- matrix(0, nrow = N, ncol = 2)
x <- y <- 0
for (i in 1:N) {
for (j in 1:thin) {
x <- rgamma(1, 3, y * y + 4)
y <- rnorm(1, 1/(x + 1), 1/sqrt(2 * (x + 1)))
}
mat[i, ] <- c(x, y)
}
mat
}
To generate a sample of size 10,000 with a thinning of 500. How long does it take?
R"""system.time(Rgibbs(10000, 500))"""
RObject{RealSxp}
user system elapsed
9.605 0.574 10.180
This is a Julia function for the same Gibbs sampler:
usingDistributionsfunctionjgibbs(N, thin) mat =zeros(N, 2) x = y =0.0for i in1:Nfor j in1:thin x =rand(Gamma(3, 1/ (y * y +4))) y =rand(Normal(1/ (x +1), 1/sqrt(2(x +1))))end mat[i, 1] = x mat[i, 2] = yend matend
jgibbs (generic function with 1 method)
Generate the same number of samples. How long does it take?
We see 40-80 fold speed up of Julia over R on this example, with similar coding effort!
4 Comparing C, C++, R, Python, and Julia
To better understand how these languages work, we consider a simple task: summing a vector.
Let’s first generate data: 1 million double precision numbers from uniform [0, 1].
usingRandomRandom.seed!(257) # seedx =rand(1_000_000) # 1 million random numbers in [0, 1)sum(x)
500156.3977288406
In this class, we extensively use package BenchmarkTools.jl for robust benchmarking. It’s the analog of the microbenchmark or bench package in R.
usingBenchmarkTools
4.1 C
We would use the low-level C code as the baseline for copmarison. In Julia, we can easily run compiled C code using the ccall function.
usingLibdlC_code ="""#include <stddef.h>double c_sum(size_t n, double *X) { double s = 0.0; for (size_t i = 0; i < n; ++i) { s += X[i]; } return s;}"""const Clib =tempname() # make a temporary file# compile to a shared library by piping C_code to gcc# (works only if you have gcc installed):open(`gcc -std=c99 -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do fprint(f, C_code) end# define a Julia function that calls the C function:c_sum(X :: Array{Float64}) =ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)
clang: warning: argument unused during compilation: '-msse3' [-Wunused-command-line-argument]
c_sum (generic function with 1 method)
# make sure it gives same answerc_sum(x)
500156.39772885485
# dollar sign to interpolate array x into local scope for benchmarkingbm =@benchmarkc_sum($x)
# a dictionary to store median runtime (in milliseconds)benchmark_result =Dict() # store median runtime (in milliseconds)benchmark_result["C"] =median(bm.times) /1e6
0.81175
4.2 R, buildin sum
Next we compare to the build in sum function in R, which is implemented using C.
R"""library(bench)library(tidyverse)# interpolate x into R workspacey <- $xrbm_builtin <- bench::mark(sum(y)) %>% print(width = Inf)""";
# store median runtime (in milliseconds)@rget rbm_loop # dataframebenchmark_result["R loop"] =median(rbm_loop[!, :median]) *1000
7.746868010144681
4.4 R, Rcpp
Rcpp package provides an easy way to incorporate C++ code in R.
R"""library(Rcpp)cppFunction('double rcpp_sum(NumericVector x) { int n = x.size(); double total = 0; for(int i = 0; i < n; ++i) { total += x[i]; } return total;}')rcpp_sum"""
RObject{ClosSxp}
function (x)
.Call(<pointer: 0x2c81fcd00>, x)
C, R builtin are the baseline C performance (gold standard).
Python builtin and Python loop are >50 fold slower than C because the loop is interpreted.
R loop is about 10 times slower than C and indicates the performance of JIT bytecode generated by its compiler package (turned on by default since R v3.4.0 (Apr 2017)).
Julia loop is close to C performance, because Julia code is JIT compiled.
Julia builtin, Python numpy, and Julia turbo are 4-5 fold faster than C because of SIMD.
@turbo (SIMD) and @tturbo (multithreading+SIMD), offered by the LoopVectorization.jl package. yields the top performance on this machine.
5 Take home message for computational scientists
High-level language (R, Python, Matlab) programmers should be familiar with existing high-performance packages. Don’t reinvent wheels.
R: RcppEigen, tidyverse, …
Python: numpy, scipy, …
In most research projects, looping is unavoidable. Then we need to use a low-level language.
R: Rcpp, …
Python: Cython, …
In this course, we use Julia, which circumvents the two language problem. So we can spend more time on algorithms, not on juggling \(\ge 2\) languages.