Computer Languages (R, Python, Julia, C/C++)

System information (for reproducibility)

In [1]:
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4
In [2]:
# load packages
using BenchmarkTools, Distributions, Libdl, PyCall, Random, RCall, Statistics

Types of computer languages

  • Compiled languages (low-level languages): C/C++, Fortran, ...

    • Directly compiled to machine code that is executed by CPU
    • Pros: fast, memory efficient
    • Cons: longer development time, hard to debug
  • Interpreted language (high-level languages): R, Matlab, Python, SAS IML, JavaScript, ...

    • Interpreted by interpreter
    • Pros: fast prototyping
    • Cons: excruciatingly slow for loops
  • Mixed (dynamic) languages: Matlab (JIT), R (compiler package), Julia, Cython, JAVA, ...

    • Pros and cons: between the compiled and interpreted languages
  • Script languages: Linux shell scripts, Perl, ...

    • Extremely useful for some data preprocessing and manipulation
  • Database languages: SQL, Hive (Hadoop).

    • Data analysis never happens if we do not know how to retrieve data from databases

    203B (Introduction to Data Science) covers some basics on scripting and database.

How high-level languages work?

Comics: The Life of a Bytecode Language

  • Typical execution of a high-level language such as R, Python, and Matlab.

  • To improve efficiency of interpreted languages such as R or Matlab, conventional wisdom is to avoid loops as much as possible. Aka, vectorize code

    The only loop you are allowed to have is that for an iterative algorithm.

  • When looping is unavoidable, need to code in C, C++, or Fortran. This creates the notorious two language problem
    Success stories: the popular glmnet package in R is coded in Fortran; tidyverse and data.table packages use a lot RCpp/C++.

  • High-level languages have made many efforts to bring themselves closer to the performance of low-level languages such as C, C++, or Fortran, with a variety levels of success.

    • Matlab has employed JIT (just-in-time compilation) technology since 2002.
    • Since R 3.4.0 (Apr 2017), the JIT bytecode compiler is enabled by default at its level 3.
    • Cython is a compiler system based on Python.
  • Modern languages such as Julia tries to solve the two language problem. That is to achieve efficiency without vectorizing code.

  • Julia execution.

Gibbs sampler example by Doug Bates

  • Doug Bates (member of R-Core, author of popular R packages Matrix, lme4, RcppEigen, etc)

    As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.

    -- Doug Bates (on the `knitr` Google Group)

  • An example from Dr. Doug Bates's slides Julia for R Programmers.

  • The task is to create a Gibbs sampler for the density
    $$ f(x, y) = k x^2 exp(- x y^2 - y^2 + 2y - 4x), x > 0 $$ using the conditional distributions $$ \begin{eqnarray*} X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\ Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right). \end{eqnarray*} $$

  • R solution. The RCall.jl package allows us to execute R code without leaving the Julia environment. We first define an R function Rgibbs().
In [3]:
# show R information
R"""
sessionInfo()
"""
Out[3]:
RObject{VecSxp}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Applications/Julia-1.7.app/Contents/Resources/julia/lib/julia/libblastrampoline.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.2
In [4]:
# define a function for Gibbs sampling

R"""
library(Matrix)

Rgibbs <- function(N, thin) {
  mat <- matrix(0, nrow=N, ncol=2)
  x <- y <- 0
  for (i in 1:N) {
    for (j in 1:thin) {
      x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate
      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
    }
    mat[i,] <- c(x, y)
  }
  mat
}
"""
Out[4]:
RObject{ClosSxp}
function (N, thin) 
{
    mat <- matrix(0, nrow = N, ncol = 2)
    x <- y <- 0
    for (i in 1:N) {
        for (j in 1:thin) {
            x <- rgamma(1, 3, y * y + 4)
            y <- rnorm(1, 1/(x + 1), 1/sqrt(2 * (x + 1)))
        }
        mat[i, ] <- c(x, y)
    }
    mat
}

To generate a sample of size 10,000 with a thinning of 500. How long does it take?

In [5]:
R"""
system.time(Rgibbs(10000, 500))
"""
Out[5]:
RObject{RealSxp}
   user  system elapsed 
 22.384   1.678  24.168 
  • This is a Julia function for the same Gibbs sampler:
In [6]:
function jgibbs(N, thin)
    mat = zeros(N, 2)
    x = y = 0.0
    for i in 1:N
        for j in 1:thin
            x = rand(Gamma(3, 1 / (y * y + 4)))
            y = rand(Normal(1 / (x + 1), 1 / sqrt(2(x + 1))))
        end
        mat[i, 1] = x
        mat[i, 2] = y
    end
    mat
end
Out[6]:
jgibbs (generic function with 1 method)

Generate the same number of samples. How long does it take?

In [7]:
jgibbs(100, 5); # warm-up
@elapsed jgibbs(10000, 500)
Out[7]:
0.278172357

We see 40-80 fold speed up of Julia over R on this example, with similar coding effort!

Comparing C, R, Python, and Julia

To better understand how these languages work, we consider a simple task: summing a vector.

Let's first generate data: 1 million double precision numbers from uniform [0, 1].

In [8]:
Random.seed!(123) # seed
x = rand(1_000_000) # 1 million random numbers in [0, 1)
sum(x)
Out[8]:
500220.6882968192

In this class, we use package BenchmarkTools.jl for robust benchmarking. It's analog of microbenchmark or bench package in R.

C

We would use the low-level C code as the baseline for copmarison. In Julia, we can easily run compiled C code using the ccall function.

In [9]:
using Libdl

C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file

# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -std=c99 -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)
Out[9]:
c_sum (generic function with 1 method)
In [10]:
# make sure it gives same answer
c_sum(x)
Out[10]:
500220.6882968189
In [11]:
# dollar sign to interpolate array x into local scope for benchmarking
bm = @benchmark c_sum($x)
Out[11]:
BenchmarkTools.Trial: 4121 samples with 1 evaluation.
 Range (minmax):  1.128 ms 1.727 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.183 ms               GC (median):    0.00%
 Time  (mean ± σ):   1.208 ms ± 64.200 μs   GC (mean ± σ):  0.00% ± 0.00%

       ▂█▆▃                                                   
  ▃▃▂▂▂████▄▅▆▅▆▅▅▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  1.13 ms        Histogram: frequency by time        1.48 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [12]:
benchmark_result = Dict() # a dictionary to store median runtime (in milliseconds)

# store median runtime (in milliseconds)
benchmark_result["C"] = median(bm.times) / 1e6
Out[12]:
1.183365

R, buildin sum

Next we compare to the build in sum function in R, which is implemented using C.

In [13]:
R"""
library(bench)
y <- $x
rbm_builtin <- bench::mark(sum(y))
"""
Out[13]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 sum(y)        2.3ms 2.47ms      405.        0B        0   203     0      502ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [14]:
# store median runtime (in milliseconds)
@rget rbm_builtin # dataframe
benchmark_result["R builtin"] = median(rbm_builtin[!, :median]) * 1000
Out[14]:
2.4676880001379686

R, handwritten loop

Handwritten loop in R is much slower.

In [15]:
R"""
sum_r <- function(x) {
  s <- 0
  for (xi in x) {
    s <- s + xi
  }
  s
}
library(bench)
y <- $x
rbm_loop <- bench::mark(sum_r(y))
"""
Out[15]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 sum_r(y)     12.8ms   13ms      76.5    12.7KB        0    39     0      510ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [16]:
# store median runtime (in milliseconds)
@rget rbm_loop # dataframe
benchmark_result["R loop"] = median(rbm_loop[!, :median]) * 1000
Out[16]:
13.009848000137936

R, Rcpp

Rcpp package is the easiest way to incorporate C++ code in R.

In [17]:
R"""
library(Rcpp)

cppFunction('double rcpp_sum(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
rcpp_sum
"""
Out[17]:
RObject{ClosSxp}
function (x) 
.Call(<pointer: 0x11e6d0980>, x)
In [18]:
R"""
rbm_rcpp <- bench::mark(rcpp_sum(y))
"""
Out[18]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>  <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 rcpp_sum(y)  1.16ms 1.21ms      808.    2.49KB        0   405     0      501ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [19]:
# store median runtime (in milliseconds)
@rget rbm_rcpp # dataframe
benchmark_result["R Rcpp"] = median(rbm_rcpp[!, :median]) * 1000
Out[19]:
1.2130129999832207

Python, builtin sum

Built in function sum in Python.

In [20]:
PyCall.pyversion
Out[20]:
v"3.7.6"
In [21]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")
bm = @benchmark $pysum($x)
Out[21]:
BenchmarkTools.Trial: 67 samples with 1 evaluation.
 Range (minmax):  73.561 ms84.218 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     75.373 ms               GC (median):    0.00%
 Time  (mean ± σ):   75.705 ms ±  1.583 ms   GC (mean ± σ):  0.00% ± 0.00%

              ▃ █▁█▁▃                                         
  ▄▄▄▇▁▄▄▁▇▄▄▄█▇█████▇▇▁▁▁▁▄▄▄▁▇▇▄▄▄▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄ ▁
  73.6 ms         Histogram: frequency by time        79.8 ms <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [22]:
# store median runtime (in miliseconds)
benchmark_result["Python builtin"] = median(bm.times) / 1e6
Out[22]:
75.373378

Python, handwritten loop

In [23]:
py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"

bm = @benchmark $sum_py($x)
Out[23]:
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (minmax):   93.824 ms170.494 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):      98.450 ms                GC (median):    0.00%
 Time  (mean ± σ):   101.546 ms ±  11.700 ms   GC (mean ± σ):  0.00% ± 0.00%

   ▄██                                                         
  ▃███▅▆▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  93.8 ms          Histogram: frequency by time          170 ms <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [24]:
# store median runtime (in miliseconds)
benchmark_result["Python loop"] = median(bm.times) / 1e6
Out[24]:
98.449899

Python, numpy

Numpy is the high-performance scientific computing library for Python.

In [25]:
# bring in sum function from Numpy 
numpy_sum = pyimport("numpy")."sum"
Out[25]:
PyObject <function sum at 0x7f9b3a9fdb90>
In [26]:
bm = @benchmark $numpy_sum($x)
Out[26]:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  390.693 μs 4.333 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     459.754 μs               GC (median):    0.00%
 Time  (mean ± σ):   481.130 μs ± 94.829 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▆█                                                          
  ▁▃███▇▆▅▅▅▄▅▆▆▅▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  391 μs          Histogram: frequency by time          797 μs <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [27]:
# store median runtime (in miliseconds)
benchmark_result["Python numpy"] = median(bm.times) / 1e6
Out[27]:
0.459754

Numpy performance is on a par with Julia build-in sum function. Both are about 3 times faster than C, possibly because of usage of SIMD.

Julia, builtin sum

@time, @elapsed, @allocated macros in Julia report run times and memory allocation.

In [28]:
@time sum(x) # no compilation time after first run
  0.000450 seconds (1 allocation: 16 bytes)
Out[28]:
500220.6882968192

For more robust benchmarking, we use BenchmarkTools.jl package.

In [29]:
bm = @benchmark sum($x)
Out[29]:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  270.633 μs836.460 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     321.719 μs                GC (median):    0.00%
 Time  (mean ± σ):   339.574 μs ±  50.124 μs   GC (mean ± σ):  0.00% ± 0.00%

      ▁▃▆█▇▇█▇▅▂                                                 
  ▁▂▄▆██████████▆▇▇▆▆▅▅▅▅▅▄▅▅▅▅▅▅▅▄▅▄▄▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▄
  271 μs           Histogram: frequency by time          498 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [30]:
benchmark_result["Julia builtin"] = median(bm.times) / 1e6
Out[30]:
0.3217185

Julia, handwritten loop

Let's also write a loop and benchmark.

In [31]:
function jl_sum(A)
    s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

bm = @benchmark jl_sum($x)
Out[31]:
BenchmarkTools.Trial: 3971 samples with 1 evaluation.
 Range (minmax):  1.165 ms 2.392 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.231 ms               GC (median):    0.00%
 Time  (mean ± σ):   1.254 ms ± 85.659 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▂  ▃▅█▇▂▂▁                                              
  ▅██▇████████▆▆▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▂▃▃▂▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  1.17 ms        Histogram: frequency by time        1.58 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [32]:
benchmark_result["Julia loop"] = median(bm.times) / 1e6
Out[32]:
1.231345

Exercise: annotate the loop by @simd and benchmark again.

Summary

In [33]:
sort(collect(benchmark_result), by = x -> x[2])
Out[33]:
9-element Vector{Pair{Any, Any}}:
  "Julia builtin" => 0.3217185
   "Python numpy" => 0.459754
              "C" => 1.183365
         "R Rcpp" => 1.2130129999832207
     "Julia loop" => 1.231345
      "R builtin" => 2.4676880001379686
         "R loop" => 13.009848000137936
 "Python builtin" => 75.373378
    "Python loop" => 98.449899
  • C, R builtin are the baseline C performance (gold standard).

  • Python builtin and Python loop are 80-100 fold slower than C because the loop is interpreted.

  • R loop is about 30 folder slower than C and indicates the performance of JIT bytecode generated by its compiler package (turned on by default since R v3.4.0 (Apr 2017)).

  • Julia loop is close to C performance, because Julia code is JIT compiled.

  • Julia builtin and Python numpy are 3-4 folder faster than C because of SIMD.

Take home message for computational scientists

  • High-level language (R, Python, Matlab) programmers should be familiar with existing high-performance packages. Don't reinvent wheels.

    • R: RcppEigen, tidyverse, ...
    • Python: numpy, scipy, ...
  • In most research projects, looping is unavoidable. Then we need to use a low-level language.

    • R: Rcpp, ...
    • Python: Cython, ...
  • In this course, we use Julia, which seems to circumvent the two language problem. So we can spend more time on algorithms, not on juggling $\ge 2$ languages.