Computer Languages (R, Python, Julia, C/C++)

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 4, 2024

1 Introduction

System information (for reproducibility):

versioninfo()

Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2024spring/slides/02-langs`

Status `~/Documents/github.com/ucla-biostat-257/2024spring/slides/02-langs/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [31c24e10] Distributions v0.25.107
  [bdcacae8] LoopVectorization v0.12.169
  [438e738f] PyCall v1.96.4
  [6f49c342] RCall v0.14.1
  [8f399da3] Libdl
  [9a3f8284] Random
  [10745b16] Statistics v1.10.0

In this lecture, we review some popular computer languages commonly used in computational statistics. We want to understand why some languages are faster than others and what are the remedies for the slow languages.

1.1 Poll: Multiple Answer

Which languages have you programmed before?
A. C or C++
B. Fortran
C. Julia
D. Matlab
E. Python

2 Types of computer languages

Compiled languages (low-level languages): C/C++, Fortran, …
- Directly compiled to machine code that is executed by CPU
- Pros: fast, memory efficient
- Cons: longer development time, hard to debug
Interpreted language (high-level languages): R, Matlab, Python, SAS IML, JavaScript, …
- Interpreted by interpreter
- Pros: fast prototyping
- Cons: excruciatingly slow for loops
Mixed (dynamic) languages: Matlab (JIT), R (compiler package), Julia, Cython, JAVA, …
- Pros and cons: between the compiled and interpreted languages
Script languages: Linux shell scripts, Perl, …
- Extremely useful for some data preprocessing and manipulation
Database languages: SQL, Hive (Hadoop).
- Data analysis never happens if we do not know how to retrieve data from databases
203B (Introduction to Data Science) covers some basics on scripting and database.

3 How high-level languages work?

Comics: The Life of a Bytecode Language

Typical execution of a high-level language such as R, Python, and Matlab.

To improve efficiency of interpreted languages such as R or Matlab, conventional wisdom is to avoid loops as much as possible. Aka, vectorize code > The only loop you are allowed to have is that for an iterative algorithm.
When looping is unavoidable, need to code in C, C++, or Fortran. This creates the notorious two language problem.
Success stories: the popular glmnet package in R is coded in Fortran; tidyverse and data.table packages use a lot RCpp/C++.

High-level languages have made many efforts to bring themselves closer to the performance of low-level languages such as C, C++, or Fortran, with a variety levels of success.
- Matlab has employed JIT (just-in-time compilation) technology since 2002.
- Since R 3.4.0 (Apr 2017), the JIT bytecode compiler is enabled by default at its level 3.
- Cython is a compiler system based on Python.
Modern languages such as Julia tries to solve the two language problem. That is to achieve efficiency without vectorizing code.
Julia execution.

4 Gibbs sampler example by Doug Bates

Doug Bates (member of R-Core, author of popular R packages Matrix, lme4, RcppEigen, etc)

As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.

– Doug Bates (on the knitr Google Group)
An example from Dr. Doug Bates’s slides Julia for R Programmers.
The task is to create a Gibbs sampler for the density
\[ f(x, y) = k x^2 \exp(- x y^2 - y^2 + 2y - 4x), \quad x > 0 \] using the conditional distributions \[ \begin{eqnarray*} X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\ Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right). \end{eqnarray*} \]

4.1 Poll: Multiple choice

Which language will be faster for this Gibbs sampler: R or Julia?
A. R is much faster than Julia
B. R is moderately faster than Julia
C. R is about same speed as Julia
D. R is moderately slower than Julia
E. R is much slower than Julia

4.2 R solution

The RCall.jl package allows us to execute R code without leaving the Julia environment. We first define an R function Rgibbs().

using RCall

# show R information
R"""
sessionInfo()
"""

RObject{VecSxp}
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] C

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.2

# define a function for Gibbs sampling

R"""
library(Matrix)

Rgibbs <- function(N, thin) {
  mat <- matrix(0, nrow=N, ncol=2)
  x <- y <- 0
  for (i in 1:N) {
    for (j in 1:thin) {
      x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate
      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
    }
    mat[i,] <- c(x, y)
  }
  mat
}
"""

RObject{ClosSxp}
function (N, thin) 
{
    mat <- matrix(0, nrow = N, ncol = 2)
    x <- y <- 0
    for (i in 1:N) {
        for (j in 1:thin) {
            x <- rgamma(1, 3, y * y + 4)
            y <- rnorm(1, 1/(x + 1), 1/sqrt(2 * (x + 1)))
        }
        mat[i, ] <- c(x, y)
    }
    mat
}

To generate a sample of size 10,000 with a thinning of 500. How long does it take?

R"""
system.time(Rgibbs(10000, 500))
"""

RObject{RealSxp}
   user  system elapsed 
 11.362   0.772  12.240

4.3 Julia solution

This is a Julia function for the same Gibbs sampler:

using Distributions

function jgibbs(N, thin)
    mat = zeros(N, 2)
    x = y = 0.0
    for i in 1:N
        for j in 1:thin
            x = rand(Gamma(3, 1 / (y * y + 4)))
            y = rand(Normal(1 / (x + 1), 1 / sqrt(2(x + 1))))
        end
        mat[i, 1] = x
        mat[i, 2] = y
    end
    mat
end

jgibbs (generic function with 1 method)

Generate the same number of samples. How long does it take?

jgibbs(100, 5); # warm-up
@elapsed jgibbs(10000, 500)

0.119380334

We see 40-80 fold speed up of Julia over R on this example, with similar coding effort!

5 Comparing C, C++, R, Python, and Julia

To better understand how these languages work, we consider a simple task: summing a vector.

Let’s first generate data: 1 million double precision numbers from uniform [0, 1].

using Random

Random.seed!(257) # seed
x = rand(1_000_000) # 1 million random numbers in [0, 1)
sum(x)

500156.3977288406

In this class, we extensively use package BenchmarkTools.jl for robust benchmarking. It’s the analog of the microbenchmark or bench package in R.

using BenchmarkTools

5.1 C

We would use the low-level C code as the baseline for copmarison. In Julia, we can easily run compiled C code using the ccall function.

using Libdl

C_code = """
#include <stddef.h>

double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file

# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):
open(`gcc -std=c99 -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X :: Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

clang: warning: argument unused during compilation: '-msse3' [-Wunused-command-line-argument]

c_sum (generic function with 1 method)

# make sure it gives same answer
c_sum(x)

500156.39772885485

# dollar sign to interpolate array x into local scope for benchmarking
bm = @benchmark c_sum($x)

BenchmarkTools.Trial: 5650 samples with 1 evaluation.
 Range (min … max):  811.625 μs …  1.107 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     858.041 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   880.589 μs ± 43.926 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
             █▇                                                 
  ▂▂▂▂▂▂▂▁▁▁▁██▅▃▃▃▂▂▂▂▂▂▂▂▂▂▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  812 μs          Histogram: frequency by time         1.01 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

# a dictionary to store median runtime (in milliseconds)
benchmark_result = Dict() 

# store median runtime (in milliseconds)
benchmark_result["C"] = median(bm.times) / 1e6

0.858041

5.2 R, buildin sum

Next we compare to the build in sum function in R, which is implemented using C.

R"""
library(bench)
library(tidyverse)

# interpolate x into R workspace
y <- $x
rbm_builtin <- bench::mark(sum(y)) %>%
  print(width = Inf)
""";

┌ Warning: RCall.jl: -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
│ v dplyr     1.1.3     v readr     2.1.4
│ v forcats   1.0.0     v stringr   1.5.0
│ v ggplot2   3.4.4     v tibble    3.2.1
│ v lubridate 1.9.2     v tidyr     1.3.0
│ v purrr     1.0.2     
│ -- Conflicts ------------------------------------------ tidyverse_conflicts() --
│ x tidyr::expand() masks Matrix::expand()
│ x dplyr::filter() masks stats::filter()
│ x dplyr::lag()    masks stats::lag()
│ x tidyr::pack()   masks Matrix::pack()
│ x tidyr::unpack() masks Matrix::unpack()
│ i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
└ @ RCall /Users/huazhou/.julia/packages/RCall/dDAVd/src/io.jl:172

# A tibble: 1 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 sum(y)       1.35ms   1.41ms      702.        0B        0   352     0
  total_time result    memory             time             gc                
    <bch:tm> <list>    <list>             <list>           <list>            
1      501ms <dbl [1]> <Rprofmem [0 x 3]> <bench_tm [352]> <tibble [352 x 3]>

# store median runtime (in milliseconds)
@rget rbm_builtin # dataframe
benchmark_result["R builtin"] = median(rbm_builtin[!, :median]) * 1000

1.4116505335550755

5.3 R, handwritten loop

Handwritten loop in R is much slower.

R"""
sum_r <- function(x) {
  s <- 0
  for (xi in x) {
    s <- s + xi
  }
  s
}
library(bench)
y <- $x
rbm_loop <- bench::mark(sum_r(y)) %>%
  print(width = Inf)
""";

# A tibble: 1 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 sum_r(y)      7.4ms   7.73ms      130.    13.1KB        0    65     0
  total_time result    memory              time            gc               
    <bch:tm> <list>    <list>              <list>          <list>           
1      501ms <dbl [1]> <Rprofmem [42 x 3]> <bench_tm [65]> <tibble [65 x 3]>

# store median runtime (in milliseconds)
@rget rbm_loop # dataframe
benchmark_result["R loop"] = median(rbm_loop[!, :median]) * 1000

7.727639022050425

5.4 R, Rcpp

Rcpp package provides an easy way to incorporate C++ code in R.

R"""
library(Rcpp)

cppFunction('double rcpp_sum(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')

rcpp_sum
"""

RObject{ClosSxp}
function (x) 
.Call(<pointer: 0x3218908c0>, x)

R"""
rbm_rcpp <- bench::mark(rcpp_sum(y)) %>%
  print(width = Inf)
""";

# A tibble: 1 x 13
  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 rcpp_sum(y)    800us    855us     1159.    2.49KB        0   580     0
  total_time result    memory             time             gc                
    <bch:tm> <list>    <list>             <list>           <list>            
1      500ms <dbl [1]> <Rprofmem [1 x 3]> <bench_tm [580]> <tibble [580 x 3]>

# store median runtime (in milliseconds)
@rget rbm_rcpp # dataframe
benchmark_result["R Rcpp"] = median(rbm_rcpp[!, :median]) * 1000

0.8551369828637689

5.5 Python, builtin sum

Built in function sum in Python.

using PyCall

PyCall.pyversion

v"3.10.12"

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")
bm = @benchmark $pysum($x)

BenchmarkTools.Trial: 125 samples with 1 evaluation.
 Range (min … max):  38.618 ms …  41.136 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     39.884 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.921 ms ± 462.277 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                           ▆█▃▂▂   ▅▆  ▃                        
  ▄▁▁▁▄▁▁▁▄▁▅▁▁▄▁▄▇▄▁▄▄▇▇▇▇█████▅▇████▄█▇▄▄▇▄▄▇▅▁▇█▄▁▁▁▇▁▄▄▅▁▄ ▄
  38.6 ms         Histogram: frequency by time           41 ms <
 Memory estimate: 240 bytes, allocs estimate: 6.

# store median runtime (in miliseconds)
benchmark_result["Python builtin"] = median(bm.times) / 1e6

39.884417

5.6 Python, handwritten loop

py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"

bm = @benchmark $sum_py($x)

BenchmarkTools.Trial: 103 samples with 1 evaluation.
 Range (min … max):  46.605 ms …  50.299 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.906 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.963 ms ± 494.630 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                    ▅▆▂ █▅▂▂ ▂  ▂▂              
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▄▅█▇▇███▅████▇██▇██▄▁▅▅▄▁▄▄▅▄▁▅ ▄
  46.6 ms         Histogram: frequency by time         50.1 ms <
 Memory estimate: 240 bytes, allocs estimate: 6.

# store median runtime (in miliseconds)
benchmark_result["Python loop"] = median(bm.times) / 1e6

48.90625

5.7 Python, numpy

Numpy is the high-performance scientific computing library for Python.

# bring in sum function from Numpy 
numpy_sum = pyimport("numpy")."sum"

PyObject <function sum at 0x324f7f630>

bm = @benchmark $numpy_sum($x)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  160.708 μs … 299.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     176.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   177.161 μs ±   3.582 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                              ▂█▇▅▂   ▃▄▂                        
  ▂▂▁▁▁▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▂▂▂▂▃█████▅▄▇███▆▄▃▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  161 μs           Histogram: frequency by time          189 μs <
 Memory estimate: 240 bytes, allocs estimate: 6.

# store median runtime (in miliseconds)
benchmark_result["Python numpy"] = median(bm.times) / 1e6

0.17625

Numpy performance is on a par with Julia build-in sum function. Both are about 3 times faster than C, possibly because of more cache-friendly recursive mapreduce implementation.

5.8 Julia, builtin sum

@time, @elapsed, @allocated macros in Julia report run times and memory allocation.

@time sum(x) # no compilation time after first run

  0.000181 seconds (1 allocation: 16 bytes)

500156.3977288406

For more robust benchmarking, we use BenchmarkTools.jl package.

bm = @benchmark sum($x)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  159.709 μs … 255.375 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     173.458 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   174.443 μs ±   5.856 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ▄     ▁      ▃▂         █▂  ▆▂▅   ▃▂  ▁▂                      ▁
  ██▅▃▄▃█▆▅▄▅▄▄██▆▅▆▅▇█▇▆▇██▇███████████████▇▇▇▇▆▆▇▇▅▇▇▆▆▆▆▆▅▄▅ █
  160 μs        Histogram: log(frequency) by time        194 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

benchmark_result["Julia builtin"] = median(bm.times) / 1e6

0.173458

5.9 Julia, handwritten loop

Let’s also write a loop and benchmark.

function jl_sum(A)
    s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

bm = @benchmark jl_sum($x)

BenchmarkTools.Trial: 6032 samples with 1 evaluation.
 Range (min … max):  811.583 μs …  1.013 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     815.167 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   828.116 μs ± 29.946 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ██▄▄▂▂▁▁ ▁▁▁    ▂ ▂                                          ▁
  ███████████████▇██████████▇█▇▇█▇▇▇▇▇▇▇▆▇▇▆▆▇▆▅▅▅▆▆▆▆▆▆▅▅▅▄▆▅ █
  812 μs        Histogram: log(frequency) by time       949 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

benchmark_result["Julia loop"] = median(bm.times) / 1e6

0.815167

Exercise: annotate the loop by @simd or LoopVectorization.@turbo and benchmark again.

using LoopVectorization

function jl_sum_turbo(A)
    s = zero(eltype(A))
    @turbo for i in eachindex(A)
      s += A[i]
    end
    s
end

bm = @benchmark jl_sum_turbo($x)
bm

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  89.291 μs … 139.209 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.962 μs ±   3.997 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █▃   ▂       ▄▃▃▃▃▂          ▁▁▁▁ ▁▁▁▁                       ▁
  ██▇▆▅█▆▅▅▄▄▅▇████████▇█▇▆▆▇███████████▇▇▆▆▆▇▆▇▇▇▇▇▇▇▆▅▅▆▅▅▅▅ █
  89.3 μs       Histogram: log(frequency) by time       105 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

benchmark_result["Julia turbo"] = median(bm.times) / 1e6

0.0895

@tturbo macro turns on multi-threading. Note for this to function, Julia needs to be started with multi-threading.

function jl_sum_tturbo(A)
    s = zero(eltype(A))
    @tturbo for i in eachindex(A)
      s += A[i]
    end
    s
end

bm = @benchmark jl_sum_tturbo($x)
bm

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  44.041 μs … 85.042 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.333 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.388 μs ±  2.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ▇█▄             ▂▁▂▁▁▁▁                                     ▁
  ████▆▇████▇▆▆▅▇█████████▇▇▆▆▅▆▆▅▆▆▇█▇▇▆▆▇▇▇▇▇▇▆▆▅▃▅▅▄▅▄▅▅▆▅ █
  44 μs        Histogram: log(frequency) by time      57.2 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

benchmark_result["Julia tturbo"] = median(bm.times) / 1e6

0.044333

5.10 Summary

sort(collect(benchmark_result), by = x -> x[2])

11-element Vector{Pair{Any, Any}}:
   "Julia tturbo" => 0.044333
    "Julia turbo" => 0.0895
  "Julia builtin" => 0.173458
   "Python numpy" => 0.17625
     "Julia loop" => 0.815167
         "R Rcpp" => 0.8551369828637689
              "C" => 0.858041
      "R builtin" => 1.4116505335550755
         "R loop" => 7.727639022050425
 "Python builtin" => 39.884417
    "Python loop" => 48.90625

C, R builtin, and RCpp are the baseline C or C++ performance (gold standard).
Python builtin and Python loop are >50 fold slower than C because the loop is interpreted.
R loop is about 10 times slower than C and indicates the performance of JIT bytecode generated by its compiler package (turned on by default since R v3.4.0 (Apr 2017)).
Julia loop is close to C performance, because Julia code is JIT compiled.
Julia builtin and Python numpy are 4-5 fold faster than C because of more cache-friendly algorithm (recursive mapreduce).
@turbo (SIMD) and @tturbo (multithreading+SIMD), offered by the LoopVectorization.jl package. yields the top performance on this machine.

6 Take home message for computational scientists

High-level language (R, Python, Matlab) programmers should be familiar with existing high-performance packages. Don’t reinvent wheels.
- R: RcppEigen, tidyverse, …
- Python: numpy, scipy, …
In most research projects, looping is unavoidable. Then we need to use a low-level language.
- R: Rcpp, …
- Python: Cython, …
In this course, we use Julia, which circumvents the two language problem. So we can spend more time on algorithms, not on juggling \(\ge 2\) languages.