Course Introduction

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 6, 2023

System information (for reproducibility):

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

Load packages:

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/01-intro`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/01-intro/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [6f49c342] RCall v0.13.14
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

1 Basic information

  • Tue/Thu 1pm-2:50pm @ CHS 41-268A
  • Instructor: Dr. Hua Zhou

2 What is statistics?

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians’s job is to help them extract knowledge and insights from data.

  • If existing software tools readily solve the problem, all the better.

  • Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).

  • This entails at least two essential skills: programming and fundamental knowledge of algorithms.

3 What is this course about?

  • Not a course on statistical packages. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?

  • Not a pure programming course, although programming is important and we do homework in Julia.

  • Not a course on data science. BIOSTAT 203B (Introduction to Data Science) in winter quarter focuses on some software tools for data scientists.

  • This course focuses on algorithms, mostly those in numerical linear algebra and numerical optimization.

4 Learning objectives

  1. Be highly appreciative of this quote by James Gentle

    The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

    Examples: \(\mathbf{X}^T \mathbf{W} \mathbf{X}\), \(\operatorname{tr} (\mathbf{A} \mathbf{B})\), \(\operatorname{diag} (\mathbf{A} \mathbf{B})\), multivariate normal density, …

  2. Become memory-conscious. You care about looping order. You do benchmarking on hot functions fanatically to make sure it’s not allocating.

  3. No inversion mentality. Whenever you see a matrix inverse in mathematical expression, your brain reacts with matrix decomposition, iterative solvers, etc. For R users, that means you almost never use the solve(M) function to obtain inverse of a matrix \(\boldsymbol{M}\).

    Examples: \((\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\), \(\mathbf{y}^T \boldsymbol{\Sigma}^{-1} \mathbf{y}\), Newton-Raphson algorithm, …

  4. Master some basic strategies to solve big data problems.

    Examples: how Google solve the PageRank problem with \(10^{9}\) webpages, linear regression with \(10^7\) observations, etc.

  5. No afraid of optimizations and treat it as a technology. Be able to recognize some major optimization classes and choose the best solver(s) correspondingly.

  6. Be immune to the language fight.

5 Course logistics

6 How to get started

All course materials are in GitHub repo https://github.com/ucla-biostat-257/2023spring. Lecture notes are Jupyter Notebooks (.ipynb files) and Quarto Markdown (.qmd files) under the slides folder. It is a good idea to learn by running through the code examples. You can do this in several ways.

6.1 Run Jupyter Notebook in Binder

A quick and easy way to run the Jupyter Notebooks is Binder, a free service that allows us to run Jupyter Notebooks in cloud. Simply follow the Binder link at the schedule page.

If you want the JupyterLab interface, replace the tree by lab in the URL.

6.2 Run Jupyter Notebook locally on your own computer

  1. Download and install Julia v1.8.x from https://julialang.org/downloads/. On Mac, use Bash command
sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia

to create a symbolic link so julia command is available anywhere in the terminal.

  1. Install IJulia package. Open Julia REPL, type ] to enter the package mode, then type
add IJulia
build IJulia
  1. Git clone the course material.
git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring

You can change biostat-257-2023spring to any other directory name you prefer.

  1. On terminal, enter the folder for the ipynb file you want to run, e.g. biostat-257-2023spring/slides/01-intro/.

  2. Open Julia REPL, type

using IJulia
jupyterlab(dir = pwd())

to open the JupyterLab in browser or

using IJulia
notebook(dir = pwd())

to open a Jupyter notebook.

  1. Course material is updated frequently. Remember to git pull to obtain the most recent material.

6.3 Run Quarto Markdown locally on your own computer

  1. Download and install Julia v1.8.x from https://julialang.org/downloads/. On Mac, use Bash command
sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia

to create a symbolic link so julia command is available anywhere in the terminal.

  1. Follow the instructions to install Quarto.

  2. Git clone the course material.

git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring

You can change biostat-257-2023spring to any other directory name you prefer.

  1. Double click the file 2023spring.Rproj to open the project in RStudio.

  2. Navigate to the slies folder and run/render qmd files as you want.

7 In class dicussion

The logistic regression is typically estimated by the Fisher scoring algorithm, or iteratively reweighted least squares (IWLS), which iterates according to \[ \boldsymbol{\beta}^{(t)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}, \] where \(\mathbf{z}^{(t)}\) are pseudo-responses and \(\mathbf{W}^{(t)} = \text{diag}(\mathbf{w}^{(t)})\) is a diagonal matrix with nonnegative weights on the diagonal. Superscript \(t\) is the iterate number.

Question: How much speedup we can achieve, by careful consideration of flops and memory usage, over the following naive implementation?

inv(X' * diagm(w) * X) * X' * diagm(w) * z

7.1 Experiment

First generate some data.

using LinearAlgebra, Random

# Random seed for reproducibility
Random.seed!(257)
# samples, features
n, p = 5000, 100
# design matrix
X = [ones(n) randn(n, p - 1)]
# pseudo-responses
z = randn(n)
# weights
w = 0.25 * rand(n);

7.2 Method 1

The following code literally translates the mathematical expression into code.

# method 1 
res1 = inv(X' * diagm(w) * X) * X' * diagm(w) * z
100-element Vector{Float64}:
  0.03628463723928781
 -0.03627208624567234
  0.0012135776263959386
 -0.00979071963439499
 -0.0225073664180177
 -0.039421539594143795
 -0.00993327709139708
 -0.004858946353764455
 -0.03245127282471158
  0.03149939821184099
  0.011118504820408902
  0.01700396581224958
  0.025334011646021427
  ⋮
 -0.02349680348970219
  0.0039118268589749955
  0.01866508163371693
  0.02450830413510585
 -0.02940780331256664
  0.00821450419241371
  0.008802962928228762
  0.01274601062082029
 -0.00840781793349544
 -0.004671426276903783
  0.008116101716043015
  0.0042616424275055955
using BenchmarkTools

bm1 = @benchmark ((inv($X' * diagm($w) * $X) * $X') * diagm($w)) * $z
bm1
BenchmarkTools.Trial: 66 samples with 1 evaluation.
 Range (minmax):  72.731 ms122.000 ms   GC (min … max): 7.88% … 40.78%
 Time  (median):     75.049 ms                GC (median):    7.57%
 Time  (mean ± σ):   76.449 ms ±   7.398 ms   GC (mean ± σ):  8.89% ±  5.23%
    ▅█                                                         
  ▄▇██▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  72.7 ms         Histogram: frequency by time          111 ms <
 Memory estimate: 393.12 MiB, allocs estimate: 18.

Several unwise choices of algorithms waste lots of flops. The memeory allocations, caused by intermediate results, also slow down the program because of the need for garbage collection. This is a common mistake a beginner programmer in a high-level language makes. For example, the following R code (same algorithm on the same data) shows similar allocation. R code is much slower than Julia possibly because of the outdated BLAS/LAPACK library being used.

using RCall

R"""
library(bench)
library(tidyverse)

# Interpolate Julia variables into R workspace
X <- $X
w <- $w
z <- $z

rbm <- bench::mark(
  solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z,
  iterations = 10
  ) %>% 
  print(width = Inf)
""";
┌ Warning: RCall.jl: ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
│ ✔ dplyr     1.1.0     ✔ readr     2.1.4
│ ✔ forcats   1.0.0     ✔ stringr   1.5.0
│ ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
│ ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
│ ✔ purrr     1.0.1     
│ ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
│ ✖ dplyr::filter() masks stats::filter()
│ ✖ dplyr::lag()    masks stats::lag()
│ ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
└ @ RCall ~/.julia/packages/RCall/Wyd74/src/io.jl:172
# A tibble: 1 × 13
  expression                                                    min   median
  <bch:expr>                                               <bch:tm> <bch:tm>
1 solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z    1.86s    1.86s
  `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result         
      <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>         
1     0.537     401MB     1.07    10    20      18.6s <dbl [100 × 1]>
  memory              time            gc               
  <list>              <list>          <list>           
1 <Rprofmem [54 × 3]> <bench_tm [10]> <tibble [10 × 3]>
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall ~/.julia/packages/RCall/Wyd74/src/io.jl:172

7.3 Method 2

In the following code, we make smarter choice of algorithms (rearranging order of evaluation; utilizing data structures such as diagonal matrix, triangular matrix, and positive definite matrices) and get rid of memeory allocation by pre-allocating intermediate arrays.

# preallocation
XtWt = Matrix{Float64}(undef, p, n)
XtX = Matrix{Float64}(undef, p, p)
Xtz = Vector{Float64}(undef, p)

function myfun(X, z, w, XtWt, XtX, Xtz)
    # XtWt = X' * W
    mul!(XtWt, transpose(X), Diagonal(w))
    # XtX = X' * W * X
    mul!(XtX, XtWt, X)
    # Xtz = X' * W * z
    mul!(Xtz, XtWt, z)
    # Cholesky on XtX
    LAPACK.potrf!('U', XtX)
    # Two triangular solves to solve (XtX) \ (Xtz)
    BLAS.trsv!('U', 'T', 'N', XtX, Xtz)
    BLAS.trsv!('U', 'N', 'N', XtX, Xtz)
end

# First check correctness vs Method 1
res2 = myfun(X, z, w, XtWt, XtX, Xtz)
100-element Vector{Float64}:
  0.03628463723928781
 -0.036272086245672526
  0.0012135776263959776
 -0.00979071963439495
 -0.022507366418017816
 -0.039421539594143844
 -0.009933277091397097
 -0.004858946353764469
 -0.03245127282471156
  0.03149939821184098
  0.01111850482040888
  0.017003965812249514
  0.025334011646021358
  ⋮
 -0.02349680348970239
  0.0039118268589750085
  0.018665081633716888
  0.02450830413510586
 -0.029407803312566747
  0.008214504192413714
  0.00880296292822884
  0.012746010620820297
 -0.008407817933495484
 -0.004671426276903789
  0.008116101716042993
  0.004261642427505555
bm2 = @benchmark myfun($X, $z, $w, $XtWt, $XtX, $Xtz)
bm2
BenchmarkTools.Trial: 4081 samples with 1 evaluation.
 Range (minmax):  1.210 ms1.431 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.223 ms              GC (median):    0.00%
 Time  (mean ± σ):   1.225 ms ± 9.452 μs   GC (mean ± σ):  0.00% ± 0.00%
             ▁▁▄▄▅▆█▆▃▂▁▁▁                                
  ▂▂▃▃▃▄▅▆▇▇███████████████▇▆▅▅▆▅▄▄▄▄▃▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▁▂ ▄
  1.21 ms        Histogram: frequency by time       1.25 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

In R, a better implementation is

solve(t(X * w) %*% X, t(X) %*% (z * w))

It’s much faster than the naive implementation. To achieve zero memory allocation, some low-level coding using C++ and RcppEigen is necessary.

R"""
bench::mark(
  solve(t(X * w) %*% X, t(X) %*% (z * w)),
  ) %>% 
  print(width = Inf)
""";
# A tibble: 1 × 13
  expression                                   min   median `itr/sec` mem_alloc
  <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 solve(t(X * w) %*% X, t(X) %*% (z * w))   20.1ms   20.6ms      48.5    11.6MB
  `gc/sec` n_itr  n_gc total_time result          memory             
     <dbl> <int> <dbl>   <bch:tm> <list>          <list>             
1     4.41    22     2      453ms <dbl [100 × 1]> <Rprofmem [10 × 3]>
  time            gc               
  <list>          <list>           
1 <bench_tm [24]> <tibble [24 × 3]>

7.4 Conclusion

By careful consideration of the computational algorithms, we achieve a whooping speedup (in Julia) of

# speed-up
median(bm1.times) / median(bm2.times)
61.35592400710615

For PhD students, that means, instead of waiting two months (65 days) for your simulations to finish, you only need one day!