Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/01-intro`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/01-intro/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[6f49c342] RCall v0.13.14
[37e2e46d] LinearAlgebra
[9a3f8284] Random
1 Basic information
Tue/Thu 1pm-2:50pm @ CHS 41-268A
Instructor: Dr. Hua Zhou
2 What is statistics?
Statistics, the science of data analysis, is the applied mathematics in the 21st century.
People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians’s job is to help them extract knowledge and insights from data.
If existing software tools readily solve the problem, all the better.
Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).
This entails at least two essential skills: programming and fundamental knowledge of algorithms.
3 What is this course about?
Not a course on statistical packages. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?
Not a pure programming course, although programming is important and we do homework in Julia.
Become memory-conscious. You care about looping order. You do benchmarking on hot functions fanatically to make sure it’s not allocating.
No inversion mentality. Whenever you see a matrix inverse in mathematical expression, your brain reacts with matrix decomposition, iterative solvers, etc. For R users, that means you almost never use the solve(M) function to obtain inverse of a matrix \(\boldsymbol{M}\).
Master some basic strategies to solve big data problems.
Examples: how Google solve the PageRank problem with \(10^{9}\) webpages, linear regression with \(10^7\) observations, etc.
No afraid of optimizations and treat it as a technology. Be able to recognize some major optimization classes and choose the best solver(s) correspondingly.
Jupyter notebooks will be posted/updated before each lecture.
6 How to get started
All course materials are in GitHub repo https://github.com/ucla-biostat-257/2023spring. Lecture notes are Jupyter Notebooks (.ipynb files) and Quarto Markdown (.qmd files) under the slides folder. It is a good idea to learn by running through the code examples. You can do this in several ways.
6.1 Run Jupyter Notebook in Binder
A quick and easy way to run the Jupyter Notebooks is Binder, a free service that allows us to run Jupyter Notebooks in cloud. Simply follow the Binder link at the schedule page.
If you want the JupyterLab interface, replace the tree by lab in the URL.
6.2 Run Jupyter Notebook locally on your own computer
You can change biostat-257-2023spring to any other directory name you prefer.
Double click the file 2023spring.Rproj to open the project in RStudio.
Navigate to the slies folder and run/render qmd files as you want.
7 In class dicussion
The logistic regression is typically estimated by the Fisher scoring algorithm, or iteratively reweighted least squares (IWLS), which iterates according to \[
\boldsymbol{\beta}^{(t)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)},
\] where \(\mathbf{z}^{(t)}\) are pseudo-responses and \(\mathbf{W}^{(t)} = \text{diag}(\mathbf{w}^{(t)})\) is a diagonal matrix with nonnegative weights on the diagonal. Superscript \(t\) is the iterate number.
Question: How much speedup we can achieve, by careful consideration of flops and memory usage, over the following naive implementation?
inv(X'*diagm(w) * X) * X'*diagm(w) * z
7.1 Experiment
First generate some data.
usingLinearAlgebra, Random# Random seed for reproducibilityRandom.seed!(257)# samples, featuresn, p =5000, 100# design matrixX = [ones(n) randn(n, p -1)]# pseudo-responsesz =randn(n)# weightsw =0.25*rand(n);
7.2 Method 1
The following code literally translates the mathematical expression into code.
BenchmarkTools.Trial: 66 samples with 1 evaluation.
Range (min … max): 72.731 ms … 122.000 ms┊ GC (min … max): 7.88% … 40.78%
Time (median): 75.049 ms ┊ GC (median): 7.57%
Time (mean ± σ): 76.449 ms ± 7.398 ms┊ GC (mean ± σ): 8.89% ± 5.23%
▅█▁
▄▇███▆▃▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
72.7 ms Histogram: frequency by time 111 ms <
Memory estimate: 393.12 MiB, allocs estimate: 18.
Several unwise choices of algorithms waste lots of flops. The memeory allocations, caused by intermediate results, also slow down the program because of the need for garbage collection. This is a common mistake a beginner programmer in a high-level language makes. For example, the following R code (same algorithm on the same data) shows similar allocation. R code is much slower than Julia possibly because of the outdated BLAS/LAPACK library being used.
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall ~/.julia/packages/RCall/Wyd74/src/io.jl:172
7.3 Method 2
In the following code, we make smarter choice of algorithms (rearranging order of evaluation; utilizing data structures such as diagonal matrix, triangular matrix, and positive definite matrices) and get rid of memeory allocation by pre-allocating intermediate arrays.