Summary of linear regression

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 23, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/15-linreg`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/15-linreg/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [7522ee7d] SweepOperator v0.3.3
  [37e2e46d] LinearAlgebra

1 Comparing methods for linear regression

Methods for solving linear regression \(\widehat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\):

Method	Flops	Remarks	Software	Stability
Sweep	\(np^2 + p^3\)	\((X^TX)^{-1}\) available	SAS	less stable
Cholesky	\(np^2 + p^3/3\)			less stable
QR by Householder	\(2np^2 - (2/3)p^3\)		R	stable
QR by MGS	\(2np^2\)	\(Q_1\) available		stable
QR by SVD	\(4n^2p + 8np^2 + 9p^3\)	\(X = UDV^T\)		most stable

Remarks:

When \(n \gg p\), sweep and Cholesky are twice faster than QR and need less space.
Sweep and Cholesky are based on the Gram matrix \(\mathbf{X}^T \mathbf{X}\), which can be dynamically updated with incoming data. They can handle huge \(n\), moderate \(p\) data sets that cannot fit into memory.
QR methods are more stable and produce numerically more accurate solution.
Although sweep is slower than Cholesky, it yields standard errors and so on.
MGS appears slower than Householder, but it yields \(\mathbf{Q}_1\).

There is simply no such thing as a universal ‘gold standard’ when it comes to algorithms.

2 Benchmark

using SweepOperator, BenchmarkTools, LinearAlgebra

linreg_cholesky(y::Vector, X::Matrix) = cholesky!(X'X) \ (X'y)

linreg_qr(y::Vector, X::Matrix) = X \ y

function linreg_sweep(y::Vector, X::Matrix)
    p = size(X, 2)
    xy = [X y]
    tableau = xy'xy
    sweep!(tableau, 1:p)
    return tableau[1:p, end]
end

function linreg_svd(y::Vector, X::Matrix)
    xsvd = svd(X)
    return xsvd.V * ((xsvd.U'y) ./ xsvd.S)
end

linreg_svd (generic function with 1 method)

using Random

Random.seed!(123) # seed

n, p = 10, 3
X = randn(n, p)
y = randn(n)

# check these methods give same answer
@show linreg_cholesky(y, X)
@show linreg_qr(y, X)
@show linreg_sweep(y, X)
@show linreg_svd(y, X);

linreg_cholesky(y, X) = [-0.07196570434574735, -0.13575699455859386, -0.18820199689456507]
linreg_qr(y, X) = [-0.07196570434574738, -0.1357569945585939, -0.18820199689456502]
linreg_sweep(y, X) = [-0.07196570434574734, -0.1357569945585939, -0.188201996894565]
linreg_svd(y, X) = [-0.07196570434574735, -0.13575699455859377, -0.1882019968945651]

n, p = 1000, 300
X = randn(n, p)
y = randn(n)

@benchmark linreg_cholesky(y, X)

BenchmarkTools.Trial: 6524 samples with 1 evaluation.
 Range (min … max):  670.209 μs …  30.713 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     723.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   765.436 μs ± 611.688 μs  ┊ GC (mean ± σ):  1.68% ± 6.51%
   ▅▇██▃                                                        ▂
  ██████▇▇▇██▇▆▄▃▅▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▁▄▅▄▅▅▆▄▆▅▅▆ █
  670 μs        Histogram: log(frequency) by time       1.56 ms <
 Memory estimate: 708.17 KiB, allocs estimate: 4.

@benchmark linreg_sweep(y, X)

BenchmarkTools.Trial: 962 samples with 1 evaluation.
 Range (min … max):  5.001 ms …   6.292 ms  ┊ GC (min … max): 0.00% … 14.58%
 Time  (median):     5.139 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.197 ms ± 227.243 μs  ┊ GC (mean ± σ):  1.10% ±  3.57%
    ▁▄▆██▇▇▅▄▂                                                 
  ▆▇███████████▁▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▄▆█▆▅▅▆▆▆▆▆▆▁▇▆▄ █
  5 ms         Histogram: log(frequency) by time      6.17 ms <
 Memory estimate: 2.99 MiB, allocs estimate: 6.

@benchmark linreg_qr(y, X)

BenchmarkTools.Trial: 685 samples with 1 evaluation.
 Range (min … max):  7.029 ms …   8.988 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.183 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.307 ms ± 309.303 μs  ┊ GC (mean ± σ):  1.00% ± 2.96%
    ▂▅▇██▅▃▃                   ▁                               
  ▅▇█████████▆█▄▇▇▇▆▇█▅▅▄▇▆▅▇█▇█▇█▇▄▄▄▆▅▆▄▅▁█▆▇▇▅▅▄▅▅▁▁▁▁▄▄▁▆ █
  7.03 ms      Histogram: log(frequency) by time      8.48 ms <
 Memory estimate: 3.95 MiB, allocs estimate: 1828.

@benchmark linreg_svd(y, X)

BenchmarkTools.Trial: 229 samples with 1 evaluation.
 Range (min … max):  20.554 ms … 46.199 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.851 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.905 ms ±  1.970 ms  ┊ GC (mean ± σ):  0.62% ± 1.37%
   █▅            ▃▂                                            
  ▇██▅▄▁▁▁▂▃▃▄▅▃▄███▃▁▃▁▄▄▃▁▂▄▅▆▄▄▄▁▁▂▁▂▃▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂ ▃
  20.6 ms         Histogram: frequency by time        25.6 ms <
 Memory estimate: 8.06 MiB, allocs estimate: 14.