Cholesky Decomposition

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 19, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/12-chol`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/12-chol/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [31c24e10] Distributions v0.25.87
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

1 Introduction

A basic tenet in numerical analysis:

The structure should be exploited whenever solving a problem.

Common structures include: symmetry, positive (semi)definiteness, sparsity, Kronecker product, low rank, …

LU decomposition (Gaussian Elimination) is not used in statistics so often because most of time statisticians deal with positive (semi)definite matrix.
That’s why we often criticize use of the solve() function in base R code, which inverts a matrix using LU decomposition. First, most likely we don’t need a matrix inverse. Second, most likely we are dealing with a pd/psd matrix and should use Cholesky.
For example, in the normal equation \[ \mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y} \] for linear regression, the coefficient matrix \(\mathbf{X}^T \mathbf{X}\) is symmetric and positive semidefinite. How to exploit this structure?

2 Cholesky decomposition

Theorem: Let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be symmetric and positive definite. Then \(\mathbf{A} = \mathbf{L} \mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular with positive diagonal entries and is unique.
Proof (by induction):
If \(n=1\), then \(\ell = \sqrt{a}\). For \(n>1\), the block equation \[ \begin{eqnarray*} \begin{pmatrix} a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22} \end{pmatrix} = \begin{pmatrix} \ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{l} & \mathbf{L}_{22} \end{pmatrix} \begin{pmatrix} \ell_{11} & \mathbf{l}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T \end{pmatrix} \end{eqnarray*} \] has solution \[ \begin{eqnarray*} \ell_{11} &=& \sqrt{a_{11}} \\ \mathbf{l} &=& \ell_{11}^{-1} \mathbf{a} \\ \mathbf{L}_{22} \mathbf{L}_{22}^T &=& \mathbf{A}_{22} - \mathbf{l} \mathbf{l}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T. \end{eqnarray*} \]
Now \(a_{11}>0\) (why?), so \(\ell_{11}\) and \(\mathbf{l}\) are uniquely determined. \(\mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T\) is positive definite because \(\mathbf{A}\) is positive definite (why?). By induction hypothesis, \(\mathbf{L}_{22}\) exists and is unique.
The constructive proof completely specifies the algorithm:

Computational cost: \[ \frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3 \quad \text{flops} \] plus \(n\) square roots. Half the cost of LU decomposition by utilizing symmetry.
In general Cholesky decomposition is very stable. Failure of the decomposition simply means \(\mathbf{A}\) is not positive definite. It is an efficient way to test positive definiteness.

3 Pivoting

When \(\mathbf{A}\) does not have full rank, e.g., \(\mathbf{X}^T \mathbf{X}\) with a non-full column rank \(\mathbf{X}\), we encounter \(a_{kk} = 0\) during the procedure.
Symmetric pivoting. At each stage \(k\), we permute both row and column such that \(\max_{k \le i \le n} a_{ii}\) becomes the pivot. If we encounter \(\max_{k \le i \le n} a_{ii} = 0\), then \(\mathbf{A}[k:n,k:n] = \mathbf{0}\) (why?) and the algorithm terminates.
With symmetric pivoting: \[ \mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T, \] where \(\mathbf{P}\) is a permutation matrix and \(\mathbf{L} \in \mathbb{R}^{n \times r}\), \(r = \text{rank}(\mathbf{A})\).

4 LAPACK and Julia implementation

LAPACK functions: ?potrf (without pivoting), ?pstrf (with pivoting).
Julia functions: cholesky, cholesky!, or call LAPACK wrapper functions potrf! and pstrf!

4.1 Example: positive definite matrix

using LinearAlgebra

A = [4.0 12 -16; 12 37 -43; -16 -43 98]

3×3 Matrix{Float64}:
   4.0   12.0  -16.0
  12.0   37.0  -43.0
 -16.0  -43.0   98.0

# Cholesky without pivoting
Achol = cholesky(Symmetric(A))

Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0

typeof(Achol)

Cholesky{Float64, Matrix{Float64}}

dump(Achol)

Cholesky{Float64, Matrix{Float64}}
  factors: Array{Float64}((3, 3)) [2.0 6.0 -8.0; 12.0 1.0 5.0; -16.0 -43.0 3.0]
  uplo: Char 'U'
  info: Int64 0

# retrieve the lower triangular Cholesky factor
Achol.L

3×3 LowerTriangular{Float64, Matrix{Float64}}:
  2.0   ⋅    ⋅ 
  6.0  1.0   ⋅ 
 -8.0  5.0  3.0

# retrieve the upper triangular Cholesky factor
Achol.U

3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0

b = [1.0; 2.0; 3.0]
A \ b # this does LU; wasteful!!!; 2/3 n^3 + 2n^2

3-element Vector{Float64}:
 28.583333333333304
 -7.666666666666659
  1.3333333333333321

Achol \ b # two triangular solves; only 2n^2 flops

3-element Vector{Float64}:
 28.583333333333332
 -7.666666666666666
  1.3333333333333333

det(A) # this does LU; wasteful!!! (2/3) n^3

36.000000000000036

det(Achol) # cheap

36.0

inv(A) # this does LU! (2/3) n^3 + (4/3) n^3

3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

inv(Achol) # (4/3) n^3

3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

4.2 Example: positive semi-definite matrix.

using Random

Random.seed!(123) # seed
A = randn(5, 3)
A = A * transpose(A) # A has rank 3

5×5 Matrix{Float64}:
  0.726688  -0.982622  -1.0597   -0.58302    0.41867
 -0.982622   1.45895    1.96103   0.931141  -0.594176
 -1.0597     1.96103    4.07717   1.96485   -1.09925
 -0.58302    0.931141   1.96485   1.35893   -0.880852
  0.41867   -0.594176  -1.09925  -0.880852   0.607154

Achol = cholesky(A, Val(true)) # 2nd argument requests pivoting

LoadError: RankDeficientException(1)

Achol = cholesky(A, Val(true), check=false) # turn off checking pd

CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}
U factor with rank 4:
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.58051e-8
  ⋅       ⋅          ⋅           ⋅          -5.55112e-16
permutation:
5-element Vector{Int64}:
 3
 2
 4
 5
 1

rank(Achol) # determine rank from Cholesky factor

rank(A) # determine rank from SVD (default), which is more numerically stable

Achol.L

5×5 LowerTriangular{Float64, Matrix{Float64}}:
  2.0192      ⋅           ⋅         ⋅            ⋅ 
  0.971191   0.718149     ⋅         ⋅            ⋅ 
  0.973082  -0.0193672   0.641612   ⋅            ⋅ 
 -0.544399  -0.0911515  -0.549978  1.05367e-8    ⋅ 
 -0.524812  -0.658539   -0.132617  1.58051e-8  -5.55112e-16

Achol.U

5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.0192  0.971191   0.973082   -0.544399    -0.524812
  ⋅      0.718149  -0.0193672  -0.0911515   -0.658539
  ⋅       ⋅         0.641612   -0.549978    -0.132617
  ⋅       ⋅          ⋅          1.05367e-8   1.58051e-8
  ⋅       ⋅          ⋅           ⋅          -5.55112e-16

Achol.p

5-element Vector{Int64}:
 3
 2
 4
 5
 1

# P A P' = L U
A[Achol.p, Achol.p] ≈ Achol.L * Achol.U

true

5 Applications

No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.

5.1 Multivariate normal density

Multivariate normal density \(\text{MVN}(0, \Sigma)\), where \(\Sigma\) is p.d., is \[ \, - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}. \]

Method 1: (a) compute explicit inverse \(\Sigma^{-1}\) (\(2n^3\) flops), (b) compute quadratic form (\(2n^2 + 2n\) flops), (c) compute determinant (\(2n^3/3\) flops).
Method 2: (a) Cholesky decomposition \(\Sigma = \mathbf{L} \mathbf{L}^T\) (\(n^3/3\) flops), (b) Solve \(\mathbf{L} \mathbf{x} = \mathbf{y}\) by forward substitutions (\(n^2\) flops), (c) compute quadratic form \(\mathbf{x}^T \mathbf{x}\) (\(2n\) flops), and (d) compute determinant from Cholesky factor (\(n\) flops).

Which method is better?

# this is a person w/o numerical analysis training
function logpdf_mvn_1(y::Vector, Σ::Matrix)
    n = length(y)
    - (n//2) * log(2π) - (1//2) * logdet(Symmetric(Σ)) - (1//2) * transpose(y) * inv(Σ) * y
end

# this is a computing-savvy person 
function logpdf_mvn_2(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * abs2(norm(Σchol.L \ y))
end

# better memory efficiency - input Σ is overwritten
function logpdf_mvn_3(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky!(Symmetric(Σ)) # Σ is overwritten
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * dot(y, Σchol \ y)
end

logpdf_mvn_3 (generic function with 1 method)

using BenchmarkTools, Distributions, Random

Random.seed!(257) # seed

n = 1000
# a pd matrix
Σ = convert(Matrix{Float64}, Symmetric([i * (n - j + 1) for i in 1:n, j in 1:n]))
y = rand(MvNormal(Σ)) # one random sample from N(0, Σ)

# at least they should give the same answer
@show logpdf_mvn_1(y, Σ)
@show logpdf_mvn_2(y, Σ)
Σc = copy(Σ)
@show logpdf_mvn_3(y, Σc);

logpdf_mvn_1(y, Σ) = -4873.326849713732
logpdf_mvn_2(y, Σ) = -4873.326849713767
logpdf_mvn_3(y, Σc) = -4873.326849713767

@benchmark logpdf_mvn_1($y, $Σ)

BenchmarkTools.Trial: 329 samples with 1 evaluation.
 Range (min … max):  14.644 ms …  29.782 ms  ┊ GC (min … max): 0.00% … 3.09%
 Time  (median):     14.895 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.211 ms ± 962.988 μs  ┊ GC (mean ± σ):  1.97% ± 3.14%
     ▇▇█▆▂                                                      
  ▃▄▆█████▅▂▃▂▁▁▃▁▁▁▂▂▃▁▂▃▄▄▃▂▄▄▄▃▄▅▃▄▄▃▁▁▃▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▃
  14.6 ms         Histogram: frequency by time         17.1 ms <
 Memory estimate: 15.77 MiB, allocs estimate: 9.

@benchmark logpdf_mvn_2($y, $Σ)

BenchmarkTools.Trial: 1427 samples with 1 evaluation.
 Range (min … max):  2.727 ms …  15.286 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     3.158 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.504 ms ± 815.902 μs  ┊ GC (mean ± σ):  8.14% ± 11.34%
         ▃█▁                                                   
  ▂▁▂▁▁▂▄███▆▃▂▂▂▂▁▁▁▂▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▃▃▂▂▂▂▂▁▂▁▂▁▁▁▂▁▂▁▂▁▂▁▂▂ ▂
  2.73 ms         Histogram: frequency by time        5.57 ms <
 Memory estimate: 15.27 MiB, allocs estimate: 6.

@benchmark logpdf_mvn_3($y, $Σc) setup=(copy!(Σc, Σ)) evals=1

BenchmarkTools.Trial: 2279 samples with 1 evaluation.
 Range (min … max):  2.031 ms …   5.622 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.046 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.052 ms ± 123.650 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
       ▁▃▆▅▅▅▆▅███▇▆█▅▅▁▁                                      
  ▂▃▃▅▇██████████████████▇▆▅▅▄▃▃▃▃▂▃▃▂▂▂▂▁▁▁▂▁▁▁▂▂▁▁▂▁▂▁▂▂▂▁▂ ▄
  2.03 ms         Histogram: frequency by time        2.09 ms <
 Memory estimate: 7.94 KiB, allocs estimate: 1.

To evaluate same multivariate normal density at many observations \(y_1, y_2, \ldots\), we pre-compute the Cholesky decomposition (\(n^3/3\) flops), then each evaluation costs \(n^2\) flops.

5.2 Linear regression

Cholesky decomposition is one approach to solve linear regression
\[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \]
Assume \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and \(\mathbf{y} \in \mathbb{R}^n\).
- Compute \(\mathbf{X}^T \mathbf{X}\): \(np^2\) flops
- Compute \(\mathbf{X}^T \mathbf{y}\): \(2np\) flops
- Cholesky decomposition of \(\mathbf{X}^T \mathbf{X}\): \(\frac{1}{3} p^3\) flops
- Solve normal equation \(\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}\): \(2p^2\) flops
- If need standard errors, another \((4/3)p^3\) flops
Total computational cost is \(np^2 + (1/3) p^3\) (without s.e.) or \(np^2 + (5/3) p^3\) (with s.e.) flops.

6 Further reading

Section 7.7 of Numerical Analysis for Statisticians of Kenneth Lange (2010).
Section II.5.3 of Computational Statistics by James Gentle (2010).
Section 4.2 of Matrix Computation by Gene Golub and Charles Van Loan (2013).