versioninfo()
Biostat/Biomath M257 Homework 3
Due May 5 @ 11:59PM
System information (for reproducibility):
Load packages:
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
using LinearAlgebra, Random
using BenchmarkTools, Distributions
Consider a linear mixed effects model \[
\mathbf{Y}_i = \mathbf{X}_i \boldsymbol{\beta} + \mathbf{Z}_i \boldsymbol{\gamma} + \boldsymbol{\epsilon}_i, \quad i=1,\ldots,n,
\] where
- \(\mathbf{Y}_i \in \mathbb{R}^{n_i}\) is the response vector of \(i\)-th individual,
- \(\mathbf{X}_i \in \mathbb{R}^{n_i \times p}\) is the fixed effect predictor matrix of \(i\)-th individual,
- \(\mathbf{Z}_i \in \mathbb{R}^{n_i \times q}\) is the random effect predictor matrix of \(i\)-th individual,
- \(\boldsymbol{\epsilon}_i \in \mathbb{R}^{n_i}\) are multivariate normal \(N(\mathbf{0}_{n_i},\sigma^2 \mathbf{I}_{n_i})\),
- \(\boldsymbol{\beta} \in \mathbb{R}^p\) are fixed effects, and
- \(\boldsymbol{\gamma} \in \mathbb{R}^q\) are random effects assumed to be \(N(\mathbf{0}_q, \boldsymbol{\Sigma}_{q \times q}\)) independent of \(\boldsymbol{\epsilon}_i\).
1 Q1 Formula (10 pts)
Write down the log-likelihood of the \(i\)-th datum \((\mathbf{y}_i, \mathbf{X}_i, \mathbf{Z}_i)\) given parameters \((\boldsymbol{\beta}, \boldsymbol{\Sigma}, \sigma^2)\).
2 Q2 Start-up code
Use the following template to define a type LmmObs
that holds an LMM datum \((\mathbf{y}_i, \mathbf{X}_i, \mathbf{Z}_i)\).
# define a type that holds LMM datum
struct LmmObs{T <: AbstractFloat}
# data
:: Vector{T}
y :: Matrix{T}
X :: Matrix{T}
Z # working arrays
# whatever intermediate vectors/arrays you may want to pre-allocate
:: Vector{T}
storage_p :: Vector{T}
storage_q :: Matrix{T}
xtx :: Matrix{T}
ztx :: Matrix{T}
ztz :: Matrix{T}
storage_qq end
# constructor
function LmmObs(
::Vector{T},
y::Matrix{T},
X::Matrix{T}
Zwhere T <: AbstractFloat
) = Vector{T}(undef, size(X, 2))
storage_p = Vector{T}(undef, size(Z, 2))
storage_q = transpose(X) * X
xtx = transpose(Z) * X
ztx = transpose(Z) * Z
ztz = similar(ztz)
storage_qq LmmObs(y, X, Z, storage_p, storage_q, xtx, ztx, ztz, storage_qq)
end
Write a function, with interface
logl!(obs, β, L, σ²)
that evaluates the log-likelihood of the \(i\)-th datum. Here L
is the lower triangular Cholesky factor from the Cholesky decomposition Σ=LL'
. Make your code efficient in the \(n_i \gg q\) case. Think the intensive longitudinal measurement setting.
function logl!(
:: LmmObs{T},
obs :: Vector{T},
β :: Matrix{T},
L :: T) where T <: AbstractFloat
σ² = size(obs.X, 1), size(obs.X, 2), size(obs.Z, 2)
n, p, q # TODO: compute and return the log-likelihood
sleep(1e-3) # wait 1 ms as if your code takes 1ms
return zero(T)
end
Hint: This function shouldn’t be very long. Mine, obeying 92-character rule, is 30 lines. If you find yourself writing very long code, you’re on the wrong track. Think about algorithm (flop count) first then use BLAS functions to reduce memory allocations.
3 Q3 Correctness (15 pts)
Compare your result (both accuracy and timing) to the Distributions.jl package using following data.
Random.seed!(257)
# dimension
= 2000, 5, 3
n, p, q # predictors
= [ones(n) randn(n, p - 1)]
X = [ones(n) randn(n, q - 1)]
Z # parameter values
= [2.0; -1.0; rand(p - 2)]
β = 1.5
σ² = fill(0.1, q, q) + 0.9I
Σ # generate y
= X * β + Z * rand(MvNormal(Σ)) + sqrt(σ²) * randn(n)
y
# form an LmmObs object
= LmmObs(y, X, Z) obs
This is the standard way to evaluate log-density of a multivariate normal, using the Distributions.jl package. Let’s evaluate the log-likelihood of this datum.
= X * β
μ = Z * Σ * transpose(Z) + σ² * I
Ω = MvNormal(μ, Symmetric(Ω)) # MVN(μ, Σ)
mvn logpdf(mvn, y)
Check that your answer matches that from Distributions.jl
= Matrix(cholesky(Σ).L)
L logl!(obs, β, L, σ²)
You will lose all 15 + 30 + 30 = 75 points if the following statement throws AssertionError
.
@assert logl!(obs, β, Matrix(cholesky(Σ).L), σ²) ≈ logpdf(mvn, y)
4 Q4 Efficiency (30 pts)
Benchmarking your code and compare to the Distributions.jl function logpdf
.
# benchmark the `logpdf` function in Distribution.jl
= @benchmark logpdf($mvn, $y) bm1
# benchmark your implementation
= Matrix(cholesky(Σ).L)
L = @benchmark logl!($obs, $β, $L, $σ²) bm2
The points you will get is \[ \frac{x}{1000} \times 30, \] where \(x\) is the speedup of your program against the standard method.
# this is the points you'll get
clamp(median(bm1).time / median(bm2).time / 1000 * 30, 0, 30)
Hint: Apparently I am using 1000 as denominator because I expect your code to be at least \(1000 \times\) faster than the standard method.
5 Q5 Memory (30 pts)
You want to avoid memory allocation in the “hot” function logl!
. You will lose 1 point for each 1 KiB = 1024 bytes
memory allocation. In other words, the points you get for this question is
clamp(30 - median(bm2).memory / 1024, 0, 30)
Hint: I am able to reduce the memory allocation to 0 bytes.
6 Q6 Misc (15 pts)
Coding style, Git workflow, etc. For reproducibity, make sure we (TA and myself) can run your Jupyter Notebook. That is how we grade Q4 and Q5. If we cannot run it, you will get zero points.