Biostat/Biomath M257 Homework 4

Due May 12 @ 11:59PM

Author

Student Name and UID

Published

May 2, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/hw/hw4`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/hw/hw4/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [944b1d66] CodecZlib v0.7.1
  [0b1a1467] KrylovKit v0.6.0
  [bdcacae8] LoopVectorization v0.12.158
  [b51810bb] MatrixDepot v1.0.10
  [295af30f] Revise v3.5.2
  [b8865327] UnicodePlots v3.5.2
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [2f01184e] SparseArrays

We are going to try different numerical methods learnt in class on the Google PageRank problem.

0.1 Q1 (5 pts) Recognize structure

Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries \[ \begin{eqnarray*} a_{ij}= \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} \] $r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules.

From a page $i$ with $r_i>0$
- with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page
- with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page
From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page

The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix.

0.2 Q2 Relate to numerical linear algebra

According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an eigen-problem. It can also be cast as solving a linear system. Since the row sums of $\mathbf{P}$ are 1, $\mathbf{P}$ is rank deficient. We can replace the first equation by the $\sum_{i=1}^n x_i = 1$.

Hint: For iterative solvers, we don’t need to replace the 1st equation. We can use the matrix $\mathbf{I} - \mathbf{P}^T$ directly if we start with a vector with all positive entries.

0.3 Q3 (10 pts) Explore data

Obtain the connectivity matrix A from the SNAP/web-Google data in the MatrixDepot package.

using MatrixDepot

md = mdopen("SNAP/web-Google")
# display documentation for the SNAP/web-Google data
mdinfo(md)

[ Info: verify download of index files...
[ Info: reading database
[ Info: adding metadata...
[ Info: adding svd data...
[ Info: writing database
[ Info: used remote sites are sparse.tamu.edu with MAT index and math.nist.gov with HTML index

1 SNAP/web-Google

1.0.0.0.0.1 MatrixMarket matrix coordinate pattern general

UF Sparse Matrix Collection, Tim Davis
http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google
name: SNAP/web-Google
[Web graph from Google]
id: 2301
date: 2002
author: Google
ed: J. Leskovec
fields: name title A id date author ed kind notes
kind: directed graph

notes:
Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,
Jure Leskovec http://snap.stanford.edu/data/index.html
email jure at cs.stanford.edu
Google web graph
Dataset information
Nodes represent web pages and directed edges represent hyperlinks between them.
The data was released in 2002 by Google as a part of Google Programming
Contest.
Dataset statistics
Nodes 875713
Edges 5105039
Nodes in largest WCC 855802 (0.977)
Edges in largest WCC 5066842 (0.993)
Nodes in largest SCC 434818 (0.497)
Edges in largest SCC 3419124 (0.670)
Average clustering coefficient 0.6047
Number of triangles 13391903
Fraction of closed triangles 0.05523
Diameter (longest shortest path) 22
90-percentile effective diameter 8.1
Source (citation)
J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large
Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.
arXiv.org:0810.1355, 2008.
Google programming contest, 2002
http://www.google.com/programming-contest/
Files
File Description
web-Google.txt.gz Webgraph from the Google programming contest, 2002

916428 916428 5105039

# connectivity matrix
A = md.A

916428×916428 SparseArrays.SparseMatrixCSC{Bool, Int64} with 5105039 stored entries:
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿

Compute summary statistics:

How much memory does A take? If converted to a Matrix{Float64} (don’t do it!), how much memory will it take?
number of web pages
number of edges (web links)
number of dangling nodes (pages with no out links)
histogram of in-degrees
list the top 20 pages with the largest in-degrees?
histogram of out-degrees
which the top 20 pages with the largest out-degrees?
visualize the sparsity pattern of $\mathbf{A}$ or a submatrix of $\mathbf{A}$ say A[1:10000, 1:10000].

Hint: For plots, you can use the UnicodePlots.jl package (spy, histogram, etc), which is fast for large data.

1.1 Q4 (5 pts) Dense linear algebra?

Consider the following methods to obtain the page ranks of the SNAP/web-Google data.

A dense linear system solver such as LU decomposition.
A dense eigen-solver for asymmetric matrix.

For the LU approach, estimate (1) the memory usage and (2) how long it will take assuming that the LAPACK functions can achieve the theoretical throughput of your computer.

1.2 Q5 (75 pts) Iterative solvers

Set the teleportation parameter at $p = 0.85$. Consider the following methods for solving the PageRank problem.

An iterative linear system solver such as GMRES.
An iterative eigen-solver such as Arnoldi method.

For iterative methods, we have many choices in Julia. See a list of existing Julia packages for linear solvers at this page. The start-up code below uses the KrylovKit.jl package. You can use other packages if you prefer. Make sure to utilize the special structure of $\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication.

1.2.1 Step 1 (15 pts)

Let’s implement a type PageRankImPt that mimics the matrix $\mathbf{M} = \mathbf{I} - \mathbf{P}^T$. For iterative methods, all we need to provide are methods for evaluating $\mathbf{M} \mathbf{v}$ and $\mathbf{M}^T \mathbf{v}$ for arbitrary vector $\mathbf{v}$.

using BenchmarkTools, LinearAlgebra, SparseArrays, Revise

# a type for the matrix M = I - P^T in PageRank problem
struct PageRankImPt{TA <: Number, IA <: Integer, T <: AbstractFloat} <: AbstractMatrix{T}
    A         :: SparseMatrixCSC{TA, IA} # adjacency matrix
    telep     :: T
    # working arrays
    # TODO: whatever intermediate arrays you may want to pre-allocate
end

# constructor
function PageRankImPt(A::SparseMatrixCSC, telep::T) where T <: AbstractFloat
    n = size(A, 1)
    # TODO: initialize and pre-allocate arrays
    PageRankImPt(A, telep)
end

LinearAlgebra.issymmetric(::PageRankImPt) = false
Base.size(M::PageRankImPt) = size(M.A)
# TODO: implement this function for evaluating M[i, j]
Base.getindex(M::PageRankImPt, i, j) = M.telep

# overwrite `out` by `(I - Pt) * v`
function LinearAlgebra.mul!(
        out :: Vector{T}, 
        M   :: PageRankImPt{<:Number, <:Integer, T}, 
        v   :: Vector{T}
        ) where T <: AbstractFloat
    # TODO: implement mul!(out, M, v)
    sleep(1e-2) # wait 10 ms as if your code takes 1ms
    return out
end

# overwrite `out` by `(I - P) * v`
function LinearAlgebra.mul!(
        out :: Vector{T}, 
        Mt  :: Transpose{T, PageRankImPt{TA, IA, T}}, 
        v   :: Vector{T}
        ) where {TA<:Number, IA<:Integer, T <: AbstractFloat}
    M = Mt.parent
    # TODO: implement mul!(out, transpose(M), v)
    sleep(1e-2) # wait 10 ms as if your code takes 1ms
    out
end

To check correctness. Note \[ \mathbf{M}^T \mathbf{1} = \mathbf{0} \] and \[ \mathbf{M} \mathbf{x} = \mathbf{0} \] for stationary distribution $\mathbf{x}$.

Download the solution file pgrksol.csv.gz. Do not put this file in your Git. You will lose points if you do. You can add a line pgrksol.csv.gz to your .gitignore file.

using CodecZlib, DelimitedFiles

isfile("pgrksol.csv.gz") || download("https://raw.githubusercontent.com/ucla-biostat-257/2023spring/master/hw/hw4/pgrksol.csv.gz")
xsol = open("pgrksol.csv.gz", "r") do io
    vec(readdlm(GzipDecompressorStream(io)))
end

916428-element Vector{Float64}:
 3.3783428216975054e-5
 2.0710155392568165e-6
 3.663065984832893e-6
 7.527510785028837e-7
 8.63328599674051e-7
 1.769418252415541e-6
 2.431230382883396e-7
 6.368417180141445e-7
 4.744973703681939e-7
 2.6895486110647536e-7
 3.18574314847409e-6
 7.375106374416742e-7
 2.431230382883396e-7
 ⋮
 1.1305006040148547e-6
 4.874825281822915e-6
 3.167946973112519e-6
 9.72688040308568e-7
 6.588614479285245e-7
 7.737011774300648e-7
 2.431230382883396e-7
 1.6219204214797293e-6
 3.912130060551738e-7
 2.431230382883396e-7
 7.296033831163157e-6
 6.330939996912478e-7

You will lose all 35 points (Steps 1 and 2) if the following statements throw AssertError.

M = PageRankImPt(A, 0.85)
n = size(M, 1)

#@assert transpose(M) * ones(n) ≈ zeros(n)
@assert norm(transpose(M) * ones(n)) < 1e-12

#@assert M * xsol ≈ zeros(n)
@assert norm(M * xsol) < 1e-12

1.2.2 Step 2 (20 pts)

We want to benchmark the hot functions mul! to make sure they are efficient and allocate no memory.

M = PageRankImPt(A, 0.85)
n = size(M, 1)
v, out = ones(n), zeros(n)
bm_mv = @benchmark mul!($out, $M, $v) setup=(fill!(out, 0); fill!(v, 1))

BenchmarkTools.Trial: 373 samples with 1 evaluation.
 Range (min … max):  10.912 ms …  12.788 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.061 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.956 ms ± 302.479 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                            ▆█▅                 
  ▄▁▁▁▁▆▅▆▆▄▁▅▆▆▄▅▆██▆▄▄▅▅▁▁▁▅▅▅▁▁▁▅▄▁▄▄▆▁▄▇████▇▆▅▄▄▄▅▁▁▁▁▄▁▆ ▆
  10.9 ms       Histogram: log(frequency) by time      12.5 ms <
 Memory estimate: 144 bytes, allocs estimate: 5.

bm_mtv = @benchmark mul!($out, $(transpose(M)), $v) setup=(fill!(out, 0); fill!(v, 1))

BenchmarkTools.Trial: 374 samples with 1 evaluation.
 Range (min … max):  10.966 ms …  12.571 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.058 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.960 ms ± 272.533 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                           ▄█▄                  
  ▄▁▁▁▆▅▅▅▁▆▄▆▅▅▄▆█▇▄▄▆▅▁▄▄▄▅▅▅▅▁▆▅▄▁▆▄▅▁▄▅████▇▁▁▁▄▄▁▁▁▄▁▁▁▁▅ ▆
  11 ms         Histogram: log(frequency) by time      12.5 ms <
 Memory estimate: 144 bytes, allocs estimate: 5.

You will lose 1 point for each 100 bytes memory allocation. So the points you will get is

clamp(10 - median(bm_mv).memory / 100, 0, 10) + 
clamp(10 - median(bm_mtv).memory / 100, 0, 10)

17.12

Hint: My median run times are about 10 ms and memory allocations are 0 bytes.

1.2.3 Step 3 (20 pts)

Let’s first try to solve the PageRank problem by the GMRES method for solving linear equations.

using KrylovKit

# normalize in-degrees to be the start point
x0   = vec(sum(A, dims = 1)) .+ 1.0
x0 ./= sum(x0)

# right hand side
b = zeros(n)

# warm up (compilation)
linsolve(M, b, x0, issymmetric = false, isposdef = false, maxiter = 1) 
# output is complex eigenvalue/eigenvector
(x_gmres, info), time_gmres, = @timed linsolve(M, b, x0, issymmetric = false, isposdef = false)

(value = ([3.5373439728225696e-5, 1.1625074089088258e-6, 7.4732619144138796e-6, 6.642899479479004e-7, 9.964349219218506e-7, 2.6571597917916015e-6, 1.660724869869751e-7, 6.642899479479004e-7, 3.321449739739502e-7, 3.321449739739502e-7  …  2.989304765765552e-6, 1.3285798958958008e-6, 2.4910873048046265e-6, 4.982174609609253e-7, 1.660724869869751e-7, 2.4910873048046265e-6, 3.321449739739502e-7, 1.660724869869751e-7, 7.4732619144138796e-6, 4.982174609609253e-7], ConvergenceInfo: one converged value after 0 iterations and 1 applications of the linear map;
norms of residuals are given by (2.4845421886e-314,).
), time = 0.042158375, bytes = 27077034, gctime = 0.009942417, gcstats = Base.GC_Diff(27077034, 3, 0, 87986, 2, 50, 9942417, 1, 0))

Check correctness. You will lose all 20 points if the following statement throws AssertError.

@assert norm(x_gmres - xsol) < 1e-8

LoadError: AssertionError: norm(x_gmres - xsol) < 1.0e-8

GMRES should be reasonably fast. The points you’ll get is

clamp(20 / time_gmres * 20, 0, 20)

20.0

Hint: My runtime is about 3-4 seconds.

1.2.4 Step 4 (20 pts)

Let’s first try to solve the PageRank problem by the Arnoldi method for solving eigen problems.

# warm up (compilation)
eigsolve(M, x0, 1, :SR, issymmetric = false, maxiter = 1)
# output is complex eigenvalue/eigenvector
(vals, vecs, info), time_arnoldi, = @timed eigsolve(M, x0, 1, :SR, issymmetric = false)

(value = ([0.0], [[0.005716124042701269, 0.00018785384177891497, 0.0012076318400073105, 0.00010734505244509426, 0.00016101757866764137, 0.00042938020978037703, 2.6836263111273564e-5, 0.00010734505244509426, 5.367252622254713e-5, 5.367252622254713e-5  …  0.00048305273600292417, 0.00021469010489018851, 0.00040254394666910346, 8.050878933382069e-5, 2.6836263111273564e-5, 0.00040254394666910346, 5.367252622254713e-5, 2.6836263111273564e-5, 0.0012076318400073105, 8.050878933382069e-5]], ConvergenceInfo: one converged value after 1 iterations and 1 applications of the linear map;
norms of residuals are given by (0.0,).
), time = 0.030408917, bytes = 32416471, gctime = 0.0, gcstats = Base.GC_Diff(32416471, 4, 0, 54556, 19, 0, 0, 0, 0))

Check correctness. You will lose all 20 points if the following statement throws AssertError.

@assert abs(Real(vals[1])) < 1e-8

x_arnoldi   = abs.(Real.(vecs[1]))
x_arnoldi ./= sum(x_arnoldi)
@assert norm(x_arnoldi - xsol) < 1e-8

LoadError: AssertionError: norm(x_arnoldi - xsol) < 1.0e-8

Arnoldi should be reasonably fast. The points you’ll get is

clamp(20 / time_arnoldi * 20, 0, 20)

20.0

Hint: My runtime is about 6-7 seconds.

1.3 Q6 (5 pts) Results

List the top 20 pages you found and their corresponding PageRank score. Do they match the top 20 pages ranked according to in-degrees?

1.4 Q7 Be proud of yourself

Go to your resume/cv and claim you have experience performing analysis on a network of one million nodes.