Algorithms

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 6, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

1 Definition

Algorithm is loosely defined as a set of instructions for doing something. Input \(\to\) Output.
Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

2 Measure of efficiency

A basic unit for measuring algorithmic efficiency is flop.

A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.

Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.
How to measure efficiency of an algorithm? Big O notation. If \(n\) is the size of a problem, an algorithm has order \(O(f(n))\), where the leading term in the number of flops is \(c \cdot f(n)\). For example,
- matrix-vector multiplication A * b, where A is \(m \times n\) and b is \(n \times 1\), takes \(2mn\) or \(O(mn)\) flops
- matrix-matrix multiplication A * B, where A is \(m \times n\) and B is \(n \times p\), takes \(2mnp\) or \(O(mnp)\) flops
A hierarchy of computational complexity:
Let \(n\) be the problem size.
- Exponential order: \(O(b^n)\) (NP-hard=“horrible”)
- Polynomial order: \(O(n^q)\) (doable)
- \(O(n \log n )\) (fast)
- Linear order \(O(n)\) (fast)
- Log order \(O(\log n)\) (super fast)
Classification of data sets by Huber.

Data Size	Bytes	Storage Mode
Tiny	\(10^2\)	Piece of paper
Small	\(10^4\)	A few pieces of paper
Medium	\(10^6\) (megatbytes)	A floppy disk
Large	\(10^8\)	Hard disk
Huge	\(10^9\) (gigabytes)	Hard disk(s)
Massive	\(10^{12}\) (terabytes)	RAID storage

Difference of \(O(n^2)\) and \(O(n\log n)\) on massive data. Suppose we have a teraflop supercomputer capable of doing \(10^{12}\) flops per second. For a problem of size \(n=10^{12}\), \(O(n \log n)\) algorithm takes about \[10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.\] \(O(n^2)\) algorithm takes about \(10^{12}\) seconds, which is approximately 31710 years!
QuickSort and FFT (invented by statistician John Tukey!) are celebrated algorithms that turn \(O(n^2)\) operations into \(O(n \log n)\). Another example is the Strassen’s method, which turns \(O(n^3)\) matrix multiplication into \(O(n^{\log_2 7})\).
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

For example, compare flops of the two mathematically equivalent expressions: (A * B) * x and A * (B * x) where A and B are matrices and x is a vector.

using BenchmarkTools, Random

Random.seed!(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

# complexity is n^3 + n^2 = O(n^3)
bm1 = @benchmark ($A * $B) * $x
bm1

BenchmarkTools.Trial: 703 samples with 1 evaluation.
 Range (min … max):  6.437 ms …  18.581 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.003 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.119 ms ± 537.078 μs  ┊ GC (mean ± σ):  1.62% ± 3.36%
                      ▃█▆▅▂                                    
  ▃▂▂▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▄██████▇▅▃▁▁▂▁▁▁▁▂▂▂▁▁▂▂▂▁▂▃▄▄▃▄▄▄▃▃▃▃▂▂▂ ▃
  6.44 ms         Histogram: frequency by time        7.89 ms <
 Memory estimate: 7.64 MiB, allocs estimate: 3.

# complexity is n^2 + n^2 = O(n^2)
bm2 = @benchmark $A * ($B * $x)
bm2

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  207.584 μs …  2.529 ms  ┊ GC (min … max): 0.00% … 90.81%
 Time  (median):     219.708 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   243.737 μs ± 57.677 μs  ┊ GC (mean ± σ):  0.09% ±  0.91%
    ▂▆█▅▁                                            ▁▃▃▁▂▂▃▃▂ ▁
  ▅▆█████▆▄▅▄▄▅▅▅▇█▆▅▁▄▃▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▃▄▁▁▁▁▁▁▁▃▃▁▄█████████ █
  208 μs        Histogram: log(frequency) by time       372 μs <
 Memory estimate: 15.88 KiB, allocs estimate: 2.

# Speed up
median(bm1.times) / median(bm2.times)

31.872799351867023

# Fortunately, Julia parsed the following code in the more efficient way
# No luck with some other languages
@code_lowered A * B * x

CodeInfo(
1 ─ %1 = B * x
│   %2 = A * %1
└──      return %2
)

3 Performance of computer systems

FLOPS (floating point operations per second) is a measure of computer performance.
For example, this laptop has the Apple M2 Max CPU with 8 performance cores and 4 efficiency cores.

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

In Julia, LinearAlgebra.peakflops computes the peak flop rate of the computer by running double precision matrix-matrix multiplication BLAS.gemm!. About 385 GFLOPS DP.

using LinearAlgebra

# Double-precison throughput
LinearAlgebra.peakflops(2^14) # matrix size 2^14

3.788241800490073e11

We roughtly estimate the single precision thoughput by (# flops / runtime). About 733 GFLOPS SP.

using BenchmarkTools, Random

# Generate matrix data
Random.seed!(257)
n = 2^14
A = randn(Float32, n, n)
B = randn(Float32, n, n)
C = Matrix{Float32}(undef, n, n)

# Single-precision throughput
bm = @benchmark mul!(C, A, B)
bm

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.285 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.

# Estimate single precision throughput by # flops / runtime
(2n^3) / (minimum(bm.times) / 1e9)

7.15992391532777e11

In a later lecture, we’ll see graphical processing units (GPUs) offer a much larger thoughput in single precision.