Algorithms

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 6, 2023

System information (for reproducibility):

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

1 Definition

  • Algorithm is loosely defined as a set of instructions for doing something. Input \(\to\) Output.

  • Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

2 Measure of efficiency

  • A basic unit for measuring algorithmic efficiency is flop.

A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.

  • Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.

  • How to measure efficiency of an algorithm? Big O notation. If \(n\) is the size of a problem, an algorithm has order \(O(f(n))\), where the leading term in the number of flops is \(c \cdot f(n)\). For example,

    • matrix-vector multiplication A * b, where A is \(m \times n\) and b is \(n \times 1\), takes \(2mn\) or \(O(mn)\) flops
    • matrix-matrix multiplication A * B, where A is \(m \times n\) and B is \(n \times p\), takes \(2mnp\) or \(O(mnp)\) flops
  • A hierarchy of computational complexity:
    Let \(n\) be the problem size.

    • Exponential order: \(O(b^n)\) (NP-hard=“horrible”)
    • Polynomial order: \(O(n^q)\) (doable)
    • \(O(n \log n )\) (fast)
    • Linear order \(O(n)\) (fast)
    • Log order \(O(\log n)\) (super fast)
  • Classification of data sets by Huber.

Data Size Bytes Storage Mode
Tiny \(10^2\) Piece of paper
Small \(10^4\) A few pieces of paper
Medium \(10^6\) (megatbytes) A floppy disk
Large \(10^8\) Hard disk
Huge \(10^9\) (gigabytes) Hard disk(s)
Massive \(10^{12}\) (terabytes) RAID storage
  • Difference of \(O(n^2)\) and \(O(n\log n)\) on massive data. Suppose we have a teraflop supercomputer capable of doing \(10^{12}\) flops per second. For a problem of size \(n=10^{12}\), \(O(n \log n)\) algorithm takes about \[10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.\] \(O(n^2)\) algorithm takes about \(10^{12}\) seconds, which is approximately 31710 years!

  • QuickSort and FFT (invented by statistician John Tukey!) are celebrated algorithms that turn \(O(n^2)\) operations into \(O(n \log n)\). Another example is the Strassen’s method, which turns \(O(n^3)\) matrix multiplication into \(O(n^{\log_2 7})\).

  • One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

  • For example, compare flops of the two mathematically equivalent expressions: (A * B) * x and A * (B * x) where A and B are matrices and x is a vector.
using BenchmarkTools, Random

Random.seed!(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

# complexity is n^3 + n^2 = O(n^3)
bm1 = @benchmark ($A * $B) * $x
bm1
BenchmarkTools.Trial: 703 samples with 1 evaluation.
 Range (minmax):  6.437 ms 18.581 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.003 ms                GC (median):    0.00%
 Time  (mean ± σ):   7.119 ms ± 537.078 μs   GC (mean ± σ):  1.62% ± 3.36%
                      ▃█▆                                 
  ▃▂▂▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▄█████▇▅▃▁▂▁▁▁▁▂▂▂▁▁▂▂▂▁▂▃▄▄▃▄▄▄▃▃▃▃▂▂▂ ▃
  6.44 ms         Histogram: frequency by time        7.89 ms <
 Memory estimate: 7.64 MiB, allocs estimate: 3.
# complexity is n^2 + n^2 = O(n^2)
bm2 = @benchmark $A * ($B * $x)
bm2
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  207.584 μs 2.529 ms   GC (min … max): 0.00% … 90.81%
 Time  (median):     219.708 μs               GC (median):    0.00%
 Time  (mean ± σ):   243.737 μs ± 57.677 μs   GC (mean ± σ):  0.09% ±  0.91%
    ▂▆▅▁                                            ▁▃▃▁▂▂▃▃▂ ▁
  ▅▆████▆▄▅▄▄▅▅▇█▆▅▁▄▃▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▃▄▁▁▁▁▁▁▁▃▃▁▄█████████ █
  208 μs        Histogram: log(frequency) by time       372 μs <
 Memory estimate: 15.88 KiB, allocs estimate: 2.
# Speed up
median(bm1.times) / median(bm2.times)
31.872799351867023
# Fortunately, Julia parsed the following code in the more efficient way
# No luck with some other languages
@code_lowered A * B * x
CodeInfo(
1 ─ %1 = B * x
 %2 = A * %1
└──      return %2
)

3 Performance of computer systems

  • FLOPS (floating point operations per second) is a measure of computer performance.

  • For example, this laptop has the Apple M2 Max CPU with 8 performance cores and 4 efficiency cores.

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  • In Julia, LinearAlgebra.peakflops computes the peak flop rate of the computer by running double precision matrix-matrix multiplication BLAS.gemm!. About 385 GFLOPS DP.
using LinearAlgebra

# Double-precison throughput
LinearAlgebra.peakflops(2^14) # matrix size 2^14
3.788241800490073e11

We roughtly estimate the single precision thoughput by (# flops / runtime). About 733 GFLOPS SP.

using BenchmarkTools, Random

# Generate matrix data
Random.seed!(257)
n = 2^14
A = randn(Float32, n, n)
B = randn(Float32, n, n)
C = Matrix{Float32}(undef, n, n)

# Single-precision throughput
bm = @benchmark mul!(C, A, B)
bm
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.285 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.
# Estimate single precision throughput by # flops / runtime
(2n^3) / (minimum(bm.times) / 1e9)
7.15992391532777e11

In a later lecture, we’ll see graphical processing units (GPUs) offer a much larger thoughput in single precision.