Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/07-algo/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[37e2e46d] LinearAlgebra
[9a3f8284] Random
1 Definition
Algorithm is loosely defined as a set of instructions for doing something. Input \(\to\) Output.
A basic unit for measuring algorithmic efficiency is flop.
A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.
Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.
How to measure efficiency of an algorithm? Big O notation. If \(n\) is the size of a problem, an algorithm has order \(O(f(n))\), where the leading term in the number of flops is \(c \cdot f(n)\). For example,
matrix-vector multiplication A * b, where A is \(m \times n\) and b is \(n \times 1\), takes \(2mn\) or \(O(mn)\) flops
matrix-matrix multiplication A * B, where A is \(m \times n\) and B is \(n \times p\), takes \(2mnp\) or \(O(mnp)\) flops
A hierarchy of computational complexity:
Let \(n\) be the problem size.
Difference of \(O(n^2)\) and \(O(n\log n)\) on massive data. Suppose we have a teraflop supercomputer capable of doing \(10^{12}\) flops per second. For a problem of size \(n=10^{12}\), \(O(n \log n)\) algorithm takes about \[10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.\]\(O(n^2)\) algorithm takes about \(10^{12}\) seconds, which is approximately 31710 years!
QuickSort and FFT (invented by statistician John Tukey!) are celebrated algorithms that turn \(O(n^2)\) operations into \(O(n \log n)\). Another example is the Strassen’s method, which turns \(O(n^3)\) matrix multiplication into \(O(n^{\log_2 7})\).
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.
The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
For example, compare flops of the two mathematically equivalent expressions: (A * B) * x and A * (B * x) where A and B are matrices and x is a vector.
# Fortunately, Julia parsed the following code in the more efficient way# No luck with some other languages@code_lowered A * B * x
CodeInfo(
1 ─ %1 = B * x
│ %2 = A * %1
└── return %2
)
3 Performance of computer systems
FLOPS (floating point operations per second) is a measure of computer performance.
For example, this laptop has the Apple M2 Max CPU with 8 performance cores and 4 efficiency cores.
versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = code
In Julia, LinearAlgebra.peakflops computes the peak flop rate of the computer by running double precision matrix-matrix multiplication BLAS.gemm!. About 385 GFLOPS DP.
We roughtly estimate the single precision thoughput by (# flops / runtime). About 733 GFLOPS SP.
usingBenchmarkTools, Random# Generate matrix dataRandom.seed!(257)n =2^14A =randn(Float32, n, n)B =randn(Float32, n, n)C =Matrix{Float32}(undef, n, n)# Single-precision throughputbm =@benchmarkmul!(C, A, B)bm
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 12.285 s (0.00% GC) to evaluate,
with a memory estimate of 0 bytes, over 0 allocations.
# Estimate single precision throughput by # flops / runtime(2n^3) / (minimum(bm.times) /1e9)
7.15992391532777e11
In a later lecture, we’ll see graphical processing units (GPUs) offer a much larger thoughput in single precision.