GPU Computing in Julia

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 18, 2023

This lecture introduces GPU computing in Julia.

1 GPGPU

GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today’s typical computer systems.

NVIDIA GPUs	H100 PCIe	RTX 6000	RTX 5000
			$RTX 5000$
Computers	servers, cluster	desktop	laptop

Main usage	scientific computing	daily work, gaming	daily work
Memory	80 GB	48 GB	16 GB
Memory bandwidth	2 TB/sec	960 GB/sec	576 GB/sec
Number of cores	???	???	???
Processor clock	??? GHz	??? GHz	??? GHz
Peak DP performance	26 TFLOPS	??? TFLOPS	??? TFLOPS
Peak SP performance	51 TFLOPS	91.1 TFLOPS	42.6 TFLOPS

2 GPU architecture vs CPU architecture

GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC
Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm
Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly

3 GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …

The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.
The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.
The Metal.jl package allows defining arrays on Apple Silicon and overloads many common operations.
The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.

I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.3.1. It has Apple M2 chip with 38 GPU cores.

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [bdcacae8] LoopVectorization v0.12.157
  [dde4c033] Metal v0.3.0
  [37e2e46d] LinearAlgebra

4 Query GPU devices in the system

using Metal

Metal.versioninfo()

macOS 13.3.0, Darwin 21.5.0

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1

1 device:
- Apple M2 Max (64.000 KiB allocated)

5 Transfer data between main memory and GPU

# generate SP data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = MtlArray(x)

3×3 MtlMatrix{Float32}:
 0.940656   0.585455    0.374012
 0.0798393  0.109429    0.3644
 0.468515   0.00697732  0.404183

# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = MtlArray(ones(Float32, 3, 3))

3×3 MtlMatrix{Float32}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

# collect data from GPU to CPU
x = collect(xd)

3×3 Matrix{Float32}:
 0.940656   0.585455    0.374012
 0.0798393  0.109429    0.3644
 0.468515   0.00697732  0.404183

6 Linear algebra

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)
n = 2^14
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = MtlArray(x)
yd = MtlArray(y)
zd = MtlArray(z);

6.1 Dot product

# SP matrix dot product on GPU: tr(X'Y)
# why are there allocations?
bm_gpu = @benchmark Metal.@sync dot($xd, $yd)

BenchmarkTools.Trial: 672 samples with 1 evaluation.
 Range (min … max):  7.302 ms …  18.474 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.435 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.441 ms ± 431.791 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
               ▁     ▁▁          ▁▁▇▂▇▇▇█▅▂▄                   
  ▂▁▂▁▁▂▂▃▄▄▅▆▆██▆█████▇▇▄▃▅▃▄▄▄█████████████▇▇▅▅▃▄▃▂▄▃▁▃▁▁▁▃ ▄
  7.3 ms          Histogram: frequency by time        7.53 ms <
 Memory estimate: 25.48 KiB, allocs estimate: 1128.

# SP matrix dot product on CPU: tr(X'Y)
bm_cpu = @benchmark dot($x, $y)

BenchmarkTools.Trial: 145 samples with 1 evaluation.
 Range (min … max):  34.249 ms …  35.262 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.511 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.545 ms ± 224.030 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
   ▁▃▄█▄ ▁▃▄         ▃     ▁   ▄               ▁  ▃       ▃▁    
  ▆█████▆███▄▆▄▄▇▇▇▁▄█▇▄▇▇▄█▄▇▁█▄▆▄▄▄▄▄▄▆▆▄▄▄▆▄█▇▇█▁▄▁▁▄▁▄██▆▇ ▄
  34.2 ms         Histogram: frequency by time         34.9 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

# speedup
median(bm_cpu.times) / median(bm_gpu.times)

4.641776054702962

6.2 Broadcast

# SP broadcast on GPU: z .= x .* y
# why is there allocation?
bm_gpu = @benchmark Metal.@sync $zd .= $xd .* $yd

BenchmarkTools.Trial: 138 samples with 1 evaluation.
 Range (min … max):  36.298 ms …  38.745 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.338 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.417 ms ± 285.914 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ▅█▄▁                                                          
  ████▇▁▄▄▇▁▄▁▁▄▆▄▁▆▁▄▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄ ▄
  36.3 ms       Histogram: log(frequency) by time      37.8 ms <
 Memory estimate: 6.88 KiB, allocs estimate: 281.

# SP broadcast on CPU: z .= x .* y
bm_cpu = @benchmark $z .= $x .* $y

BenchmarkTools.Trial: 141 samples with 1 evaluation.
 Range (min … max):  34.799 ms …  36.508 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.513 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.507 ms ± 187.628 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                          ▃▇█▇ ▂                                
  ▃▁▁▁▁▁▃▁▃▁▁▃▃▃▃▄▃▄▁▃▆▅▇▇██████▄▇▆▄▄▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  34.8 ms         Histogram: frequency by time         36.4 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

# speedup
median(bm_cpu.times) / median(bm_gpu.times)

0.9772861479998408

6.3 Matrix multiplication

# SP matrix multiplication on GPU
bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd)

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  902.111 ms … 941.063 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     910.987 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   917.558 ms ±  16.523 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █   █        ██                                     █       █  
  █▁▁▁█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  902 ms           Histogram: frequency by time          941 ms <
 Memory estimate: 800 bytes, allocs estimate: 40.

For this problem size on this machine, we see GPU achieves a staggering 9 TFLOPS throughput with single precision!

# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)

9.750571600938021e12

# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.522 s (0.00% GC) to evaluate,
 with a memory estimate of 0 bytes, over 0 allocations.

# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)

7.024372790292876e11

We see >10x speedup by GPUs in this matrix multiplication example.

# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
# xtxd = xd'xd + I
# @benchmark Metal.@sync cholesky($(xtxd))

# xtx = collect(xtxd)
# @benchmark cholesky($(Symmetric(xtx)))

GPU speedup of Cholesky seems unavailable at the moment.

7 Evaluation of elementary and special functions on GPU

# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu

BenchmarkTools.Trial: 125 samples with 1 evaluation.
 Range (min … max):  39.883 ms …  48.114 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     39.916 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.051 ms ± 745.504 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
   █▇▂                                                          
  ▅███▅▃▃▁▁▂▂▃▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▂▁▁▂▂▃▃▂▂▁▃▁▁▁▁▃▁▁▁▃▁▁▂▁▁▁▁▁▁▁▃ ▂
  39.9 ms         Histogram: frequency by time         40.6 ms <
 Memory estimate: 8.13 KiB, allocs estimate: 310.

# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu

BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.450 s …  2.454 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.453 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.452 s ± 2.005 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █                                     █                █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.45 s        Histogram: frequency by time        2.45 s <
 Memory estimate: 0 bytes, allocs estimate: 0.

# Speed up
median(bm_cpu.times) / median(bm_gpu.times)

61.45153880560752

GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.