GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today’s typical computer systems.
NVIDIA GPUs
H100 PCIe
RTX 6000
RTX 5000
Computers
servers, cluster
desktop
laptop
Main usage
scientific computing
daily work, gaming
daily work
Memory
80 GB
48 GB
16 GB
Memory bandwidth
2 TB/sec
960 GB/sec
576 GB/sec
Number of cores
???
???
???
Processor clock
??? GHz
??? GHz
??? GHz
Peak DP performance
26 TFLOPS
??? TFLOPS
??? TFLOPS
Peak SP performance
51 TFLOPS
91.1 TFLOPS
42.6 TFLOPS
2 GPU architecture vs CPU architecture
GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC
Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm
Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly
3 GPGPU in Julia
GPU support by Julia is under active development. Check JuliaGPU for currently available packages.
There are multiple paradigms to program GPU in Julia, depending on the specific hardware.
CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …
The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.
The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.
The Metal.jl package allows defining arrays on Apple Silicon and overloads many common operations.
The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.
I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.3.1. It has Apple M2 chip with 38 GPU cores.
versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_NUM_THREADS = 8
JULIA_EDITOR = code
usingBenchmarkTools, LinearAlgebra, RandomRandom.seed!(257)n =2^14# on CPUx =rand(Float32, n, n)y =rand(Float32, n, n)z =zeros(Float32, n, n)# on GPUxd =MtlArray(x)yd =MtlArray(y)zd =MtlArray(z);
6.1 Dot product
# SP matrix dot product on GPU: tr(X'Y)# why are there allocations?bm_gpu =@benchmark Metal.@syncdot($xd, $yd)
BenchmarkTools.Trial: 672 samples with 1 evaluation.
Range (min … max): 7.302 ms … 18.474 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.435 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.441 ms ± 431.791 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ ▁▁▇▂▇▇▇█▅▂▄
▂▁▂▁▁▂▂▃▄▄▅▆▆██▆█████▇▇▄▃▅▃▄▄▄█████████████▇▇▅▅▃▄▃▂▄▃▁▃▁▁▁▃ ▄
7.3 ms Histogram: frequency by time 7.53 ms <
Memory estimate: 25.48 KiB, allocs estimate: 1128.
# SP matrix dot product on CPU: tr(X'Y)bm_cpu =@benchmarkdot($x, $y)
BenchmarkTools.Trial: 145 samples with 1 evaluation.
Range (min … max): 34.249 ms … 35.262 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.511 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.545 ms ± 224.030 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▄█▄ ▁▃▄ ▃ ▁ ▄ ▁ ▃ ▃▁
▆█████▆███▄▆▄▄▇▇▇▁▄█▇▄▇▇▄█▄▇▁█▄▆▄▄▄▄▄▄▆▆▄▄▄▆▄█▇▇█▁▄▁▁▄▁▄██▆▇ ▄
34.2 ms Histogram: frequency by time 34.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# SP matrix multiplication on GPUbm_gpu =@benchmark Metal.@syncmul!($zd, $xd, $yd)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min … max): 902.111 ms … 941.063 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 910.987 ms ┊ GC (median): 0.00%
Time (mean ± σ): 917.558 ms ± 16.523 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ██ █ █
█▁▁▁█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
902 ms Histogram: frequency by time 941 ms <
Memory estimate: 800 bytes, allocs estimate: 40.
For this problem size on this machine, we see GPU achieves a staggering 9 TFLOPS throughput with single precision!
# SP throughput on GPU(2n^3) / (minimum(bm_gpu.times) /1e9)
9.750571600938021e12
# SP matrix multiplication on CPUbm_cpu =@benchmarkmul!($z, $x, $y)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 12.522 s (0.00% GC) to evaluate,
with a memory estimate of 0 bytes, over 0 allocations.
# SP throughput on CPU(2n^3) / (minimum(bm_cpu.times) /1e9)
7.024372790292876e11
We see >10x speedup by GPUs in this matrix multiplication example.
# cholesky on Gram matrix# This one doesn't seem to work on Apple M2 chip yet# xtxd = xd'xd + I# @benchmark Metal.@sync cholesky($(xtxd))