Conjugate Gradient and Krylov Subspace Methods

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

May 4, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/18-cg`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/18-cg/Project.toml`
  [2169fc97] AlgebraicMultigrid v0.5.1
  [7d9fca2a] Arpack v0.5.4
  [6e4b80f9] BenchmarkTools v1.3.2
  [42fd0dbc] IterativeSolvers v0.9.2
  [7a12625a] LinearMaps v3.10.0
  [b51810bb] MatrixDepot v1.0.10
  [af69fa37] Preconditioners v0.6.0
  [b8865327] UnicodePlots v3.5.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [2f01184e] SparseArrays

1 Introduction

Conjugate gradient is the top-notch iterative method for solving large, structured linear systems \(\mathbf{A} \mathbf{x} = \mathbf{b}\), where \(\mathbf{A}\) is pd.
Earlier we talked about Jacobi, Gauss-Seidel, and successive over-relaxation (SOR) as the classical iterative solvers. They are rarely used in practice due to slow convergence.

Kershaw’s results for a fusion problem.

Method	Number of Iterations
Gauss Seidel	208,000
Block SOR methods	765
Incomplete Cholesky conjugate gradient	25

History: Hestenes (UCLA professor!) and Stiefel proposed conjugate gradient method in 1950s.

Hestenes and Stiefel (1952), Methods of conjugate gradients for solving linear systems, Jounral of Research of the National Bureau of Standards.

Solve linear equation \(\mathbf{A} \mathbf{x} = \mathbf{b}\), where \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is pd, is equivalent to \[ \begin{eqnarray*} \text{minimize} \,\, f(\mathbf{x}) = \frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}. \end{eqnarray*} \] Denote \(\nabla f(\mathbf{x}) = \mathbf{A} \mathbf{x} - \mathbf{b} =: r(\mathbf{x})\).

2 Conjugate gradient (CG) method

Consider a simple idea: coordinate descent, that is to update components \(x_j\) alternatingly. Same as the Gauss-Seidel iteration. Usually it takes too many iterations.

A set of vectors \(\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}\) is said to be conjugate with respect to \(\mathbf{A}\) if \[ \begin{eqnarray*} \mathbf{p}_i^T \mathbf{A} \mathbf{p}_j = 0, \quad \text{for all } i \ne j. \end{eqnarray*} \] For example, eigen-vectors of \(\mathbf{A}\) are conjugate to each other. Why?
Conjugate direction method: Given a set of conjugate vectors \(\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}\), at iteration \(t\), we search along the conjugate direction \(\mathbf{p}^{(t)}\) \[ \begin{eqnarray*} \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}, \end{eqnarray*} \] where \[ \begin{eqnarray*} \alpha^{(t)} = - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}} \end{eqnarray*} \] is the optimal step length.
Theorem: In conjugate direction method, \(\mathbf{x}^{(t)}\) converges to the solution in at most \(n\) steps.

Intuition: Look at graph.

Conjugate gradient method. Idea: generate \(\mathbf{p}^{(t)}\) using only \(\mathbf{p}^{(t-1)}\) \[ \begin{eqnarray*} \mathbf{p}^{(t)} = - \mathbf{r}^{(t)} + \beta^{(t)} \mathbf{p}^{(t-1)}, \end{eqnarray*} \] where \(\beta^{(t)}\) is determined by the conjugacy condition \(\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t)} = 0\) \[ \begin{eqnarray*} \beta^{(t)} = \frac{\mathbf{r}^{(t)T} \mathbf{A} \mathbf{p}^{(t-1)}}{\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t-1)}}. \end{eqnarray*} \]
CG algorithm (preliminary version):
1. Given \(\mathbf{x}^{(0)}\)
2. Initialize: \(\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}\), \(\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}\), \(t=0\)
3. While \(\mathbf{r}^{(t)} \ne \mathbf{0}\)
  1. \(\alpha^{(t)} \gets - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}\)
  2. \(\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}\)
  3. \(\mathbf{r}^{(t+1)} \gets \mathbf{A} \mathbf{x}^{(t+1)} - \mathbf{b}\)
  4. \(\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{A} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}\)
  5. \(\mathbf{p}^{(t+1)} \gets - \mathbf{r}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}\)
  6. \(t \gets t+1\)
Remark: The initial conjugate direction \(\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}\) is crucial.
Theorem: With CG algorithm
1. \(\mathbf{r}^{(t)}\) are mutually orthogonal.
2. \(\{\mathbf{r}^{(0)},\ldots,\mathbf{r}^{(t)}\}\) is contained in the Krylov subspace of degree \(t\) for \(\mathbf{r}^{(0)}\), denoted by \[ \begin{eqnarray*} {\cal K}(\mathbf{r}^{(0)}; t) = \text{span} \{\mathbf{r}^{(0)},\mathbf{A} \mathbf{r}^{(0)}, \mathbf{A}^2 \mathbf{r}^{(0)}, \ldots, \mathbf{A}^{t} \mathbf{r}^{(0)}\}. \end{eqnarray*} \]
3. \(\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(t)}\}\) is contained in \({\cal K}(\mathbf{r}^{(0)}; t)\).
4. \(\mathbf{p}^{(0)}, \ldots, \mathbf{p}^{(t)}\) are conjugate with respect to \(\mathbf{A}\).
  The iterates \(\mathbf{x}^{(t)}\) converge to the solution in at most \(n\) steps.
CG algorithm (economical version): saves one matrix-vector multiplication.
1. Given \(\mathbf{x}^{(0)}\)
2. Initialize: \(\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}\), \(\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}\), \(t=0\)
3. While \(\mathbf{r}^{(t)} \ne \mathbf{0}\)
  1. \(\alpha^{(t)} \gets \frac{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}\)
  2. \(\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}\)
  3. \(\mathbf{r}^{(t+1)} \gets \mathbf{r}^{(t)} + \alpha^{(t)} \mathbf{A} \mathbf{p}^{(t)}\)
  4. \(\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{r}^{(t+1)}}{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}\)
  5. \(\mathbf{p}^{(t+1)} \gets - \mathbf{r}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}\)
  6. \(t \gets t+1\)
Computation cost per iteration is one matrix vector multiplication: \(\mathbf{A} \mathbf{p}^{(t)}\).
Consider PageRank problem, \(\mathbf{A}\) has dimension \(n \approx 10^{10}\) but is highly structured (sparse + low rank). Each matrix vector multiplication takes \(O(n)\).
Theorem: If \(\mathbf{A}\) has \(r\) distinct eigenvalues, \(\mathbf{x}^{(t)}\) converges to solution \(\mathbf{x}^*\) in at most \(r\) steps.

3 Pre-conditioned conjugate gradient (PCG)

Summary of conjugate gradient method for solving \(\mathbf{A} \mathbf{x} = \mathbf{b}\) or equivalently minimizing \(\frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}\):
- Each iteration needs one matrix vector multiplication: \(\mathbf{A} \mathbf{p}^{(t+1)}\). For structured \(\mathbf{A}\), often \(O(n)\) cost per iteration.
- Guaranteed to converge in \(n\) steps.
Two important bounds for conjugate gradient algorithm:

Let \(\lambda_1 \le \cdots \le \lambda_n\) be the ordered eigenvalues of a pd \(\mathbf{A}\).
\[ \begin{eqnarray*} \|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& \left( \frac{\lambda_{n-t} - \lambda_1}{\lambda_{n-t} + \lambda_1} \right)^2 \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2 \\ \|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& 2 \left( \frac{\sqrt{\kappa(\mathbf{A})}-1}{\sqrt{\kappa(\mathbf{A})}+1} \right)^{t} \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2, \end{eqnarray*} \] where \(\kappa(\mathbf{A}) = \lambda_n/\lambda_1\) is the condition number of \(\mathbf{A}\).

Messages:
- Roughly speaking, if the eigenvalues of \(\mathbf{A}\) occur in \(r\) distinct clusters, the CG iterates will approximately solve the problem after \(O(r)\) steps.
- \(\mathbf{A}\) with a small condition number (\(\lambda_1 \approx \lambda_n\)) converges fast.
Pre-conditioning: Change of variables \(\widehat{\mathbf{x}} = \mathbf{C} \mathbf{x}\) via a nonsingular \(\mathbf{C}\) and solve \[ (\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}) \widehat{\mathbf{x}} = \mathbf{C}^{-T} \mathbf{b}. \] Choose \(\mathbf{C}\) such that
- \(\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}\) has small condition number, or
- \(\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}\) has clustered eigenvalues
- Inexpensive solution of \(\mathbf{C}^T \mathbf{C} \mathbf{y} = \mathbf{r}\)
Preconditioned CG does not make use of \(\mathbf{C}\) explicitly, but rather the matrix \(\mathbf{M} = \mathbf{C}^T \mathbf{C}\).
Preconditioned CG (PCG) algorithm:
1. Given \(\mathbf{x}^{(0)}\), pre-conditioner \(\mathbf{M}\)
2. \(\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}\)
3. solve \(\mathbf{M} \mathbf{y}^{(0)} = \mathbf{r}^{(0)}\) for \(\mathbf{y}^{(0)}\)
4. \(\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}\), \(t=0\)
5. While \(\mathbf{r}^{(t)} \ne \mathbf{0}\)
  1. \(\alpha^{(t)} \gets \frac{\mathbf{r}^{(t)T} \mathbf{y}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}\)
  2. \(\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}\)
  3. \(\mathbf{r}^{(t+1)} \gets \mathbf{r}^{(t)} + \alpha^{(t)} \mathbf{A} \mathbf{p}^{(t)}\)
  4. Solve \(\mathbf{M} \mathbf{y}^{(t+1)} \gets \mathbf{r}^{(t+1)}\) for \(\mathbf{y}^{(t+1)}\)
  5. \(\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{y}^{(t+1)}}{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}\)
  6. \(\mathbf{p}^{(t+1)} \gets - \mathbf{y}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}\)
  7. \(t \gets t+1\)
Remark: Only extra cost in the pre-conditioned CG algorithm is the need to solve the linear system \(\mathbf{M} \mathbf{y} = \mathbf{r}\).
Pre-conditioning is more like an art than science. Some choices include
- Incomplete Cholesky. \(\mathbf{A} \approx \tilde{\mathbf{L}} \tilde{\mathbf{L}}^T\), where \(\tilde{\mathbf{L}}\) is a sparse approximate Cholesky factor. Then \(\tilde{\mathbf{L}}^{-1} \mathbf{A} \tilde{\mathbf{L}}^{-T} \approx \mathbf{I}\) (perfectly conditioned) and \(\mathbf{M} \mathbf{y} = \tilde{\mathbf{L}} \tilde {\mathbf{L}}^T \mathbf{y} = \mathbf{r}\) is easy to solve.
- Banded pre-conditioners.
- Choose \(\mathbf{M}\) as a coarsened version of \(\mathbf{A}\).
- Subject knowledge. Knowledge about the structure and origin of a problem is often the key to devising efficient pre-conditioner. For example, see recent work of Stein, Chen, Anitescu (2012) for pre-conditioning large covariance matrices. http://epubs.siam.org/doi/abs/10.1137/110834469

3.1 Example of PCG

Preconditioners.jl wraps a bunch of preconditioners.

We use the Wathen matrix (sparse and positive definite) as a test matrix.

using BenchmarkTools, MatrixDepot, IterativeSolvers, LinearAlgebra, SparseArrays

# Wathen matrix of dimension 30401 x 30401
A = matrixdepot("wathen", 100)

[ Info: verify download of index files...
[ Info: reading database
[ Info: adding metadata...
[ Info: adding svd data...
[ Info: writing database
[ Info: used remote sites are sparse.tamu.edu with MAT index and math.nist.gov with HTML index

30401×30401 SparseMatrixCSC{Float64, Int64} with 471601 stored entries:
⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦

using UnicodePlots
spy(A)

          ┌──────────────────────────────────────────┐    
        1 │⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ > 0
          │⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ < 0
          │⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀│    
          │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀│    
   30 401 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦│    
          └──────────────────────────────────────────┘    
          ⠀1⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀30 401⠀    
          ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀471 601 ≠ 0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

# sparsity level
count(!iszero, A) / length(A)

0.0005102687577359558

# rhs
b = ones(size(A, 1))
# solve Ax=b by CG
xcg = cg(A, b);
@benchmark cg($A, $b)

BenchmarkTools.Trial: 37 samples with 1 evaluation.
 Range (min … max):  135.707 ms … 162.602 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     136.210 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   137.365 ms ±   4.789 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █▅                                                             
  ██▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  136 ms           Histogram: frequency by time          163 ms <
 Memory estimate: 951.14 KiB, allocs estimate: 16.

3.1.1 Diagonal preconditioner

Compute the diagonal preconditioner:

using Preconditioners

# Diagonal preconditioner
@time p = DiagonalPreconditioner(A)
dump(p)

  0.017532 seconds (59.32 k allocations: 4.745 MiB, 96.44% compilation time)
DiagonalPreconditioner{Float64, Vector{Float64}}
  D: Array{Float64}((30401,)) [10.22544568279155, 54.5357103082216, 12.826351887882891, 13.871499760487152, 7.031155075854652, 23.627993977404333, 12.034480062736154, 40.55589969052182, 18.89433417600622, 60.21388258151134  …  10.432731311110011, 14.590070405614327, 67.38097751883308, 22.720770184399075, 53.796463464628665, 18.271684760432244, 43.652521924343304, 12.531173817675096, 23.180405103257222, 4.346325956860729]

# solver Ax=b by PCG
xpcg = cg(A, b, Pl = p)
# same answer?
norm(xcg - xpcg)

5.398453019613479e-7

# PCG with diagonal preconditioner is >5 fold faster than CG
@benchmark cg($A, $b, Pl = $p)

BenchmarkTools.Trial: 394 samples with 1 evaluation.
 Range (min … max):  12.435 ms …  14.974 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.660 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.705 ms ± 256.504 μs  ┊ GC (mean ± σ):  0.12% ± 1.18%
     ▃▆▇██▆▃▃▂                                                  
  ▄▆▆██████████▅▄▆▅▄▄▁▆▁▁▄▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄ ▆
  12.4 ms       Histogram: log(frequency) by time      14.4 ms <
 Memory estimate: 951.14 KiB, allocs estimate: 16.

3.1.2 Incomplete Cholesky preconditioner

Compute the incomplete cholesky preconditioner:

@time p = CholeskyPreconditioner(A, 2)
dump(p)

  0.324356 seconds (2.18 M allocations: 127.970 MiB, 4.41% gc time, 94.51% compilation time)
CholeskyPreconditioner{LimitedLDLFactorizations.LimitedLDLFactorization{Float64, Int64, Vector{Int64}, Vector{Int64}}}
  ldlt: LimitedLDLFactorizations.LimitedLDLFactorization{Float64, Int64, Vector{Int64}, Vector{Int64}}
    __factorized: Bool true
    n: Int64 30401
    colptr: Array{Int64}((30402,)) [1, 13, 25, 37, 54, 65, 77, 89, 101, 118  …  265018, 265021, 265025, 265028, 265032, 265035, 265037, 265038, 265038, 265038]
    rowind: Array{Int64}((281402,)) [3, 4, 5, 11, 12, 1557, 1558, 1559, 1599, 4639  …  30394, 30395, 30396, 30396, 30397, 30398, 30398, 30399, 30400, 30400]
    Lrowind: SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
      parent: Array{Int64}((281402,)) [3, 4, 5, 11, 12, 1557, 1558, 1559, 1599, 4639  …  30394, 30395, 30396, 30396, 30397, 30398, 30398, 30399, 30400, 30400]
      indices: Tuple{UnitRange{Int64}}
        1: UnitRange{Int64}
          start: Int64 1
          stop: Int64 265037
      offset1: Int64 0
      stride1: Int64 1
    lvals: Array{Float64}((281402,)) [0.5231042887635314, -0.1875, 0.10189571123646846, 0.08151656898917478, -0.040758284494587384, 0.5231042887635315, -0.18749999999999997, -0.040758284494587384, 0.10189571123646846, -0.2092417155054126  …  -0.17192378697755945, -0.1888058368108667, 0.10581078637004135, -0.22960757659012265, -0.11535093883371636, 0.034455993692922716, -0.12663483184497698, -0.20831489235167305, 0.12034152957261698, -0.21809675337242582]
    Lnzvals: SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
      parent: Array{Float64}((281402,)) [0.5231042887635314, -0.1875, 0.10189571123646846, 0.08151656898917478, -0.040758284494587384, 0.5231042887635315, -0.18749999999999997, -0.040758284494587384, 0.10189571123646846, -0.2092417155054126  …  -0.17192378697755945, -0.1888058368108667, 0.10581078637004135, -0.22960757659012265, -0.11535093883371636, 0.034455993692922716, -0.12663483184497698, -0.20831489235167305, 0.12034152957261698, -0.21809675337242582]
      indices: Tuple{UnitRange{Int64}}
        1: UnitRange{Int64}
          start: Int64 1
          stop: Int64 265037
      offset1: Int64 0
      stride1: Int64 1
    nnz_diag: Int64 30401
    adiag: Array{Float64}((30401,)) [10.22544568279155, 54.5357103082216, 12.826351887882891, 13.871499760487152, 7.031155075854652, 23.627993977404333, 12.034480062736154, 40.55589969052182, 18.89433417600622, 60.21388258151134  …  10.432731311110011, 14.590070405614327, 67.38097751883308, 22.720770184399075, 53.796463464628665, 18.271684760432244, 43.652521924343304, 12.531173817675096, 23.180405103257222, 4.346325956860729]
    D: Array{Float64}((30401,)) [75.00968538766729, 79.28803789535466, 59.62498031365897, 22.949266214514232, 50.60628284427254, 36.75761101860171, 78.33305246002078, 34.86264016145856, 17.27517849252949, 48.75799934815756  …  28.782792441979897, 34.33421950635495, 16.447573622258805, 36.327469417210914, 10.945522146089326, 11.293138280275826, 8.854221803177621, 25.99847665656258, 6.449420388294696, 19.312488612406398]
    P: Array{Int64}((30401,)) [5112, 5414, 5290, 5291, 5292, 5114, 5416, 5294, 5295, 5296  …  29131, 29261, 29429, 29563, 29731, 29865, 30033, 30167, 30335, 19305]
    α: Float64 0.0
    α_increase_factor: Float64 10.0
    α_out: Float64 0.0
    memory: Int64 2
    Pinv: Array{Int64}((30401,)) [22084, 22085, 22100, 22092, 22094, 22093, 22170, 22010, 22012, 22011  …  6621, 6624, 6622, 6659, 6634, 6642, 6635, 6640, 6637, 6638]
    wa1: Array{Float64}((30401,)) [103.69296599267771, 106.44816995776756, 110.33169793814027, 53.3764134358725, 99.75060865171298, 52.38804878006546, 100.57304424859754, 61.178790281202524, 39.935584789491216, 93.08817154191605  …  86.25209476169196, 102.41490150032965, 54.87554684589229, 104.8838993162198, 37.10541247414502, 24.72310285309287, 30.787411382686372, 84.36228558059938, 28.087650385878014, 39.1592873847572]
    s: Array{Float64}((30401,)) [0.09820313482278212, 0.09692390432842846, 0.09520282796197492, 0.13687536986346988, 0.1001249293950945, 0.13816049749365397, 0.09971470344844803, 0.1278496541068593, 0.15824134871852277, 0.10364604868319821  …  0.1076750728623675, 0.09881398731434961, 0.13499278879235466, 0.0976440057297431, 0.16416530132383536, 0.20111687498358188, 0.18022432706761254, 0.10887441421031661, 0.18868713690339678, 0.1598021470338802]
    w: Array{Float64}((30401,)) [0.07133070293419384, 0.030722517422547106, 0.148526060651414, -0.13896315848176116, -0.19193396037159913, 0.03396652202626995, 0.08119181932892428, 0.40147227326691004, -0.10650960441032821, -0.22198751251497612  …  -0.0755470567316685, 0.08539740502292292, -0.062307730383196605, -0.013189064080058314, -0.02627564441741181, -0.06844204226109589, -0.10184966947092891, 0.0050825490642796185, -0.036405969309298646, 0.003202680142822931]
    indr: Array{Int64}((30401,)) [30400, 30400, 30400, 30400, 30398, 30398, 30396, 30394, 30396, 30346  …  220591, 220592, 220594, 220595, 220597, 220598, 220600, 220601, 220601, 220601]
    indf: Array{Int64}((30401,)) [11, 23, 35, 52, 63, 75, 87, 99, 116, 127  …  265016, 265019, 265023, 265026, 265030, 265033, 265035, 0, 0, 0]
    list: Array{Int64}((30401,)) [4, 12, 4, 4637, 3, 9, 8, 12, 482, 68  …  7373, 30388, 18410, 18409, 7345, 0, 7339, 30398, 0, 0]
    pos: Array{Int64}((30401,)) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  30392, 30393, 30394, 30395, 30396, 30397, 30398, 30399, 30400, 30401]
    neg: Array{Int64}((0,)) Int64[]
    computed_posneg: Bool false
  memory: Int64 2

Pre-conditioned conjugate gradient:

# solver Ax=b by PCG
xpcg = cg(A, b, Pl = p)
# same answer?
norm(xcg - xpcg)

5.402457562263577e-7

# PCG with incomplete Cholesky is >5 fold faster than CG
@benchmark cg($A, $b, Pl = $p)

BenchmarkTools.Trial: 361 samples with 1 evaluation.
 Range (min … max):  13.589 ms …  16.957 ms  ┊ GC (min … max): 0.00% … 18.78%
 Time  (median):     13.835 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.876 ms ± 298.116 μs  ┊ GC (mean ± σ):  0.18% ±  1.61%
          ▁▅▆█▆▇▄                                               
  ▃▃▃▄▄▅▆▇███████▇▅▄▃▄▃▃▂▃▃▁▃▂▂▂▁▁▁▁▁▁▁▁▁▃▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▂▂ ▃
  13.6 ms         Histogram: frequency by time         14.8 ms <
 Memory estimate: 951.14 KiB, allocs estimate: 16.

3.1.3 AMG preconditioner

Let’s try the AMG preconditioner.

using AlgebraicMultigrid

@time ml = AMGPreconditioner{RugeStuben}(A) # Construct a Ruge-Stuben solver

  1.000778 seconds (5.70 M allocations: 357.450 MiB, 5.08% gc time, 96.34% compilation time)

AMGPreconditioner{RugeStuben, AlgebraicMultigrid.MultiLevel{AlgebraicMultigrid.Pinv{Float64}, GaussSeidel{SymmetricSweep}, GaussSeidel{SymmetricSweep}, SparseMatrixCSC{Float64, Int64}, Adjoint{Float64, SparseMatrixCSC{Float64, Int64}}, SparseMatrixCSC{Float64, Int64}, AlgebraicMultigrid.MultiLevelWorkspace{Vector{Float64}, 1}}, AlgebraicMultigrid.V}(Multilevel Solver
-----------------
Operator Complexity: 1.133
Grid Complexity: 1.171
No. of Levels: 8
Coarse Solver: Pinv
Level     Unknowns     NonZeros
-----     --------     --------
    1        30401       471601 [88.28%]
    2         3589        44617 [ 8.35%]
    3         1083        12821 [ 2.40%]
    4          348         3702 [ 0.69%]
    5          114         1038 [ 0.19%]
    6           38          292 [ 0.05%]
    7           14           90 [ 0.02%]
    8            5           25 [ 0.00%]
, AlgebraicMultigrid.V())

# use AMG preconditioner in CG
xamg = cg(A, b, Pl = ml)
# same answer?
norm(xcg - xamg)

5.302707566339999e-7

@benchmark cg($A, $b, Pl = $ml)

BenchmarkTools.Trial: 110 samples with 1 evaluation.
 Range (min … max):  45.135 ms …  51.864 ms  ┊ GC (min … max): 0.00% … 12.55%
 Time  (median):     45.363 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.508 ms ± 689.680 μs  ┊ GC (mean ± σ):  0.13% ±  1.20%
    ▁▅▂▁█ ▁▆                                                    
  ▄▇█████▇██▅▇▁▄▃▁▃▁▅▁▄▁▁▁▁▃▁▁▁▃▃▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
  45.1 ms         Histogram: frequency by time         47.2 ms <
 Memory estimate: 951.14 KiB, allocs estimate: 16.

4 Other Krylov subspace methods

We leant about CG/PCG, which is for solving \(\mathbf{A} \mathbf{x} = \mathbf{b}\), \(\mathbf{A}\) pd.
MINRES (minimum residual method): symmetric indefinite \(\mathbf{A}\).
Bi-CG (bi-conjugate gradient): unsymmetric \(\mathbf{A}\).
Bi-CGSTAB (Bi-CG stabilized): improved version of Bi-CG.
GMRES (generalized minimum residual method): current de facto method for unsymmetric \(\mathbf{A}\). E.g., PageRank problem.
Lanczos method: top eigen-pairs of a large symmetric matrix.
Arnoldi method: top eigen-pairs of a large unsymmetric matrix.
Lanczos bidiagonalization algorithm: top singular triplets of large matrix.
LSQR: least square problem \(\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2\). Algebraically equivalent to applying CG to the normal equation \((\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}\).
LSMR: least square problem \(\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2\). Algebraically equivalent to applying MINRES to the normal equation \((\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}\).

5 Software

5.1 Matlab

Iterative methods for solving linear equations:
pcg, bicg, bicgstab, gmres, …
Iterative methods for top eigen-pairs and singular pairs:
eigs, svds, …
Pre-conditioner:
cholinc, luinc, …
Get familiar with the reverse communication interface (RCI) for utilizing iterative solvers:

x = gmres(A, b)
x = gmres(@Afun, b)
eigs(A)
eigs(@Afun)

5.2 Julia

eigs and svds in the Arpack.jl package. Numerical examples later.
IterativeSolvers.jl package. CG numerical examples
See the list of Julia packages for iterative methods.

5.2.1 Least squares example

using BenchmarkTools, IterativeSolvers, LinearAlgebra, Random, SparseArrays

Random.seed!(257) # seed
n, p = 10000, 5000
X = sprandn(n, p, 0.001) # iid standard normals with sparsity 0.01
β = ones(p)
y = X * β + randn(n)

β̂_qr = X \ y
# least squares by QR
@benchmark $X \ $y

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.577 s …   1.673 s  ┊ GC (min … max): 0.50% … 5.86%
 Time  (median):     1.621 s              ┊ GC (median):    2.89%
 Time  (mean ± σ):   1.623 s ± 49.826 ms  ┊ GC (mean ± σ):  3.07% ± 2.91%
  █   █                                          █        █  
  █▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█ ▁
  1.58 s         Histogram: frequency by time        1.67 s <
 Memory estimate: 1.43 GiB, allocs estimate: 178.

β̂_lsqr = lsqr(X, y)
@show norm(β̂_qr - β̂_lsqr)
# least squares by lsqr
@benchmark lsqr($X, $y)

norm(β̂_qr - β̂_lsqr) = 0.00010166339934556622

BenchmarkTools.Trial: 162 samples with 1 evaluation.
 Range (min … max):  29.940 ms …  33.276 ms  ┊ GC (min … max): 0.00% … 6.83%
 Time  (median):     30.848 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.050 ms ± 709.774 μs  ┊ GC (mean ± σ):  0.97% ± 2.05%
               ▁▁▁▃ ▄▄█  ▁                                      
  ▃▁▄▅▄▃▅▃▅▁▅▇█████▆████▅█▄▄▃▄▄▁▁▁▁▁▃▁▁▃▃▃▁▃▁▃▃▅▄▁▄▄▁▄▅▄▄▅▃▁▃▃ ▃
  29.9 ms         Histogram: frequency by time         32.8 ms <
 Memory estimate: 17.54 MiB, allocs estimate: 1820.

β̂_lsmr = lsmr(X, y)
@show norm(β̂_qr - β̂_lsmr)
# least squares by lsmr
@benchmark lsmr($X, $y)

norm(β̂_qr - β̂_lsmr) = 0.10052598447267552

BenchmarkTools.Trial: 231 samples with 1 evaluation.
 Range (min … max):  21.098 ms …  26.241 ms  ┊ GC (min … max): 0.00% … 18.29%
 Time  (median):     21.373 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.654 ms ± 961.050 μs  ┊ GC (mean ± σ):  1.07% ±  3.62%
   ▆█▇▅▄▂▁▂                                                     
  █████████▇▄▆▆▄▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▆█▄▁▁▁▁▄▄ ▆
  21.1 ms       Histogram: log(frequency) by time      25.8 ms <
 Memory estimate: 6.27 MiB, allocs estimate: 801.

5.2.2 Use LinearMaps in iterative solvers

In many applications, it is advantageous to define linear maps indead of forming the actual (sparse) matrix. For a linear map, we need to specify how it acts on right- and left-multiplication on a vector. The LinearMaps.jl package is exactly for this purpose and interfaces nicely with IterativeSolvers.jl, Arnoldi.jl and other iterative solver packages.

Applications:
1. The matrix is not sparse but admits special structure, e.g., easy + low rank (PageRank), Kronecker proudcts, etc.
2. Less memory usage. 3. Linear algebra on a standardized (centered and scaled) sparse matrix.

Consider the differencing operator that takes differences between neighboring pixels

\[ \mathbf{D} = \begin{pmatrix} -1 & 1 & & & \\ & -1 & 1 & & \\ & & \ddots & \\ & & & - 1 & 1 \\ 1 & & & & -1 \end{pmatrix}. \]

using LinearMaps, IterativeSolvers

# Overwrite y with A * x
# left difference assuming periodic boundary conditions
function leftdiff!(y::AbstractVector, x::AbstractVector) 
    N = length(x)
    length(y) == N || throw(DimensionMismatch())
    @inbounds for i in 1:N
        y[i] = x[i] - x[mod1(i - 1, N)]
    end
    return y
end

# Overwrite y with A' * x
# minus right difference
function mrightdiff!(y::AbstractVector, x::AbstractVector) 
    N = length(x)
    length(y) == N || throw(DimensionMismatch())
    @inbounds for i in 1:N
        y[i] = x[i] - x[mod1(i + 1, N)]
    end
    return y
end

# define linear map
D = LinearMap{Float64}(leftdiff!, mrightdiff!, 100; ismutating=true)

100×100 FunctionMap{Float64,true}(leftdiff!, mrightdiff!; issymmetric=false, ishermitian=false, isposdef=false)

Linear maps can be used like a regular matrix.

@show size(D)
v = ones(size(D, 2)) # vector of all 1s
@show D * v
@show D' * v;

size(D) = (100, 100)
D * v = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
D' * v = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

If we form the corresponding dense matrix, it will look like

Matrix(D)

100×100 Matrix{Float64}:
  1.0   0.0   0.0   0.0   0.0   0.0  …   0.0   0.0   0.0   0.0   0.0  -1.0
 -1.0   1.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0  -1.0   1.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0  -1.0   1.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0  -1.0   1.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0  -1.0   1.0  …   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0  -1.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0  …   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  ⋮                             ⋮    ⋱         ⋮                      
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0  …   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      1.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0  …  -1.0   1.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0  -1.0   1.0   0.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0  -1.0   1.0   0.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0  -1.0   1.0   0.0
  0.0   0.0   0.0   0.0   0.0   0.0      0.0   0.0   0.0   0.0  -1.0   1.0

If we form the corresponding sparse matrix, it will look like

using SparseArrays
sparse(D)

100×100 SparseMatrixCSC{Float64, Int64} with 200 stored entries:
⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈
⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄

using UnicodePlots
spy(sparse(D))

       ┌──────────────────────────────────────────┐    
     1 │⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈│ > 0
       │⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ < 0
       │⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀⠀⠀│    
       │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄⠀⠀│    
   100 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⣄│    
       └──────────────────────────────────────────┘    
       ⠀1⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀100⠀    
       ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀200 ≠ 0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Compute top singular values using iterative method (Arnoldi).

using Arpack
Arpack.svds(D, nsv = 3)

(SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}([-0.10000000000003784 -0.12610107456829575 0.06401967660582586; 0.10000000000003065 0.12987207165686732 -0.05597539641260025; … ; -0.1000000000000521 -0.11708293685000366 0.07931951776561391; 0.10000000000004498 0.12183241414850406 -0.07181130038329539], [1.999999999999991, 1.999013120731463, 1.999013120731457], [-0.10000000000003396 0.10000000000002691 … -0.10000000000004843 0.10000000000004114; -0.12804975793830697 0.1315662173203481 … -0.11951664975119619 0.12402794466205336; 0.06002715628725726 -0.05186839557967478 … 0.07560271445022253 -0.06794901723277247]), 3, 32, 526, [0.04258949839207237, 0.14721604804533064, -0.011022034322333656, 0.05622695730309966, -0.18486947987539046, -0.06893624219396152, -0.058094777857784594, -0.029289396615192375, -0.07254854992344173, -0.00975416583922165  …  0.024831913415879165, -0.13578084697251516, 0.02681766807280268, -0.08554983305702116, 0.04044885519724034, 0.04633598288333846, 0.04804027361050814, 0.08825454846822017, 0.04845254033517858, 0.16996721894328326])

using LinearAlgebra
# check solution against the direct method for SVD
svdvals(Matrix(D))

100-element Vector{Float64}:
 2.0
 1.9990131207314632
 1.999013120731463
 1.996053456856543
 1.996053456856543
 1.99112392920616
 1.99112392920616
 1.9842294026289555
 1.9842294026289555
 1.9753766811902753
 1.9753766811902753
 1.9645745014573774
 1.9645745014573774
 ⋮
 0.37476262917144926
 0.31286893008046185
 0.31286893008046174
 0.25066646712860857
 0.25066646712860846
 0.18821662663702862
 0.1882166266370286
 0.12558103905862672
 0.12558103905862666
 0.06282151815625658
 0.06282151815625657
 1.7635897249928907e-16

Compute top eigenvalues of the Gram matrix D'D using iterative method (Arnoldi).

Arpack.eigs(D'D, nev = 3, which = :LM)

([3.9999999999999907, 3.996053456856537, 3.9960534568565333], [0.09999999999998786 -0.009399690541663409 -0.14110863126584505; -0.09999999999998202 0.0005208581322834926 0.14142039706777043; … ; 0.10000000000000063 -0.027011172214567336 -0.13881785395111446; -0.09999999999999405 0.018241426666785143 0.14023997416271491], 3, 29, 477, [-0.11143136146466825, -0.09379191938271311, -0.26066819357025, -0.04821186674086208, -0.00044236861085249316, 0.14800776405222132, -0.11112031736267719, 0.07442363053822573, -0.04267710288965029, 0.23889296427901438  …  0.0065084972593795155, -0.05077322319654768, -0.1460426446406352, -0.026520846605774986, 0.11477751117771083, -0.060738880152664966, 0.16090083149646847, 0.13006045875353353, -0.12879338556316872, 0.05985490000934182])

6 Further reading

Chapter 5 of Numerical Optimization by Jorge Nocedal and Stephen Wright (1999).
Sections 11.3-11.5 of Matrix Computations by Gene Golub and Charles Van Loan (2013).