DCP Examples - Linear Programming (LP)

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

June 8, 2023

System information (for reproducibility):

versioninfo()

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 9 on 8 virtual cores
Environment:
  LD_LIBRARY_PATH = :/Applications/knitro-13.2.0-MacOS-ARM/lib
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/29-lp`

Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/29-lp/Project.toml`
  [1e616198] COSMO v0.8.7
  [13f3f980] CairoMakie v0.10.5
  [f65535da] Convex v0.15.3
  [2e9cd046] Gurobi v1.0.1
  [b8f27783] MathOptInterface v1.17.1
  [1ec41992] MosekTools v0.15.0
  [c946c3f1] SCS v1.2.0
  [9a3f8284] Random

using Convex, COSMO, CairoMakie, Gurobi, MathOptInterface, MosekTools, Random, SCS
const MOI = MathOptInterface

[ Info: Precompiling Convex [f65535da-76fb-5f13-bab9-19810c17039a]
[ Info: Precompiling COSMO [1e616198-aa4e-51ec-90a2-23f7fbd31d8d]
[ Info: Precompiling CairoMakie [13f3f980-e62b-5c42-98c6-ff1f3baf88f0]
[ Info: Precompiling ForwardDiffStaticArraysExt [b74fd6d0-9da7-541f-a07d-1b6af30a262f]
[ Info: Precompiling Gurobi [2e9cd046-0924-5485-92f1-d5272153d98b]
[ Info: Precompiling MosekTools [1ec41992-ff65-5c91-ac43-2df89e9693a4]
[ Info: Precompiling SCS [c946c3f1-0d1f-5ce8-9dea-7daa1f7e2d13]

MathOptInterface

1 Linear programming (LP)

A general linear program takes the form \[\begin{eqnarray*} &\text{minimize}& \mathbf{c}^T \mathbf{x} \\ &\text{subject to}& \mathbf{A} \mathbf{x} = \mathbf{b} \\ & & \mathbf{G} \mathbf{x} \preceq \mathbf{h}. \end{eqnarray*}\] Linear program is a convex optimization problem, why?

The standard form of an LP is \[\begin{eqnarray*} &\text{minimize}& \mathbf{c}^T \mathbf{x} \\ &\text{subject to}& \mathbf{A} \mathbf{x} = \mathbf{b} \\ & & \mathbf{x} \succeq \mathbf{0}. \end{eqnarray*}\] To transform a general linear program into the standard form, we introduce the slack variables \(\mathbf{s} \succeq \mathbf{0}\) such that \(\mathbf{G} \mathbf{x} + \mathbf{s} = \mathbf{h}\). Then we write \(\mathbf{x} = \mathbf{x}^+ - \mathbf{x}^-\), where \(\mathbf{x}^+ \succeq \mathbf{0}\) and \(\mathbf{x}^- \succeq \mathbf{0}\). This yields the problem \[\begin{eqnarray*} &\text{minimize}& \mathbf{c}^T (\mathbf{x}^+ - \mathbf{x}^-) \\ &\text{subject to}& \mathbf{A} (\mathbf{x}^+ - \mathbf{x}^-) = \mathbf{b} \\ & & \mathbf{G} (\mathbf{x}^+ - \mathbf{x}^-) + \mathbf{s} = \mathbf{h} \\ & & \mathbf{x}^+ \succeq \mathbf{0}, \mathbf{x}^- \succeq \mathbf{0}, \mathbf{s} \succeq \mathbf{0} \end{eqnarray*}\] in \(\mathbf{x}^+\), \(\mathbf{x}^-\), and \(\mathbf{s}\).

Slack variables are often used to transform a complicated inequality constraint to simple non-negativity constraints.
The inequality form of an LP is \[\begin{eqnarray*} &\text{minimize}& \mathbf{c}^T \mathbf{x} \\ &\text{subject to}& \mathbf{G} \mathbf{x} \preceq \mathbf{h}. \end{eqnarray*}\]
Some softwares, e.g., solveLP in R, require an LP be written in either standard or inequality form. However a good software should do this for you!
A piecewise-linear minimization problem \[\begin{eqnarray*} &\text{minimize}& \max_{i=1,\ldots,m} (\mathbf{a}_i^T \mathbf{x} + b_i) \end{eqnarray*}\] can be transformed to an LP \[\begin{eqnarray*} &\text{minimize}& t \\ &\text{subject to}& \mathbf{a}_i^T \mathbf{x} + b_i \le t, \quad i = 1,\ldots,m, \end{eqnarray*}\] in \(\mathbf{x}\) and \(t\). Apparently \[ \text{minimize} \max_{i=1,\ldots,m} |\mathbf{a}_i^T \mathbf{x} + b_i| \] and \[ \text{minimize} \max_{i=1,\ldots,m} (\mathbf{a}_i^T \mathbf{x} + b_i)_+ \] are also LP.
Any convex optimization problem \[\begin{eqnarray*} &\text{minimize}& f_0(\mathbf{x}) \\ &\text{subject to}& f_i(\mathbf{x}) \le 0, \quad i=1,\ldots,m \\ && \mathbf{a}_i^T \mathbf{x} = b_i, \quad i=1,\ldots,p, \end{eqnarray*}\] where \(f_0,\ldots,f_m\) are convex functions, can be transformed to the epigraph form \[\begin{eqnarray*} &\text{minimize}& t \\ &\text{subject to}& f_0(\mathbf{x}) - t \le 0 \\ & & f_i(\mathbf{x}) \le 0, \quad i=1,\ldots,m \\ & & \mathbf{a}_i^T \mathbf{x} = b_i, \quad i=1,\ldots,p \end{eqnarray*}\] in variables \(\mathbf{x}\) and \(t\). That is why people often say linear program is universal.
The linear fractional programming \[\begin{eqnarray*} &\text{minimize}& \frac{\mathbf{c}^T \mathbf{x} + d}{\mathbf{e}^T \mathbf{x} + f} \\ &\text{subject to}& \mathbf{A} \mathbf{x} = \mathbf{b} \\ & & \mathbf{G} \mathbf{x} \preceq \mathbf{h} \\ & & \mathbf{e}^T \mathbf{x} + f > 0 \end{eqnarray*}\] can be transformed to an LP \[\begin{eqnarray*} &\text{minimize}& \mathbf{c}^T \mathbf{y} + d z \\ &\text{subject to}& \mathbf{G} \mathbf{y} - z \mathbf{h} \preceq \mathbf{0} \\ & & \mathbf{A} \mathbf{y} - z \mathbf{b} = \mathbf{0} \\ & & \mathbf{e}^T \mathbf{y} + f z = 1 \\ & & z \ge 0 \end{eqnarray*}\] in \(\mathbf{y}\) and \(z\), via transformation of variables \[\begin{eqnarray*} \mathbf{y} = \frac{\mathbf{x}}{\mathbf{e}^T \mathbf{x} + f}, \quad z = \frac{1}{\mathbf{e}^T \mathbf{x} + f}. \end{eqnarray*}\] See Section 4.3.2 of Boyd and Vandenberghe (2004) for proof.

2 LP example: compressed sensing

Compressed sensing Candes and Tao (2006) and Donoho (2006) tries to address a fundamental question: how to compress and transmit a complex signal (e.g., musical clips, mega-pixel images), which can be decoded to recover the original signal?

Suppose a signal \(\mathbf{x} \in \mathbb{R}^n\) is sparse with \(s\) non-zeros. We under-sample the signal by multiplying a (flat) measurement matrix \(\mathbf{y} = \mathbf{A} \mathbf{x}\), where \(\mathbf{A} \in \mathbb{R}^{m\times n}\) has iid normal entries. Candes, Romberg and Tao (2006) show that the solution to \[\begin{eqnarray*} &\text{minimize}& \|\mathbf{x}\|_1 \\ &\text{subject to}& \mathbf{A} \mathbf{x} = \mathbf{y} \end{eqnarray*}\] exactly recovers the true signal under certain conditions on \(\mathbf{A}\) when \(n \gg s\) and \(m \approx s \ln(n/s)\). Why sparsity is a reasonable assumption? Virtually all real-world images have low information content.

The \(\ell_1\) minimization problem apparently is an LP, by writing \(\mathbf{x} = \mathbf{x}^+ - \mathbf{x}^-\), \[\begin{eqnarray*} &\text{minimize}& \mathbf{1}^T (\mathbf{x}^+ + \mathbf{x}^-) \\ &\text{subject to}& \mathbf{A} (\mathbf{x}^+ - \mathbf{x}^-) = \mathbf{y} \\ & & \mathbf{x}^+ \succeq \mathbf{0}, \mathbf{x}^- \succeq \mathbf{0}. \end{eqnarray*}\]
Let’s try a numerical example.

2.1 Generate a sparse signal and sub-sampling

# random seed
Random.seed!(123)
# Size of signal
n = 1024
# Sparsity (# nonzeros) in the signal
s = 20
# Number of samples (undersample by a factor of 8) 
m = 128

# Generate and display the signal
x0 = zeros(n)
x0[rand(1:n, s)] = randn(s)
# Generate the random sampling matrix
A = randn(m, n) / m
# Subsample by multiplexing
y = A * x0

# plot the true signal
f = Figure()
Axis(
    f[1, 1], 
    title = "True Signal x0",
    xlabel = "x",
    ylabel = "y"
)
lines!(1:n, x0)
f

2.2 Solve LP by DCP (disciplined convex programming) interface Convex.jl

Check Convex.jl documentation for a list of supported operations.

# # Use Mosek solver
# solver = Mosek.Optimizer()
# MOI.set(solver, MOI.RawOptimizerAttribute("LOG"), 1)

# # Use Gurobi solver
# solver = Gurobi.Optimizer()
# MOI.set(solver, MOI.RawOptimizerAttribute("OutputFlag"), 1)

# Use COSMO solver
solver = COSMO.Optimizer()
# MOI.set(solver, MOI.RawOptimizerAttribute("max_iter"), 5000)

# # Use SCS solver
# solver = SCS.Optimizer()
# MOI.set(solver, MOI.RawOptimizerAttribute("verbose"), 1)

# Set up optimizaiton problem
x = Variable(n)
problem = minimize(norm(x, 1))
problem.constraints += A * x == y

# Solve the problem
@time solve!(problem, solver)

------------------------------------------------------------------
          COSMO v0.8.7 - A Quadratic Objective Conic Solver
                         Michael Garstka
                University of Oxford, 2017 - 2022
------------------------------------------------------------------

Problem:  x ∈ R^{2049},
          constraints: A ∈ R^{2177x2049} (136193 nnz),
          matrix size to factor: 4226x4226,
          Floating-point precision: Float64
Sets:     Nonnegatives of dim: 2048
          ZeroSet of dim: 129
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
          ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
          ρ = 0.1, σ = 1e-06, α = 1.6,
          max_iter = 5000,
          scaling iter = 10 (on),
          check termination every 25 iter,
          check infeasibility every 40 iter,
          KKT system solver: QDLDL
Acc:      Anderson Type2{QRDecomp},
          Memory size = 15, RestartedMemory,    
          Safeguarded: true, tol: 2.0
Setup Time: 419.56ms

Iter:   Objective:  Primal Res: Dual Res:   Rho:
1   -8.1502e+03 8.6359e+00  5.9185e-01  1.0000e-01
25   5.5863e+00 5.0161e-01  5.9243e-02  1.0000e-01
50   1.2348e+01 8.8831e-01  5.8605e-02  1.0000e-01
75   1.0235e+01 5.1553e-02  1.9633e-05  1.0000e-01
100  1.0725e+01 3.7799e-02  1.2149e-04  1.0000e-01
125  1.1132e+01 1.8919e-01  1.7166e-02  1.0000e-01
150  1.1050e+01 9.3353e-02  9.5073e-03  1.0000e-01
175  1.1693e+01 2.0385e-02  1.3864e-01  1.2681e+00
200  1.2646e+01 1.6343e-02  6.3195e-03  1.2681e+00
225  1.2830e+01 6.2782e-03  1.1031e-02  1.2681e+00
250  1.2916e+01 2.2998e-03  1.9798e-03  1.2681e+00
275  1.2927e+01 6.5687e-04  2.5049e-04  1.2681e+00
300  1.2929e+01 1.8784e-06  2.0167e-06  1.2681e+00

------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 321 (incl. 21 safeguarding iter)
Optimal objective: 12.93
Runtime: 1.321s (1321.42ms)

 11.896074 seconds (47.72 M allocations: 3.717 GiB, 8.88% gc time, 98.58% compilation time)

# Display the solution
f = Figure()
Axis(
    f[1, 1], 
    title = "Reconstructed signal overlayed with x0",
    xlabel = "x",
    ylabel = "y"
)
scatter!(1:n, x0, label = "truth")
lines!(1:n, vec(x.value), label = "recovery")
axislegend(position = :lt)
f

3 LP example: quantile regression

In linear regression, we model the mean of response variable as a function of covariates. In many situations, the error variance is not constant, the distribution of \(y\) may be asymmetric, or we simply care about the quantile(s) of response variable. Quantile regression offers a better modeling tool in these applications.
In \(\tau\)-quantile regression, we minimize the loss function \[\begin{eqnarray*} f(\beta) = \sum_{i=1}^n \rho_\tau (y_i - \mathbf{x}_i^T \beta), \end{eqnarray*}\] where \(\rho_\tau(z) = z (\tau - 1_{\{z < 0\}})\). Writing \(\mathbf{y} - \mathbf{X} \beta = \mathbf{r}^+ - \mathbf{r}^-\), this is equivalent to the LP \[\begin{eqnarray*} &\text{minimize}& \tau \mathbf{1}^T \mathbf{r}^+ + (1-\tau) \mathbf{1}^T \mathbf{r}^- \\ &\text{subject to}& \mathbf{r}^+ - \mathbf{r}^- = \mathbf{y} - \mathbf{X} \beta \\ & & \mathbf{r}^+ \succeq \mathbf{0}, \mathbf{r}^- \succeq \mathbf{0} \end{eqnarray*}\] in \(\mathbf{r}^+\), \(\mathbf{r}^-\), and \(\beta\).

4 LP Example: \(\ell_1\) regression

A popular method in robust statistics is the median absolute deviation (MAD) regression that minimizes the \(\ell_1\) norm of the residual vector \(\|\mathbf{y} - \mathbf{X} \beta\|_1\). This apparently is equivalent to the LP \[\begin{eqnarray*} &\text{minimize}& \mathbf{1}^T (\mathbf{r}^+ + \mathbf{r}^-) \\ &\text{subject to}& \mathbf{r}^+ - \mathbf{r}^- = \mathbf{y} - \mathbf{X} \beta \\ & & \mathbf{r}^+ \succeq \mathbf{0}, \mathbf{r}^- \succeq \mathbf{0} \end{eqnarray*}\] in \(\mathbf{r}^+\), \(\mathbf{r}^-\), and \(\beta\).

\(\ell_1\) regression = MAD = 1/2-quantile regression.

5 LP Example: \(\ell_\infty\) regression (Chebychev approximation)

Minimizing the worst possible residual \(\|\mathbf{y} - \mathbf{X} \beta\|_\infty\) is equivalent to the LP \[\begin{eqnarray*} &\text{minimize}& t \\ &\text{subject to}& -t \le y_i - \mathbf{x}_i^T \beta \le t, \quad i = 1,\dots,n \end{eqnarray*}\] in variables \(\beta\) and \(t\).

6 LP Example: Dantzig selector

Candes and Tao (2007) propose a variable selection method called the Dantzig selector that solves \[\begin{eqnarray*} &\text{minimize}& \|\mathbf{X}^T (\mathbf{y} - \mathbf{X} \beta)\|_\infty \\ &\text{subject to}& \sum_{j=2}^p |\beta_j| \le t, \end{eqnarray*}\] which can be transformed to an LP. Indeed they name the method after George Dantzig, who invented the simplex method for efficiently solving LP in 50s.

7 LP Example: 1-norm SVM

In two-class classification problems, we are given training data \((\mathbf{x}_i, y_i)\), \(i=1,\ldots,n\), where \(\mathbf{x}_i \in \mathbb{R}^p\) are feature vectors and \(y_i \in \{-1, 1\}\) are class labels. Zhu, Rosset, Tibshirani, and Hastie (2004) propose the 1-norm support vector machine (svm) that achieves the dual purpose of classification and feature selection. Denote the solution of the optimization problem \[\begin{eqnarray*} &\text{minimize}& \sum_{i=1}^n \left[ 1 - y_i \left( \beta_0 + \sum_{j=1}^p x_{ij} \beta_j \right) \right]_+ \\ &\text{subject to}& \|\beta\|_1 = \sum_{j=1}^p |\beta_j| \le t \end{eqnarray*}\] by \(\hat \beta_0(t)\) and \(\hat \beta(t)\). 1-norm svm classifies a future feature vector \(\mathbf{x}\) by the sign of fitted model \[\begin{eqnarray*} \hat f(\mathbf{x}) = \hat \beta_0 + \mathbf{x}^T \hat \beta. \end{eqnarray*}\]

Many more applications of LP: Airport scheduling (Copenhagen airport uses Gurobi), airline flight scheduling, NFL scheduling, match.com, \(\LaTeX\), …

Apparently any loss/penalty or loss/constraint combinations of form \[ \{\ell_1, \ell_\infty, \text{quantile}\} \times \{\ell_1, \ell_\infty, \text{quantile}\}, \] possibly with affine (equality and/or inequality) constraints, can be formulated as an LP.