# Discrete Minimax by Quadratic Majorization Jan de Leeuw Version 15, December 12, 2016 Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/minimax](http://gifi.stat.ucla.edu/minimax) has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code. #Introduction Suppose $\mathcal{I}$ is the closed interval $[L,U]$, and $f:\mathcal{I}\rightarrow\mathbb{R}$. A function $g:\mathcal{I}\otimes\mathcal{I}\rightarrow\mathbb{R}$ is a *majorization scheme* for $f$ on $\mathcal{I}$ if * $g(x,x) = f(x)$ for all $x\in\mathcal{I}$, * $g(x,y) \geq f(x)$ for all $x,y\in\mathcal{I}$. In other words for each $y$ the global minimum of $g(x,y)-f(x)$ over $x\in\mathcal{I}$ is zero, and it is attained at $y$. If the functions $f$ and $g$ are differentiable and the minimum is attained at an interior point of the interval, we have $\mathcal{D}_1g(x,y)=\mathcal{D}f(x)$. If the functions are in addition twice differentiable we have $\mathcal{D}_{11}g(x,y)\geq \mathcal{D}^2f(x)$. The majorization conditions are not symmetric in $x$ and $y$, and consequently it sometimes is more clear to write $g_y(x)$ for $g(x,y)$, so that $g_y:\mathcal{I}\rightarrow\mathbb{R}$. We say that $g_y$ majorizes $f$ on $\mathcal{I}$ at $y$, or with *support point* $y$. A *majorization algorithm* is of the form $$x^{(k+1)}\in\mathop{\mathbf{argmin}}_{x\in\mathcal{I}}g(x,x^{(k)})$$ It then follows that $$\label{E:sandwich} f(x^{(k+1)})\leq g(x^{(k+1)},x^{(k)})\leq g(x^{(k)},x^{(k)})=f(x^{(k)}),$$ Thus a majorization step decreases the value of the objective function. The chain $\eqref{E:sandwich}$ is called the *sandwich inequality*. In $\eqref{E:sandwich}$ the inequality $f(x^{(k+1)})\leq g(x^{(k+1)},x^{(k)})$ follows from majorization, the inequality $g(x^{(k+1)},x^{(k)})\leq g(x^{(k)},x^{(k)})$ follows from minimization. This explains why majorization algorithms are also called *MM algorithms* (@lange_16). Using *MM* has the advantage that it can be used for the dual family of minorization-maximization algorithms. #Majorizing a Maximum **Theorem 1: [Majorization of Maximum]** Suppose $p(x)=\max_{i=1}^n f_i(x)$, and $g_i$ majorizes $f_i$ on $\mathcal{I}$ at $y$. Define $q(x,y):=\max_{i=1}^n g_i(x,y)$. Then $q$ majorizes $p$ on $\mathcal{I}$ at $y$. **Proof:** First $$q(y,y)=\max_{i=1}^n g_i(y,y)=\max_{i=1}^n f_i(y)=p(y).$$ Second $$p(x)=\max_{i=1}^n f_i(x)=f_k(x)\leq g_k(x,y)\leq\max_{i=1}^n g_i(x,y)=q(x,y).$$ **QED** ##Quadratic Majorization of a Maximum Suppose the majorizing function $g_i$ are quadratics. We have $$\label{E:gi} g_i(x,y)=f_i(y)+f'_i(y)(x-y)+\frac12 K_i(x-y)^2.$$ In a majorization step we have to compute $$x^{(k+1)}=\mathop{\mathbf{argmin}}_{L\leq x\leq U} \max_{i=1}^n f_i(x^{(k)})+f'_i(x^{(k)})(x-x^{(k)})+\frac12 K_i(x-x^{(k)})^2.$$ Now it is easy, at least conceptually, to design a majorization algorithm. For each $i Figure 1: Uniform Piecewise Quadratic Majorizers The iterations of the majorization algorithm, started from -1.5, 0.5, and 0.0, are given next. Convergence is fast and ultimately seems to be quadratic. Note that, in this example at least, convergence is to the point where the two quadratic majorizers are equal. At this point $$f(x^{(k)})+f'(x^{(k)})(x^{(k+1)}-x^{(k)})=-f(x^{(k)})-f'(x^{(k)})(x^{(k+1)}-x^{(k)}),$$ or $$x^{(k+1)}=x^{(k)}-\frac{f(x^{(k)})}{f'(x^{(k)})},$$ which is a Newton step. r h <- myIterator (xinit = -1.5, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: -1.50000000 xnew: -1.17391304 cnew: 0.32608696 rate: 0.00000000 ## Iteration: 2 xold: -1.17391304 xnew: -1.03230713 cnew: 0.14160592 rate: 0.43425814 ## Iteration: 3 xold: -1.03230713 xnew: -1.00145595 cnew: 0.03085117 rate: 0.21786642 ## Iteration: 4 xold: -1.00145595 xnew: -1.00000317 cnew: 0.00145278 rate: 0.04709010 ## Iteration: 5 xold: -1.00000317 xnew: -1.00000000 cnew: 0.00000317 rate: 0.00218127 ## Iteration: 6 xold: -1.00000000 xnew: -1.00000000 cnew: 0.00000000 rate: 0.00000475  r h <- myIterator (xinit = 0.5, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: 0.50000000 xnew: 0.47916667 cnew: 0.02083333 rate: 0.00000000 ## Iteration: 2 xold: 0.47916667 xnew: 0.45323351 cnew: 0.02593316 rate: 1.24479167 ## Iteration: 3 xold: 0.45323351 xnew: 0.42125533 cnew: 0.03197818 rate: 1.23310004 ## Iteration: 4 xold: 0.42125533 xnew: 0.38228601 cnew: 0.03896932 rate: 1.21862221 ## Iteration: 5 xold: 0.38228601 xnew: 0.33548832 cnew: 0.04679769 rate: 1.20088533 ## Iteration: 6 xold: 0.33548832 xnew: 0.28029309 cnew: 0.05519523 rate: 1.17944358 ## Iteration: 7 xold: 0.28029309 xnew: 0.21660081 cnew: 0.06369228 rate: 1.15394535 ## Iteration: 8 xold: 0.21660081 xnew: 0.14499646 cnew: 0.07160436 rate: 1.12422348 ## Iteration: 9 xold: 0.14499646 xnew: 0.06691911 cnew: 0.07807734 rate: 1.09039932 ## Iteration: 10 xold: 0.06691911 xnew: -0.00060751 cnew: 0.06752663 rate: 0.86486843 ## Iteration: 11 xold: -0.00060751 xnew: 0.00000000 cnew: 0.00060751 rate: 0.00899663 ## Iteration: 12 xold: 0.00000000 xnew: 0.00000000 cnew: 0.00000000 rate: 0.00000074  r h <- myIterator (xinit = 0.0, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: 0.00000000 xnew: 0.00000000 cnew: 0.00000000 rate: 0.00000000  We can find better quadratic majorizers of our cubics by using the results of @deleeuw_E_16q. Define the sharp quadratic majorizers for which $$K(y):=\max_{L\leq x\leq U} f''(y)+\frac13 f'''(y)(x-y)=f''(y)+\frac13\max\ (f'''(L-y), f'''(U-y))$$ For$\frac16(x^3-x)$this gives$K_1(y)=y+\frac13(U-y)$and for$-\frac16(x^3-x)$it gives$K_2(y)=-y-\frac13(L-y)$. If$L=-2$and$U=+2$then$K_1(y)=\frac23 y+\frac23$and$K_2(y)=-\frac23 y+\frac23$. At$y=-\frac32$this means$K_1=-\frac13$and$K_2=\frac53$. At$y=\frac12$we have$K_1=1$and$K_2=\frac13$. At$y=0$we have$K_1=K_2=\frac23$. The piecewise quadratc majorizers are in figure 2. Note that for both$y=-\frac32$and for$y=\frac12$there are two interior knots. Figure 2: Sharp Piecewise Quadratic Majorizers As expected, using no lifitng and sharp majorization requires fewer iterations to reach the roots, although of course each iteration is more work. r h <- myIterator (xinit = -1.5, f = myAbs, a = a, k = c(-1/3,5/3), lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: -1.50000000 xnew: -1.08333333 cnew: 0.41666667 rate: 0.00000000 ## Iteration: 2 xold: -1.08333333 xnew: -1.00057225 cnew: 0.08276108 rate: 0.19862660 ## Iteration: 3 xold: -1.00057225 xnew: -1.00000000 cnew: 0.00057225 rate: 0.00691450 ## Iteration: 4 xold: -1.00000000 xnew: -1.00000000 cnew: 0.00000000 rate: 0.00000033  r h <- myIterator (xinit = 0.5, f = myAbs, a = a, k = c(1,1/3), lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: 0.50000000 xnew: 0.37500000 cnew: 0.12500000 rate: 0.00000000 ## Iteration: 2 xold: 0.37500000 xnew: 0.08593750 cnew: 0.28906250 rate: 2.31250000 ## Iteration: 3 xold: 0.08593750 xnew: 0.00534433 cnew: 0.08059317 rate: 0.27880882 ## Iteration: 4 xold: 0.00534433 xnew: 0.00002796 cnew: 0.00531637 rate: 0.06596546 ## Iteration: 5 xold: 0.00002796 xnew: 0.00000000 cnew: 0.00002796 rate: 0.00525925 ## Iteration: 6 xold: 0.00000000 xnew: 0.00000000 cnew: 0.00000000 rate: 0.00002796  r h <- myIterator (xinit = 0.0, f = myAbs, a = a, k = c(2/3,2/3), lw = -2, up = 2, verbose = TRUE, final = FALSE)   ## Iteration: 1 xold: 0.00000000 xnew: 0.00000000 cnew: 0.00000000 rate: 0.00000000  The root-finding example can easily be extended to more interesting cases. If we have a convex function$f$with a quadratic majorizer, then$-F$is concave and, if so desired, it can be lifted to have the same quadratic term. ##MDS in the Chebyshev Norm Consider the multidimensional scaling (MDS) problem of minimizing the loss function $$\sigma(x):=\max_{i=1}^n w_i\left|\delta_i-\sqrt{x'A_ix}\right|.$$ Here the$w_i$are positive weights, the$\delta_i$are positive dissimilarties, and the$\sqrt{x'A_ix}$are Euclidean distances. We give the quadratic majorizations that are needed. The first majorization, based on Cauchy-Schwartz, is actually linear. This is the basic SMACOF majorization, due to @deleeuw_C_77. $$\delta_i-\sqrt{x'A_{i}x}\leq \delta_i-\frac{1}{\sqrt{y'A_{i}y}}x'A_{i}y.$$ The second majorization, based on the AM/GM inequality, is quadratic. It was first used in the MDS context by @heiser_91. $$\sqrt{x'A_ix}-\delta_i\leq \frac12\frac{1}{\sqrt{y'A_iy}}\{x'A_ix+y'A_iy\}-\delta_i$$ We can add an appropriate quadratic term to the first (linear) majorization to achieve multivariate lifting. We will not give a full implementation in this paper, but in principle it is quite straightforward. It also is of some interestto observe that the alternative loss funcion $$\sigma(x):=\max_{i=1}^n w_i\left|\delta_i^2-x'A_ix\right|$$ is already a maximum of quadratics, and consequently minimizing it does not require quadratic majorization. Without multivariate lifting, however, the regions where one quadratic function dominates all others can be quite complicated. #Appendix: Code ##auxilary.R r mprint <- function (x, d = 2, w = 5) { print (noquote (formatC ( x, di = d, wi = w, fo = "f" ))) } myIterator <- function (xinit, f, eps = 1e-6, itmax = 100, verbose = FALSE, final = TRUE, ...) { xold <- xinit cold <- Inf itel <- 1 repeat { xnew <- f (xold, ...) cnew <- abs (xnew - xold) rate <- cnew / cold if (verbose) cat( "Iteration: ", formatC (itel, width = 3, format = "d"), "xold: ", formatC ( xold, digits = 8, width = 12, format = "f" ), "xnew: ", formatC ( xnew, digits = 8, width = 12, format = "f" ), "cnew: ", formatC ( cnew, digits = 8, width = 12, format = "f" ), "rate: ", formatC ( rate, digits = 8, width = 12, format = "f" ), "\n" ) if ((cnew < eps) || (itel == itmax)) break xold <- xnew cold <- cnew itel <- itel + 1 } if (final) cat( "Iteration: ", formatC (itel, width = 3, format = "d"), "xinit: ", formatC ( xinit, digits = 8, width = 12, format = "f" ), "xfinal: ", formatC ( xnew, digits = 8, width = 12, format = "f" ), "change: ", formatC ( cnew, digits = 8, width = 12, format = "f" ), "rate: ", formatC ( rate, digits = 8, width = 12, format = "f" ), "\n" ) return (list ( itel = itel, xinit = xinit, xfinal = xnew, change = cnew, rate = rate )) } cobwebPlotter <- function (xold, func, lowx = 0, hghx = 1, lowy = lowx, hghy = hghx, eps = 1e-10, itmax = 25, ...) { x <- seq (lowx, hghx, length = 100) y <- sapply (x, function (x) func (x, ...)) plot ( x, y, xlim = c(lowx , hghx), ylim = c(lowy, hghy), type = "l", col = "RED", lwd = 2 ) abline (0, 1, col = "BLUE") base <- 0 itel <- 1 repeat { xnew <- func (xold, ...) if (itel > 1) { lines (matrix(c(xold, xold, base, xnew), 2, 2)) } lines (matrix(c(xold, xnew, xnew, xnew), 2, 2)) if ((abs (xnew - xold) < eps) || (itel == itmax)) { break () } base <- xnew xold <- xnew itel <- itel + 1 } }  ##minimax.R r minQuadratic <- function (y, f, g, k, lw, up) { func <- function (x, f, g, k) { return (f + g * (x - y) + 0.5 * k * (x - y) ^ 2) } fup <- func (up, f, g, k) flw <- func (lw, f, g, k) if (k <= 0) { if (fup <= flw) return (list (x = up, f = fup)) if (fup >= flw) return (list (x = lw, f = flw)) } xmn <- y - g / k fmn <- func (xmn, f, g, k) if (xmn >= up) return (list (x = up, f = fup)) if (xmn <= lw) return (list (x = lw, f = flw)) return (list (x = xmn, f = fmn)) } minimaxQ <- function (y, f, g, k, lw, up) { n <- length (f) res <- Inf for (i in 1:n) { xlw <- lw xup <- up fail <- FALSE for (j in 1:n) { if (j == i) next df <- f[i] - f[j] dg <- g[i] - g[j] if (dg > 0) xlw <- max (xlw, y - df / dg) if (dg < 0) xup <- min (xup, y - df / dg) if ((dg == 0) && (f[i] < f[j])) fail <- TRUE } if (xlw >= xup) fail <- TRUE if (!fail) h <- minQuadratic (y, f[i], g[i], k, xlw, xup) if (h$f < res) { res <- h$f sol <- h$x } } return (sol) } minimaxB <- function (y, f, g, k, lw, up) { n <- length (f) for (i in 1:(n - 1)) { for (j in (i + 1):n) { df <- f[i] - f[j] dg <- g[i] - g[j] dk <- k[i] - k[j] rt <- y + solve (polynomial (c (df, dg, dk / 2))) rt <- rt[which ((rt <= up) & (rt >= lw) & (!is.complex (rt)))] } } rt <- sort (unique (c(lw, up, rt))) m <- length (rt) fmin <- Inf for (j in 1:(m - 1)) { x <- (rt[j] + rt[j + 1]) / 2 smax <- -Inf for (i in 1:n) { si <- f[i] + g[i] * (x - y) + 0.5 * k[i] * (x - y) ^ 2 if (si > smax) { smax <- si l <- i } } h <- minQuadratic (y, f[l], g[l], k[l], rt[j], rt[j + 1]) if (h$f < fmin) { fmin <- h$f xmin <- h\$x } } return (xmin) }  #References