--- title: "Discrete Minimax by Quadratic Majorization" author: "Jan de Leeuw" date: "Version 15, December 12, 2016" output: html_document: keep_md: yes number_sections: yes toc: yes toc_depth: 3 pdf_document: keep_tex: yes number_sections: yes toc: yes toc_depth: 3 fontsize: 12pt graphics: yes bibliography: minimax.bib abstract: We construct piecewise quadratic majorizers for minimax problems. This is appled to finding roots of cubics. An application to a Chebyshev versions of MDS loss is also outlined. --- {r function_code, echo = FALSE} source("auxilary.R") source("minimax.R")  {r packages, echo = FALSE} options (digits = 10) library(captioner) library(polynom)  {r equation_numbering, echo = FALSE} figure_nums <- captioner (prefix = "Figure") theorem_nums <- captioner (prefix = "Theorem") figure_nums (name = "uniform", caption = "Uniform Piecewise Quadratic Majorizers", display = FALSE) figure_nums (name = "sharp", caption = "Sharp Piecewise Quadratic Majorizers", display = FALSE) theorem_nums(name = "major_max", caption = "[Majorization of Maximum]", display = FALSE)  Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/minimax](http://gifi.stat.ucla.edu/minimax) has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code. #Introduction Suppose $\mathcal{I}$ is the closed interval $[L,U]$, and $f:\mathcal{I}\rightarrow\mathbb{R}$. A function $g:\mathcal{I}\otimes\mathcal{I}\rightarrow\mathbb{R}$ is a *majorization scheme* for $f$ on $\mathcal{I}$ if * $g(x,x) = f(x)$ for all $x\in\mathcal{I}$, * $g(x,y) \geq f(x)$ for all $x,y\in\mathcal{I}$. In other words for each $y$ the global minimum of $g(x,y)-f(x)$ over $x\in\mathcal{I}$ is zero, and it is attained at $y$. If the functions $f$ and $g$ are differentiable and the minimum is attained at an interior point of the interval, we have $\mathcal{D}_1g(x,y)=\mathcal{D}f(x)$. If the functions are in addition twice differentiable we have $\mathcal{D}_{11}g(x,y)\geq \mathcal{D}^2f(x)$. The majorization conditions are not symmetric in $x$ and $y$, and consequently it sometimes is more clear to write $g_y(x)$ for $g(x,y)$, so that $g_y:\mathcal{I}\rightarrow\mathbb{R}$. We say that $g_y$ majorizes $f$ on $\mathcal{I}$ at $y$, or with *support point* $y$. A *majorization algorithm* is of the form $$x^{(k+1)}\in\mathop{\mathbf{argmin}}_{x\in\mathcal{I}}g(x,x^{(k)})$$ It then follows that $$\label{E:sandwich} f(x^{(k+1)})\leq g(x^{(k+1)},x^{(k)})\leq g(x^{(k)},x^{(k)})=f(x^{(k)}),$$ Thus a majorization step decreases the value of the objective function. The chain $\eqref{E:sandwich}$ is called the *sandwich inequality*. In $\eqref{E:sandwich}$ the inequality $f(x^{(k+1)})\leq g(x^{(k+1)},x^{(k)})$ follows from majorization, the inequality $g(x^{(k+1)},x^{(k)})\leq g(x^{(k)},x^{(k)})$ follows from minimization. This explains why majorization algorithms are also called *MM algorithms* (@lange_16). Using *MM* has the advantage that it can be used for the dual family of minorization-maximization algorithms. #Majorizing a Maximum **r theorem_nums("major_max", display = "f")** Suppose $p(x)=\max_{i=1}^n f_i(x)$, and $g_i$ majorizes $f_i$ on $\mathcal{I}$ at $y$. Define $q(x,y):=\max_{i=1}^n g_i(x,y)$. Then $q$ majorizes $p$ on $\mathcal{I}$ at $y$. **Proof:** First $$q(y,y)=\max_{i=1}^n g_i(y,y)=\max_{i=1}^n f_i(y)=p(y).$$ Second $$p(x)=\max_{i=1}^n f_i(x)=f_k(x)\leq g_k(x,y)\leq\max_{i=1}^n g_i(x,y)=q(x,y).$$ **QED** ##Quadratic Majorization of a Maximum Suppose the majorizing function $g_i$ are quadratics. We have $$\label{E:gi} g_i(x,y)=f_i(y)+f'_i(y)(x-y)+\frac12 K_i(x-y)^2.$$ In a majorization step we have to compute $$x^{(k+1)}=\mathop{\mathbf{argmin}}_{L\leq x\leq U} \max_{i=1}^n f_i(x^{(k)})+f'_i(x^{(k)})(x-x^{(k)})+\frac12 K_i(x-x^{(k)})^2.$$ Now it is easy, at least conceptually, to design a majorization algorithm. For each i {r first_distraction, fig.align= "center", fig.width = 10, fig.height = 10, echo = FALSE, cache = FALSE} a <- c(0,-1,0,1)/6 x <- seq (-2, 2, length = 1000) f <- polynomial (a) fx <- predict (f, x) gx <- - predict (f, x) df <- deriv (f) y <- .5 ffx <- predict (f, y) + predict (df, y) * (x-y) + (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + (x - y) ^ 2 plot(x, pmax(fx, gx), type = "l", col = "RED", lwd = 2) lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) y <- -1.5 ffx <- predict (f, y) + predict (df, y) * (x-y) + (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + (x - y) ^ 2 lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) y <- 0 ffx <- predict (f, y) + predict (df, y) * (x-y) + (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + (x - y) ^ 2 lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) abline(v=.5) abline(v=-1.5) abline(v=0)  r figure_nums("uniform") The iterations of the majorization algorithm, started from -1.5, 0.5, and 0.0, are given next. Convergence is fast and ultimately seems to be quadratic. Note that, in this example at least, convergence is to the point where the two quadratic majorizers are equal. At this point $$f(x^{(k)})+f'(x^{(k)})(x^{(k+1)}-x^{(k)})=-f(x^{(k)})-f'(x^{(k)})(x^{(k+1)}-x^{(k)}),$$ or $$x^{(k+1)}=x^{(k)}-\frac{f(x^{(k)})}{f'(x^{(k)})},$$ which is a Newton step. {r ad_hoc, echo = FALSE} source("auxilary.R") a <- c(0,-1,0,1)/6 myAbs <- function (y, a, k, lw, up) { f <- polynomial (a) df <- deriv (f) ff <- c(predict (f, y), -predict (f, y)) gg <- c(predict (df, y), -predict (df, y)) return (minimaxQ (y, ff, gg, k, lw, up)) }  {r ad_hoc_2} h <- myIterator (xinit = -1.5, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE) h <- myIterator (xinit = 0.5, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE) h <- myIterator (xinit = 0.0, f = myAbs, a = a, k = 2, lw = -2, up = 2, verbose = TRUE, final = FALSE)  We can find better quadratic majorizers of our cubics by using the results of @deleeuw_E_16q. Define the sharp quadratic majorizers for which $$K(y):=\max_{L\leq x\leq U} f''(y)+\frac13 f'''(y)(x-y)=f''(y)+\frac13\max\ (f'''(L-y), f'''(U-y))$$ For\frac16(x^3-x)$this gives$K_1(y)=y+\frac13(U-y)$and for$-\frac16(x^3-x)$it gives$K_2(y)=-y-\frac13(L-y)$. If$L=-2$and$U=+2$then$K_1(y)=\frac23 y+\frac23$and$K_2(y)=-\frac23 y+\frac23$. At$y=-\frac32$this means$K_1=-\frac13$and$K_2=\frac53$. At$y=\frac12$we have$K_1=1$and$K_2=\frac13$. At$y=0$we have$K_1=K_2=\frac23$. The piecewise quadratc majorizers are in figure r figure_nums("sharp", display = "n"). Note that for both$y=-\frac32$and for$y=\frac12there are two interior knots. {r next_distraction, fig.align= "center", fig.width = 10, fig.height = 10, echo = FALSE, cache = FALSE} a <- c(0,-1,0,1)/6 x <- seq (-2, 2, length = 1000) f <- polynomial (a) fx <- predict (f, x) gx <- - predict (f, x) df <- deriv (f) y <- .5 k1 <- 1 k2 <- 1/3 ffx <- predict (f, y) + predict (df, y) * (x-y) + 0.5 * k1 * (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + 0.5 * k2 * (x - y) ^ 2 plot(x, pmax(fx, gx), type = "l", col = "RED", lwd = 2) lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) y <- -1.5 k1 <- -1/3 k2 <- 5/3 ffx <- predict (f, y) + predict (df, y) * (x-y) + 0.5 * k1 * (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + 0.5 * k2 * (x - y) ^ 2 lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) y <- 0 k1 <- 2/3 k2 <- 2/3 ffx <- predict (f, y) + predict (df, y) * (x-y) + 0.5 * k1 * (x - y) ^ 2 ggx <- -predict (f, y) - predict (df, y) * (x-y) + 0.5 * k2 * (x - y) ^ 2 lines (x, pmax (ffx, ggx), col = "BLUE", lwd = 2) abline(v=.5) abline(v=-1.5) abline(v=0)  r figure_nums("sharp") As expected, using no lifitng and sharp majorization requires fewer iterations to reach the roots, although of course each iteration is more work. {r ad_hoc_3, echo = FALSE} source("auxilary.R") a <- c(0,-1,0,1)/6 myAbs <- function (y, a, k, lw, up) { f <- polynomial (a) df <- deriv (f) ff <- c(predict (f, y), -predict (f, y)) gg <- c(predict (df, y), -predict (df, y)) return (minimaxB (y, ff, gg, k, lw, up)) }  {r ad_hoc_4} h <- myIterator (xinit = -1.5, f = myAbs, a = a, k = c(-1/3,5/3), lw = -2, up = 2, verbose = TRUE, final = FALSE) h <- myIterator (xinit = 0.5, f = myAbs, a = a, k = c(1,1/3), lw = -2, up = 2, verbose = TRUE, final = FALSE) h <- myIterator (xinit = 0.0, f = myAbs, a = a, k = c(2/3,2/3), lw = -2, up = 2, verbose = TRUE, final = FALSE)  The root-finding example can easily be extended to more interesting cases. If we have a convex functionf$with a quadratic majorizer, then$-F$is concave and, if so desired, it can be lifted to have the same quadratic term. ##MDS in the Chebyshev Norm Consider the multidimensional scaling (MDS) problem of minimizing the loss function $$\sigma(x):=\max_{i=1}^n w_i\left|\delta_i-\sqrt{x'A_ix}\right|.$$ Here the$w_i$are positive weights, the$\delta_i$are positive dissimilarties, and the$\sqrt{x'A_ix}\$ are Euclidean distances. We give the quadratic majorizations that are needed. The first majorization, based on Cauchy-Schwartz, is actually linear. This is the basic SMACOF majorization, due to @deleeuw_C_77. $$\delta_i-\sqrt{x'A_{i}x}\leq \delta_i-\frac{1}{\sqrt{y'A_{i}y}}x'A_{i}y.$$ The second majorization, based on the AM/GM inequality, is quadratic. It was first used in the MDS context by @heiser_91. $$\sqrt{x'A_ix}-\delta_i\leq \frac12\frac{1}{\sqrt{y'A_iy}}\{x'A_ix+y'A_iy\}-\delta_i$$ We can add an appropriate quadratic term to the first (linear) majorization to achieve multivariate lifting. We will not give a full implementation in this paper, but in principle it is quite straightforward. It also is of some interestto observe that the alternative loss funcion $$\sigma(x):=\max_{i=1}^n w_i\left|\delta_i^2-x'A_ix\right|$$ is already a maximum of quadratics, and consequently minimizing it does not require quadratic majorization. Without multivariate lifting, however, the regions where one quadratic function dominates all others can be quite complicated. #Appendix: Code ##auxilary.R {r file_auxilary, code = readLines("auxilary.R")}  ##minimax.R {r file_auxilary, code = readLines("minimax.R")}  #References