---
title: "Some Majorization Theory for Weighted Least Squares"
author: "Jan de Leeuw"
date: "Version 13, May 26, 2017"
output:
html_document:
keep_md: yes
number_sections: yes
toc: yes
pdf_document:
keep_tex: yes
number_sections: yes
toc: yes
toc_depth: 3
fontsize: 12pt
graphics: yes
bibliography: wls.bib
abstract: In many situations in numerical analysis least squares loss functions with
diagonal weight matrices are much easier to minimize than least square loss functions
with full positive semi-definite weight matrices. We use majorization to replace
problems with a full weight matrix by a sequence of diagonal weight matrix problems.
Diagonal weights which optimally approximate the full weights are computed using
a simple semi-definite programming procedure.
---
```{r function_code, echo = FALSE}
source("maxR.R")
source("jbkPava.R")
```
```{r packages, echo = FALSE}
options (digits = 10)
library (captioner)
mprint <- function (x,
d = 6,
w = 8,
f = "") {
print (noquote (formatC (
x,
di = d,
wi = w,
fo = "f",
flag = f
)))
}
```
```{r equation_numbering, echo = FALSE}
figure_nums <- captioner (prefix = "Figure")
theorem_nums <- captioner (prefix = "Theorem")
lemma_nums <- captioner (prefix = "Lemma")
corollary_nums <- captioner (prefix = "Corollary")
definition_nums <- captioner (prefix = "Defintion")
```
Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/wls](http://gifi.stat.ucla.edu/wls) has a pdf version, the complete Rmd file, the R code, and the bib file.
#Introduction
In this paper we study the problem of minimizing a weighted least squares loss function $\sigma(x)=x'Wx$ over a set $\mathcal{X}\subseteq\mathbb{R}^n$.
Here $W$ is a positive semi-definite matrix of order $n$.
Examples are linear least squares, where
$\mathcal{X}$ is of the form $y-A\beta$, or isotone regression, where $\mathcal{X}=y-\mathcal{K}$ with $K$ the cone of isotone vectors, or nonlinear least
squares, where the elements of $\mathcal{X}$ are of the form $y-F(\theta)$ for $F:\mathbb{R}^p\rightarrow\mathbb{R}^n$ and $\theta\in\mathbb{R}^p$.
The idea is that weighted least squares problems with general $W$ are more difficult to solve than problems with a diagonal $W$. We use majorization theory to
reduce problems with non-diagonal weights to sequences of problems with diagonal weights. General discussions of majorization (also known as MM) algorithms
are in @deleeuw_B_16b and @lange_16. Majorization theory has been used earlier in the context of weighted least squares problems by @heiser_95,
@kiers_97 and @groenen_giaquinto_kiers_03, although their applications are limited to linear regression and low-rank matrix approximation (also known as principal component
analysis) with rather simple bounds.
#Majorization
Write $x=z+(x-z)$ and thus
$$
\sigma(x)=(z+(x-z))'W(z+(x-z))=\sigma(z)+2z'W(x-z)+(x-z)'W(x-z).
$$
Now suppose we have a diagonal $D$ satisfying $D\gtrsim W$ in the Loewner sense, i.e. $D-W$ is positive semi-definite. Then $\sigma(x)\leq\eta(x,z)$ for all $x,z\in\mathcal{X}$, where
$$
\eta(x,z):=\sigma(z)+2z'W(x-z)+(x-z)'D(x-z),
$$
and, of course, $\sigma(x)=\eta(x,x)$ for all $x\in\mathcal{X}$. Thus $\eta:\mathcal{X}\otimes\mathcal{X}\rightarrow\mathbb{R}$ is a *majorization scheme* in the sense of
@deleeuw_B_16b, chapter 7. The corresponding majorization algorithm is
$$
x^{(k+1)}\in\mathop{\mathbf{Argmin}}_{x\in\mathcal{X}}\ \eta(x,x^{(k)}),
$$
and the *sandwich inequality* is
$$
\sigma(x^{(k+1)})\leq\eta(x^{(k+1)},x^{(k)})\leq\eta(x^{(k)},x^{(k)})=\sigma(x^{(k)}).
$$
The first inequality is strict if the minimizer of $\eta(x,x^{(k)})$ is unique, the second is strict if $D-W$ is positive definite.
Define $r(z):=(I-D^{-1}W)z$. Then, completing the square, minimizing $\eta(x,z)$ over $x$ is the same as minimizing
$$
\rho(x,z):=(x-r(z))'D(x-r(z))
$$
over $x$.
Consequently an iteration of the majorization algorithm finds $x^{(k+1)}$ by projecting $r(x^{(k)})$ on $\mathcal{X}$, using the metric $D$.
##Singularity
It should be mentioned that our majorization method also applies if $W$ is singular. In fact we can show that $D$ will be positive definite in all situations of interest.
Suppose $D$ is singular, i.e. it has zero diagonal elements. Then $D\gtrsim W$ implies that $W$ also has the same zero diagonal elements, and because $W$ is
positive semi-definite all off-diagonal elements of $W$ corresponding with these diagonal elements are zero as well. If $I$ is the index set for which the diagonal elements of $W$ are zero, then we must minimize $z_I'W_I^{\ }z_I^{\ }$ and for majorization use a positive definite $D_I\gtrsim W_I$. Thus majorization can be also be used to regularize singular problems.
If $W$ is indefinite it is still possible to choose a positive definite diagonal $D$ such that $D-W\gtrsim 0$. We will still have a majorization algorithm
that generates a decreasing sequence of loss function values, but the loss function may no longer be bounded below, and thus the sequence of loss function values may
not converge.
#Rate of Convergence
##Linear Least Squares
For simplicity we start with the linear case. Thus $\mathcal{X}=\{x\mid x=y-A\beta\}$. Assume $A$ to be of full column rank, and assume $D$ to be positive definite.
We can write the majorization scheme as
$$
\eta(\beta,\gamma)=\sigma(\gamma)+2(y-A\gamma)'WA(\gamma-\beta)+(\gamma-\beta)'A'DA(\gamma-\beta).
$$
Thus
$$
\mathcal{D}_1\eta(\beta,\gamma)=-2A'W(y-A\gamma)-2A'DA(\gamma-\beta),
$$
which is zero for
$$
\beta=\gamma+(A'DA)^{-1}A'W(y-A\gamma)=(A'DA)^{-1}A'Wy+(I-(A'DA)^{-1}A'WA)\gamma.
$$
The iterations take the form
$$
\beta^{(k+1)}=(A'DA)^{-1}A'Wy+(I-(A'DA)^{-1}A'WA)\beta^{(k)},
$$
and thus
\begin{equation}\label{E:lrate}
\beta^{(k+1)}-\beta^{(k)}=(I-(A'DA)^{-1}A'WA)(\beta^{(k)}-\beta^{(k-1)}).
\end{equation}
The speed of convergence of the iteration is determined by the eigenvalues of $(A'DA)^{-1}A'WA$, which are less than or equal to one. Thus the smaller $D$ is, given that $D-W\gtrsim 0$, the faster the convergence will be. The linear convergence rate of our majorization algorithm is the largest eigenvalue of $I-(A'DA)^{-1}A'WA$.
##Non-linear Least Squares
For non-linear least squares $\mathcal{X}=\{x\mid y-F(\theta)\}$. The majorization scheme is
$$
\eta(\theta,\xi)=\sigma(\xi)-2(y-F(\xi))'W(F(\theta)-F(\xi))+(F(\theta)-F(\xi))'D(F(\theta)-F(\xi)).
$$
Minimizing over $\theta$ for fixed $\xi$ means solving
$$
\mathcal{D}_1\eta(\theta,\xi)=-2G(\theta)'W(y-F(\xi))+2G(\theta)'D(F(\theta)-F(\xi))=0,
$$
where $G\theta)$ is short for $\mathcal{D}F(\theta)$. This implicitly defines $\theta$ as a function of $\xi$, and we have, at a fixed point where $\theta(\xi)=\xi$,
$$
\mathcal{D}\theta(\xi)=-\left[\mathcal{D}_{11}\eta(\xi,\xi)\right]^{-1}\mathcal{D}_{12}\eta(\xi,\xi).
$$
What remains to be done is computing expressions for the derivatives. We have
$$
\mathcal{D}_{12}\eta(\xi,\xi)=2G(\xi)'WG(\xi)-2G(\xi)'DG(\xi),
$$
$$
\mathcal{D}_{11}\eta(\theta,\xi)=2G(\xi)'DG(\xi)-2H(\xi),
$$
where
$$
H(\xi)=\sum_{i=1}^n\mathcal{D}^2f_i(\xi)\sum_{j=1}^nw_{ij}(y_j-f_j(\xi)).
$$
Thus
\begin{equation}\label{E:nlrate}
\mathcal{D}\theta(\xi)=[G(\xi)'DG(\xi)-H(\xi)]^{-1}(G(\xi)'DG(\xi)-G(\xi)'WG(\xi)).
\end{equation}
In the linear case $H(\xi)=0$ and $G(\xi)=A$, and we find the result in equation $\eqref{E:lrate}$. Again we see that, in general terms, it will be good to have $D$ as small as possible. Note that if we have perfect fit, i.e. if $y=F(\xi)$, then again $H(\xi)=0$ and
$$
\mathcal{D}\theta(\xi)=(I-[G(\xi)'DG(\xi)]^{-1}G(\xi)'WG(\xi)).
$$
Both in the linear and nonlinear case the derivatives $G(\xi)$ and $A$ complicate the comparison of different diagonal matrices of weights, but the general idea that we
want both $D-W\gtrsim 0$ and $D$ as small as possible seems to be a useful summary of the results so far.
#Minimum Trace Majorization Bound
Let us look at various ways in which we can have a small $D$, while satisfying $D-W\gtrsim 0$.There are many ways, of course, to measure how large $D$ is. In the context of linear iterations, we have seen that what matters is the size of $A'DA$ relative to $A'WA$, and something very similar is true in the non-linear case. But in this section we will look for diagonal matrices $D$ that are small in a more general sense, independent of the value of any derivatives or coefficient matrices.
The set of all scalar matrices $D=\lambda I$ with $D\gtrsim W$ is easy to describe. We have $D\gtrsim W$ if and only if $\lambda\geq\lambda_{\text{max}}(W)$, the largest eigenvalue of $W$. Thus we also have $D\gtrsim W$ if $D=\|W\|I$, with $W$ any matrix norm. Perhaps the easiest bound is $D=\mathbf{tr}(W)I$, although that will tend to be really unsharp.
Another easy bound uses $V=\mathbf{diag}(W)$. Now $V^{-\frac12}WV^{-\frac12}$ is a correlation matrix, which means its largest eigenvalue is less than or equal to $n$. Thus
$D=nV$ satisfies $D\gtrsim W$. In general, however, it seems that requiring $D$ to be scalar, and working with the largest eigenvalue, or even bound on the largest eigenvalue, will lead to large $D$, and consequently slow convergence of the majorization algorithm.
So we will attempt to do better. For now we adopt the trace as a simple measure of the size of $D$, which does not take specific aspects of the least squares problem into account.
What we try to solve in this section is to compute $\min_D\mathbf{tr}\ D$ over $D-W\gtrsim 0$. Let's call this the *minimum trace majorization bound* or MTMB problem. Before we start, let us note the similarity of the MTMB problem to *minimum trace factor analysis* (MTFA), in which we compute $\max_D\mathbf{tr}\ D$ over $W-D\gtrsim 0$ (and $D\gtrsim 0$ in the case of constrained MTFA). See @watson_92 and @jamshidian_bentler_98 for MTFA algorithms. Both MTFA and MTMB are examples of *semidefinite programming* or SDP (@vandenberghe_boyd_96).
We first analyse the MTMB problem a bit more in detail. The Lagrangian is
$$
\mathcal{L}(D,R)=\mathbf{tr}\ D-\mathbf{tr}\ R(D-W),
$$
where $R\gtrsim 0$ is a symmetric matrix of Lagrange multipliers (or dual variables). The nececessary and sufficient conditions for a minimum are
\begin{align*}
D-W&\gtrsim 0,\\
R&\gtrsim 0,\\
\mathbf{diag}(R)&=I,\\
R(D-W)=0.
\end{align*}
Because
$$
\max_{R\gtrsim 0}\mathcal{L}(D,R)=\begin{cases}\mathbf{tr}\ D&\text{ if }D-W\gtrsim 0,\\+\infty&\text{ otherwise},\end{cases}
$$
the primal problem is $\min_D\max_{R\gtrsim 0}\mathcal{L}(D,R)$.
The dual function is
$$
\min_D\mathcal{L}(D,R)=\begin{cases}\mathbf{tr}\ RW&\text{ if }\mathbf{diag}(R)=I,\\-\infty&\text{ otherwise},\end{cases}
$$
and thus dual probem is to maximize $\mathbf{tr}\ RW$ over $R\gtrsim 0$ with $\mathbf{diag}(R)=I$, i.e. over all correlation matrices. Because both the primal and the dual problem are structly feasible, the optimal value of both problems are equal.
##Extreme Correlation Matrices
Correlation matrices, i.e. positive semi-definite matrices with a unit diagonal, form a compact convex $\mathcal{E}_{n\times n}$ set in the space of all symmetric matrices. @laurent_poljak_96 call this set the *elliptope*. A linear function on a compact convex set attains its maximum at one of the extreme poonts of the set, i.e. one of the points that cannot be represented as a convex combination of two other points in the set. Thus the solution of the dual problem of maximizing $\mathbf{tr}\ RW$ is an extreme point of the
elliptope. This makes it interesting for us to look at the structure of such extreme points
Correlation matrices of rank one are of the form $xx'$, where $x$ is a vector with element $\pm 1$. Such matrices are called *cut matrices* in @laurent_poljak_96, and their convex hull is the *cut polytope*. The max-cut problem is maximizing a linear function over the cut polytope, and clearly maximizing that linear function over the elliptope is a convex relaxation of the *max cut* problem. Cut matrices are extreme points of the elliptope, but having rank one is only
sufficient, not necessary, for extreme points. A comprehensive recent paper on the rank and structure of the extreme points of the elliptope, which also reviews previous results on the topic, is @li_tam_94.
It was first shown by @grone_pierce_watkins_90 that the elliptope $\mathcal{E}_{n\times n}$ contains an extreme point of rank $r$ if and only if $r(r+1)\leq 2n$. The face structure of the elliptope was analyzed in @laurent_poljak_96, and necessary and sufficient conditions for a correlation matrix of rank $r$ to be an extreme point are given in @ycart_85, @li_tam_94, @partharasarathy_02, and @huerlimann_15.
#Algorithm
Although we have a choice of many different SDP algorithms, we use a special purpose method to solve MTMB. It seems to work well. We maximize $\mathbf{tr}\ U'UW$ by cycling over the columns of $U$, changing one of them at a time, requiring throughout that $u_i'u_i=1$. Clearly the solution for column $u_i$, keeping all other column at their current value, is
given by $v=\sum_{k\not= i}w_{ij}u_k$ and then $u_i=v/\|v\|$. The appendix has a simple R function `maxR()` to do exactly this
If we have found the solution, we use $R(D-W)=0$ to find the solution of the primal problem.
##Examples
We give some example of application of our dual algorithm. Our first matrix $W$ is
```{r first_w, echo = FALSE}
x<-diff(diag(6))
w<-crossprod(x)
```
```{r mprint_first_w, echo = FALSE}
mprint (w, w = 3, d = 1, f = "+")
e <- max(eigen(w)$values)
```
This has largest eigenvalue `r e`, and thus the best scalar $D$ has trace `r nrow(w)*e`. Using the trace of $W$ for a scalar $D$ gives a trace of
$D$ of `r nrow(w)*sum(diag(w))`. The iterations of our algorithm, which always start with $R=I$, are
```{r mtmb_1, echo = FALSE}
h<-maxR(w, verbose = TRUE)
```
The optimum $R$ is a cut matrix, given as
```{r mtmb_1_r, echo = FALSE}
mprint(h$r, w = 3, d = 1, f = "+")
```
and the diagonal of the corresponding $D$ is
```{r mtmb_1_d, echo = FALSE}
mprint(h$d)
```
This is close to the best scalar $D$, because in this first example all eigenvalues of $W$ are close.
In the second example we take $W$ of order 6, but of rank one.
```{r mtmb_2_w, echo = FALSE}
x<-seq (1:6)/5
w<-outer(x, x)
mprint (w, w = 3, d = 1, f = "+")
e <- max (eigen (w)$values)
```
Its largest eigenvalue (and its trace) is `r e`. The iterations are
```{r mtmb_2, echo = FALSE}
h<-maxR(w, verbose = TRUE)
```
The cut matrix we find is
```{r mtmb_2_r, echo = FALSE}
mprint(h$r, w = 3, d = 1, f = "+")
```
which corresponds with $D$ equal to
```{r mtmb_2_d, echo = FALSE}
mprint(h$d)
```
#Majorization Examples
##Monotone Regression
Consider the problem of minimizing $(y-x)'W(y-x)$ over all $x$ satisfying $x_1\leq\cdots\leq x_n$. If $W$ is diagonal, this problem can be solved quickly (in linear time) with the
usual monotone regression algorithms, and thus we can use majorization to reduce the problem to a sequence of such monotone regressions.
```{r monreg_example, echo = FALSE}
x<-matrix(0,10,10)
x[outer(1:10,1:10,"<=")]<-1
w<-crossprod(x)
e<-max(eigen(w)$values)
f<-sum(diag(w))
y<-c(1,3,2,3,3,1,1,4,4,1)
dopt <- maxR(w)$d
```
In our example we use $W$ equal to
```{r monreg_w, echo = FALSE}
mprint(w, d=0, w=2)
```
and $y$ equal to
```{r monreg_y, echo = FALSE}
mprint(y, d=0, w=2)
```
Matrix $W$ has a largest eigenvalue equal to `r e` and a trace equal to `r f`. The trace-optimal $D$ is
```{r monreg_d, echo = FALSE}
mprint(dopt, d=0, w=2)
```
```{r monreg_comp_1, echo = FALSE}
h1 <- jbkWPava (y, w, d = rep(e, 10), itmax = 10000, verbose = FALSE)
h2 <- jbkWPava (y, w, d = rep(f, 10), itmax = 10000, verbose = FALSE)
h3 <- jbkWPava (y, w, d = dopt, itmax = 10000, verbose = FALSE)
```
We make three runs of the majorization version `jbkWPava()`, which uses `jbkPava()` from @deleeuw_E_17g for the diagonal monotone regression. Using a scalar matrix with the trace
requires `r h2$itel` iterations, using the largest eigenvalue uses `r h1$itel` iterations, and using the trace-optimal $D$ uses `r h3$itel` iterations. Of course this does not address the question if the gain in the number of iterations offsets the extra computation needed for either the largest eigenvalue or the trace-optimal $D$.
##Multidimensional Scaling
Suppose $\Delta_r$ with $r=1,...,R$ are independent and identically distributed symmetric and hollow dissimilarity matrices of order $n$. Define for $i