\documentclass[12pt,]{article}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\else % if luatex or xelatex
\ifxetex
\usepackage{mathspec}
\else
\usepackage{fontspec}
\fi
\defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\usepackage[margin=1in]{geometry}
\usepackage{hyperref}
\hypersetup{unicode=true,
pdftitle={Some Majorization Theory for Weighted Least Squares},
pdfauthor={Jan de Leeuw},
pdfborder={0 0 0},
breaklinks=true}
\urlstyle{same} % don't use monospace font for urls
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{{#1}}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{{#1}}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{{#1}}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{{#1}}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{{#1}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{{#1}}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{{#1}}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{{#1}}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{{#1}}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{{#1}}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{{#1}}}
\newcommand{\ImportTok}[1]{{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{{#1}}}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{{#1}}}}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{{#1}}}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{{#1}}}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{{#1}}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{{#1}}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{{#1}}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{{#1}}}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{{#1}}}}
\newcommand{\BuiltInTok}[1]{{#1}}
\newcommand{\ExtensionTok}[1]{{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.77,0.63,0.00}{{#1}}}
\newcommand{\RegionMarkerTok}[1]{{#1}}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{{#1}}}}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{{#1}}}}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{{#1}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{{#1}}}}
\newcommand{\NormalTok}[1]{{#1}}
\usepackage{graphicx,grffile}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
% Redefines (sub)paragraphs to behave more like sections
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
%%% Use protect on footnotes to avoid problems with footnotes in titles
\let\rmarkdownfootnote\footnote%
\def\footnote{\protect\rmarkdownfootnote}
%%% Change title format to be more compact
\usepackage{titling}
% Create subtitle command for use in maketitle
\newcommand{\subtitle}[1]{
\posttitle{
\begin{center}\large#1\end{center}
}
}
\setlength{\droptitle}{-2em}
\title{Some Majorization Theory for Weighted Least Squares}
\pretitle{\vspace{\droptitle}\centering\huge}
\posttitle{\par}
\author{Jan de Leeuw}
\preauthor{\centering\large\emph}
\postauthor{\par}
\predate{\centering\large\emph}
\postdate{\par}
\date{Version 13, May 26, 2017}
\begin{document}
\maketitle
\begin{abstract}
In many situations in numerical analysis least squares loss functions
with diagonal weight matrices are much easier to minimize than least
square loss functions with full positive semi-definite weight matrices.
We use majorization to replace problems with a full weight matrix by a
sequence of diagonal weight matrix problems. Diagonal weights which
optimally approximate the full weights are computed using a simple
semi-definite programming procedure.
\end{abstract}
{
\setcounter{tocdepth}{3}
\tableofcontents
}
Note: This is a working paper which will be expanded/updated frequently.
All suggestions for improvement are welcome. The directory
\href{http://gifi.stat.ucla.edu/wls}{gifi.stat.ucla.edu/wls} has a pdf
version, the complete Rmd file, the R code, and the bib file.
\section{Introduction}\label{introduction}
In this paper we study the problem of minimizing a weighted least
squares loss function \(\sigma(x)=x'Wx\) over a set
\(\mathcal{X}\subseteq\mathbb{R}^n\). Here \(W\) is a positive
semi-definite matrix of order \(n\).
Examples are linear least squares, where \(\mathcal{X}\) is of the form
\(y-A\beta\), or isotone regression, where \(\mathcal{X}=y-\mathcal{K}\)
with \(K\) the cone of isotone vectors, or nonlinear least squares,
where the elements of \(\mathcal{X}\) are of the form \(y-F(\theta)\)
for \(F:\mathbb{R}^p\rightarrow\mathbb{R}^n\) and
\(\theta\in\mathbb{R}^p\).
The idea is that weighted least squares problems with general \(W\) are
more difficult to solve than problems with a diagonal \(W\). We use
majorization theory to reduce problems with non-diagonal weights to
sequences of problems with diagonal weights. General discussions of
majorization (also known as MM) algorithms are in De Leeuw (2016) and
Lange (2016). Majorization theory has been used earlier in the context
of weighted least squares problems by Heiser (1995), Kiers (1997) and
Groenen, Giaquinto, and Kiers (2003), although their applications are
limited to linear regression and low-rank matrix approximation (also
known as principal component analysis) with rather simple bounds.
\section{Majorization}\label{majorization}
Write \(x=z+(x-z)\) and thus \[
\sigma(x)=(z+(x-z))'W(z+(x-z))=\sigma(z)+2z'W(x-z)+(x-z)'W(x-z).
\] Now suppose we have a diagonal \(D\) satisfying \(D\gtrsim W\) in the
Loewner sense, i.e. \(D-W\) is positive semi-definite. Then
\(\sigma(x)\leq\eta(x,z)\) for all \(x,z\in\mathcal{X}\), where \[
\eta(x,z):=\sigma(z)+2z'W(x-z)+(x-z)'D(x-z),
\] and, of course, \(\sigma(x)=\eta(x,x)\) for all \(x\in\mathcal{X}\).
Thus \(\eta:\mathcal{X}\otimes\mathcal{X}\rightarrow\mathbb{R}\) is a
\emph{majorization scheme} in the sense of De Leeuw (2016), chapter 7.
The corresponding majorization algorithm is \[
x^{(k+1)}\in\mathop{\mathbf{Argmin}}_{x\in\mathcal{X}}\ \eta(x,x^{(k)}),
\] and the \emph{sandwich inequality} is \[
\sigma(x^{(k+1)})\leq\eta(x^{(k+1)},x^{(k)})\leq\eta(x^{(k)},x^{(k)})=\sigma(x^{(k)}).
\] The first inequality is strict if the minimizer of
\(\eta(x,x^{(k)})\) is unique, the second is strict if \(D-W\) is
positive definite.
Define \(r(z):=(I-D^{-1}W)z\). Then, completing the square, minimizing
\(\eta(x,z)\) over \(x\) is the same as minimizing \[
\rho(x,z):=(x-r(z))'D(x-r(z))
\] over \(x\).
Consequently an iteration of the majorization algorithm finds
\(x^{(k+1)}\) by projecting \(r(x^{(k)})\) on \(\mathcal{X}\), using the
metric \(D\).
\subsection{Singularity}\label{singularity}
It should be mentioned that our majorization method also applies if
\(W\) is singular. In fact we can show that \(D\) will be positive
definite in all situations of interest. Suppose \(D\) is singular,
i.e.~it has zero diagonal elements. Then \(D\gtrsim W\) implies that
\(W\) also has the same zero diagonal elements, and because \(W\) is
positive semi-definite all off-diagonal elements of \(W\) corresponding
with these diagonal elements are zero as well. If \(I\) is the index set
for which the diagonal elements of \(W\) are zero, then we must minimize
\(z_I'W_I^{\ }z_I^{\ }\) and for majorization use a positive definite
\(D_I\gtrsim W_I\). Thus majorization can be also be used to regularize
singular problems.
If \(W\) is indefinite it is still possible to choose a positive
definite diagonal \(D\) such that \(D-W\gtrsim 0\). We will still have a
majorization algorithm that generates a decreasing sequence of loss
function values, but the loss function may no longer be bounded below,
and thus the sequence of loss function values may not converge.
\section{Rate of Convergence}\label{rate-of-convergence}
\subsection{Linear Least Squares}\label{linear-least-squares}
For simplicity we start with the linear case. Thus
\(\mathcal{X}=\{x\mid x=y-A\beta\}\). Assume \(A\) to be of full column
rank, and assume \(D\) to be positive definite. We can write the
majorization scheme as \[
\eta(\beta,\gamma)=\sigma(\gamma)+2(y-A\gamma)'WA(\gamma-\beta)+(\gamma-\beta)'A'DA(\gamma-\beta).
\] Thus \[
\mathcal{D}_1\eta(\beta,\gamma)=-2A'W(y-A\gamma)-2A'DA(\gamma-\beta),
\] which is zero for \[
\beta=\gamma+(A'DA)^{-1}A'W(y-A\gamma)=(A'DA)^{-1}A'Wy+(I-(A'DA)^{-1}A'WA)\gamma.
\] The iterations take the form \[
\beta^{(k+1)}=(A'DA)^{-1}A'Wy+(I-(A'DA)^{-1}A'WA)\beta^{(k)},
\] and thus
\begin{equation}\label{E:lrate}
\beta^{(k+1)}-\beta^{(k)}=(I-(A'DA)^{-1}A'WA)(\beta^{(k)}-\beta^{(k-1)}).
\end{equation}
The speed of convergence of the iteration is determined by the
eigenvalues of \((A'DA)^{-1}A'WA\), which are less than or equal to one.
Thus the smaller \(D\) is, given that \(D-W\gtrsim 0\), the faster the
convergence will be. The linear convergence rate of our majorization
algorithm is the largest eigenvalue of \(I-(A'DA)^{-1}A'WA\).
\subsection{Non-linear Least Squares}\label{non-linear-least-squares}
For non-linear least squares \(\mathcal{X}=\{x\mid y-F(\theta)\}\). The
majorization scheme is \[
\eta(\theta,\xi)=\sigma(\xi)-2(y-F(\xi))'W(F(\theta)-F(\xi))+(F(\theta)-F(\xi))'D(F(\theta)-F(\xi)).
\] Minimizing over \(\theta\) for fixed \(\xi\) means solving \[
\mathcal{D}_1\eta(\theta,\xi)=-2G(\theta)'W(y-F(\xi))+2G(\theta)'D(F(\theta)-F(\xi))=0,
\] where \(G\theta)\) is short for \(\mathcal{D}F(\theta)\). This
implicitly defines \(\theta\) as a function of \(\xi\), and we have, at
a fixed point where \(\theta(\xi)=\xi\), \[
\mathcal{D}\theta(\xi)=-\left[\mathcal{D}_{11}\eta(\xi,\xi)\right]^{-1}\mathcal{D}_{12}\eta(\xi,\xi).
\] What remains to be done is computing expressions for the derivatives.
We have \[
\mathcal{D}_{12}\eta(\xi,\xi)=2G(\xi)'WG(\xi)-2G(\xi)'DG(\xi),
\] \[
\mathcal{D}_{11}\eta(\theta,\xi)=2G(\xi)'DG(\xi)-2H(\xi),
\] where \[
H(\xi)=\sum_{i=1}^n\mathcal{D}^2f_i(\xi)\sum_{j=1}^nw_{ij}(y_j-f_j(\xi)).
\] Thus
\begin{equation}\label{E:nlrate}
\mathcal{D}\theta(\xi)=[G(\xi)'DG(\xi)-H(\xi)]^{-1}(G(\xi)'DG(\xi)-G(\xi)'WG(\xi)).
\end{equation}
In the linear case \(H(\xi)=0\) and \(G(\xi)=A\), and we find the result
in equation \(\eqref{E:lrate}\). Again we see that, in general terms, it
will be good to have \(D\) as small as possible. Note that if we have
perfect fit, i.e.~if \(y=F(\xi)\), then again \(H(\xi)=0\) and \[
\mathcal{D}\theta(\xi)=(I-[G(\xi)'DG(\xi)]^{-1}G(\xi)'WG(\xi)).
\] Both in the linear and nonlinear case the derivatives \(G(\xi)\) and
\(A\) complicate the comparison of different diagonal matrices of
weights, but the general idea that we want both \(D-W\gtrsim 0\) and
\(D\) as small as possible seems to be a useful summary of the results
so far.
\section{Minimum Trace Majorization
Bound}\label{minimum-trace-majorization-bound}
Let us look at various ways in which we can have a small \(D\), while
satisfying \(D-W\gtrsim 0\).There are many ways, of course, to measure
how large \(D\) is. In the context of linear iterations, we have seen
that what matters is the size of \(A'DA\) relative to \(A'WA\), and
something very similar is true in the non-linear case. But in this
section we will look for diagonal matrices \(D\) that are small in a
more general sense, independent of the value of any derivatives or
coefficient matrices.
The set of all scalar matrices \(D=\lambda I\) with \(D\gtrsim W\) is
easy to describe. We have \(D\gtrsim W\) if and only if
\(\lambda\geq\lambda_{\text{max}}(W)\), the largest eigenvalue of \(W\).
Thus we also have \(D\gtrsim W\) if \(D=\|W\|I\), with \(W\) any matrix
norm. Perhaps the easiest bound is \(D=\mathbf{tr}(W)I\), although that
will tend to be really unsharp.
Another easy bound uses \(V=\mathbf{diag}(W)\). Now
\(V^{-\frac12}WV^{-\frac12}\) is a correlation matrix, which means its
largest eigenvalue is less than or equal to \(n\). Thus \(D=nV\)
satisfies \(D\gtrsim W\). In general, however, it seems that requiring
\(D\) to be scalar, and working with the largest eigenvalue, or even
bound on the largest eigenvalue, will lead to large \(D\), and
consequently slow convergence of the majorization algorithm. So we will
attempt to do better. For now we adopt the trace as a simple measure of
the size of \(D\), which does not take specific aspects of the least
squares problem into account.
What we try to solve in this section is to compute
\(\min_D\mathbf{tr}\ D\) over \(D-W\gtrsim 0\). Let's call this the
\emph{minimum trace majorization bound} or MTMB problem. Before we
start, let us note the similarity of the MTMB problem to \emph{minimum
trace factor analysis} (MTFA), in which we compute
\(\max_D\mathbf{tr}\ D\) over \(W-D\gtrsim 0\) (and \(D\gtrsim 0\) in
the case of constrained MTFA). See Watson (1992) and Jamshidian and
Bentler (1998) for MTFA algorithms. Both MTFA and MTMB are examples of
\emph{semidefinite programming} or SDP (Vandenberghe and Boyd (1996)).
We first analyse the MTMB problem a bit more in detail. The Lagrangian
is \[
\mathcal{L}(D,R)=\mathbf{tr}\ D-\mathbf{tr}\ R(D-W),
\] where \(R\gtrsim 0\) is a symmetric matrix of Lagrange multipliers
(or dual variables). The nececessary and sufficient conditions for a
minimum are
\begin{align*}
D-W&\gtrsim 0,\\
R&\gtrsim 0,\\
\mathbf{diag}(R)&=I,\\
R(D-W)=0.
\end{align*}
Because \[
\max_{R\gtrsim 0}\mathcal{L}(D,R)=\begin{cases}\mathbf{tr}\ D&\text{ if }D-W\gtrsim 0,\\+\infty&\text{ otherwise},\end{cases}
\] the primal problem is \(\min_D\max_{R\gtrsim 0}\mathcal{L}(D,R)\).
The dual function is \[
\min_D\mathcal{L}(D,R)=\begin{cases}\mathbf{tr}\ RW&\text{ if }\mathbf{diag}(R)=I,\\-\infty&\text{ otherwise},\end{cases}
\] and thus dual probem is to maximize \(\mathbf{tr}\ RW\) over
\(R\gtrsim 0\) with \(\mathbf{diag}(R)=I\), i.e.~over all correlation
matrices. Because both the primal and the dual problem are structly
feasible, the optimal value of both problems are equal.
\subsection{Extreme Correlation
Matrices}\label{extreme-correlation-matrices}
Correlation matrices, i.e.~positive semi-definite matrices with a unit
diagonal, form a compact convex \(\mathcal{E}_{n\times n}\) set in the
space of all symmetric matrices. Laurent and Poljak (1996) call this set
the \emph{elliptope}. A linear function on a compact convex set attains
its maximum at one of the extreme poonts of the set, i.e.~one of the
points that cannot be represented as a convex combination of two other
points in the set. Thus the solution of the dual problem of maximizing
\(\mathbf{tr}\ RW\) is an extreme point of the elliptope. This makes it
interesting for us to look at the structure of such extreme points
Correlation matrices of rank one are of the form \(xx'\), where \(x\) is
a vector with element \(\pm 1\). Such matrices are called \emph{cut
matrices} in Laurent and Poljak (1996), and their convex hull is the
\emph{cut polytope}. The max-cut problem is maximizing a linear function
over the cut polytope, and clearly maximizing that linear function over
the elliptope is a convex relaxation of the \emph{max cut} problem. Cut
matrices are extreme points of the elliptope, but having rank one is
only sufficient, not necessary, for extreme points. A comprehensive
recent paper on the rank and structure of the extreme points of the
elliptope, which also reviews previous results on the topic, is Li and
Tam (1994).
It was first shown by Grone, Pierce, and Watkins (1990) that the
elliptope \(\mathcal{E}_{n\times n}\) contains an extreme point of rank
\(r\) if and only if \(r(r+1)\leq 2n\). The face structure of the
elliptope was analyzed in Laurent and Poljak (1996), and necessary and
sufficient conditions for a correlation matrix of rank \(r\) to be an
extreme point are given in Ycart (1985), Li and Tam (1994),
Parthasarathy (2002), and HÃ¼rlimann (2015).
\section{Algorithm}\label{algorithm}
Although we have a choice of many different SDP algorithms, we use a
special purpose method to solve MTMB. It seems to work well. We maximize
\(\mathbf{tr}\ U'UW\) by cycling over the columns of \(U\), changing one
of them at a time, requiring throughout that \(u_i'u_i=1\). Clearly the
solution for column \(u_i\), keeping all other column at their current
value, is given by \(v=\sum_{k\not= i}w_{ij}u_k\) and then
\(u_i=v/\|v\|\). The appendix has a simple R function \texttt{maxR()} to
do exactly this
If we have found the solution, we use \(R(D-W)=0\) to find the solution
of the primal problem.
\subsection{Examples}\label{examples}
We give some example of application of our dual algorithm. Our first
matrix \(W\) is
\begin{verbatim}
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] +1.0 -1.0 +0.0 +0.0 +0.0 +0.0
## [2,] -1.0 +2.0 -1.0 +0.0 +0.0 +0.0
## [3,] +0.0 -1.0 +2.0 -1.0 +0.0 +0.0
## [4,] +0.0 +0.0 -1.0 +2.0 -1.0 +0.0
## [5,] +0.0 +0.0 +0.0 -1.0 +2.0 -1.0
## [6,] +0.0 +0.0 +0.0 +0.0 -1.0 +1.0
\end{verbatim}
This has largest eigenvalue 3.7320508076, and thus the best scalar \(D\)
has trace 22.3923048454. Using the trace of \(W\) for a scalar \(D\)
gives a trace of \(D\) of 60. The iterations of our algorithm, which
always start with \(R=I\), are
\begin{verbatim}
## itel 1 fold 10.000000 fnew 17.656854
## itel 2 fold 17.656854 fnew 19.431440
## itel 3 fold 19.431440 fnew 19.804726
## itel 4 fold 19.804726 fnew 19.914796
## itel 5 fold 19.914796 fnew 19.962473
## itel 6 fold 19.962473 fnew 19.983844
## itel 7 fold 19.983844 fnew 19.993073
## itel 8 fold 19.993073 fnew 19.997032
## itel 9 fold 19.997032 fnew 19.998729
## itel 10 fold 19.998729 fnew 19.999455
## itel 11 fold 19.999455 fnew 19.999767
## itel 12 fold 19.999767 fnew 19.999900
## itel 13 fold 19.999900 fnew 19.999957
## itel 14 fold 19.999957 fnew 19.999982
## itel 15 fold 19.999982 fnew 19.999992
## itel 16 fold 19.999992 fnew 19.999997
## itel 17 fold 19.999997 fnew 19.999999
## itel 18 fold 19.999999 fnew 19.999999
\end{verbatim}
The optimum \(R\) is a cut matrix, given as
\begin{verbatim}
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] +1.0 -1.0 +1.0 -1.0 +1.0 -1.0
## [2,] -1.0 +1.0 -1.0 +1.0 -1.0 +1.0
## [3,] +1.0 -1.0 +1.0 -1.0 +1.0 -1.0
## [4,] -1.0 +1.0 -1.0 +1.0 -1.0 +1.0
## [5,] +1.0 -1.0 +1.0 -1.0 +1.0 -1.0
## [6,] -1.0 +1.0 -1.0 +1.0 -1.0 +1.0
\end{verbatim}
and the diagonal of the corresponding \(D\) is
\begin{verbatim}
## [1] 2.000000 4.000000 4.000000 4.000000 4.000000 2.000000
\end{verbatim}
This is close to the best scalar \(D\), because in this first example
all eigenvalues of \(W\) are close.
In the second example we take \(W\) of order 6, but of rank one.
\begin{verbatim}
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] +0.0 +0.1 +0.1 +0.2 +0.2 +0.2
## [2,] +0.1 +0.2 +0.2 +0.3 +0.4 +0.5
## [3,] +0.1 +0.2 +0.4 +0.5 +0.6 +0.7
## [4,] +0.2 +0.3 +0.5 +0.6 +0.8 +1.0
## [5,] +0.2 +0.4 +0.6 +0.8 +1.0 +1.2
## [6,] +0.2 +0.5 +0.7 +1.0 +1.2 +1.4
\end{verbatim}
Its largest eigenvalue (and its trace) is 3.64. The iterations are
\begin{verbatim}
## itel 1 fold 3.640000 fnew 17.132389
## itel 2 fold 17.132389 fnew 17.633437
## itel 3 fold 17.633437 fnew 17.639916
## itel 4 fold 17.639916 fnew 17.639999
## itel 5 fold 17.639999 fnew 17.640000
\end{verbatim}
The cut matrix we find is
\begin{verbatim}
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
## [2,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
## [3,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
## [4,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
## [5,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
## [6,] +1.0 +1.0 +1.0 +1.0 +1.0 +1.0
\end{verbatim}
which corresponds with \(D\) equal to
\begin{verbatim}
## [1] 0.840000 1.680000 2.520000 3.360000 4.200000 5.040000
\end{verbatim}
\section{Majorization Examples}\label{majorization-examples}
\subsection{Monotone Regression}\label{monotone-regression}
Consider the problem of minimizing \((y-x)'W(y-x)\) over all \(x\)
satisfying \(x_1\leq\cdots\leq x_n\). If \(W\) is diagonal, this problem
can be solved quickly (in linear time) with the usual monotone
regression algorithms, and thus we can use majorization to reduce the
problem to a sequence of such monotone regressions.
In our example we use \(W\) equal to
\begin{verbatim}
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 1 1 1 1 1 1 1 1
## [2,] 1 2 2 2 2 2 2 2 2 2
## [3,] 1 2 3 3 3 3 3 3 3 3
## [4,] 1 2 3 4 4 4 4 4 4 4
## [5,] 1 2 3 4 5 5 5 5 5 5
## [6,] 1 2 3 4 5 6 6 6 6 6
## [7,] 1 2 3 4 5 6 7 7 7 7
## [8,] 1 2 3 4 5 6 7 8 8 8
## [9,] 1 2 3 4 5 6 7 8 9 9
## [10,] 1 2 3 4 5 6 7 8 9 10
\end{verbatim}
and \(y\) equal to
\begin{verbatim}
## [1] 1 3 2 3 3 1 1 4 4 1
\end{verbatim}
Matrix \(W\) has a largest eigenvalue equal to 44.7660686527 and a trace
equal to 55. The trace-optimal \(D\) is
\begin{verbatim}
## [1] 10 19 27 34 40 45 49 52 54 55
\end{verbatim}
We make three runs of the majorization version \texttt{jbkWPava()},
which uses \texttt{jbkPava()} from De Leeuw (2017) for the diagonal
monotone regression. Using a scalar matrix with the trace requires 355
iterations, using the largest eigenvalue uses 296 iterations, and using
the trace-optimal \(D\) uses 113 iterations. Of course this does not
address the question if the gain in the number of iterations offsets the
extra computation needed for either the largest eigenvalue or the
trace-optimal \(D\).
\subsection{Multidimensional Scaling}\label{multidimensional-scaling}
Suppose \(\Delta_r\) with \(r=1,...,R\) are independent and identically
distributed symmetric and hollow dissimilarity matrices of order \(n\).
Define for \(i