0\) for \(x\in\mathcal{X}\). Then \[ g(x,y)=f(y)+a(x)-f(y)b(x) \] D-majorizes \(f\) on \(\mathcal{X}\). Clearly \[ g(y,y)=f(y)+a(y)-\frac{a(y)}{b(y)}b(y)=f(y). \] Moreover \(g(x,y)\leq g(y,y)\) can be written as \[ f(y)+a(x)-f(y)b(x)\leq f(y) \] which implies \(f(x)\leq f(y)\). Suppose \[ g(x,y)=f(y)+f'(y)(x-y)+\frac12 a(x-y)^2, \] with \(a>0\). Now \(g(x,y)\leq f(y)\) if and only if \[ (x-y)(f'(y)+\frac12a(x-y))\leq 0, \] Or \(x\) must be in the interval between \(y\) and \(y-\frac{2f'(y)}{a}\). For a cubic \[ f(x)=f(y)+f'(y)(x-y)+\frac12 f''(y)(x-y)^2+\frac16 f'''(x-y)^3 \] we require that \[ (x-y)(f'(y)+\frac12 f''(y)(x-y)+\frac16 f'''(x-y)^2)\leq 0 \] for all \(x\) in the interval between \(y\) and \(y-\frac{2f'(y)}{a}\). Suppose \(f\) is increasing and differentiable. We have \(f'(y)\geq 0\) for all \(y\). Thus for any \[ x\in[y-\frac{2f'(y)}{a},y] \] we have \(x\leq y\) and consequently \(f(x)\leq f(y)\). In other words: an increasing function is Dinkelbach majorized by any convex quadratic. Consider \(f\) with \(f(x)=\frac14 x^4\) and let us majorize at \(y=1\) by a convex quadratic \(g\) which has both \(g(1)=f(1)=\frac14\) and \(g'(1)=f'(1)=1.\) It follows that \[ g(x)=ax^2+(1-2a)x+(a-\frac34) \] for some \(a\geq 0\). Note that if we would also require that \(g''(1)\geq h''(1)\) then we need \(a\geq\frac32.\) If we compute \(h=g-f\) we find \[ h(x)=-\frac14(x-1)^2(x^2+2x+(3-4a)). \] Majorization at \(y\) would mean \(h(x)\geq 0\) for all \(x\), which is clearly impossible because \(h(x)\) will be negative for \(x\) very large and \(x\) very small. We have \(h(x)\geq 0\) if and only if \(q(x)\mathop{=}\limits^{\Delta}x^2+2x+(3-4a)\leq 0\).The quadratic \(q\) can only be non-positive if it has two real roots, which happens if \(a\geq\frac12\), and then \(q\) is non-positive between its two real roots, which are \(-1-\sqrt{4a-2}\) and \(-1+\sqrt{4a-2}.\) The sublevel set \(\mathcal{L}(1)=\{x\mid g(x)\leq\frac14\}\) is the interval \([\frac{a-1}{a},1]\). For S-majorization we need \(h(x)\geq 0\) for all \(x\in\mathcal{L}(1)\), which means that interval \([\frac{a-1}{a},1]\) must be a subset of interval \([-1-\sqrt{4a-2},-1+\sqrt{4a-2}].\) Thus we must have \(1\leq -1+\sqrt{4a-2}\) as well as \(\frac{a-1}{a}\geq -1-\sqrt{4a-2}.\) The first inequality gives \(a\geq\frac32\), the second one \(a\geq\frac12\). Thus we have S-majorization at \(y=1\) if and only if \(a\geq\frac32\). Figure 1 shows the S-majorization with \(a=3/2\), first globally, and then in closeup with \(\mathcal{L}(1)=[\frac13,1]\) on the horizontal axis. Note that the S-majorizer is certainly not a majorizer. Also note that \(a=3/2\) actually makes \(g''(1)=f''(1)\), which means the quadratic S-majorizer is the quadratic approximation used in Newton's method. \begin{center}\includegraphics{bras_files/figure-latex/qtwothree-1} \end{center} Figure 1: S-majorization at y = 1 for a = 1.5. For D-majorization we need \(f(x)\leq f(1)=\frac14\) for all \(x\in\mathcal{L}(1)\) which is the case if \(\frac{a-1}{a}\geq -1\), i.e. \(g\) D-majorizes \(f\) at \(y=1\) if and only if \(a\geq\frac12.\) In figure 2 we see that a D-majorizer can actually be a minorizer ! Of course the S-majorization in figure 1 is also a D-majorization. \begin{center}\includegraphics{bras_files/figure-latex/qhalf-1} \end{center} Figure 2: D-majorization at y = 1 for a = 0.5. \chapter{Background}\label{background} \section{Introduction}\label{introduction-7} \section{Analysis}\label{analysis} \subsection{SemiContinuities}\label{semicontinuities} The \emph{lower limit} or \emph{limit inferior} of a sequence \(\{x_i\}\) is defined as \[ \liminf_{i\rightarrow\infty}x_i=\lim_{i\rightarrow\infty}\left[\inf_{k\geq i} x_k\right]=\sup_{i\geq 0}\left[\inf_{k\geq i} x_k\right]. \] Alternatively, the limit inferior is the smallest cluster point or subsequential limit \[ \liminf_{i\rightarrow\infty}x_i=\min\{y\mid x_{i_\nu}\rightarrow y\}. \] In the same way \[ \limsup_{i\rightarrow\infty}x_i=\lim_{i\rightarrow\infty}\left[\sup_{k\geq i} x_k\right]=\inf_{i\geq 0}\left[\sup_{k\geq i} x_k\right]. \] We always have \[ \inf_i x_i\leq\liminf_{i\rightarrow\infty} x_i\leq\limsup_{i\rightarrow\infty} x_i\leq\sup_i x_i. \] Also if \(\liminf_{i\rightarrow\infty} x_i=\limsup_{i\rightarrow\infty} x_i\) then \[ \lim_{i\rightarrow\infty}x_i=\liminf_{i\rightarrow\infty} x_i=\limsup_{i\rightarrow\infty} x_i. \] The \emph{lower limit} or \emph{limit inferior} of a function at a point \(\overline{x}\) is defined as \[ \liminf_{x\rightarrow \overline{x}}f(x)=\sup_{\delta>0}\left[\inf_{x\in\mathcal{B}(\overline{x},\delta)}f(x)\right], \] where \[ \mathcal{B}(\overline{x},\delta)\mathop{=}\limits^{\Delta}\{x\mid\|x-\overline{x}\|\leq\delta\}. \] Alternatively \[ \lim_{i\rightarrow\infty} x_i=\liminf_{x\rightarrow \overline{x}}f(x)=\min\{y\mid x_i\rightarrow \overline{x}\text { and }f(x_i)\rightarrow y\}. \] In the same way \[ \limsup_{x\rightarrow \overline{x}}f(x)=\inf_{\delta>0}\left[\sup_{x\in\mathcal{B}(\overline{x},\delta)}f(x)\right], \] A function is \emph{lower semi-continuous} at \(\overline{x}\) if \[ \liminf_{x\rightarrow \overline{x}}f(x)\geq f(\overline{x}) \] Since we always have \(\liminf_{x\rightarrow \overline{x}}f(x)\leq f(\overline{x})\) we can also define lower semicontinuity as \[ \liminf_{x\rightarrow \overline{x}}f(x)=f(\overline{x}). \] A function is \emph{upper semi-continuous} at \(\overline{x}\) if \[ \limsup_{x\rightarrow \overline{x}}f(x)= f(\overline{x}). \] We have \[ \liminf_{x\rightarrow \overline{x}}f(x)\leq\limsup_{x\rightarrow \overline{x}}f(x). \] A function is \emph{continuous} at \(\overline{x}\) if and only if it is both lower semicontinuous and upper semicontinous, i.e.~if \[ f(\overline{x})=\lim_{x\rightarrow \overline{x}} f(x)=\liminf_{x\rightarrow \overline{x}}f(x)=\limsup_{x\rightarrow \overline{x}} f(x). \] \subsection{Directional Derivatives}\label{directional-derivatives} The notation and terminology are by no means standard. We generally follow \citet{demyanov_10}. The \emph{lower Dini directional derivative} of \(f\) at \(x\) in the direction \(z\) is \[ \delta^-f(x,z)\mathop{=}\limits^{\Delta}\liminf_{\alpha\downarrow 0}\frac{f(x+\alpha z)-f(x)}{\alpha}=\sup_{\delta>0}\inf_{0<\alpha<\delta}\frac{f(x+\alpha z)-f(x)}{\alpha}. \] and the corresponding \emph{upper Dini directional derivative} is \[ \delta^+f(x,z)\mathop{=}\limits^{\Delta}\limsup_{\alpha\downarrow 0}\frac{f(x+\alpha z)-f(x)}{\alpha}=\inf_{\delta>0}\sup_{0<\alpha<\delta}\frac{f(x+\alpha z)-f(x)}{\alpha}, \] If \[ \delta f(x,z)=\lim_{\alpha\downarrow 0}\frac{f(x+\alpha z)-f(x)}{\alpha} \] exists, i.e.~if \(\delta^+f(x,z)=\delta^-f(x,z)\), then it we simply write \(\delta f(x,z)\) for the Dini directional derivative of \(f\) at \(x\) in the direction \(z\).\citet{penot_13} calls this the \emph{radial derivative} and \citet{schirotzek_07} calls it the \emph{directional Gateaux derivative.} If \(\delta f(x,z)\) exists \(f\) is \emph{Dini directionally differentiable} at \(x\) in the direction \(z\), and if \(\delta f(x,z)\) exists at \(x\) for all \(z\) we say that \(f\) is \emph{Dini directionally differentiable} at \(x.\) \citet{delfour_12} calls \(f\) \emph{semidifferentiable} at \(x\). In a similar way we can define the \emph{Hadamard lower and upper directional derivatives}. They are \[ d^-f(x,z)\mathop{=}\limits^{\Delta}\liminf_{\substack{\alpha\downarrow 0\\u\rightarrow z}}\frac{f(x+\alpha u)-f(x)}{\alpha}= \sup_{\delta>0}\inf_{\substack{u\in\mathcal{B}(z,\delta)\\\alpha\in(0,\delta)}}\frac{f(x+\alpha u)-f(x)}{\alpha}, \] and \[ d^+f(x,z)\mathop{=}\limits^{\Delta}\limsup_{\substack{\alpha\downarrow 0\\u\rightarrow z}}\frac{f(x+\alpha u)-f(x)}{\alpha}= \inf_{\delta>0}\sup_{\substack{u\in\mathcal{B}(z,\delta)\\\alpha\in(0,\delta)}}\frac{f(x+\alpha u)-f(x)}{\alpha}, \] The Hadamard directional derivative \(df(x,z)\) exists if both \(d^+f(x,z)\) and \(d^-f(x,z)\) exist and are equal. In that case \(f\) is \emph{Hadamard directionally differentiable} at \(x\) in the direction \(z\), and if \(df(x,z)\) exists at \(x\) for all \(z\) we say that \(f\) is Hadamard directionally differentiable at \(x.\) Generally we have \[ d^-f(x,z)\leq\delta^-f(x,z)\leq\delta^+f(x,z)\leq d^+f(x,z) \] The \emph{classical directional derivative} of \(f\) at \(x\) in the direction \(g\) is \[ \Delta f(x,z)\mathop{=}\limits^{\Delta}\lim_{\alpha\rightarrow 0}\frac{f(x+\alpha z)-f(x)}{\alpha}. \] Note that for the absolute value function at zero we have \(\delta f(0,1)=df(0,1)=1\), while \(\Delta f(0,1)=\lim_{\alpha\rightarrow 0}\mathbf{sign}(\alpha)\) does not exist. The classical directional derivative is not particularly useful in the context of optimization problems. \subsection{Differentiability and Derivatives}\label{differentiability-and-derivatives} The function \(f\) is \emph{Gateaux differentiable} at \(x\) if and only if the Dini directional derivative \(\delta f(x,z)\) exists for all \(z\) and is linear in \(z\). Thus \(\delta f(x,z)=G(x)z\) The function \(f\) is \emph{Hadamard differentiable} at \(x\) if the Hadamard directional derivative \(df(x,z)\) exists for all \(z\) and is linear in \(z\). Function \(f\) is \emph{locally Lipschitz} at \(z\) if there is a ball \(\mathcal{B}(z,\delta)\) and a \(\gamma>0\) such that \(\|f(x)-f(y)|\leq\gamma\|x-y\|\) for all \(x,y\in\mathcal{B}(z,\delta).\) If \(f\) is locally Lipschitz and Gateaux differentiable then it is Hadamard differentiable. If the Gateaux derivative of \(f\) is continuous then \(f\) is Frechet differentiable. Define Frechet differentiable The function \(f\) is Hadamard differentiable if and only if it is Frechet differentiable. Gradient, Jacobian \subsection{Taylor's Theorem}\label{taylors-theorem} Suppose \(f:\mathcal{X}\rightarrow\mathbb{R}\) is \(p+1\) times continuously differentiable in the open set \(\mathcal{X}\subseteq\mathbb{R}^n\). Define, for all \(0\leq s\leq p\), \[ h_s(x,y)\mathop{=}\limits^{\Delta}\frac{1}{s!}\langle\mathcal{D}^sf(y),(x-y)^{s}\rangle, \] as the inner product of the \(s\)-dimensional array of partial derivatives \(\mathcal{D}^sf(y)\) and the \(s\)-dimensional outer power of \(x-y.\) Both arrays are super-symmetric, and have dimension \(n^s.\) By convention \(h_0(x,y)=f(y)\). Also define the \emph{Taylor Polynomials} \[ g_p(x,y)\mathop{=}\limits^{\Delta}\sum_{s=0}^ph_s(x,y) \] and the \emph{remainder} \[ r_p(x,y)\mathop{=}\limits^{\Delta}f(x)-g_p(x,y). \] Assume \(\mathcal{X}\) contains the line segment with endpoints \(x\) and \(y\). Then \emph{Lagrange's form of the remainder} says there is a \(0\leq\lambda\leq 1\) such that \[ r_p(x,y)=\frac{1}{(p+1)!}\langle\mathcal{D}^{p+1}f(x+\lambda(y-x)),(x-y)^{p+1}\rangle, \] and the \emph{integral form of the remainder} says \[ r_p(x,y)=\frac{1}{p!}\int_0^1(1-\lambda)^p\langle\mathcal{D}^{p+1}f(x+\lambda(y-x)),(x-y)^p\rangle d\lambda. \] \subsection{Implicit Functions}\label{implicit-functions} The classical implicit function theorem is discussed in all analysis books. I am particularly fond of \citet{spivak_65}. The history of the theorem, and many of its variations, is discussed in \citet{krantz_parks_03} and a comprenhensive modern treatment, using the tools of convex and variational analysis, is in \citet{dontchev_rockafellar_14}. Suppose \(f:\mathbb{R}^n\otimes\mathbb{R}^m\mapsto\mathbb{R}^m\) is continuously differentiable in an open set containing \((x,y)\) where \(f(x,y)=0.\) Define the \(m\times m\) matrix \[ A(x,y)\mathop{=}\limits^{\Delta}\mathcal{D}_2f(x,y) \] and suppose that \(A(x,y)\) is non-singular. Then there is an open set \(\mathcal{X}\) containing \(x\) and an open set \(\mathcal{Y}\) containing \(y\) such that for every \(x\in\mathcal{X}\) there is a unique \(y(x)\in\mathcal{Y}\) with \(f(x,y(x))=0.\) The function \(y:\mathbb{R}^n\mapsto\mathbb{R}^m\) is differentiable. If we differentiate \(f(x,y(x))=0\) we find \[ \mathcal{D}_1f(x,y(x))+\mathcal{D}_2(x,f(x))\mathcal{D}y(x), \] and thus \[ \mathcal{D}y(x)=-[\mathcal{D}_2f(x,y(x))]^{-1}\mathcal{D}_1f(x,y(x)). \] As an example consider the eigenvalue problem \begin{align*} A(x)y&=\lambda y,\\ y'y&=1 \end{align*} where \(A\) is a function of a real parameter \(x\). Then \[ \begin{bmatrix} A(x)-\lambda I&-x\\ x'&0 \end{bmatrix} \begin{bmatrix} \mathcal{D}y(x)\\ \mathcal{D}\lambda(x) \end{bmatrix} = \begin{bmatrix} -\mathcal{D}A(x)x\\ 0 \end{bmatrix}, \] which works out to \begin{align*} \mathcal{D}\lambda(x)&=y(x)'\mathcal{D}A(x)y(x),\\ \mathcal{D}y(x)&=-(A(x)-\lambda(x)I)^+\mathcal{D}A(x)y(x). \end{align*} \subsection{Necessary and Sufficient Conditions for a Minimum}\label{necessary-and-sufficient-conditions-for-a-minimum} Directional derivatives can be used to provide simple necessary or sufficient conditions for a minimum (\citet{demyanov_09}, propositions 8 and 9). \textbf{Result:} If \(x\) is a local minimizer of \(f\) then \(\delta^-f(x,z)\geq 0\) and \(d^-f(x,z)\geq 0\) for all directions \(z\). If \(d^-f(x,z)>0\) for all \(z\not= 0\) then \(f\) has a strict local minimum at \(x\). The special case of a quadratic deserves some separate study, because the quadratic model is so prevalent in optimization. So let us look at \(f(x)=c+b'x+\frac12 x'Ax\), with \(A\) symmetric. Use the eigen-decomposition \(A=K\Lambda K'\) to change variables to \(\tilde x\mathop{=}\limits^{\Delta}K'x\), also using \(\tilde b\mathop{=}\limits^{\Delta}K'b\). Then \(f(\tilde x)=c+\tilde b'\tilde x+\frac12\tilde x'\Lambda\tilde x\), which we can write as \begin{align*} f(\tilde x)=\ &c-\frac12\sum_{i\in I_+\cup I_-}\frac{\tilde b_i^2}{\lambda_i}\\ &+\frac12\sum_{i\in I_+}|\lambda_i|(\tilde x_i+\frac{\tilde b_i}{\lambda_i})^2+\\ &-\frac12\sum_{i\in I_-}|\lambda_i|(\tilde x_i+\frac{\tilde b_i}{\lambda_i})^2+\\ &+\sum_{i\in I_0}\tilde b_i\tilde x_i. \end{align*} Here \begin{align*} I_+&\mathop{=}\limits^{\Delta}\{i\mid\lambda_i>0\},\\ I_-&\mathop{=}\limits^{\Delta}\{i\mid\lambda_i<0\},\\ I_0&\mathop{=}\limits^{\Delta}\{i\mid\lambda_i=0\}. \end{align*} \begin{itemize} \tightlist \item If \(I_-\) is non-empty we have \(\inf_x f(x)=-\infty\). \item If \(I_-\) is empty, then \(f\) attains its minimum if and only if \(\tilde b_i=0\) for all \(i\in I_0\). Otherwise again \(\inf_x f(x)=-\infty.\) \end{itemize} If the minimum is attained, then \[ \min_x f(x) = c-\frac12 b'A^+b, \] with \(A^+\) the Moore-Penrose inverse. And the minimum is attained if and only if \(A\) is positive semi-definite and \((I-A^+A)b=0\). \section{Point-to-set Maps}\label{point-to-set-maps} \subsection{Continuities}\label{continuities} \subsection{Marginal Functions and Solution Maps}\label{marginal-functions-and-solution-maps} Suppose \(f:\mathbb{R}^n\otimes\mathbb{R}^n\rightarrow\mathbb{R}\) and \(g(x)=\min_y f(x,y)\). Suppose the minimum is attained at a unique \(y(x)\), where \(\mathcal{D}_2f(x,y(x))=0.\) Then obviously \(g(x)=f(x,y(x))\). Differentiating \(g\) gives \[ \mathcal{D}g(x)=\mathcal{D}_1f(x,y(x))+\mathcal{D}_2f(x,y(x))\mathcal{D}y(x)=\mathcal{D}_1f(x,y(x)).\tag{1} \] To differentiate the solution map we need second derivatives of \(f\). Differentiating the implicit definition \(\mathcal{D}_2f(x,y(x))=0\) gives \[ \mathcal{D}_{21}f(x,y(x))+\mathcal{D}_{22}f(x,y(x))\mathcal{D}y(x)=0, \] or \[ \mathcal{D}y(x)=-[\mathcal{D}_{22}f(x,y(x))]^{-1}\mathcal{D}_{21}f(x,y(x)).\tag{2} \] Now combine both \((1)\) and \((2)\) to obtain \[ \mathcal{D}^2g(x)=\mathcal{D}_{11}f(x,y(x))-\mathcal{D}_{12}f(x,y(x))[\mathcal{D}_{22}f(x,y(x))]^{-1}\mathcal{D}_{21}f(x,y(x)).\tag{3} \] We see that if \(\mathcal{D}^2f(x,y(x))\gtrsim 0\) then \(0\lesssim\mathcal{D}^2g(x)\lesssim\mathcal{D}_{11}f(x,y(x))\). Now consider minimization problem with constraints. Suppose \(h_1,\cdots,h_p\) are twice continuously differentiable functions on \(\mathbb{R}^m\), and suppose \[ \mathcal{Y}=\{y\in\mathbb{R}^m\mid h_1(y)=\cdots=h_p(y)=0\}. \] Define \[ g(x)\mathop{=}\limits^{\Delta}\min_{y\in\mathcal{Y}}f(x,y), \] and \[ y(x)\mathop{=}\limits^{\Delta}\mathop{\mathbf{argmin}}\limits_{y\in\mathcal{Y}} f(x,y), \] where again we assume the minimizer is unique and satisfies \begin{align*} \mathcal{D}_2f(x,y(x))-\sum_{s=1}^p\lambda_s(x)\mathcal{D}h_s(y(x))&=0,\\ h_s(y(x))&=0. \end{align*} Differentiate again, and define \begin{align*} A(x)&\mathop{=}\limits^{\Delta}\mathcal{D}_{22}f(x,y(x)),\\ H_s(x)&\mathop{=}\limits^{\Delta}\mathcal{D}^2h_s(y(x)),\\ E(x)&\mathop{=}\limits^{\Delta}-\mathcal{D}_{21}f(x,y(x)),\\ B(x)&\mathop{=}\limits^{\Delta}\mathcal{D}H(y(x)), \end{align*} and \[ C(x)\mathop{=}\limits^{\Delta}A(x)-\sum_{s=1}^p\lambda_sH_s(x). \] Then \[ \begin{bmatrix} C(x)&-B(x)\\ B'(x)&0 \end{bmatrix} \begin{bmatrix}\mathcal{D}y(x)\\\mathcal{D}\lambda(x)\end{bmatrix}= \begin{bmatrix}E(x)\\0\end{bmatrix}, \] which leads to \begin{align*} \mathcal{D}y(x)&=\left\{I-B(x)\left[B'(x)C^{-1}(x)B(x)\right]^{-1}B'(x)\right\}C^{-1}(x)E(x),\\ \mathcal{D}\lambda(x)&=\left[B'(x)C^{-1}(x)B(x)\right]^{-1}B'(x)C^{-1}(x)E(x). \end{align*} There is an alternative way of arriving at basically the same result. Suppose the manifold \(G(x)=0\) is parametrized locally as \(x=F(w)\). Then \[ y(z)=\mathbf{Arg}\mathop{\mathbf{min}}\limits_{w} f(F(w),z), \] and \(G(F(w))=0\), i.e. \(\mathcal{D}G(F(w))\mathcal{D}F(w)=0\). Let \(h(w,z)=f(F(w),z)\). Then \[ \mathcal{D}y(z)=-[\mathcal{D}_{11}f(F(w(z)),z)]^{-1}\mathcal{D}_{12}h(F(w(z)),z). \] \[ \mathcal{D}_{11}f(F(w(z)),z)=\mathcal{D}_1f\mathcal{D}^2F_i+(\mathcal{D}F)'\mathcal{D}_{11}\mathcal{D}F \] \subsection{Solution Maps}\label{solution-maps} \section{Basic Inequalities}\label{basic-inequalities} \subsection{Jensen's Inequality}\label{jensens-inequality-1} \subsection{The AM-GM Inequality}\label{the-am-gm-inequality} The Arithmetic-Geometric Mean Inequality is simple, but quite useful for majorization. For completeness, we give the statement and proof here. \textbf{Theorem:} If \(x\geq 0\) and \(y\geq 0\) then \(\sqrt{xy}\leq\frac{1}{2}(x+y),\) with equality if and only if \(x=y.\) \textbf{Proof:} Expand \((\sqrt{x}-\sqrt{y})^2\geq 0,\) and collect terms. \textbf{QED} \textbf{Corollary:} \(\mid xy\mid\leq\frac{1}{2}(x^2+y^2)\) \textbf{Proof:} Just a simple rewrite of the theorem. \textbf{QED} \subsection{Polar Norms and the Cauchy-Schwarz Inequality}\label{polar-norms-and-the-cauchy-schwarz-inequality-1} \textbf{Theorem:} Suppose \(x,y\in\mathbb{R}^n.\) Then \((x'y)^2\leq x'x.y'y,\) with equality if and only if \(x\) and \(y\) are proportional. \textbf{Proof:} The result is trivially true if either \(x\) or \(y\) is zero. Thus we suppose both are non-zero. We have \(h(\lambda)\mathop{=}\limits^{\Delta}(x-\lambda y)'(x-\lambda y)\geq 0\) for all \(\lambda.\) Thus \[ \min_\lambda h(\lambda)=x'x-\frac{(x'y)^2}{y'y}\geq 0, \] which is the required result. \textbf{QED} \subsection{Young's Inequality}\label{youngs-inequality} The AM-GM inequality is a very special cases of Young's inequality. We derive it in a general form, using the coupling functions introduced by Moreau. Suppose \(f\) is a real-valued function on \(\mathcal{X}\) and \(g\) is a real-valued function on \(\mathcal{X}\times\mathcal{Y}\), called the \emph{coupling} function. Here \(\mathcal{X}\) and \(\mathcal{Y}\) are arbitrary. Define the \emph{\(g\)-conjugate} of \(f\) by \[ f^\circ_g(y)\mathop{=}\limits^{\Delta}\sup_{x\in\mathcal{X}}\ \{g(x,y)-f(x)\} \] Then \(g(x,y)-f(x)\leq f^\circ_g(y)\) and thus \(g(x,y)\leq f(x)+f^\circ_g(y)\), which is the generalized Young's inequality. We can also write this in the form that directly suggests minorization \[ f(x)\geq f^\circ_g(y)+g(x,y). \] The classical coupling function is \(g(x,y)=xy\) with both \(x\) and \(y\) in the positive reals. If we take \(f(x)=\frac{1}{p}x^p\), with \(p>1\), then \[ f^\circ_g(y)=\sup_{x}\ \{xy-\frac{1}{p}x^p\}. \] The \(\sup\) is attained for \(x=y^{\frac{1}{p-1}}\), from which we find \(f^\circ_g(y)=\frac{1}{q}y^q,\) with \(q\mathop{=}\limits^{\Delta}\frac{p}{p-1}\). Thus if \(p,q>1\) such that \[ \frac{1}{p}+\frac{1}{q}=1. \] Then for all \(x,y>0\) we have \[ xy\leq\frac{x^p}{p}+\frac{y^q}{q}, \] with equality if and only if \(y=x^{p-1}\). \section{Fixed Point Problems and Methods}\label{background:fixedpoint} As we have emphasized before, the algorithms discussed in this books are all special cases of block relation methods. But block relation methods are often appropriately analyzed as \emph{fixed point methods}, which define an even wider class of iterative methods. Thus we will not discuss actual fixed point algorithms that are not block relation methods, but we will use general results on fixed point methods to analyze block relaxation methods. A (stationary, one-step) fixed point method on \(\mathcal{X}\subseteq\mathbb{R}^n\) is defined as a map \(A:\mathcal{X}\rightarrow\mathcal{X}\). Depending on the context we refer to \(A\) as the \emph{update map} or \emph{algorithmic map}. Iterative sequences are generated by starting with \(x^{(0)}\in\mathcal{X}\) and then setting \[ x^{(k)}=A(x^{(k-1)})=A^k(x^{(0)}). \] for \(k=1,2,\cdots\). Such a sequence is also called the \emph{Picard sequence} generated by the map. If the sequence \(x^{(k)}\) converges to, say, \(x_\infty\), and if \(A\) is continuous, then \(x_\infty=A(x_\infty)\), and thus \(x_\infty\) is a \emph{fixed point} of \(A\) on \(\mathcal{X}\). The set of all \(x\in\mathcal{X}\) such that \(x=A(x)\) is called the \emph{fixed point set} of \(A\) on \(\mathcal{X}\), and is written as \(\mathcal{F}_\mathcal{X}\). The literature on fixed point methods is truly gigantic. There are textbooks, conferences, and dedicated journals. A nice and compact treatment, mostly on existence theorems for fixed points, is \citet{smart_74}. An excellent modern overview, concentrating on metrical fixed point theory and iterative computation, is \citet{berinde_07}. The first key result in fixed point theory is the \emph{Brouwer Fixed Point Theorem}, which says that for compact convex \(\mathcal{X}\) and continuous \(A\) there is at least one \(x\in\mathcal{F}_\mathcal{X}(A)\). The second is the \emph{Banach Fixed Point Theorem}, which says that if \(\mathcal{X}\) is a non-empty complete metric space and \(A\) is a \emph{contraction}, i.e. \(d(A(x),A(y))<\kappa\ d(x,y)\) for some \(0\leq\kappa<1\), then the Picard sequence \(x^{(k)}\) converges from any starting point \(x^{(0)}\) to the unique fixed point of \(A\) in \(\mathcal{X}\). Much of the fixed point literature is concerned with relaxing the contraction assumption and choosing more general spaces on which the various mappings are defined. I shall discuss some of the generalizations that we will use later in this book. First, we can generalize to point-to-set maps \(A:\mathcal{X}\rightarrow 2^\mathcal{X}\), where \(2^\mathcal{X}\) is the \emph{power set} of \(\mathcal{X}\), i.e.~the set of all subsets. Point-to-set maps are also called \emph{correspondences} or \emph{multivalued maps}. The Picard sequence is now defined by \(x^{(k)}\in A(x^{(k-1)})\) and we have a fixed point \(x\in\mathcal{F}_\mathcal{X}(A)\) if and only if \(x\in A(x)\). The generalization of the Brouwer Fixed Point Theorem is the \emph{Kakutani Fixed Point Theorem}. It assumes that \(\mathcal{X}\) is non-empty, compact and convex and that \(A(x)\) is non-empty and convex for each \(x\in\mathcal{X}\). In addition, the map \(A\) must be \emph{closed} or \emph{upper semi-continuous} on \(\mathcal{X}\), i.e.~whenever \(x_n\rightarrow x_\infty\) and \(y_n\in A(x_n)\) and \(y_n\rightarrow y_\infty\) we have \(y_\infty\in A(x_\infty)\). Under these conditions Kakutani's Theorem asserts the existence of a fixed point. Our discussion of the global convergence of block relaxation algorithms, in a later chapter, will be framed using fixed points of point-to-set maps, assuming the closedness of maps. In another generalization of iterative algorithms we get rid of the one-step and the stationary assumptions. The iterative sequence is \[ x^{(k)}\in A_k(x^{(0)},x^{(1)},\cdots,x^{(k-1)}). \] Thus the iterations have perfect memory, and the update map can change in each iteration. In an \(\ell\)-step method, memory is less than perfect, because the update is a function of only the previous \(\ell\) elements in the sequence. Formally, for \(k\geq\ell\), \[ x^{(k)}\in A_k(x^{(k-l)},\cdots,x^{(k-1)}), \] with some special provisions for \(k<\ell\). Any \(\ell\)-step method on \(\mathcal{X}\) can be rewritten as a one-step method on \[ \underbrace{\mathcal{X}\otimes\mathcal{X}\otimes\cdots\otimes\mathcal{X}}_{\ell\ \text{times}}. \] This makes it possible to limit our discussion to one-step methods. In fact, we will mostly discuss block-relaxation methods which are stationary one-step fixed point methods. For non-stationary methods it is somewhat more complicated to define fixed points. In that case it is natural to define a set \(\mathcal{S}\subseteq\mathcal{X}\) of \emph{desirable points} or \emph{targets}, which for stationary algorithms will generally, but not necessarily, coincide with the fixed points of \(A\). The questions we will then have to answer are if and under what conditions our algorithms converge to desirable points, and if they converge how fast the convergence will take place. \subsection{Subsequential Limits}\label{subsequential-limits} \section{Convex Functions}\label{convex-functions} \section{Composition}\label{composition-1} \subsection{Differentiable Convex Functions}\label{differentiable-convex-functions} If a function \(f\) attains its minimum on a convex set \(\mathcal{X}\) at \(x\), and \(f\) is differentiable at \(x\), then \((x-y)'\mathcal{D}f(x)\geq 0\) for all \(y\in\mathcal{X}\). If \(f\) attains its minimum on \([a,b]\) at \(a\), and \(f\) is differentiable at \(a\), then \(f'(a)\geq 0\). Or, more precisely, if \(f\) is differentiable from the right at \(a\) and \(f'_R(a)\geq 0\). Suppose \(\mathcal{X}=\{x\mid x'x\leq 1\}\) is the unit ball and a differentiable \(f\) attains its a minimum at \(x\) with \(x'x=1.\) Then \((x-y)'\mathcal{D}f(x)\geq 0\) for all \(y\in\mathcal{X}.\) This is true if and only if \begin{align*} \min_{y\in\mathcal{X}}(x-y)'\mathcal{D}f(x)&= x'\mathcal{D}f(x)-\max_{y\in\mathcal{X}}y'\mathcal{D}F(x)=\\&=x'\mathcal{D}f(x)-\|\mathcal{D}f(x)\|=0. \end{align*} By Cauchy-Schwartz this means that \(\mathcal{D}f(x)=\lambda x\), with \(\lambda= x'\mathcal{D}f(x)=\|\mathcal{D}f(x)\|.\) As an aside, if a differentiable function \(f\) attains its minimum on the unit sphere \(\{x\mid x'x=1\}\) at \(x\) then \(f(x/\|x\||)\) attains is minimum over \(\mathbb{R}^n\) at \(x\). Setting the derivative equal to zero shows that we must have \(\mathcal{D}f(x)(I-xx')=0\), which again translates to \(\mathcal{D}f(x)=\lambda x\), with \(\lambda= x'\mathcal{D}f(x)=\|\mathcal{D}f(x)\|.\) \section{Zangwill Theory}\label{zangwill-theory} \subsection{Algorithms as Point-to-set Maps}\label{algorithms-as-point-to-set-maps} The theory studies iterative algorithms with the following properties. An algorithm works in a space \(\Omega.\) It consists of a triple \((\mathcal{A},\psi,\mathcal{P}),\) with \(\mathcal{A}\) a mapping of \(\Omega\) into the set of nonempty subsets of \(\Omega,\) with \(\psi\) a real-valued function on \(\Omega,\) and with \(\mathcal{P}\) a subset of \(\Omega.\) We call \(\mathcal{A}\) the \emph{algorithmic map} or the \emph{update}, \(\psi\) the \emph{evaluation function} and \(\mathcal{P}\) the \emph{desirable points}. The algorithm works as follows. \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \tightlist \item start at an arbitrary \(\omega^{(0)}\in\Omega,\) \item if \(\omega^{(k)}\in\mathcal{P},\) then we stop, \item otherwise we construct the \emph{successor} by the rule \(\omega^{(k+1)}\in\mathcal{A}(\omega^k).\) \end{enumerate} We study properties of the sequences \(\omega^{(k)}\) generated by the algorithm, in particular their convergence. \subsection{Convergence of Function Values}\label{convergence-of-function-values} For this method we have our first (rather trivial) convergence theorem. \textbf{Theorem:} If * \(\Omega\) is compact, * \(\psi\) is jointly continuous on \(\Omega\), then \begin{itemize} \tightlist \item The sequence \(\{\psi^{(k)}\}\) converges to, say, \(\psi^\infty,\) \item the sequence \(\{\omega^{(k)}\}\) has at least one convergent subsequence, \item if \(\omega^\infty\) is an accumulation point of \(\{\omega^{(k)}\},\) then \(\psi(\omega^\infty)=\psi^\infty.\) \end{itemize} \textbf{Proof:} Compactness and continuity imply that \(\psi\) is bounded below on \(\Omega,\) and the minima in each of the substeps exist. This means that \(\{\psi^{(k)}\}\) is nonincreasing and bounded below, and thus convergent. Existence of convergent subsequences is guaranteed by Bolzano-Weierstrass, and if we have a subsequence \(\{\omega^{(k)}\}_{k\in\mathcal{K}}\) converging to \(\omega^\infty\) then by continuity \(\{\psi(\omega^{(k)})\}_{k\in\mathcal{K}}\) converges to \(\psi(\omega^\infty).\) But all subsequences of a convergent sequence converge to the same point, and thus \(\psi(\omega^\infty)=\psi^\infty.\) \textbf{QED} \textbf{Remark:} The assumptions in the theorem can be relaxed. Continuity is not necessary. Think of the function \(\psi(x)=\sum_{s=1}^p\text{sign}(x_s),\) which clearly is minimized in a single cycle. Typically, however, statistical problems exhibit continuity, so it may not be worthwhile to actually relax the assumption. Compactness is more interesting. If we define the level sets \[ \Omega_0\mathop{=}\limits^{\Delta}\{\omega\in\Omega\mid\psi(\omega)\leq\psi^{(0)}\}, \] then obviously it is sufficient to assume that \(\Omega_0\) is compact. The same thing is true for the even weaker assumption that the iterates \(\omega^{(k)}\) are in a compact set. We do not assume the sets \(\Omega_s\) are connected, i.e.~they could be discrete. For instance, we could minimize \(\|X\beta-y-\delta\|^2\) over \(\beta\) and \(\delta\), under the constraint that the elements of \(\delta\) take only two different unspecified values. This is not difficult to do with block-relaxation, but generally problems with discrete characteristics present us with special problems and complications. Thus, in most instances, we have connected sets in mind. For discrete components, the usual topology of \(\mathbb{R}^s\) may not be the most natural one to study convergence. In several problems in statistics the sets \(\Omega_s\) can be infinite-dimensional. This is true, for instance, in much of semi-parametric and non-parametric statistics. We mostly ignore the complications arising from infinite dimensionality (again, of a topological nature), because in actual computations we work with finite-dimensional approximations anyway. Theorem \ref{T:triv} is very general, but the conclusions are quite weak. We have convergence of the function values, but about the sequence \(\{\omega^{(k)}\}\) we only know that it has one or more accumulation points, and that all accumulation points have the same function value. We have not established other desirable properties of these accumulation points. In order to prove global convergence (i.e.~convergence from any initial point) we use the general theory developed initially by \citet{zangwill_69} (and later by \citet{polak_69}, \citet{meyer_76}, \citet{meyer_75}, and others). The best introduction and overview is perhaps the volume edited by Huard (\citet{huard_79a}). \subsection{Convergence of Solutions}\label{convergence-of-solutions} \textbf{Theorem: {[}Zangwill{]}} \begin{itemize} \tightlist \item If \(\mathcal{A}\) is \emph{uniformly compact} on \(\Omega,\) i.e.~there is a compact \(\Omega_0\subseteq\Omega\) such that \(\mathcal{A}(\omega)\subseteq\Omega_0\) for all \(\omega\in\Omega,\) \item If \(\mathcal{A}\) is \emph{upper-semicontinuous} or \emph{closed} on \(\Omega-\mathcal{P},\) i.e.~if \(\xi_i\in\mathcal{A}(\omega_i)\) and \(\xi_i\rightarrow\xi\) and \(\omega_i\rightarrow\omega\) then \(\xi\in\mathcal{A}(\omega),\) \item If \(\mathcal{A}\) is \emph{strictly monotonic} on \(\Omega-\mathcal{P}\), i.e. \(\xi\in\mathcal{A}(\omega)\) implies \(\psi(\xi)\) \textless{} \(\psi(\omega)\) if \(\omega\) is not a \emph{desirable point}. then all accumulation points of the sequence \(\{\omega^{(k)}\}\) generated by the algorithm are desirable points. \end{itemize} \textbf{Proof:} Compactness implies that \(\{\omega^{(k)}\}\) has a convergent subsequence. Suppose its index-set is \(\mathcal{K}=\{k_1,k_2,\cdots\}\) and that it converges to \(\omega_\mathcal{K}.\) Since \(\{\psi(\omega^{(k)})\}\) converges to, say \(\psi_\infty,\) we see that also \[ \{\psi(\omega^{(k_1)}),\psi(\omega^{(k_2)}),\cdots\} \rightarrow\psi_\infty. \] Now consider \(\{\omega^{(k_1+1)},\omega^{(k_2+1)},\cdots\},\) which must again have a convergent subsequence. Suppose its index-set is \(\mathcal{L}=\{\ell_1+1,\ell_2+1,\cdots\}\) and that it converges to \(\omega_\mathcal{L}.\) Then \(\psi(\omega_\mathcal{K})=\psi_(\omega_\mathcal{L})=\psi_\infty.\) Assume \(\omega_\mathcal{K}\) is not a fixed point. Now \[ \{\omega^{(\ell_1)},\omega^{(\ell_2)},\cdots\} \rightarrow \omega_\mathcal{K} \] and \[ \{\omega^{(\ell_1+1)},\omega^{(\ell_2+1)},\cdots\} \rightarrow\omega_\mathcal{L}, \] with \(\omega^{(\ell_j+1)}\in\mathcal{A}(\omega^{(\ell_j+1)}.\) Thus, by usc, \(\omega_\mathcal{L}\in\mathcal{A}(\omega_\mathcal{K}).\) If \(\omega_\mathcal{K}\) is not a fixed point, then strict monotonicity gives gives \(\psi(\omega_\mathcal{L})\) \textless{} \(\psi(\omega_\mathcal{K}),\) which contradicts our earlier \(\psi(\omega_\mathcal{K})=\psi_(\omega_\mathcal{L}).\) \textbf{QED} The concept of closedness of a map can be illustrated with the following picture, showing a map which is not closed at at least one point. \begin{figure} {\centering \includegraphics{graphics/bookfig3} } \caption{The map is not closed}\label{fig:bookfig3} \end{figure} We have already seen another example: Powell's coordinate descent example shows that the algorithm map is not closed at six of the edges of the cube \(\{\pm 1,\pm 1,\pm 1\}.\) It is easy to see that desirable points are generalized fixed points, in the sense that \(\omega\in\mathcal{P}\) is equivalent to that \(\omega\in\mathcal{A}(\omega).\) According to Zangwill's theorem each accumulation point is a generalized fixed point. This, however, does not prove convergence, because there can be many accumulation points. If we redefine fixed points as points such that \(\mathcal{A}(\omega)=\{\omega\},\) then we can strengthen the theorem {[}Meyer, 1976{]}. \textbf{Theorem: {[}Meyer{]}} Suppose the conditions of Zangwill's theorem are satisfied for the stronger definition of a fixed point, i.e. \(\xi\in\mathcal{A}(\omega)\) implies \(\psi(\xi)\) \textless{} \(\psi(\omega)\) if \(\omega\) is not a fixed point, then in addition to what we had before \(\{\omega^{(k)}\}\) is \emph{asymptotically regular}, i.e. \(\|\omega^{(k)}-\omega^{(k+1)}\|\rightarrow 0.\) \textbf{Proof:} Use the notation in the proof of Zangwill's theorem. Suppose \(\|\omega^{(\ell_i+1)}-\omega^{(\ell_i)}\|\) \textgreater{} \(\delta\) \textgreater{} \(0.\) Then \(\|\omega_\mathcal{L}-\omega_\mathcal{K}\|\geq\delta.\) But \(\omega_\mathcal{K}\) is a fixed point (in the strong sense) and thus \(\omega_\mathcal{L}\in\mathcal{A}(\omega_\mathcal{K})=\{\omega_\mathcal{K}\},\) a contradiction. \textbf{QED} \textbf{Theorem: {[}Ostrowski{]}} If the bounded sequence \(\Omega=\{\omega^{(k)}\}\) satisfies \(\|\omega^{(k)}-\omega^{(k+1)}\|\rightarrow 0,\) then the derived set \(\Omega'\) is either a point or a continuum. \textbf{Proof:} We follow Ostrowski {[}1966, pages 203--204{]}. A \emph{continuum} is a closed set, which is not the union of two or more disjoint closed sets. Of course the derived set is closed. Suppose it is the union of the disjoint closed sets \(C_1\) and \(C_2,\) which are a distance of at least \(p\) apart. We can choose \(k_0\) such that \(\|\omega^{(k)}-\omega^{(k+1)}\|\leq\frac{p}{3}\) for all \(k\geq k_0.\) \textbf{QED} Only if the derived set is a single point, we have actual convergence. Thus Meyer's theorem still does not prove actual convergence, but it is close enough for all practical purposes. Observe boundedness is essential in Theorem \ref{T:ostrow}, otherwise the derived set may be empty, think of the series \(\sum\frac{1}{k}.\) \section{Rates of Convergence}\label{rates-of-convergence} The basic result we use is due to Perron and \citet{ostrowski_66}. \textbf{Theorem:} * If the iterative algorithm \(x^{(k+1)}=\mathcal{A}(x^{(k)}),\) converges to \(x_\infty,\) * and \(\mathcal{A}\) is differentiable at \(x_\infty,\) * and \(0<\rho=\|\mathcal{D}\mathcal{A}(\omega_\infty)\|<1,\) then the algorithm is linearly convergent with rate \(\rho.\) \textbf{Proof:} \textbf{QED} The norm in the theorem is the spectral norm, i.e.~the modulus of the maximum eigenvalue. Let us call the derivative of \(A\) the \emph{iteration matrix} and write it as \(\mathcal{M}.\) In general block relaxation methods have linear convergence, and the linear convergence can be quite slow. In cases where the accumulation points are a continuum we have sublinear rates. The same things can be true if the local minimum is not strict, or if we are converging to a saddle point. Generalization to non-differentiable maps. Points of attraction and repulsion. Superlinear etc \subsection{Over- and Under-Relaxation}\label{over--and-under-relaxation} \subsection{Acceleration of Convergence of Fixed Point Methods}\label{acceleration-of-convergence-of-fixed-point-methods} \section{Matrix Algebra}\label{matrix-algebra} \subsection{Eigenvalues and Eigenvectors of Symmetric Matrices}\label{eigenvalues-and-eigenvectors-of-symmetric-matrices} In this section we give a fairly complete introduction to eigenvalue problems and generalized eigenvalue problems. We use a constructive variational approach, basically using the Rayleigh quotient and deflation. This works best for positive semi-definite matrices, but after dealing with those we discuss several generalizations. Suppose \(A\) is a positive semi-definite matrix of order \(n\). Consider the problem of maximizing the quadratic form \(f(x)=x'Ax\) on the sphere \(x'x=1\). At the maximum, which is always attained, we have \(Ax=\lambda x\), with \(\lambda\) a Lagrange multiplier, as well as \(x'x=1\). It follows that \(\lambda=f(x)\). Note that the maximum is not necessarily attained at a unique value. Also the maximum is zero if and only if \(A\) is zero. Any pair \((x,\lambda)\) such that \(Ax=\lambda x\) and \(x'x=1\) is called an \emph{eigen-pair} of \(A\). The members of pair are the \emph{eigenvector} \(x\) and the corresponding \emph{eigenvalue} \(\lambda\). \textbf{Result 1:} Suppose \((x_1,\lambda_1)\) and \((x_2,\lambda_2)\) are two eigen-pairs, with \(\lambda_1\not=\lambda_2.\) Then premultiplying both sides of \(Ax_2=\lambda_2x_2\) by \(x_1'\) gives \(\lambda_1x_1'x_2=\lambda_2x_1'x_2\), and thus \(x_1'x_2=0\). This shows that \(A\) cannot have more than \(n\) distinct eigenvalues. If there were \(p>n\) distinct eigenvalues, then the \(n\times p\) matrix \(X\), which has the corresponding eigenvectors as columns, would have column-rank \(p\) and row-rank \(n\), which is impossible. In words: one cannot have more than \(n\) orthonormal vectors in \(n-\)dimensional space. Suppose the distinct values are \(\tilde\lambda_1>\cdots>\tilde\lambda_s\), with \(s=1,\cdots,p.\) Thus each of the eigenvalues \(\lambda_i\) is equal to one of the \(\tilde\lambda_s.\) \textbf{Result 2:} If \((x_1,\lambda)\) and \((x_2,\lambda)\) are two eigen-pairs with the same eigenvalue \(\lambda\) then any linear combination \(\alpha x_1+\beta x_2,\) suitably normalized, is also an eigenvector with eigenvalue \(\lambda\). Thus the eigenvectors corresponding with an eigenvalue \(\lambda\) form a linear subspace of \(\mathbb{R}^n\), with dimension, say, \(1\leq n_s\leq n\). This subspace can be given an orthonormal basis in an \(n\times n_s\) matrix \(X_s\). The number \(n_s\) is the \emph{multiplicity} of \(\tilde\lambda_s,\) and by implication of the eigenvalue \(\lambda_i\) equal to \(\tilde\lambda_s\). Of course these results are only useful if eigen-pairs exist. We have shown that at least one eigen-pair exists, the one corresponding to the maximum of \(f\) on the sphere. We now give a procedure to compute additonal eigen-pairs. Consider the following algorithm for generating a sequence \(A^{(k)}\) of matrices. We start with \(k=1\) and \(A^{(1)}=A\). 1. \textbf{Test:} If \(A^{(k)}=0\) stop. 2. \textbf{Maximize:} Computes the maximum of \(x'A^{(k)}x\) over \(x'x=1\). Suppose this is attained at an eigen-pair \((x^{(k)},\lambda^{(k)})\). If the maximizer is not unique, select an arbitrary one. 3. \textbf{Orthogonalize:} Replace \(x^{(k)}\) by \(x^{(k)}-\sum_{\ell=1}^{k-1}((x^{(\ell)})'x^{(k)})x^{(\ell)}.\) 4. \textbf{Deflate:} Set \(A^{(k+1)}=A^{(k)}-\lambda^{(k)}x^{(k)}(x^{(k)})'\), 5. \textbf{Update:} Go back to step 1 with \(k\) replaced by \(k+1\). If \(k=1\) then in step (2) we compute the largest eigenvalue of \(A\) and a corresponding eigenvector. In that case there is no step (3). Step (4) constructs \(A^{(2)}\) by \emph{deflation}, which basically removes the contribution of the largest eigenvalue and corresponding eigenvector. If \(x\) is an eigenvector of \(A\) with eigenvalue \(\lambda<\lambda^{(1)}\), then \[ A^{(2)}x=Ax-\lambda^{(1)}x^{(1)}(x^{(1)})'x=Ax=\lambda x \] by result (1) above. Also, of course, \(A^{(2)}x^{(1)}=0\), so \(x^{(1)}\) is an eigenvector of \(A^{(2)}\) with eigenvalue \(0\). If \(x\not= x^{(1)}\) is an eigenvector of \(A\) with eigenvalue \(\lambda=\lambda^{(1)}\), then by result (2) we can choose \(x\) such that \(x'x^{(1)}=0,\) and thus \[ A^{(2)}x=Ax-\lambda x^{(1)}(x^{(1)})'x=\lambda(I-x^{(1)}(x^{(1)})')x=\lambda x. \] We see that \(A^{(2)}\) has the same eigenvectors as \(A\), with the same multiplicities, except for \(\lambda^{(1)}\), which now has its old multiplicity \(-1\), and zero, which now has its old multiplicity \(+1\). Now if \(x^{(2)}\) is the eigenvector corresponding with \(\lambda^{(2)}\), the largest eigenvalue of \(A^{(2)}\), then by result (1) \(x^{(2)}\) is automatically orthogonal to \(x^{(1)}\), which is an eigenvalue of \(A^{(2)}\) with eigenvalue zero. Thus step (3) is not ever necessary, although it will lead to more precise numerical computation. Following the steps of the algorithm we see thatit defines \(p\) orthonormal matrices \(X_s\), which moreover satisfy \(X_s'X_t=0\) for \(s\not= t\), and with \(\sum_{s=1}^p n_s=\mathbf{rank}(A)\). Also \[ A=\sum_{s=1}^p \tilde\lambda_sP_s,\tag{1a} \] where \(P_s\) is the projector \(X_sX_s'\). This is the \emph{eigen decomposition} or the \emph{spectral decomposition} of a positive semi-definite \(A\). Our algorithm stops when \(A^{(k)}=0\), which is the same as \(\sum_{s=1}^p n_s=\mathbf{rank}(A)\). If \(\mathbf{rank}(A)p\) there exist \(\alpha_{js}\) such that \[ x_j=\sum_{s=1}^p\alpha_{js} x_s.\tag{1} \] Premultiply \((1)\) with \(\lambda_j\) to get \[ \lambda_jx_j=\sum_{s=1}^p\alpha_{js} \lambda_jx_s.\tag{2} \] Premultiply \((1)\) by \(A\) to get \[ \lambda_jx_j=\sum_{s=1}^p\alpha_{js} \lambda_sx_s.\tag{3} \] Subtract \((2)\) from \((3)\) to get \[ \sum_{s=1}^p\alpha_{js}(\lambda_s-\lambda_j)x_s = 0, \] which implies that \(\alpha_{js}(\lambda_s-\lambda_j)=0,\) because the \(x_s\) are linearly independent. Since the eigenvalues are unequal, this implies \(a_{js}=0\) and thus \(x_j=0\) for all \(j>p\), contradicting that the \(x_j\) are eigenvectors. Thus \(p=m\) and the \(x_j\) are linearly independent. \textbf{Note 030615} Add small amount on defective matrices. Add stuff on characteristic and minimal polynomials. Take about using the SVD instead. \subsection{Modified Eigenvalue Problems}\label{modified-eigenvalue-problems} Suppose we know an eigen decomposition \(B=K\Phi K'\) of a real symmetric matrix \(B\) of order \(n\), and we want to find an eigen decomposition of the rank-one modification \(A=B+\gamma cc'\), where \(\gamma\not= 0\). The problem was first discussed systematically by \citet{golub_73}. Also see \citet{bunch_nielsen_sorensen_78} for a more detailed treatment and implmentation. Eigen-pairs of \(A\) must satisfy \[ (B+\gamma cc')x=\lambda x. \] Change variables to \(y=K'x\) and define \(d\mathop{=}\limits^{\Delta}K'c\). For the time being suppose all elements of \(d\) are non-zero and all elements of \(\Phi\) are different, with \(\phi_1>\cdots>\phi_n\). We must solve \[ (\Phi+\gamma dd')y=\lambda y, \] which we can also write as \[ (\Phi-\lambda I)y=-\gamma (d'y)d. \] Suppose \((y,\lambda)\) is a solution with \(d'y=0\). Then \((\Phi-\lambda I)y=0\) and because all \(\phi_k\) are different \(y\) must be a vector with a single element, say \(y_k\), not equal to zero. But then \(d'y=y_kd_k\), which is non-zero. Thus \(d'y\) is non-zero at a solution, and because eigenvectors are determined up to a scalar factor we may as well require \(e'y=1\). Now solve \begin{align*} (\Phi-\lambda I)y&=-\gamma d,\\ d'y&=1. \end{align*} At a solution we must have \(\lambda\not=\phi_i\), because otherwise \(d_i\) would be zero. Thus \[ y_i=-\gamma \frac{d_i}{\phi_i-\lambda}, \] and we can find \(\lambda\) by solving \[ 1+\gamma\sum_{i=1}^n\frac{d_i^2}{\phi_i-\lambda}=0. \] If we define \[ f(\lambda)=\sum_{i=1}^n\frac{d_i^2}{\phi_i-\lambda}, \] then we must solve \(f(\lambda)=-\frac{1}{\gamma}\). Let's first look at a particular \(f\). \begin{center}\includegraphics{bras_files/figure-latex/secular1-1} \end{center} Figure 1: Linear Secular Equation We have \(f'(\lambda)>0\) for all \(\lambda\), and \[ \lim_{\lambda\rightarrow-\infty}f(\lambda)=\lim_{\lambda\rightarrow+\infty}f(\lambda)=0. \] There are vertical asymptotes at all \(\phi_i\), and between \(\phi_i\) and \(\phi_{i+1}\) the function increases from \(-\infty\) to \(+\infty\). For \(\lambda<\phi_n\) the function increases from 0 to \(+\infty\) and for \(\lambda>\phi_1\) it increases from \(-\infty\) to 0. Thus the equation \(f(\lambda)=-1/\gamma\) has one solution in each of the \(n-1\) open intervals between the \(\phi_i\). If \(\gamma<0\) it has an additional solution smaller than \(\phi_n\) and if \(\gamma>0\) it has a solution larger than \(\phi_1\). If \(\gamma<0\) then \[ \lambda_n<\phi_n<\lambda_{n-1}<\cdots<\phi_2<\lambda_1<\phi_1, \] and if \(\gamma>0\) then \[ \phi_n<\lambda_n<\phi_{n-1}\cdots<\phi_2<\lambda_2<\phi_1<\lambda_1. \] Finding the actual eigenvalues in their intervals can be done with any root-finding method. Of course some will be better than other for solving this particular problem. See Melman \citet{melman_95}, \citet{melman_97}, and \citet{melman_98} for suggestions and comparisons. We still have to deal with the assumptions that the elements of \(d\) are non-zero and that all \(\phi_i\) are different. Suppose \(p\) elements of \(d_i\) are zero, without loss of generality it can be the last \(p\). Partition \(\Phi\) and \(d\) accordingly. Then we need to solve the modified eigen-problem for \[ \begin{bmatrix} \Phi_1+\gamma d_1d_1'&0\\ 0&\Phi_2 \end{bmatrix}. \] But this is a direct sum of smaller matrices and the eigenvalues problems for \(\Phi_2\) and \(\Phi_1+\gamma d_1d_1'\) can be solved separately. If not all \(\phi_i\) are different we can partitioning the matrix into blocks corresponding with the, say, \(p\) different eigenvalues. \[ \begin{bmatrix} \phi_1I+\gamma d_1d_1'&\gamma d_1d_2'&\cdots&\gamma d_1d_p'\\ \gamma d_2d_1'&\phi_2I+\gamma d_2d_2'&\cdots&\gamma d_2d_p'\\ \vdots&\vdots&\ddots&\vdots\\ \gamma d_pd_1'&\gamma d_pd_2'&\cdots&\phi_pI+d_pd_p' \end{bmatrix}. \] Now use the \(p\) matrices \(L_s\) which are square orthonormal of order \(n_s\), and have their first column equal to \(d_s/\|d_s\|.\) Form the direct sum of the \(L_s\) and compute \(L_s'(\Phi+\gamma dd')L_s\). This gives \[ \begin{bmatrix} \phi_1I+\gamma\|d_1\|^2 e_1e_1'&\gamma\|d_1\|\|d_2\|e_1e_2'&\cdots&\gamma\|d_1\|\|d_p\|e_1e_p'\\ \gamma\|d_2\|\|d_1\|e_2e_1'&\phi_2I+\gamma\|d_2\|^2e_2e_2'&\cdots&\gamma\|d_2\|\|d_p\|e_2e_p'\\ \vdots&\vdots&\ddots&\vdots\\ \gamma \|d_p\|\|d_1\|e_pe_1'&\gamma \|d_p\|\|d_2\|e_pe_2'&\cdots&\phi_pI+\gamma\|d_p\|^2e_pe_p' \end{bmatrix} \] with the \(e_s\) unit vectors, i.e.~vectors that are zero except for element \(s\) that is one. A row and column permutation makes the matrix a direct sum of the \(p\) diagonal matrices \(\phi_s I\) of order \(n_s-1\) and the \(p\times p\) matrix \[ \begin{bmatrix} \phi_1+\gamma\|d_1\|^2 &\gamma\|d_1\|\|d_2\|&\cdots&\gamma\|d_1\|\|d_p\|\\ \gamma\|d_2\|\|d_1\|e_2e_1'&\phi_2+\gamma\|d_2\|^2&\cdots&\gamma\|d_2\|\|d_p\|\\ \vdots&\vdots&\ddots&\vdots\\ \gamma \|d_p\|\|d_1\|e_pe_1'&\gamma \|d_p\|\|d_2\|&\cdots&\phi_p+\gamma\|d_p\|^2 \end{bmatrix} \] This last matrix satisfies our assumptions of different diagonal elements and nonzero off-diagonal elements, and consequently can be analyzed by using our previous results. A very similar analysis is possible for modfied singular value decomposition, for which we refer to \citet{bunch_nielsen_78}). \subsection{Quadratic on a Sphere}\label{quadratic-on-a-sphere-1} Another problem naturally leading to a different secular equation is finding stationary values of a quadratic function \(f\) defined by \[ f(x)=\frac12 x'Ax-b'x + c \] on the unit sphere \(\{x\mid x'x=1\}\). This was first studied by \citet{forsythe_golub_65}. Their treatment was subsequently simplified and extended by \citet{spjotvoll_72} and \citet{gander_81}. The problem has recently received some attention because of the development of trust region methods for optimization, and, indeed, because of Nesterov majorization. The stationary equations are \begin{align*} (A-\lambda I)x&=b,\\ x'x&=1. \end{align*} Suppose \(A=K\Phi K'\) with the \(\phi_1\geq\cdots\geq\phi_n\) , change variables to \(y=K'x\), and define \(d\mathop{=}\limits^{\Delta}K'b\). Then we must solve \begin{align*} (\Phi-\lambda I)y&=d,\\ y'y&=1. \end{align*} Assume for now that the elements of \(d\) are non-zero. Then \(\lambda\) cannot be equal to one of the \(\phi_i\). Thus \[ y_i=\frac{d_i}{\phi_i-\lambda} \] and we must have \(h(\lambda)=1\), where \[ h(\lambda)\mathop{=}\limits^{\Delta}\sum_{i=1}^n\frac{d_i^2}{(\phi_i-\lambda)^2}. \] Again, let's look at an example of a particular \(h\). The plots in Figure 1 show both \(f\) ad \(h\). We see that \(h(\lambda)=1\) has 12 solutions, so the remaining question is which one corresponds with the minimum of \(f\). \begin{center}\includegraphics{bras_files/figure-latex/secular2-1} \end{center} Figure 1: Quadratic Secular Equation Again \(h\) has vertical asympotes at the \(\phi_i\). Beween two asymptotes \(h\) decreases from \(+\infty\) to a minimum, and then increases again to \(+\infty\). Note that \[ h'(\lambda)=2\sum_{i=1}^n\frac{d_i^2}{(\phi_i-\lambda)^3}, \] and \[ h''(\lambda)=6\sum_{i=1}^n\frac{d_i^2}{(\phi_i-\lambda)^4}, \] and thus \(h\) is convex in each of the intervals between asymptotes. Also \(h\) is convex and increasing from zero to \(+\infty\) on \((-\infty,\phi_n)\) and convex and decreasing from \(+\infty\) to zero on \((\phi_1,+\infty)\). \subsection{Generalized Inverses}\label{generalized-inverses} \subsection{Partitioned Matrices}\label{partitioned-matrices} \section{Matrix Differential Calculus}\label{matrix-differential-calculus} \subsection{Matrix Derivatives}\label{matrix-derivatives} A matrix, of course, is just an element of a finite dimensional linear vector space. We write \(X\in\mathbb{R}^{n\times m}\), and we use the inner product \(\langle X,Y\rangle=\mathbf{tr} X'Y,\) and corresponding norm \(\|X\|=\sqrt{\mathbf{tr}\ X'X}.\) Thus derivatives of real-valued function of matrices, or derivatives of matrix-valued functions of matrices, are covered by the usual definitions and formulas. Nevertheless there is a surprisingly huge literature on differential calculus for real-valued functions of matrices, and matrix-valued functions of matrices. One of the reason for the proliferation of publications is that a matrix-valued function of matrices can be thought of a function of for matrix space \(\mathbb{R}^{n\times m}\) to matrix-space \(\mathbb{R}^{p\times q}\), but also as a function of vector space \(\mathbb{R}^{nm}\) to vector space \(\mathbb{R}^{pq}\). There are obvious isomorphisms between the two representations, but they naturally lead to different notations. We will consistently choose the matrix-space formulation, and consequently minimize the role of the \(\mathbf{vec}()\) operator and the special constructs such as the commutation and duplication matrix. The other choice Nevertheless having a compendium of the standard real-valued and matrix-valued functions available is of some interest. The main reference is the book by \citet{magnus_neudecker_99}. We will avoid using differentials and the \(\mathbf{vec}()\) operator. Suppose \(F\) is a matrix valued function of a single variable \(x\). In other words \(F:\mathbb{R}\rightarrow\mathbb{R}^{n\times m}\) is a matrix of functions, as in \[ F(x)= \begin{bmatrix} f_{11}(x)&f_{12}(x)&\cdots&f_{1m}(x)\\ f_{21}(x)&f_{22}(x)&\cdots&f_{2m}(x)\\ \vdots&\vdots&\ddots&\vdots\\ f_{n1}(x)&f_{n2}(x)&\cdots&f_{nm}(x) \end{bmatrix}. \] Now the derivatives of any order of \(F\), if they exist, are also matrix valued functions \[ \mathcal{D}^sF(x)= \begin{bmatrix} \mathcal{D}^sf_{11}(x)&\mathcal{D}^sf_{12}(x)&\cdots&\mathcal{D}^sf_{1m}(x)\\ \mathcal{D}^sf_{21}(x)&\mathcal{D}^sf_{22}(x)&\cdots&\mathcal{D}^sf_{2m}(x)\\ \vdots&\vdots&\ddots&\vdots\\ \mathcal{D}^sf_{n1}(x)&\mathcal{D}^sf_{n2}(x)&\cdots&\mathcal{D}^sf_{nm}(x) \end{bmatrix}. \] If \(F\) is a function of a vector \(x\in\mathbb{R}^p\) then partial derivatives are defined similarly, as in \[ \mathcal{D}_{i_1\cdots i_s}F(x)= \begin{bmatrix} \mathcal{D}_{i_1\cdots i_s}f_{11}(x)&\mathcal{D}_{i_1\cdots i_s}f_{12}(x)&\cdots&\mathcal{D}_{i_1\cdots i_s}f_{1m}(x)\\ \mathcal{D}_{i_1\cdots i_s}f_{21}(x)&\mathcal{D}_{i_1\cdots i_s}f_{22}(x)&\cdots&\mathcal{D}_{i_1\cdots i_s}f_{2m}(x)\\ \vdots&\vdots&\ddots&\vdots\\ \mathcal{D}_{i_1\cdots i_s}f_{n1}(x)&\mathcal{D}_{i_1\cdots i_s}f_{n2}(x)&\cdots&\mathcal{D}_{i_1\cdots i_s}f_{nm}(x) \end{bmatrix}, \] with \(1\leq i_s\leq p.\) The notation becomes slightly more complicated if \(F\) is a function of a \(p\times q\) matrix \(X\), i.e.~an element of \(\mathbb{R}^{p\times q}\). It then makes sense to write the partials as \(\mathcal{D}_{(i_1,j_1)\cdots(i_s,j_s)}F(X)\) where \(1\leq i_s\leq p\) and \(1\leq j_s\leq q.\) \subsection{Derivatives of Eigenvalues and Eigenvectors}\label{derivatives-of-eigenvalues-and-eigenvectors} This appendix summarizes some of the results in \citet{deleeuw_R_07c}, \citet{deleeuw_R_08b}, and \citet{deleeuw_sorenson_U_12b}. We refer to those reports for more extensive calculations and applications. Suppose \(A\) and \(B\) are two real symmetric matrices depending smoothly on a real parameter \(\theta\). The notation below suppresses the dependence on \(\theta\) of the various quantities we talk about, but it is important to remember that all eigenvalues and eigenvectors we talk about are functions of \(\theta\). The \emph{generalized eigenvalue} \(\lambda_s\) and the corresponding \emph{generalized eigenvector} \(x_s\) are defined implicitly by \(Ax_s=\lambda_sBx_s\). Moreover the eigenvector is identified by \(x_s'Bx_s^{\ }=1\). We suppose that in a neighborhood of \(\theta\) the eigenvalue \(\lambda_s\) is unique and \(B\) is positive definite. A precise discussion of the required assumptions is, for example, in \citet{wilkinson_65} or \citet{kato_76}. Differentiating \(Ax_s=\lambda_sBx_s\) gives the equation \[ (A-\lambda_sB)(\mathcal{D}x_s)=-((\mathcal{D}A)-\lambda_s(\mathcal{D}B))x_s+(\mathcal{D}\lambda_s)Bx_s, \tag{1} \] while \(x_s'Bx_s=1\) gives \[ x_s'B(\mathcal{D}x_s)=-\frac12 x_s'(\mathcal{D}B)x_s. \tag{2} \] Premultiplying \(\mathbf{(1)}\) by \(x_s'\) gives \[ \mathcal{D}\lambda_s=x_s'((\mathcal{D}A)-\lambda_s(\mathcal{D}B))x_s \] Now suppose \(AX=BX\Lambda\) with \(X'BX=I\). Then from \(\mathbf{(1)}\), for \(t\not= s\), premultiplying by \(x_t'\) gives \[ (\lambda_t-\lambda_s)x_t'B(Dx_s)=-x_t'((\mathcal{D}A)-\lambda_s(\mathcal{D}B))x_s. \] If we define \(g\) by \[ g_t\mathop{=}\limits^{\Delta}\begin{cases}\frac{1}{\lambda_t-\lambda_s}x_t'((\mathcal{D}A)-\lambda_s(\mathcal{D}B))x_s&\text{ for }t\not= s,\\ \frac12 x_t'(\mathcal{D}B)x_t&\text{ for }t=s, \end{cases} \] then \(X'B(\mathcal{D}x_s)=-g\) and thus \(\mathcal{D}x_s=-Xg\). A first important special case is the \emph{ordinary eigenvalue problem}, in which \(B=I,\) which obviously does not depend on \(\theta\), and consequently has \(\mathcal{D}B=0\). Then \[ \mathcal{D}\lambda_s=x_s'(\mathcal{D}A)x_s, \] while \[ g_t\mathop{=}\limits^{\Delta}\begin{cases}\frac{1}{\lambda_t-\lambda_s}x_t'(\mathcal{D}A)x_s&\text{ for }t\not= s,\\ 0&\text{ for }t=s. \end{cases} \] If we use the Moore\_Penrose inverse the derivatives of the eigenvector can be written as \[ \mathcal{D}x_s=-(A-\lambda_s I)^+(\mathcal{D}A)x_s. \] Written in a different way this expression is \[ \mathcal{D}x_s=\sum_{t\not =s}\frac{u_{st}}{\lambda_s-\lambda_t}x_t, \] with \(U\mathop{=}\limits^{\Delta}X'(\mathcal{D}A)X\), so that \(\mathcal{D}\lambda_s=u_{ss}\). In the next important special case is the \emph{singular value problem} The singular values and vectors of an \(n\times m\) rectangular \(Z\), with \(n\geq m\), solve the equations \(Zy_s=\lambda_s x_s\) and \(Z'x_s=\lambda_s y_s\). It follows that \(Z'Zy_s=\lambda_s^2y_s\), i.e.~the right singular vectors are the eigenvectors and the singular values are the square roots of the eigenvalues of \(A=Z'Z\). Now we can apply our previous results on eigenvalues and eigenvectors. If \(A=Z'Z\) then \(\mathcal{D}A=Z'(\mathcal{D}Z)+(\mathcal{D}Z)'Z\). We have, at an isolated singular value \(\lambda_s\), \[ \mathcal{D}\lambda_s^2=y_s'(Z'(\mathcal{D}Z)+(\mathcal{D}Z)'Z)y_s=2\lambda_s x_s'(\mathcal{D}Z)y_s, \] and thus \[ \mathcal{D}\lambda_s=x_s'(\mathcal{D}Z)y_s. \] For the singular vectors our previous results on eigenvectors give \[ \mathcal{D}y_s=-(Z'Z-\lambda_s^2 I)^+(Z'(\mathcal{D}Z)+(\mathcal{D}Z)'Z)y_s, \] and in the same way \[ \mathcal{D}x_s=-(ZZ'-\lambda_s^2 I)^+(Z(\mathcal{D}Z)'+(\mathcal{D}Z)Z')x_s. \] Now let \(Z=X\Lambda Y'\), with \(X\) and \(Y\) square orthonormal, and with \(\Lambda\) and \(n\times m\) diagonal matrix (with \(\mathcal{rank}(Z)\) positive diagonal entries in non-increasing order along the diagonal). Also define \(U\mathop{=}\limits^{\Delta}X'(DZ)Y\). Then \(\mathcal{D}\lambda_s=u_{ss}\), and \[ \mathcal{D}y_s =\sum_{t\not= s}\frac{\lambda_su_{st}+\lambda_t u_{ts}}{\lambda_s^2-\lambda_t^2} y_t, \] and \[ \mathcal{D}x_s =\sum_{t\not= s}\frac{\lambda_t u_{st}+\lambda_su_{ts}}{\lambda_s^2-\lambda_t^2} x_t. \] Note that if \(Z\) is symmetric we have \(X=Y\) and \(U\) is symmetric, so we recover our previous result for eigenvectors. Also note that if the parameter \(\theta\) is actually element \((i,j)\) of \(Z\), i.e.~if we are computing partial derivatives, then \(u_{st}=x_{is}y_{jt}\). The results on eigen and singular value decomposition can be applied in many different ways. mostly by simply using the product rule for derivatives, For a square symmetric \(A\) or order \(n\), for example, we have \[ f(A)\mathop{=}\limits^{\Delta}\sum_{s=1}^n f(\lambda_s)x_sx_s'. \] and thus \[ \mathcal{D}f(A)=\sum_{s=1}^nDf(\lambda_s)(\mathcal{D}\lambda_s)x_sx_s'+f(\lambda_s)(x_s(\mathcal{D}x_s)'+(\mathcal{D}x_s)x_s'). \] The generalized inverse of a rectangular \(Z\) is \[ Z^+\mathop{=}\limits^{\Delta}\sum_{s=1}^r \frac{1}{\lambda_s}y_sx_s', \] where \(r=\mathbf{rank}(Z)\). Summation is over the positive singular values, and for differentiability we must assume that the rank of \(Z\) is constant in a neighborhood of \(\theta\). The Procrustus transformation of a rectangular \(Z\), which is the projection of \(Z\) on the Stiefel manifold of orthonormal matrices, is \[ \mathbf{proc}(Z)\mathop{=}\limits^{\Delta}Z(Z'Z)^{-\frac12}=\sum_{s=1}^m x_sy_s', \] where we assume for differentiability that \(Z\) is of full column rank. The projection of \(Z\) on the set of all matrices of rank less than or equal to \(r\), which is of key importance in PCA and MDS, is \[ \Pi_r(Z)\mathop{=}\limits^{\Delta}\sum_{s=1}^r\lambda_s x_sy_s'=Z\sum_{s=1}^ry_sy_s', \] where summation is over the \(r\) largest singular values. \section{Graphics and Code}\label{graphics-and-code} \subsection{Multidimensional Scaling}\label{multidimensional-scaling} Many of the examples in the book are taken from the area of \emph{multidimensional scaling (MDS)}. In this appendix we describe the basic MDS notation and terminology. Our approach to MDS is based on Kruskal {[}1964ab{]}, using terminology and notation of De Leeuw {[}1977{]} and De Leeuw and Heiser {[}1982{]}. For a more recent and more extensive discussion of MDS see Borg and Groenen {[}2005{]}. The data in an MDS problem consist of information about the \emph{dissimilarities} between pairs of \(n\) \emph{objects}. Dissimilarities are like distances, in the sense that they give some information about physical or psychological closeness, but they need not satisfy any of the distance axioms. In metric MDS the dissimilarity between objects \(i\) and \(j\) is a given number \(\delta_{ij}\), usually positive and symmetric, with possibly some of the dissimilarities missing. In non-metric MDS we only have a partial order on some or all of the \(n^2\) dissimilarities. We want to represent the \(n\) objects as \(n\) points in a \emph{metric space} in such a way that the \emph{distances} between the points approximate the dissimilarities between the objects. An MDS loss function is typically of the form \(\sigma(X,\Delta)=\|\Delta-D(X)\|\) for some norm, or pseudo-norm, on the space of \(n\times n\) matrices. Here \(X\) are the \(n\) points in the metric space, with \(D(X)\) the symmetric, non-negative, and hollow matrix of distances. The MDS problem is to minimize loss over all mappings \(X\) and all feasible \(\Delta\). In the metric MDS problems \(\Delta\) is fixed at the observed data, in non-metric MDS any monotone transformation of \(\Delta\) is feasible. The definition of MDS we have given leaves room for all kinds of metric spaces and all kinds of norms to measure loss. In almost all applications both in this book and elsewhere, we are interested in \emph{Euclidean MDS}, where the metric space is \(\mathbb{R}^p\), and in loss functions that use the (weighted) sum of squares of residuals \(r_{ij}(X,\Delta)\). Thus the loss function has the general form \[ \sigma(X,\Delta)= \sum_{i=1}^n\sum_{j=1}^n w_{ij}^{\ }r_{ij}^2(X,\Delta), \] where \(X\) is an \(n\times p\) matrix called the \emph{configuration}. The most popular choices for the residuals are \begin{align*} R_1(X,\Delta)&=\Delta-D(X),\\ R_2(X,\Delta)&=\Delta^2-D^2(X),\\ R_0(X,\Delta)&=\log\Delta-\log D(X),\\ R_S(X,\Delta)&=-\frac12 J_n(\Delta^2-D^2(X))J_n. \end{align*} Here \(\Delta^2\) and \(\log\Delta\) are \emph{elementwise} transformations of the dissimilarities, with corresponding transformations \(D^2\) and \(\log D\) of the distances. In \(R_S\) we use the centering operator \(J_n=I-\frac{1}{n}ee'\). For Euclidean distances, and centered \(X\), \[ R_S(X,\Delta)=\Gamma(\Delta)-XX', \] with \(\Gamma(\Delta)\mathop{=}\limits^{\Delta}-\frac12 J_n\Delta^2J_n\). Metric Euclidean MDS, using \(R_S\) with unit weights, means finding the best rank \(p\) approximation to \(\Gamma(\Delta)\), which can be done finding the \(p\) dominant eigenvalues and corresponding eigenvectors. This is also known as \emph{Classical MDS} {[}Torgerson, 1958{]}. The loss function \(\sigma_1\) that uses \(R_1\) is called \emph{stress} {[}Kruskal, 1964ab{]}, the function \(\sigma_2\) that uses \(R_2\) is \emph{sstress} {[}Takane et al, 1977{]}, and loss \(\sigma_S\) that uses \(R_S\) is \emph{strain} {[}De Leeuw and Heiser, 1982{]}. \(R_0\) has been nameless so far, but it has been proposed by Ramsay {[}1977{]}. Because of its limiting properties (see below), we will call it \emph{strull}. Both \(R_1\) ant \(R_2\) are obviously special cases of \[ R_r(X,\Delta)=\Delta^r-D^r(X), \] for which the corresponding loss function \(\sigma_r\) is called \emph{r-stress}. Because \[ \lim_{r\rightarrow 0}\frac{R_r(X,\Delta)}{r}= \log\Delta-\log D(X) \] we see that \(\sigma_0\) is a limiting case of \(\frac{1}{r^2}\sigma_r\). There is some matrix notation that is useful in dealing with Euclidean MDS. Suppose \(e_i\) and \(e_j\) are unit vectors, with all elements equal to zero, except one element which is equal to one. Then \[ d_{ij}^2(X)=(e_i-e_j)'XX'(e_i-e_j)=\mathbf{tr}\ X'A_{ij}X=\mathbf{tr}\ A_{ij}C(X), \] where \(A_{ij}\mathop{=}\limits^{\Delta}(e_i-e_j)(e_i-e_j)'\) and \(C(X)\mathop{=}\limits^{\Delta}XX'\). If we define \[ A^{*p}_{ij}\mathop{=}\limits^{\Delta}\underbrace{A_{ij}\oplus\cdots\oplus A_{ij}}_{p\text{ times}}, \] and \(\overline{x}=\mathbf{vec}(X)\) then \(d_{ij}^2(X)=\overline{x}'A^{*p}_{ij}\overline{x}\), which allows us to work with vectors in \(\mathbb{R}^{np}\) instead of matrices in \(\mathbb{R}^{n\times p}\). \subsection{Cobweb Plots}\label{cobweb-plots} Suppose we have a one-dimensional Picard sequence which starts at \(x^{(0)}\), and then is defined by \[ x^{(k+1)}=f(x^{(k)}). \] The cobweb plot draws the line \(y=x\) and the function \(y=f(x)\). A fixed point is a point where the line and the function intersect. We visualize the iteration by starting at \((x^{(0)},f(x^{(0)}))=(x^{(0)},x^{(1)})\), then draw a horizontal line to \((f(x^{(0)}),f(x^{(0)}))=(x^{(1)},x^{(1)})\), then draw a vertical line to \((x^{(1)},f(x^{(1)}))=(x^{(1)},x^{(2)})\), and so on. For a convergent sequence we will see zig-zagging parallel to the axes in smaller and smaller steps to a point where the function and the line intersect. An illustration will make this clear. The Newton iteration for the square root of \(a\) is \[ x^{(k+1)}=\frac12\left(x^{(k)}+\frac{a}{x^{(k)}}\right). \] The iterations for \(a=.5\) starting with \(x^{(0)}=.1\) are in the cobweb plot in figure \ref{fig:cobnewton}. \begin{figure} {\centering \includegraphics{bras_files/figure-latex/cobnewton-1} } \caption{Cobweb plot for Newton Square Root Iteration}\label{fig:cobnewton} \end{figure} In the code section there is \texttt{R} code for a general cobweb plotter with a variable number of parameters. \chapter{Code}\label{code} \begin{Shaded} \begin{Highlighting}[] \NormalTok{blockRelax <-} \StringTok{ }\NormalTok{function (f,} \NormalTok{x,} \NormalTok{g,} \DataTypeTok{itmax =} \DecValTok{100}\NormalTok{,} \DataTypeTok{eps =} \FloatTok{1e-8}\NormalTok{,} \DataTypeTok{verbose =} \OtherTok{TRUE}\NormalTok{) \{} \NormalTok{k <-}\StringTok{ }\KeywordTok{split} \NormalTok{(}\DecValTok{1}\NormalTok{:}\KeywordTok{length} \NormalTok{(x), g)} \NormalTok{m <-}\StringTok{ }\KeywordTok{length} \NormalTok{(k)} \NormalTok{fold <-}\StringTok{ }\KeywordTok{f} \NormalTok{(x)} \NormalTok{itel <-}\StringTok{ }\DecValTok{1} \NormalTok{blockFun <-}\StringTok{ }\NormalTok{function (z, g, y, i) \{} \NormalTok{y[i] <-}\StringTok{ }\NormalTok{z} \KeywordTok{return} \NormalTok{(}\KeywordTok{g} \NormalTok{(y))} \NormalTok{\}} \NormalTok{repeat \{} \NormalTok{for (i in }\DecValTok{1}\NormalTok{:m) \{} \NormalTok{kk <-}\StringTok{ }\NormalTok{k[[i]]} \NormalTok{o <-} \StringTok{ }\KeywordTok{optim} \NormalTok{(} \NormalTok{x[kk],} \NormalTok{blockFun,} \DataTypeTok{gr =} \OtherTok{NULL}\NormalTok{,} \DataTypeTok{g =} \NormalTok{f,} \DataTypeTok{y =} \NormalTok{x,} \DataTypeTok{i =} \NormalTok{kk,} \DataTypeTok{method =} \StringTok{"BFGS"} \NormalTok{)} \NormalTok{x[kk] <-}\StringTok{ }\NormalTok{o$par} \NormalTok{fnew <-}\StringTok{ }\NormalTok{o$value} \NormalTok{\}} \NormalTok{if (verbose)} \KeywordTok{cat}\NormalTok{(} \StringTok{"Iteration: "}\NormalTok{,} \KeywordTok{formatC} \NormalTok{(itel, }\DataTypeTok{width =} \DecValTok{3}\NormalTok{, }\DataTypeTok{format =} \StringTok{"d"}\NormalTok{),} \StringTok{"fold: "}\NormalTok{,} \KeywordTok{formatC} \NormalTok{(} \NormalTok{fold,} \DataTypeTok{digits =} \DecValTok{8}\NormalTok{,} \DataTypeTok{width =} \DecValTok{12}\NormalTok{,} \DataTypeTok{format =} \StringTok{"f"} \NormalTok{),} \StringTok{"fnew: "}\NormalTok{,} \KeywordTok{formatC} \NormalTok{(} \NormalTok{fnew,} \DataTypeTok{digits =} \DecValTok{8}\NormalTok{,} \DataTypeTok{width =} \DecValTok{12}\NormalTok{,} \DataTypeTok{format =} \StringTok{"f"} \NormalTok{),} \StringTok{"}\CharTok{\textbackslash{}n}\StringTok{"} \NormalTok{)} \NormalTok{if ((itel ==}\StringTok{ }\NormalTok{itmax) ||}\StringTok{ }\NormalTok{((fold -}\StringTok{ }\NormalTok{fnew) <}\StringTok{ }\NormalTok{eps))} \NormalTok{break} \NormalTok{itel <-}\StringTok{ }\NormalTok{itel +}\StringTok{ }\DecValTok{1} \NormalTok{fold <-}\StringTok{ }\NormalTok{fnew} \NormalTok{\}} \KeywordTok{return} \NormalTok{(}\KeywordTok{list} \NormalTok{(}\DataTypeTok{x =} \NormalTok{x, }\DataTypeTok{f =} \NormalTok{fnew))} \NormalTok{\}} \end{Highlighting} \end{Shaded} Code Segment 1: Block Relaxation \begin{Shaded} \begin{Highlighting}[] \NormalTok{bls <-} \StringTok{ }\NormalTok{function (z,} \NormalTok{y,} \DataTypeTok{xold =} \KeywordTok{rep}\NormalTok{(}\DecValTok{0}\NormalTok{, }\KeywordTok{ncol}\NormalTok{(y)),} \DataTypeTok{blocks =} \KeywordTok{as.list}\NormalTok{(}\DecValTok{1}\NormalTok{:}\KeywordTok{ncol}\NormalTok{(y)),} \DataTypeTok{itmax =} \DecValTok{100}\NormalTok{,} \DataTypeTok{eps =} \FloatTok{1e-10}\NormalTok{,} \DataTypeTok{verbose =} \OtherTok{TRUE}\NormalTok{) \{} \NormalTok{nblocks <-}\StringTok{ }\KeywordTok{length} \NormalTok{(blocks)} \NormalTok{fold <-}\StringTok{ }\KeywordTok{sum}\NormalTok{((z -}\StringTok{ }\NormalTok{y %*%}\StringTok{ }\NormalTok{xold) ^}\StringTok{ }\DecValTok{2}\NormalTok{)} \NormalTok{xopt <-}\StringTok{ }\KeywordTok{qr.solve}\NormalTok{(y, z)} \NormalTok{eold <-}\StringTok{ }\KeywordTok{sqrt} \NormalTok{(}\KeywordTok{sum} \NormalTok{((xold -}\StringTok{ }\NormalTok{xopt) ^}\StringTok{ }\DecValTok{2}\NormalTok{))} \NormalTok{itel <-}\StringTok{ }\DecValTok{1} \NormalTok{repeat \{} \NormalTok{xwork <-}\StringTok{ }\NormalTok{xold} \NormalTok{for (i in }\DecValTok{1}\NormalTok{:nblocks) \{} \NormalTok{u <-}\StringTok{ }\KeywordTok{drop} \NormalTok{(y %*%}\StringTok{ }\NormalTok{xwork)} \NormalTok{yact <-}\StringTok{ }\NormalTok{y[, blocks[[i]], drop =}\StringTok{ }\OtherTok{FALSE}\NormalTok{]} \NormalTok{xact <-}\StringTok{ }\NormalTok{xwork[blocks[[i]]]} \NormalTok{yres <-}\StringTok{ }\NormalTok{z -}\StringTok{ }\NormalTok{(u -}\StringTok{ }\NormalTok{yact %*%}\StringTok{ }\NormalTok{xact)} \NormalTok{xwork[blocks[[i]]] <-}\StringTok{ }\KeywordTok{qr.solve} \NormalTok{(yact, yres)} \NormalTok{\}} \NormalTok{xnew <-}\StringTok{ }\NormalTok{xwork} \NormalTok{fnew <-}\StringTok{ }\KeywordTok{sum}\NormalTok{((z -}\StringTok{ }\NormalTok{y %*%}\StringTok{ }\NormalTok{xnew) ^}\StringTok{ }\DecValTok{2}\NormalTok{)} \NormalTok{enew <-}\StringTok{ }\KeywordTok{sqrt} \NormalTok{(}\KeywordTok{sum} \NormalTok{((xold -}\StringTok{ }\NormalTok{xnew) ^}\StringTok{ }\DecValTok{2}\NormalTok{))} \NormalTok{if (verbose) \{} \KeywordTok{cat}\NormalTok{(} \StringTok{"itel: "}\NormalTok{,} \KeywordTok{formatC}\NormalTok{(itel, }\DataTypeTok{digits =} \DecValTok{3}\NormalTok{, }\DataTypeTok{width =} \DecValTok{3}\NormalTok{),} \StringTok{"fold: "}\NormalTok{,} \KeywordTok{formatC}\NormalTok{(} \NormalTok{fold,} \DataTypeTok{digits =} \DecValTok{6}\NormalTok{,} \DataTypeTok{width =} \DecValTok{10}\NormalTok{,} \DataTypeTok{format =} \StringTok{"f"} \NormalTok{),} \StringTok{"fnew: "}\NormalTok{,} \KeywordTok{formatC}\NormalTok{(} \NormalTok{fnew,} \DataTypeTok{digits =} \DecValTok{6}\NormalTok{,} \DataTypeTok{width =} \DecValTok{10}\NormalTok{,} \DataTypeTok{format =} \StringTok{"f"} \NormalTok{),} \StringTok{"ratio: "}\NormalTok{,} \KeywordTok{formatC}\NormalTok{(} \NormalTok{enew /}\StringTok{ }\NormalTok{eold,} \DataTypeTok{digits =} \DecValTok{6}\NormalTok{,} \DataTypeTok{width =} \DecValTok{10}\NormalTok{,} \DataTypeTok{format =} \StringTok{"f"} \NormalTok{),} \StringTok{"}\CharTok{\textbackslash{}n}\StringTok{"} \NormalTok{)} \NormalTok{\}} \NormalTok{if ((}\KeywordTok{abs}\NormalTok{(fold -}\StringTok{ }\NormalTok{fnew) <}\StringTok{ }\NormalTok{eps) ||}\StringTok{ }\NormalTok{(itel ==}\StringTok{ }\NormalTok{itmax))} \NormalTok{break()} \NormalTok{fold <-}\StringTok{ }\NormalTok{fnew} \NormalTok{eold <-}\StringTok{ }\NormalTok{enew} \NormalTok{xold <-}\StringTok{ }\NormalTok{xnew} \NormalTok{itel <-}\StringTok{ }\NormalTok{itel +}\StringTok{ }\DecValTok{1} \NormalTok{\}} \KeywordTok{return} \NormalTok{(x)} \NormalTok{\}} \end{Highlighting} \end{Shaded} Code Segment 2: Block Least Squares \begin{Shaded} \begin{Highlighting}[] \NormalTok{blockRate <-} \StringTok{ }\NormalTok{function (f,} \NormalTok{x,} \DataTypeTok{blocks =} \KeywordTok{as.list} \NormalTok{(}\DecValTok{1}\NormalTok{:}\KeywordTok{length}\NormalTok{(x)),} \DataTypeTok{numerical =} \OtherTok{FALSE}\NormalTok{,} \DataTypeTok{product_form =} \OtherTok{FALSE}\NormalTok{) \{} \NormalTok{if (numerical) \{} \NormalTok{h <-}\StringTok{ }\KeywordTok{hessian} \NormalTok{(f, x)} \NormalTok{\} else \{} \NormalTok{h <-}\StringTok{ }\KeywordTok{f} \NormalTok{(x)} \NormalTok{\}} \NormalTok{nvar <-}\StringTok{ }\KeywordTok{length} \NormalTok{(x)} \NormalTok{nblocks <-}\StringTok{ }\KeywordTok{length} \NormalTok{(blocks)} \NormalTok{nb <-}\StringTok{ }\DecValTok{1}\NormalTok{:nblocks} \NormalTok{nn <-}\StringTok{ }\DecValTok{1}\NormalTok{:nvar} \NormalTok{g <-} \StringTok{ }\KeywordTok{sapply} \NormalTok{(nn, function (i)} \KeywordTok{which} \NormalTok{(}\KeywordTok{sapply} \NormalTok{(blocks, function (x)} \KeywordTok{any} \NormalTok{(i ==}\StringTok{ }\NormalTok{x))))} \NormalTok{if (product_form) \{} \NormalTok{sder <-}\StringTok{ }\KeywordTok{diag} \NormalTok{(nvar)} \NormalTok{for (i in nb) \{} \NormalTok{bi <-}\StringTok{ }\NormalTok{blocks [[i]]} \NormalTok{ei <-}\StringTok{ }\KeywordTok{ifelse} \NormalTok{(}\KeywordTok{outer}\NormalTok{(nn, bi, }\StringTok{"=="}\NormalTok{), }\DecValTok{1}\NormalTok{, }\DecValTok{0}\NormalTok{)} \NormalTok{sder <-} \StringTok{ }\NormalTok{(}\KeywordTok{diag}\NormalTok{(nvar) -}\StringTok{ }\NormalTok{ei %*%}\StringTok{ }\KeywordTok{solve} \NormalTok{(h[bi, bi], h[bi, , }\DataTypeTok{drop =} \OtherTok{FALSE}\NormalTok{])) %*%}\StringTok{ }\NormalTok{sder} \NormalTok{\}} \NormalTok{\} else \{} \NormalTok{alow <-}\StringTok{ }\NormalTok{h *}\StringTok{ }\KeywordTok{ifelse} \NormalTok{(}\KeywordTok{outer} \NormalTok{(g, g, }\StringTok{">="}\NormalTok{), }\DecValTok{1}\NormalTok{, }\DecValTok{0}\NormalTok{)} \NormalTok{sder <-}\StringTok{ }\NormalTok{-}\KeywordTok{solve} \NormalTok{(alow, h -}\StringTok{ }\NormalTok{alow)} \NormalTok{\}} \KeywordTok{return} \NormalTok{(sder)} \NormalTok{\}} \end{Highlighting} \end{Shaded} Code Segment 3: Block Rate \begin{Shaded} \begin{Highlighting}[] \NormalTok{cobwebPlotter <-} \StringTok{ }\NormalTok{function (xold,} \NormalTok{func,} \DataTypeTok{lowx =} \DecValTok{0}\NormalTok{,} \DataTypeTok{hghx =} \DecValTok{1}\NormalTok{,} \DataTypeTok{lowy =} \NormalTok{lowx,} \DataTypeTok{hghy =} \NormalTok{hghx,} \DataTypeTok{eps =} \FloatTok{1e-10}\NormalTok{,} \DataTypeTok{itmax =} \DecValTok{25}\NormalTok{,} \NormalTok{...) \{} \NormalTok{x <-}\StringTok{ }\KeywordTok{seq} \NormalTok{(lowx, hghx, }\DataTypeTok{length =} \DecValTok{100}\NormalTok{)} \NormalTok{y <-}\StringTok{ }\KeywordTok{sapply} \NormalTok{(x, function (x)} \KeywordTok{func} \NormalTok{(x, ...))} \KeywordTok{plot} \NormalTok{(} \NormalTok{x,} \NormalTok{y,} \DataTypeTok{xlim =} \KeywordTok{c}\NormalTok{(lowx , hghx),} \DataTypeTok{ylim =} \KeywordTok{c}\NormalTok{(lowy, hghy),} \DataTypeTok{type =} \StringTok{"l"}\NormalTok{,} \DataTypeTok{col =} \StringTok{"RED"}\NormalTok{,} \DataTypeTok{lwd =} \DecValTok{2} \NormalTok{)} \KeywordTok{abline} \NormalTok{(}\DecValTok{0}\NormalTok{, }\DecValTok{1}\NormalTok{, }\DataTypeTok{col =} \StringTok{"BLUE"}\NormalTok{)} \NormalTok{base <-}\StringTok{ }\DecValTok{0} \NormalTok{itel <-}\StringTok{ }\DecValTok{1} \NormalTok{repeat \{} \NormalTok{xnew <-}\StringTok{ }\KeywordTok{func} \NormalTok{(xold, ...)} \NormalTok{if (itel >}\StringTok{ }\DecValTok{1}\NormalTok{) \{} \KeywordTok{lines} \NormalTok{(}\KeywordTok{matrix}\NormalTok{(}\KeywordTok{c}\NormalTok{(xold, xold, base, xnew), }\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{))} \NormalTok{\}} \KeywordTok{lines} \NormalTok{(}\KeywordTok{matrix}\NormalTok{(}\KeywordTok{c}\NormalTok{(xold, xnew, xnew, xnew), }\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{))} \NormalTok{if ((}\KeywordTok{abs} \NormalTok{(xnew -}\StringTok{ }\NormalTok{xold) <}\StringTok{ }\NormalTok{eps) ||}\StringTok{ }\NormalTok{(itel ==}\StringTok{ }\NormalTok{itmax)) \{} \NormalTok{break ()} \NormalTok{\}} \NormalTok{base <-}\StringTok{ }\NormalTok{xnew} \NormalTok{xold <-}\StringTok{ }\NormalTok{xnew} \NormalTok{itel <-}\StringTok{ }\NormalTok{itel +}\StringTok{ }\DecValTok{1} \NormalTok{\}} \NormalTok{\}} \end{Highlighting} \end{Shaded} Code Segment 4: Cobweb Plotter \chapter{NEWS}\label{news} 001 03/10/16 \begin{itemize} \tightlist \item First translation from gitbook \end{itemize} 002 03/16/16 \begin{itemize} \tightlist \item Now generates legal tex \item crossrefs sorted out \end{itemize} 003 03/17/16 \begin{itemize} \tightlist \item bibtex file completed \end{itemize} \renewcommand\bibname{References} \bibliography{bras.bib} \end{document}