--- title: "Derivatives of Low Rank PSD Approximation" author: "Jan de Leeuw" date: "Version 01, November 23, 2016" output: html_document: keep_md: yes number_sections: yes toc: yes toc_depth: 3 pdf_document: keep_tex: yes number_sections: yes toc: yes toc_depth: 3 fontsize: 12pt graphics: yes bibliography: rank.bib abstract: In De Leeuw (2008) we studied the derivatives of the least squares rank $p$ approximation in the case of general rectangular matrices. We modify these results for the symmetric positive semi-definite case, using basically the same derivation. We apply the formulas to compute an expression for the convergence rate of Thomson's iterative principal component algorithm for factor analysis. --- ------ {r function_code, echo = FALSE} source("auxilary.R") source("speed.R") source("derivative.R") source("thomson.R")  {r packages, echo = FALSE} options (digits = 10) library(captioner) theorem_nums <- captioner (prefix = "Theorem") corollary_nums <- captioner (prefix = "Corollary") theorem_nums(name = "projection_derivative", caption = "[Projection Derivative]", display = FALSE) theorem_nums(name = "projection_eigen_structure", caption = "[Projection Eigen Structure]", display = FALSE) corollary_nums(name = "projection_eigen_values", caption = "[Projection Eigen Values]", display = FALSE)  Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/rank](http://gifi.stat.ucla.edu/rank) has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code. #Introduction We have a symmetric matric $B$ that we want to approximate, in the least squares sense, by a symmetric positive semi-definite matrix of rank less than or equal to $p$. This is a problem that arises in various forms of multivariate analysis, specifically in factor analysis and multidimensinal scaling. A brief history is in @deleeuw_R_74c. Thus the problem is to minimize \begin{equation}\label{E:stress} \sigma(X):=\mathbf{SSQ}\ (B-XX') \end{equation} over $X\in\mathbb{R}^{n\times p}$. The stationary equations are \begin{equation}\label{E:steq} BX=X(X'X). \end{equation} Clearly the loss function is in variant under rotation, which means we can require without loss of generality that $X'X$ is diagonal. Thus each column of the optimal $X$ is either zero or an eigenvector corresponding with a positive eigenvalue of $B$. At stationary points $\sigma(X)=\mathbf{SSQ}(B)-\mathbf{tr}\ X'BX$. If the signature of $B$ is $(n_+,n_-,n_0)$ then the optimal $X$ has $\min(n_+,p)$ non-zero columns equal to $\sqrt{\lambda_s}k_s$, where the $k_s$ are normalized eigenvectors corresponding to positive eigenvalues $\lambda_s$, and $\max(0,p-n_+)$ zero columns. Although the optimal $X$ is never unique, the optimal $XX'$ is unique if and only if $\lambda_p>\lambda_{p+1}$. If $B=K\Lambda K'$ is a complete eigen-decomposition, with eigenvalues in decreasing order along the diagonal, then \begin{equation}\label{E:decomp} \Gamma_p(B)=K_p\Lambda_p^\frac12 \end{equation} where the $k_s$ are the normalized eigenvectors corresponding with the $p$ largest eigenvalues. To guarantee differentiability we assume throughout that $\lambda_p>\lambda_{p+1}$ as well as $\lambda_p>0$. #Derivative ##Formula Theorem r theorem_nums("projection_derivative", display = "n") gives an expression for the first order term in the Taylor expansion \begin{equation}\label{E:taylor} \Gamma_p(B+E)=\Gamma_p(B)+\mathcal{D}\Gamma_p(B)(E)+o(\|E\|). \end{equation} Note that $\Gamma_p$ is defined on the space of real doubly-centered symmetric matrices $\mathcal{S}^{n\times n}$, with values in that same space. The derivative $\mathcal{D}\Gamma_p(B)$ at $B$ is a linear operator from $\mathcal{S}^{n\times n}$ to $\mathcal{S}^{n\times n}$, which we apply to the real symmetric and doubly-centered perturbation $E$. The symmetry makes our results different from those of @deleeuw_R_08b. Define the matrix $\Xi:=K'EK$ with elements $\xi_{st}$. **r theorem_nums("projection_derivative", display = "f")** \begin{equation}\label{E:gamder} \mathcal{D}\Gamma_p(B)(E)=\sum_{s=1}^p\xi_{ss}k_sk_s' -\sum_{s=1}^p\sum_{\substack{t=1\\ t\not=s}}^n\frac{\lambda_s}{\lambda_t-\lambda_s}\xi_{st}(k_tk_s'+k_sk_t'). \end{equation} **Proof:** From @deleeuw_R_07c (and many other places) we know that \begin{align*} \lambda_s(B+E)&=\lambda_s+\xi_{ss}+o(\|E\|),\\ k_s(B+E)&=k_s-(B-\lambda_sI)^+Ek_s+o(\|E\|). \end{align*} Thus \begin{align*} \Gamma_p(B+E)&=\sum_{s=1}^p (\lambda_s+\xi_{ss})(k_s-(B-\lambda_sI)^+Ek_s)(k_s-(B-\lambda_sI)^+Ek_s)'+o(\|E\|)=\\ &=\Gamma_p(B)+\sum_{s=1}^p\left\{\xi_{ss}k_sk_s'-\lambda_sk_sk_s'E(B-\lambda_sI)^+-\lambda_s(B-\lambda_sI)^+Ek_sk_s'\right\}+o(\|E\|)=\\ &=\Gamma_p(B)+\sum_{s=1}^p\left\{\xi_{ss}k_sk_s'-\sum_{\substack{t=1\\ t\not=s}}^n\frac{\lambda_s}{\lambda_t-\lambda_s}\xi_{st}(k_sk_t'+k_tk_s')\right\}+o(\|E\|). \end{align*} **QED** Note that if $B$ and $E$ are SDC, then so is $\mathcal{D}\Gamma_p(B)(E)$. We generate some random data to illustrate theorem r theorem_nums("projection_derivative", display = "n"). {r first_example_data} set.seed(12345) eps <- .001 b <- crossprod (matrix (rnorm(400), 100, 4)) / 100 e <- eps * crossprod (matrix (rnorm(400), 100, 4)) / 100  Then compute $\Gamma_p(B + E)$. {r first_example_result_1, echo = FALSE} mprint (eb <- lrank (b + e))  Then compute the zero-order approximation $\Gamma_p(B)$. {r first_example_result_0, echo = FALSE} mprint (ee <- lrank (b))  And then compute the first-order approximation $\Gamma_p(B)+\mathcal{D}\Gamma_p(B)(E)$, using the program perturb() from the appendix. {r first_example_result_2, echo = FALSE} mprint (bb <- lrank (b) + perturb (b, e)$mm)  Clearly the first order approximates is better than the zero order one . For zero order the sum of the absolute values of the approximation errors is r noquote (formatC (sum(abs(eb-ee)), di = 10, wi = 12, fo = "f")), and for first order it is r noquote (formatC (sum(abs(eb-bb)), di = 10, wi = 12, fo = "f")). ##Structure To get more insight into the structure of the derivative we look at its eigenvalues and eigenvectors, i.e. those SDC perturbations$E$for which$\mathcal{D}\Gamma_p(B)(E)=\omega E$for some$\omega$. We investigate, again following @deleeuw_R_08b, perturbation matrices$E$of the form$E=k_\alpha k_\beta'+k_\beta k_\alpha'$, with the$k_\alpha$the eigenvectors of$B$. It suffices to look at those$\frac12 n(n-1)$perturbations for which$\alpha<\beta$. Note they are orthogonal and consequenty are a basis for the space of SDC matrices. Also \begin{equation}\label{E:kron} \xi_{st}=k_s'Ek_t=k_s'k_\alpha k_\beta'k_t+k_s'k_\beta k_\alpha'k_t=\delta^{s\alpha}\delta^{t\beta}+\delta^{s\beta}\delta^{t\alpha}. \end{equation} **r theorem_nums("projection_eigen_structure", display = "f")** For$\alpha<\beta$\begin{equation}\label{E:struct} \mathcal{D}\Gamma_p(B)(k_\alpha k_\beta'+k_\beta k_\alpha')= \begin{cases} k_\alpha k_\beta'+k_\beta k_\alpha'&\text{ if }1\leq\alpha<\beta\leq p,\\ \frac{\lambda_\alpha}{\lambda_\alpha-\lambda_\beta}(k_\alpha k_\beta'+k_\beta k_\alpha')&\text{ if }1\leq\alpha\leq pp$ then $\beta>p$ and thus $\eqref{E:term}$ is zero. If $\alpha<=p$ and $\beta>p$ then $\delta^{s\beta}=0$ and $$\sum_{s=1}^p\sum_{\substack{t=1\\ t\not=s}}^n\frac{\lambda_s}{\lambda_t-\lambda_s}(\delta^{s\alpha}\delta^{t\beta}+\delta^{s\beta}\delta^{t\alpha})(k_sk_t'+k_tk_s')= \frac{\lambda_\alpha}{\lambda_\beta-\lambda_\alpha}(k_\alpha k_\beta'+k_\beta k_\alpha').$$ If $\alpha<\beta\leq p$ then $$\sum_{s=1}^p\sum_{\substack{t=1\\ t\not=s}}^n\frac{\lambda_s}{\lambda_t-\lambda_s}(\delta^{s\alpha}\delta^{t\beta}+\delta^{s\beta}\delta^{t\alpha})(k_sk_t'+k_tk_s')= \frac{\lambda_\alpha}{\lambda_\beta-\lambda_\alpha}(k_\alpha k_\beta'+k_\beta k_\alpha')+\frac{\lambda_\beta}{\lambda_\alpha-\lambda_\beta}(k_\alpha k_\beta'+k_\beta k_\alpha')=k_\alpha k_\beta'+k_\beta k_\alpha'.$$ **QED** **r corollary_nums("projection_eigen_values", display = "f")** The derivative $\mathcal{D}\Gamma_p(B)$ has $\frac12 p(p-1)$ eigenvalues equal to $+1$, $\frac12 (n-p)(n-p-1)$ eigenvalues equal to zero, and $p(n-p)$ eigenvalues equal to $\frac{\lambda_\alpha}{\lambda_\alpha-\lambda_\beta}$, where $1\leq\alpha<\beta\leq n$ and the $\lambda_\alpha$ are the ordered eigenvalues of $B$. Note that if $B$ is positive definite then the $p(n-p)$ eigenvalues for $\alpha\leq p$ and $\beta>p$ are larger than one, if $B$ is positive semi-definite of rank $p$ they are equal to one. #Thomson's Factor Analysis Algorithm @thomson_34 proposed an alternating least squares algorithm to find factor analysis solutions. We minimize the sum of squares $$\sigma(X,\Delta)=\mathbf{SSQ}(B-\Delta-XX')$$ over loadings $X$ and the diagonal matrix of uniquenesses $\Delta$. The algorithm starts with $\Delta^{(0)}$ and alternates \begin{align*} C^{(k)}&=\Gamma_p(B-\Delta^{(k)}),\\ \Delta^{(k+1)}&=\mathbf{diag}(B-C^{(k)}). \end{align*} By substituting the first step into the second step we can think of the algorithm as a map $\tau$ that updates uniquenesses in a vector $\delta^{(k)}$ to a vector $\delta^{(k+1)}$. A straightforward application of theorem r theorem_nums("projection_derivative", display = "n") shows $$\mathcal{D}_j\tau_i(\delta)=\sum_{s=1}^p\left\{k_{is}^2k_{js}^2-2\sum_{\substack{t=1\\ t\not=s}}^n\frac{\lambda_s}{\lambda_t-\lambda_s}k_{is}k_{it}k_{js}k_{jt}\right\}.$$ using the eigenvalues and eigenvectors of $B-\Delta$. As an example we use the Harman.Burt data from the psych package (@revelle_16). {r factor_analysis, cache = TRUE} data(Harman, package = "psych") h3 <- thomson (Harman.Burt, p = 3, verbose = FALSE, eps = 1e-15, itmax = 1000) h2 <- thomson (Harman.Burt, p = 2, verbose = FALSE, eps = 1e-15, itmax = 1000) h1 <- thomson (Harman.Burt, p = 1, verbose = FALSE, eps = 1e-15, itmax = 1000)  In one dimension we need r h1$itel iterations to get to loss function value r h1$f, with an estimated convergence rate of r h1$r. In two dimensions r h2$itel iterations get us to loss r h2$f with estimated rate r h2$r. And in three dimensions r h3$itel iterations get us to r h3$f and rate r h3$r. The estimated convergence rates correspond closely to the largest eigenvalues of the Jacobian, computed by the function tderivative. {r} mprint (eigen (tderivative (Harman.Burt, h1$u, p=1))$values) mprint (eigen (tderivative (Harman.Burt, h2$u, p=2))$values) mprint (eigen (tderivative (Harman.Burt, h3$u, p=3))\$values)  #Appendix: Code ##auxilary.R {r file_auxilary, code = readLines("auxilary.R")}  ##thomson.R {r file_auxilary, code = readLines("thomson.R")}  #References