# Weighted Low-rank Approximation using Majorization Jan de Leeuw Version 18, June 02, 2017 Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/linlow](http://gifi.stat.ucla.edu/linlow) has a pdf version, the complete Rmd file with all code chunks, the R code, and the bib file. #Introduction In a rank-p matrix approximation problem we have an $n\times m$ matrix $X$ and we want to find the closest $n\times m$ matrix $\widehat{X}$ of rank-p in the least squares sense (@eckart_young_36, @keller_62). The problem is usually parametrized as $\widehat X=AB'$ where $A$ is $n\times p$ and $B$ is $m\times p$. If closeness is measured by the usual Frobenius norm, i.e. the unweighted sum of squares, then the solution is given by the first $p$ singular values of $X$ and their corresponding left and right singular vectors. Generalizations to more general Euclidean norms are discussed, with some historical references, in @deleeuw_A_84b (and undoubtedly in hundreds of other places). In this paper we generalized to simple weighted least squares in which the square of each residual $x_{ij}-\widehat x_{ij}$ is weighted by a non-negative weight $w_{ij}$. Thus the loss function we must minimize is $$ \sigma(A,B):=\sum_{i=1}^n\sum_{j=1}^mw_{ij}(x_{ij}-\sum_{s=1}^pa_{is}b_{js})^2. $$ This general problem was perhaps first tackled by @gabriel_zamir_79, using an *alternating least squares* algorithm of the form (with superscript $k$ indicating iteration count) \begin{align*} A^{(k)}&=\mathop{\mathbf{argmin}}_A\ \sigma(A,B^{(k)}),\\ B^{(k+1)}&=\mathop{\mathbf{argmin}}_B\ \sigma(A^{(k)},B). \end{align*} Both subproblems of this *criss-cross regression* algorithm are weighted linear least squares problems, with a possibly non-sparse matrix of weights. We will go a different route, by concentrating on simplifying the weights. #Majorization using Weights *Majorization algorithms* are discussed in general in @deleeuw_C_94c and @heiser_95, and more recently and more extensively in @lange_16 and @deleeuw_B_16b. We assume the basic majorization approach is familiar and go straight away to the details. Our first job is to construct a majorization scheme. Suppose $A^{(k)}$ and $B^{(k)}$ are solutions after $k$ iterations. Define $$ \hat x_{ij}^{(k)}:=\sum_{s=1}^pa_{is}^{(k)}b_{js}^{(k)}. $$ Then $$ \sum_{s=1}^pa_{is}b_{js}=\hat x_{ij}^{(k)}+(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)}), $$ and consequently $$ \sigma(A,B)= \sigma(A^{(k)},B^{(k)})-2\sum_{i=1}^n\sum_{j=1}^mw_{ij}(x_{ij}-\hat x_{ij}^{(k)})(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})+ \sum_{i=1}^n\sum_{j=1}^mw_{ij}(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})^2. $$ If $c_{ij}\geq w_{ij}$ and $c_{ij}>0$ we have $$ \sigma(A,B)\leq\sigma(A^{(k)},B^{(k)})-2\sum_{i=1}^n\sum_{j=1}^mw_{ij}(x_{ij}-\hat x_{ij}^{(k)})(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})+ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})^2 $$ Define $$ r_{ij}^{(k)}:=\frac{w_{ij}}{c_{ij}}(x_{ij}-\widehat x_{ij}^{(k)}), $$ and $$ h_{ij}^{(k)}:=\widehat x_{ij}^{(k)}+\frac{w_{ij}}{c_{ij}}(x_{ij}-\hat x_{ij}^{(k)}). $$ Note that $h_{ij}^{(k)}$ is a convex combination of $\widehat x_{ij}^{(k)}$ and $x_{ij}$. Now $$ \sigma(A,B)\leq\sigma(\alpha,\beta)-2\sum_{i=1}^n\sum_{j=1}^mc_{ij}\{r_{ij}^{(k)}\}^2+ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2 $$ In the majorization step we minimize $$ \eta(A,B,A^{(k)},B^{(k)}):=\sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2, $$ which is a matrix approximation problem similar to our original problem, with adjusted weights and an adjusted target. In the next iteration we define a new target, solve the matrix approximatio problem again, and so on. It seems that our developments so far have not gained us much. Instead of one matrix approximation problem we now have a sequence of them, all of the same structure. The gain will have to come from the choice of the $c_{ij}$, which we will try to construct in such a way that the matrix approximation problems in each iteration become more simple. ##Rank-one Weights First suppose $c_{ij}>0$ and there are $u>0$ and $v>0$ such that $c_{ij}=u_iv_j$. Now $$ g_{ij}^{(k)}:=\sqrt{u_iv_j}\ \widehat h_{ij}^{(k)}, $$ and define new variables $\alpha_{is}=\sqrt{u_i}a_{is}$ and $\beta_{js}=\sqrt{v_j}b_{js}$. Then $$ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2=\sum_{i=1}^n\sum_{j=1}^m(\sum_{s=1}^p\alpha_{is}\beta_{js}-g_{ij}^{(k)})^2 $$ This shows that the matrix approximation problems that must be solved in each majorization step are unweighted, which means they can be solved rapidly using partial singular value decomposition. ##The Symmetric Case If $X$ is symmetric our loss function becomes $$ \sigma(A)=\sum_{i=1}^n\sum_{j=1}^n w_{ij}(x_{ij}-\sum_{s=1}^p a_{is}a_{js})^2. $$ Now choose $c_{ij}=u_iu_j\leq w_{ij}$, and basically the same majorization scheme can be applied. The matrix approximation problem in each iteration is now solved by using the first $p$ eigenvalues of the current target (@deleeuw_R_74c). ##Some History From an historical point of view the most important special case is that in which $w_{ij}$ is either one or zero, which zero indicating *missing data*. It was analyzed by @thomson_34 in the case of factor analysis, using basically the same *alternating least squares* algorithm used by @yates_33 in factorial experiments. We derive this algorithm from our majorization perspective by choosing $c_{ij}=1$, no matter if $w_{ij}=1$ or $w_{ij}=0$. Thus $$ r_{ij}^{(k)}=w_{ij}(x_{ij}-\hat x_{ij}^{(k)})=\begin{cases}0&\text{ if }w_{ij}=0,\\x_{ij}-\hat x_{ij}^{(k)}&\text{ if }w_{ij}=1,\end{cases} $$ and $$ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+r_{ij}^{(k)}=\begin{cases}\hat x_{ij}^{(k)}&\text{ if }w_{ij}=0,\\x_{ij}&\text{ if }w_{ij}=1.\end{cases} $$ This is exactly the Thomson-Yates method for dealing with data that are missing, either by design, by accident, or structurally from properties of the model. The equivalence of the alternating least squares and majorization methods in this case was already observed by @kiers_97. @kiers_97, following some suggestions by @heiser_95, proposes to use $c_{ij}=\max_{i=1}^n\max_{j=1}^m w_{ij}$. Thus all elements of $c_{ij}$ are the same, say equal to $c$, and $$ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+\frac{w_{ij}}{c}(x_{ij}-\hat x_{ij}^{(k)}). $$ This is improved by @groenen_giaquinto_kiers_03 by using $c_{ij}=\max_{j=1}^m w_{ij}$, which means that we can write $$ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+\frac{w_{ij}}{c_i}(x_{ij}-\hat x_{ij}^{(k)}). $$ Both @kiers_97 and @groenen_giaquinto_kiers_03 compare their majorization algorithms with the alternating least squares method of @gabriel_zamir_79. The conclusion seems to be that the algorithm of @kiers_97, which uses a scalar bound for the $w_{ij}$ is slower than criss-cross regression, while the algorithm of @groenen_giaquinto_kiers_03, which uses a row-wise bound, is actually faster than criss-cross regression. It is clear that both @kiers_97 and @groenen_giaquinto_kiers_03 use $c_{ij}$ of the form $c_{ij}=u_iv_j$ for some simple specific choices of $u$ and $v$. We will now look at some general theory for weight majorizations of this form, and see if we can come up with a possible improvement. ##Rate of Convergence The algorithmic map $\mathcal{A}:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}^{n\times m}$ in the case that $C=uv'$ is defined as the composition of three maps. * The first is the affine map $\mathcal{G}(Z):=U^\frac12(Z+F\star(X-Z))V^\frac12,$ where $\star$ is used for the Hadamard product, and $F$ has elements $f_{ij}=w_{ij}/c_{ij}$. The diagonal matrices $U$ and $V$ have the elements of $u$ and $v$ on the diagonal. * The second map $\Gamma_p$ non-linearly projects its argument, using the Frobenius norm, on the set of all matrices of rank less than or equal to $p$. * And finally we have the linear map $\mathcal{S}(Z):=U^{-\frac12}ZV^{-\frac12}$. Combining these three maps, all from $\mathbb{R}^{n\times m}$ to $\mathbb{R}^{n\times m}$, $$ \mathcal{A}(Z)=\mathcal{S}(\Gamma_p(\mathcal{G}(Z))). $$ The asymptotic error constant of the iterative majorization procedure is the spectral radius of the derivative of the algorithmic map $\mathcal{A}$. By the chain rule $$ \mathcal{A}(Z+\Delta)=\mathcal{A}(Z)+\mathcal{S}(\mathcal{D}\Gamma_p(\mathcal{G}(Z))\mathcal{H}(\Delta))+\mathcal{o}(\|\Delta\|), $$ where $\mathcal{H}(\Delta)=U^\frac12(E\star\Delta)V^\frac12$ and $E$ is the matrix with elements $1-w_{ij}/c_{ij}$. The derivatives of $\Gamma_p$ in the rectangular case are computed in @deleeuw_R_08b. From that paper, using the representation $\Gamma_p(\mathcal{G}(Z))=\mathcal{G}(Z)L_p^{\ }L_p'$, with $L_p$ the normalized eigenvectors corresponding with the $p$ largest eigenvalues of $\mathcal{G}(Z)'\mathcal{G}(Z)$, $$ \mathcal{D}\Gamma_p(\mathcal{G}(Z))\mathcal{H}(\Delta)=\mathcal{H}(\Delta)L_p^{}L_p'-\mathcal{G}(Z)\{\Xi_p^{}(Z,\Delta)+\Xi_p'(Z,\Delta)\}, $$ where $$ \Xi_p(Z,\Delta)=\sum_{s=1}^p (\mathcal{G}(Z)'\mathcal{G}(Z)-\lambda_s^2I)^+(\mathcal{G}(Z)'\mathcal{H}(\Delta)+\mathcal{H}(\Delta)'\mathcal{G}(Z))l_s^{}l_s'. $$ #One-sided Approximation As a consequence of our previous discussions it makes sense to look for $u\geq 0$ and $v\geq 0$ such that $u_iv_j\geq w_{ij}$, but, given the constraints, the matrix with elements $u_iv_j$ is as small as possible. There are many ways in which we can formalize "as small as possible". In @deleeuw_E_17m convex analysis and duality were used in a more general context. In this paper we use the notion of *approximation from above*, i.e. we minimize some norm $\|W-uv'\|$ over $uv'\geq W$ and $u,v\geq 0$. In order to make this a convex problem we take logaritms. Thus we minimize $\|\log W -(\log u+\log v)\|$ over $\log u_i + \log v_j\geq\log w_{ij}$ for all $i$ and $j$. One-sided discrete linear approximation for general $\ell_p$ norms is discussed in @watson_73. We will use least squares on the logarithms, which leads to a quadratic programming problem. But $\ell_1$ and $\ell_\infty$ are both interesting as well, and lead to linear programming problems. Using logarithms only makes sense if $w_{ij}>0$. For $w_{ij}=0$ there is an obvious way to proceed, by just skipping those elements. The $u_i$ and $v_j$ computed over the non-zero $w_{ij}$ will automatically satisfy $u_iv_j>w_{ij}=0$ for the zero weights as well. Suppose $\mathcal{M}$ is the set of pairs $(i,j)$ such that $w_{ij}>0$.To solve minimization of $$ \mathop{\sum\sum}_{(i,j)\in\mathcal{M}}\ (\log w_{ij}-\log u_i-\log v_j)^2 $$ over over $\log u_i + \log v_j\geq\log w_{ij}$ we use the `\lsi()` function from the `lsei` package (@wang_lawson_hanson_15). #Example The data are crash data on New Zealand roads, from thr VGAM package (@yee_10). Data are cross-classified by time (hour of the day) and day of the week, accumulated over 2009. We use `crashi`, which are the number of injuries by car. ``` ## Mon Tue Wed Thu Fri Sat Sun ## 0 16 10 22 12 29 55 55 ## 1 13 11 15 23 23 42 64 ## 2 5 8 16 13 24 37 64 ## 3 6 4 8 12 19 31 45 ## 4 7 6 11 16 11 35 35 ## 5 12 14 14 18 19 27 35 ## 6 37 37 32 45 32 21 36 ## 7 66 79 92 75 73 40 33 ## 8 117 138 132 138 122 59 36 ## 9 67 81 68 75 72 59 45 ## 10 67 70 62 76 72 84 57 ## 11 80 80 50 80 74 114 86 ## 12 75 85 86 87 94 101 93 ## 13 77 69 84 90 90 105 80 ## 14 84 87 98 85 104 103 96 ## 15 112 136 134 156 158 120 103 ## 16 115 110 138 144 146 106 90 ## 17 127 130 140 149 155 104 97 ## 18 63 69 91 97 142 83 64 ## 19 47 63 53 57 67 69 52 ## 20 25 46 62 55 68 70 44 ## 21 34 42 49 53 85 62 33 ## 22 24 26 35 52 67 54 18 ## 23 28 23 20 49 61 69 29 ``` We pretend the data are independent Poisson counts, and we choose weights equal to $x_{ij}^{-1}$. The optimal rank-one approximation from above is computed wth the program `mbound()` in the appendix.