Note: This is a working paper which will be expanded/updated frequently. The directory gifi.stat.ucla.edu/full has a pdf copy of this article, the complete Rmd file with all code chunks, the bib file, the LaTeX style file, and the R source code.

1 Problem

We use the abbreviation MDS for the minimization of Kruskal’s stress \begin{equation} \sigma(X):=\mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}(\delta_{ij}-d_{ij}(X))^2\label{E:stress} \end{equation}

over the \(n\times p\) configurations \(X\) (Kruskal (1964a), Kruskal (1964b)). Symmetric matrix \(W=\{w_{ij}\}\) has known non-negative weights, symmetric matrix \(\Delta=\{\delta_{ij}\}\) has non-negative dissimilarities, and the symmetric matrix-valued function \(D(X)=\{d_{ij}(X)\}\) has Euclidean distances between the points, i.e. the rows of \(X\).

In the usual applications of MDS the user chooses dimensionality \(p\) to be much smaller than \(n\), because the name of the game is dimensionality reduction. In this paper we give a detailed analysis of full dimensional scaling or FDS, which is MDS with \(p=n\).

2 Notation

We introduce some notation which is useful for all Euclidean MDS problems, not just FDS problems. This notation has been more or less standard since De Leeuw (1977). Suppose the \(e_i\) are unit vectors of length \(n\), i.e. \(e_i\) has all its elements equal to zero except element \(i\), which is equal to one. Also define the matrices \(A_{ij}:=(e_i-e_j)(e_i-e_j)'\). Thus \(A_{ij}\) has four nonzero elements: elements \((i,i)\) and \((j,j)\) are \(+1\), elements \((i,j)\) and \((j,i)\) are \(-1\). With these definitions \begin{equation} d_{ij}^2(X)=(e_i-e_j)'XX'(e_i-e_j)=\mathbf{tr}\ X'A_{ij}X=\mathbf{tr}\ A_{ij}XX'.\label{E:matrixd} \end{equation}

Now expand stress. If we assume without loss of generality that \[ \mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}\delta_{ij}^2=1, \] and we define

\begin{align} \rho(X)&:=\mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}\delta_{ij}d_{ij}(X),\label{E:rho}\\ \eta(X)&:=\mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}d_{ij}^2.\label{E:eta} \end{align} Now \begin{equation} \sigma(X)=1-\rho(X)+\frac12\eta(X). \end{equation} To develop some convenient matrix notation we define \begin{align} V&:=\mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}A_{ij},\\ B(X)&:=\mathop{\sum\sum}_{1\leq i<j\leq n} \begin{cases}w_{ij}\frac{\delta_{ij}}{d_{ij}(X)}A_{ij}&\text{ if }d_{ij}(X)>0,\\ 0&\text{ if }d_{ij}(X)=0,\end{cases} \end{align} so that \begin{align} \rho(X)&=\mathbf{tr}\ X'B(X)X,\\ \eta(X)&=\mathbf{tr}\ X'VX. \end{align}

Note that \(V\) has off-diagonal elements equal to \(-w_{ij}\) and \(B(X)\) has off-diagonal elements \[b_{ij}(X)=-w_{ij}\frac{\delta_{ij}}{d_{ij}(X)}\] if \(d_{ij}(X)>0\) and zero otherwise. Diagonal elements are filled to make rows and columns sum to zero. Both \(V\) and \(B(X)\) are doubly-centered and positive semi-definite. If we assume without loss of generality that the MDS problem is irredicible (De Leeuw (1977)) then \(\mathbf{rank}(V)=n-1\) and the only vectors in the null space are multiples of \(e:=(1,1,\cdots,1)\). If all weights are eual to one then \(V=nI-ee'\).

3 Local minima of stress

De Leeuw (1984) proves a key necessary condition for a local minimum in MDS. Also see De Leeuw, Groenen, and Mair (2016) for a more recent proof. We also review some additional results on maxima and minima of stress.

Theorem 1: [De Leeuw 84, De Leeuw 93, De Leeuw and Groenen 97]

  1. If \(\sigma\) has a local minimum at \(X\) then \(d_{ij}(X)>0\) for all \(i,j\) such that \(w_{ij}\delta_{ij}>0\).
  2. If \(\sigma\) has a local minimum at \(X\) then \(\sigma\) is differentiable in a neighborhood of \(X\). Thus \(B(X)X=VX\).
  3. If \(\sigma\) has a local minimum at a column-centered \(X\) and \(\mathbf{rank}(X)=n-1\) then \(\sigma(X)=0\).
  4. \(\sigma\) has a local maximum at \(X\) if and only if \(X=0\).

Let’s look at a small example with four points, all dissimilarities equal, and all weights equal to one. There is a local minimum with four points in the corners of a square, with stress equal to 0.0285955. And there is another local minimum with three points forming an equilateral triangle, and the fourth point in the center. This has stress 0.0669873. We can compute stress for all points of the form \(\alpha X+\beta Y\), where \(X\) and \(Y\) are the two local minima. Figure 1 has a contour plot of \(\sigma(\alpha,\beta)\), showing the local minima at \((1,0)\) and \((0,1)\), and the local maximum at \((0,0)\).


Figure 1: Plane spanned by two local minima, equal dissimilarities

Alternatively, we can plot stress on the line connecting \(X\) and \(Y\). Note that although stress only has a local maximum at the origin in configuration space, it can very well have local maxima if restricted to lines.
Figure 2: Line connecting two local minima, equal dissimilarities

4 Cross Product Space

So far we have formulated the MDS problem in configuration space. Stress is a function of \(X\), the \(n\times p\) configuration matrix. We now consider an alternative formulation, where stress is a function of a positive semi-definite \(C\) or order \(n\). The relevant definitions are \begin{equation} \sigma(C):=1-2\rho(C)+\eta(C), \end{equation} where \begin{align*} \rho(C)&:=\mathbf{tr}\ B(C)C,\\ \eta(C)&:=\mathbf{tr}\ VC, \end{align*} with \begin{equation*} B(C):=\mathop{\sum\sum}_{1\leq i<j\leq n} \begin{cases}w_{ij}\frac{\delta_{ij}}{d_{ij}(C)}A_{ij}&\text{ if }d_{ij}(C)>0,\\ 0&\text{ if }d_{ij}(C)=0.\end{cases} \end{equation*}

and \(d_{ij}^2(C):=\mathbf{tr}\ A_{ij}C\).

We call the space of all positive semi-definite \(n\times n\) matrices cross product space. The problem of minimizing \(\sigma\) over \(n\times p\)-dimensional configuration space is equivalent to the problem of minimizing \(\sigma\) over the set of matrices \(C\) in \(n\times n\)-dimensional cross product space that have rank less than or equal to \(p\). The corresponding solutions are related by the simple relationship \(C=XX'\).

Theorem 2: [De Leeuw 93] Stress is convex on cross product space.

Proof: First, \(\eta\) is linear in \(C\). Second, \[ \rho(C)=\mathop{\sum\sum}_{1\leq i<j\leq n} w_{ij}\delta_{ij}\sqrt{\mathbf{tr}\ A_{ij}C}. \] This is the weighted sum of square roots of non-negative functions that are linear in \(C\), and it is consequently concave. Thus \(\sigma\) is convex. QED

Unfortunately the subset of cross product space of all matrices with rank less than or equal to \(p\) is far from simple (see Datorro (2015)), so computational approaches to MDS prefer to work in configuration space.

5 Full-dimensional Scaling

Cross product space, the set of all positive semi-definite matrices, is a closed convex cone \(\mathcal{K}\) in the linear space of all \(n\times n\) symmetric matrices. This has an interesting consequence.

Theorem 3: [De Leeuw 93] Full-dimensional scaling, i.e. minimizing \(\sigma\) over \(\mathcal{K}\), is a convex programming problem. Thus in FDS all local minima are global. If \(w_{ij}\delta_{ij}>0\) for all \(i,j\) then the minimum is unique.

This result has been around since about 1985. De Leeuw (1993) gives a proof, but the report it appeared in remained unpublished. A published proof is in De Leeuw and Groenen (1997). Another treatment of FDS, with a somewhat different emphasis, is in De Leeuw (2014).

Now, by a familiar theorem (Theorem 31.4 in Rockafellar (1970)), a matrix \(C\) minimizes \(\sigma\) over \(\mathcal{K}\) if and only if \begin{align} C&\in\mathcal{K},\\ V-B(C)&\in\mathcal{K},\\ \mathbf{tr}\ C(V-B(C))&=0. \end{align}

We give a computational proof of this result for FDS that actually yields a bit more.

Theorem 4: [Expansion] For \(\Delta\in\mathcal{K}\) we have \begin{equation} \sigma(C+\epsilon\Delta)=\sigma(C)-2\epsilon^{\frac12}\sum_{\mathbf{tr}\ A_iC = 0}w_i\delta_i\sqrt{\mathbf{tr}\ A_i\Delta}+\epsilon\ \mathbf{tr}\ (V-B(C))\Delta +o(\epsilon).\label{E:expand} \end{equation}

Proof: Simple expansion. QED

Theorem 5: [Convex]: Suppose \(C\) is a solution to the problem of minimizing \(\sigma\) over \(\mathcal{K}\). Then

  1. \(\mathbf{tr}\ A_{ij}C > 0\) for all \(i,j\) for which \(w_{ij}\delta_{ij}>0\).
  2. \(V-B(C)\) is positive semi-definite.
  3. \(\mathbf{tr}\ C(V-B(C))=0\).
  4. If \(C\) is positive definite then \(V=B(C)\) and \(\sigma(C)=0\).

Proof: The \(\epsilon^\frac12\) term in \(\eqref{E:expand}\) needs to vanish at a local minimum. This proves the first part. It follows that at a local minimum

\begin{equation*} \sigma(C+\epsilon\Delta)=\sigma(C)+ \epsilon\ \mathbf{tr}\ (V-B(C))\Delta+o(\epsilon). \end{equation*}

If \(V-B(C)\) is not positive semi-definite, then there is a \(\Delta\in\mathcal{K}\) such that \(\mathbf{tr}\ (V-B(C))\Delta < 0\). Thus \(C\) cannot be the minimum, which proves the second part. If we choose \(\Delta=C\) we find

\begin{equation*} \sigma((1+\epsilon)C)=\sigma(C)+ \epsilon\ \mathbf{tr}\ (V-B(C))C+o(\epsilon). \end{equation*}

and choosing \(\epsilon\) small and negative shows we must have \(\mathbf{tr}\ (V-B(C))C=0\) for \(C\) to be a minimum. This proves the third part. Finally, if \(\sigma\) has a minimum at \(C\), and \(C\) is positive definite, then from parts 2 and 3 we have \(V=B(C)\). Comparing off-diagonal elements shows \(\Delta=D(C)\), and thus \(\sigma(C)=0\). QED

If \(C\) is the solution of the FDS problem, then \(\mathbf{rank}(C)\) defines the Gower rank of the dissimilarities. The number of positive eigenvalues of the negative of the doubly-centered matrix of squared dissimilarities, the matrix factored in classical MDS, defines the Torgerson rank of the dissimilarities. The Gower conjecture is that the Gower rank is less than or equal to the Torgerson rank. No proof and no counter examples have been found.

We compute the FDS solution using the smacof algorithm \begin{equation} X^{(k+1)}=V^+B(X^{(k)}) \end{equation}

in the space of all \(n\times n\) configurations, using the identity matrix as a default starting point. Since we work in configuration space, not in crossproduct space, this does not guarantee convergence to the unique FDS solution, but after convergence we can easily check the necessary and sufficient conditions of theorem 5.

As a small example, consider four points with all dissimilarities equal to one, except \(\delta_{14}\) which is equal to three. Clearly the triangle inequality is violated, and thus there certainly is no perfect fit mapping into Euclidean space.

The FDS solution turns out to have rank two, thus the Gower rank is two. The singular values of the FDS solution are

## [1] 0.4508464709 0.2125310645 0.0000001303

Gower rank two also follows from the eigenvalues of the matrix \(B(C)\), which are

## [1] 1.0000000000 1.0000000000 0.9205543464

6 Another planar example

Suppose \(X\) and \(Y\) are two linearly independent configurations and consider all configurations \(Z=\alpha X+\beta Y\) in the plane spanned by \(X\) and \(Y\). Define \[ B(\gamma):=\mathop{\sum\sum}_{1\leq i<j\leq n}\begin{cases}w_{ij}\frac{\delta_{ij}}{d_{ij}(\alpha X+\beta Y)} \begin{bmatrix}\mathbf{tr}\ X'A_{ij}X&\mathbf{tr}\ X'A_{ij}Y\\\mathbf{tr}\ Y'A_{ij}X&\mathbf{tr}\ Y'A_{ij}Y\end{bmatrix}&\text{ if }d_{ij}(\alpha X+\beta Y)>0,\\ 0&\text{ if }d_{ij}(\alpha X+\beta Y)=0, \end{cases} \] where \(\gamma:=\begin{bmatrix}\alpha\\\beta\end{bmatrix}\). Also \[ U:=\begin{bmatrix}\mathbf{tr}\ X'VX&\mathbf{tr}\ X'VY\\\mathbf{tr}\ Y'VX&\mathbf{tr}\ Y'VY\end{bmatrix}. \] Then \begin{equation} \sigma(\gamma):=1-2\gamma'B(\gamma)\gamma+\gamma'U\gamma,\label{E:gamma} \end{equation}

and the stationary equations are \[ B(\gamma)\gamma=U\gamma. \] We have already seen an example of \(\sigma\) on what can be called combination space in figure 1, where \(X\) and \(Y\) are two local minima. Another interesting example uses a configuration \(X\) and its gradient \(Y=\mathcal{D}\sigma(X)\). In that case finding the minimum of \(\sigma\) over combination space means finding the optimal step size in the steepest descent gradient method.

For a configuration with three points equally spaced on a horizontal line and the fourth point above the middle of the three

##            [,1]       [,2]
## [1,] -0.1006209 -0.1006209
## [2,]  0.0000000 -0.1006209
## [3,]  0.1006209 -0.1006209
## [4,]  0.0000000  0.3018627

we have gradient

##         [,1]        [,2]
## 1  0.3250910 -0.08772788
## 2  0.3638044 -0.07804082
## 3 -0.6888953 -0.08772788
## 4  0.0000000  0.25349657
The contour plot for the linear combinations is