Abstract

We give an algorithm, with R code, to minimize the multidimensional scaling stress loss function under the condition that some or all of the fitted distances are between given positive upper and lower bounds. This paper combines theory, algorithms, code, and results of De Leeuw (2017b) and De Leeuw (2017a).Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory gifi.stat.ucla.edu/updown has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code.

In this paper we study the minimization of *stress*, defined as \[
\mathbf{stress}(X):=\frac14\sum_{i=1}^n\sum_{j=1}^n w_{ij}(\delta_{ij}-d_{ij}(X))^2
\] over \(n\times p\) *configuration matrices* \(X\). Here \(W=\{w_{ij}\}\) and \(\Delta=\{\delta_{ij}\}\) are given symmetric, hollow, and non-negative matrices of, respectively, *weights* and *dissimilarities*. The \(d_{ij}(X)\) are Euclidean distances between the points with coordinates in the rows of \(X\). Thus \[
d_{ij}^2(X):=(x_i-x_j)'(x_i-x_j)=(e_i-e_j)'XX'(e_i-e_j)=\mathbf{tr}\ X'A_{ij}X.
\] Here \(e_i\) and \(e_j\) are unit vectors, with all elements equal to zero except for the single element \(i\) or \(j\), which is equal to one. Also \(A_{ij}:=(e_i-e_j)(e_i-e_j)'\).

The problem of minimizing **stress** is well-studied (see, for example, Borg and Groenen (2005)). In this paper we analyze the problem of minimizing **stress** over all configurations \(X\) for which \(\alpha_{ij}\leq d_{ij}(X)\leq\beta_{ij}\) for some or all pairs \(i,j\), where the \(\alpha_{ij}\) and \(\beta_{ij}\) are given bounds satisfying \(0\leq\alpha_{ij}<\beta_{ij}\leq+\infty\). We suppose the constraints are strongly consistent, i.e.Â there is at least one \(n\times p\) configuration matrix \(X\) such that \(\alpha_{ij}<d_{ij}(X)<\beta_{ij}\).

These optimization problems are non-standard in various ways. Although the set of \(X\) such that \(d_{ij}(X)\leq\beta_{ij}\) is convex, the constraints \(d_{ij}(X)\geq\alpha_{ij}\) define a *reverse convex set* (Meyer (1970)) in configuration space. The loss funcion **stress** is neither convex nor concave. We can, however, use basic `smacof`

theory (De Leeuw (1977)) to majorize stress by a convex quadratic and to majorize the constraints by linear or convex quadratic functions. Majorization can be used to define a simple iterative algorithm. In each iteration we majorize stress and the constraints, and then minimize the convex quadratic majorizer over configurations satisfying the majorized constraints using quadratic programming with quadratic constraints or QPQC (De Leeuw (2017c)). Compute a new majorization at the minimizer of the majorized problem, and so on. These are the two steps in the majorization or MM approach to optimization (De Leeuw (1994), Lange (2016)).

A simplified expression for stress, first used in De Leeuw (1993), is \[ \mathbf{stress}(x)=\frac12\sum_{k=1}^K w_k(\delta_k-\sqrt{x'A_kx})^2. \] We have put the elements below the diagonal of \(W\) and \(\Delta\) in a vector of length \(\frac12 n(n-1)\). Also \(x:=\mathbf{vec}(X)\) and the \(A_k\) are the direct sums of \(p\) copies of the corresponding \(A_{ij}\). Now expand the square, and assume without loss of generality that \(\frac12\sum_{k=1}^K w_k\delta_k^2=1\). Then \[ \mathbf{stress}(x)=1-\sum_{k=1}^K w_k\delta_k\sqrt{x'A_kx}+\frac12 x'Vx, \] with \(V:=\sum_{k=1}^K w_kA_k\).

Suppose \(V=K\Lambda^2K'\) is a complete eigen-decomposition of \(V\). Some care is needed to handle the fact that \(V\) is singular, but under the assumption of irreducibility (De Leeuw (1977)) this is easily taken care of. Change variables with \(z=\Lambda^{\frac12}K'x\). Then \[ \mathbf{stress}(z)=1-\sum_{k=1}^K w_k\delta_k\sqrt{z'\tilde A_kz}+\frac12 z'z, \] with \(\tilde A_k=\Lambda^{-\frac12}K'A_kK\Lambda^{-\frac12}\), so that \(\sum_{k=1}^K w_k\tilde A_k=I\).

We have see in the previous section that we can write \[
\mathbf{stress}(x)=1-\rho(x)+\frac12 x'x,
\] where \[
\rho(x):=\sum_{k=1}^K w_k\delta_k d_k(x),
\] with distances \(d_k(x):=\sqrt{x'A_kx}\), and the \(A_k\) positive semi-definite matrices that add up to the identity. By Cauchy-Schwarz, \[
\rho(x)=x'B(x)x\geq x'B(y)y,
\] where \[
B(x):=\sum_{k=1}^K w_k\frac{\delta_k}{d_k(x)}A_k.
\] If we define the *Guttman transform* if \(x\) as \(\Gamma(x):=B(x)x\), then for all \(x\) \[
\mathbf(stress)(x)=1-x'\Gamma(x)+\frac12 x'x=(1-\frac12\Gamma(x)'\Gamma(x))+\frac12(x-\Gamma(x))'(x-\Gamma(x)),
\] and for all \(x,y\) \[
\mathbf(stress)\leq 1-x'\Gamma(y)+\frac12 x'x=(1-\frac12\Gamma(y)'\Gamma(y))+\frac12(x-\Gamma(y)'(x-\Gamma(y)),
\] In this notation a `smacof`

iteration is \(x^{(k+1)}=\Gamma(x^{(k)})\).

Suppose the problem we are interested in is to minimize a real-valued \(f_0\) over all \(x\in\mathbb{R}^n\) that satisfy \(f_j(x)\leq 0\) for all \(j=1,\cdots,m\). We call this problem \(\mathcal{P}\).

Now suppose \(g_j:\mathbb{R}^n\otimes\mathbb{R}^n\rightarrow\mathbb{R}\) majorizes \(f_j\), i.e. \[\begin{align*} f_j(x)&\leq g_j(x,y)\text{ for all }x,y\in\mathbb{R}^n,\\ f_j(y)&=g_j(y,y)\text{ for all }y\in\mathbb{R}^n. \end{align*}\]To simplify matters we suppose all \(f_j\) and \(g_j\) are continuously differentiable, and all \(g_j\) are convex in their first argument. Define problem \(\mathcal{P}_k\) as the problem of minimizing \(g_0(x,x^{(k-1)})\) over the \(x\in\mathbb{R}^n\) that satisfy \(g_j(x,x^{(k-1)})\leq 0\). Define a sequence \(x^{(k)}\) with \(x^{(0)}\) feasible for \(\mathcal{P}\), and with \(x^{(k)}\) for \(k\geq 1\) a solution of \(\mathcal{P}_k\).

**Theorem 1: [Convergence]**

- \(x^{(k)}\) is feasible for \(\mathcal{P}\) for all \(k\geq 1\).
- \(f_0(x^{(k)})\leq f_0(x^{(k-1)})\) for all \(k\geq 1\).

**Proof:** For the first part \(f_j(x^{(k)})\leq g_j(x^{(k)},x^{(k-1)})\) by majorization, and \(g_j(x^{(k)},x^{(k-1)})\leq 0\) by feasibility for \(\mathcal{P}_k\). For the second part \(f_0(x^{(k)})\leq g_0(x^{(k)},x^{(k-1)})\) by majorization. Because \(x^{(k)}\) solves \(\mathcal{P}_k\) we have \(g_0(x^{(k)},x^{(k-1)})\leq g_0(x^{(k-1)},x^{(k-1)})\) if \(x^{(k-1)}\) is feasible for \(\mathcal{P}_k\), i.e.Â if \(g_j(x^{(k-1)},x^{(k-1)})\leq 0\). But \(g_j(x^{(k-1)},x^{(k-1)})=f_j(x^{(k-1)})\leq 0\) by the first part of the theorem. **QED**

By theorem 1 we have a sequence of points with decreasing function values that are all feasible for \(\mathcal{P}\). We can say a bit more about accumulation points of this sequence. Define \(\mathcal{P}(y)\) as the convex problem of minimizing \(g_0(x,y)\) over \(x\in\mathbb{R}^n\) and \(g_j(x,y)\leq 0\).

**Theorem 2: [Necessary]**

If \(x\) solves \(\mathcal{P}(x)\) then \(x\) satisfies the first order necessary conditions for a local minimum of \(\mathcal{P}\).

**Proof:** The necessary and sufficient conditions for \(x\) to be a minimum of \(\mathcal{P}(y)\) are that there exists \(\lambda\geq 0\) such that \[
\mathcal{D}_1g_0(x,y)+\sum_{j=1}^m\lambda_j\mathcal{D}_1g_j(x,y)=0,
\] with in addition \(g_j(x,y)\leq 0\) and \(\lambda_jg_j(x,y)=0\).

The first order necessary conditions for \(x\) to be a local minimum of \(\mathcal{P}\) are that there exists \(\lambda\geq 0\) such that \[ \mathcal{D}f_0(x)+\sum_{j=1}^m\lambda_j\mathcal{D}f_j(x)=0, \] with in addition \(f_j(x)\leq 0\) and \(\lambda_jf_j(x)=0\).

But if the \(f_j\) and \(g_j\) are differentiable, we know from majorization theory (De Leeuw (2016), Lange (2016)) that \(\mathcal{D}f_j(x)=\mathcal{D}_1g_j(x,x)\) for all \(x\).

**QED**

It follows that if we define \(\mathcal{F}(y)\) as the solution of \(\mathcal{P}(y)\) then fixed points of \(\mathcal{F}\) are local minimizers of \(\mathcal{P}\). Thus if \(x^{(k)}\) converges to a fixed point of \(\mathcal{F}\), it converges to a local minimizer of \(\mathcal{P}\), and \(f_0(x^{(k)})\) converges to a local minimum.

and otherwise leave undisturbed. No majorization is required.

The second set of constraints is \(\sqrt{x'A_kx}\geq\alpha_k\) or \(\alpha_k-\sqrt{x'A_kx}\leq 0\). We majorize at \(x^{(k-1)}\), using Cauchy-Schwarz, to \[\begin{equation}\label{E:down} \alpha_k-\frac{1}{d_k(x^{(k-1)})}x'A_kx^{(k-1)}\leq 0 \end{equation}\]Thus the QPQC problem we have to solve in each bounded `smacof`

iteration, a.k.a. problem \(\mathcal{P}_k\), is minimization of the quadratic \(1-x'\Gamma(x^{(k-1)})+\frac12 x'x\) over \(x\) satisfying the quadratic constraints \(\eqref{E:up}\) and the linear constraints \(\eqref{E:down}\).

Minimizing a convex quadratic under convex quadratic constraints has been implemented in the R function `qpqc`

, described in De Leeuw (2017c). It dualizes the problem and then uses the `nnnewtin`

function for Newtonâ€™s method with non-negativity constraints. We apply `qpqc`

to solve our problems \(\mathcal{P}_k\), using the fact that the linear constraints \(\eqref{E:down}\) are just quadratic constraints with a zero quadratic component.

The `smacofUpDown`

function has two arguments without default values, a vector `delta`

of length \(\frac12n(n-1)\) with dissimilarities (lower-diagonal, columnwise, same ordering as a `dist`

object) and an \(n\times p\) matrix `xini`

with an initial configuration.

`args(smacofUpDown)`

```
## function (delta, xini, w = rep(1, length(delta)), bndlw = rep(0,
## length(delta)), bndup = rep(Inf, length(delta)), itmax = 1000,
## eps = 1e-10, verbose = FALSE)
## NULL
```

The arguments `bndlw`

and `bndup`

are vectors of length \(\frac12n(n-1)\), with default values 0 and \(+\infty\). Thus, by default, `smacofUpDown`

computes an unconstrained weighted least squares multidimensional scaling solution, where the weights in `w`

are by default all equal to \(+1\).

There is no default value for `xini`

, and the user must make sure that the distances of the initial configuration are between the bounds (are feasible). This is easy enough to do if there are only lower bounds (take any configuration and make it big enough) or only uppper bounds (make it small enough). If there are both upper and lower bounds then feasibility should be checked, because `smacofUpDown`

will refuse to start if the initial configuration has any infeasible distances.

Our first example has \(n=10\) with all \(\delta_{ij}\) equal. The optimal `smacof`

solution in two dimensions needs 131 iterations to arrive at stress 0.1098799783. We reanalyze the data with the constraint that the distance between the first point and all nine others is at least one. The algorithm incorporating these bounds, implemented in the function `smacofAbove`

, uses 156 iterations and finds stress 0.1340105193. The two configurations are in figure 1. There are five active constraints, i.e.Â distances equal to one, indicated by lines in the plot.

In the second analysis, now using eight objects with equal dissimilarties, we require \(d_{ij}(X)\geq 1\) for all \(i,j\) with \(|i-j|=1\), and \(d_{ij}(X)\leq 2\) for all \(i,j\) with \(|i-j|=2\).

In the unconstrained case we findSo far we have assumed \(\alpha_{ij}<\beta_{ij}\) for all \(i,j\). It is possible, however, to let \(\alpha_{ij}=\beta_{ij}\), which means we require \(d_{ij}(X)=\alpha_{ij}=\beta_{ij}\). Note the complete set of constraints still needs to be consistent, and the initial configuration must still be feasible. In our example we require all seven distances between successive points to be equal to one.

Our algorithm in this case becomes extremely slow. It stops at the upper bound of 2000 iterations, with **stress** 0.145422896. Of course all constraints are now active. The Lagrange multipliers are large, serving as penalty parameters to force the equality constraints.

```
## [1] 607.209037 798.640263 184.410120 470.471412 837.258144 684.253342
## [7] 342.484954 607.726977 799.263569 184.916274 470.878355 838.697106
## [13] 685.017712 342.476621
```

Note, however, that the previous analysis where we merly required \(d_{ij}(X)\geq 1\) for \(|i-j|=1\) produced a solution feasible for the current problem with **stress** equal to 0.1183956861. This makes it interesting see what goes on if we increase the upper bound of the number of iterations even more, say, to 10000.

We now find **stress** 0.1016650356 after 4638 iterations. The Lagrange multipliers for this solution are much smaller (first seven correspond with upper bounds, second seven with lower bounds). What basically happens is that for each \((i,j)\) with \(|i-j|=1\) either the upper or the lower constraint is seen to be violated and gets a zero Lagrange multiplier. The corresponding Lagrange multiplier for the corresponding other constraint is positive, because it is interpreted as being satisfied. This is a consequence of using floating point with limited precision, and the difference between \(\alpha_{ij}=0.99\cdots\) and \(\beta_{ij}=1.00\cdots\).

```
## [1] 0.1769885825 0.0000000000 0.2256740368 0.0000000001 0.0000005800
## [6] 0.0000000012 0.0707712910
```

```
## [1] 0.0000094723 0.2413717195 0.0000020656 0.0233523251 0.1287322429
## [6] 0.1247717131 0.0000000002
```

The successive distances are

```
## [1] 0.9999999397 1.0000000032 1.0000000368 1.0000000221 0.9999999739
## [6] 0.9999999647 0.9999999492
```

As the next illustration we use data from De Gruijter (1967), with average dissimilarity judgments between Dutch political parties in 1967. The data are

```
## KVP PvdA VVD ARP CHU CPN PSP BP
## PvdA 5.63
## VVD 5.27 6.72
## ARP 4.60 5.64 5.46
## CHU 4.80 6.22 4.97 3.20
## CPN 7.54 5.12 8.13 7.84 7.80
## PSP 6.73 4.59 7.55 6.73 7.08 4.08
## BP 7.18 7.22 6.90 7.28 6.96 6.34 6.88
## D66 6.17 5.47 4.67 6.13 6.04 7.42 6.36 7.36
```

First, three different but comparable anlyses are done. The first does not impose restriction, the second is *MDS from below*, which requires \(d_{ij}(X)\leq\delta_{ij}\) for all \(i,j\). And the third is *MDS from above*, for which \(d_{ij}(X)\geq\delta_{ij}\) for all \(i,j\). The configurations are quite similar, except for the position of Dâ€™66, which at the time was a novelty in Dutch politics. The value of **stress** at the solutions is, respectively, 0.044603386, 0.0752702106, and 0.280130691.