--- title: "Shepard Non-metric Multidimensional Scaling" author: "Jan de Leeuw" date: "Version 18, February 01, 2017" output: pdf_document: keep_tex: yes number_sections: yes toc: yes toc_depth: 3 html_document: keep_md: yes number_sections: yes toc: yes fontsize: 12pt graphics: yes bibliography: shepard.bib abstract: We give an algorithm, with R code, to minimize the multidimensional scaling loss function proposed in Shepard's 1962 papers. We show the loss function can be justified by using the classical rearrangement inequality, and we investigated its differentiability. --- {r function_code, echo = FALSE} source("shepard.R")  {r packages, echo = FALSE} options (digits = 10) library (captioner)  {r equation_numbering, echo = FALSE} figure_nums <- captioner (prefix = "Figure") figure_nums (name = "rectangle_conf", caption = "Rectangle Data, Configuration, Original Left, Rounded Right", display = FALSE) figure_nums (name = "rectangle_shep", caption = "Rectangle Data, Shepard Plot, Original Left, Rounded Right", display = FALSE) figure_nums (name = "rectangle_shep_norm", caption = "Rectangle Data, Normalized by RMS", display = FALSE) figure_nums (name = "gruijter_conf", caption = "De Gruijter Data, Configuration", display = FALSE) figure_nums (name = "gruijter_shep", caption = "De Gruijter Data, Shepard Plot", display = FALSE) figure_nums (name = "ekman_conf", caption = "Ekman Data, Configuration", display = FALSE) figure_nums (name = "ekman_shep", caption = "Ekman Data, Shepard Plot", display = FALSE)  Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory [gifi.stat.ucla.edu/shepard](http://gifi.stat.ucla.edu/shepard) has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code. #Introduction After @torgerson_58 had more or less finalized the treatment of metric multidimensional scaling, Shepard had been searching for the function relating proximity judgments and distance, in what we now call the Shepard diagram (@deleeuw_mair_C_15). His theory (@shepard_57) suggested a negative exponential relationship, but the experimental results were not convincing. The next major step was to let the data determine the nature of the relationship, and showing that this could be done by using only ordinal proximity information. This lead to the groundbreaking papers @shepard_62a and @shepard_62b. Although Shepard's contributions were certainly not forgotten, they were definitely overshadowed by the corresponding papers @kruskal_64a and @kruskal_64b, published two years later by Shepard's "co-worker" and "mathematical colleague" Joseph Kruskal. In Google Scholar the Shepard papers have 2364 and 1063 citations, the Kruskal papers have 6167 and 4167 (per 01/17/2017). Shepard was well aware of the importance of his 1962 papers, which were the cumulation of a long list of his earlier contributions to mapping measures of proximity. > I have always regarded the development reported in this paper as one of my most original and significant accomplishments. In the 1962 papers he developed an iterative technique, and a corresponding computer program, that maintains monotonicity between distances and dissimilarities while concentrating the variance of the configuration as much as possible in a small number of dimensions. From @shepard_79: > The idea of representing objects (such as colors, sounds, shapes, faces, word meanings, etc.) as points in space, in such a way that the distances between the points represented the perceived similarities between the objects, had occurred to me while I was an undergraduate student in 1951. But it was not until ten years later, when I gained access to the powerful computing facilities at the Bell Telephone Laboratories, that I conceived of an iterative process for reconstructing the implied spatial configuration even when the form of the monotone function relating similarity and distance was completely unknown. > After a period of trial-and-error adjustment of the parameters of the iterative process, success came with dramatic suddenness on March 17, 1961. According to the computer log, it was at precisely 2.33 p.m. EST on that day that the iterative process first converged to a stationary configuration, revealing a remarkably exact recovery of an underlying test configuration. The excitement of that moment was rivaled only by the birth of my daughter on the very next day. Since then my daughter has developed into a fine young woman; and, thanks in part to the subsequent contributions of my mathematical colleague Joseph Kruskal, nonmetric multidimensional scaling is now finding wide application throughout the cognitive, behavioral, and biomedical sciences. Kruskal was politely critical of Shepard's work. From @kruskal_64a, page 2: > However, Shepard's technique still lacks a solid logical foundation. Most notably, and in common with most other authors, he does not give a mathematically explicit definition of what constitutes a solution. He places the monotone relationship as the central feature, but points out (@shepard_62a, p. 128) that a low-dimensional solution cannot be expected to satisfy this criterion perfectly. He introduces a measure of departure $\delta$ from this condition [@shepard_62a, pp. 136-137] but gives it only secondary importance as a criterion for deciding when to terminate his iterative process. His iterative process itself implies still another way of measuring the departure from monotonicity. And again, on page 6-7, there is an implicit criticism of Shepard's use of the original proximity measures to define fit. > Should we measure deviations between the curve and stars along the distance axis or along the dissimilarity axis? The answer is "along the distance axis." For if we measure them along the dissimilarity axis, we shall find ourselves doing arithmetic with dissimilarities. This we must not do, because we are committed to using only the rank ordering of the dissimilarities. From subsequent publications, and from some personal communications as well, I get the impression that Shepard interpreted Kruskal's contribution as a relatively minor technical improvement of a procedure he had invented and published. I remember the consternation when one of the versions of the KYST program for MDS stated that KYST was short for Kruskal-Young-Seery-Torgerson. While Judith B. Seery, a programmer at Bell Labs, certainly worked on the program, later versions made the important correction that KYST stood for Kruskal-Young-Shepard-Torgerson. There seems to be some subtle acrimony in Shepard's retelling of the history. @shepard_74, p 376-377: > Although my original method generally yielded spatial configurations that appeared indistinguishable from those furnished by subsequent methods, my mathematical colleague Joseph Kruskal (@kruskal_64a) soon noted that the precise measure of departure from monotonicity that was being minimized by the method was neither explicitly defined nor even known to exist in an explicitly definable form. Thus, despite the intuitive plausibility and practical success of the method, it lacked the conceptual advantage of a strict mathematical specification of exactly what problem was being solved. Moreover, in the absence of an explicitly defined loss function, general techniques for the minimization of such functions (notably, gradient methods) were not apparently applicable. The iterative method that was used consequently appeared somewhat ad hoc. > Now however, some twelve years following my initial report of that method, my two former associates at the Bell Laboratories, Joe Kruskal (who succeeds me as President of this Society) and Doug Carroll (who has just been elected to succeed him), have jointly discovered that, unbeknown to all of us, my original method was in fact equivalent to a gradient method. Moreover they have even determined that the explicit form of the measure of departure from monotonicity that was (implicitly) minimized is one that appears to be quite reasonable (Kruskal and Carroll, personal communication; see Kruskal [in press]). Indeed, the method of optimization that was used apparently turns out, with a minor alteration in a normalizing factor, to belong to a class of "interchange methods" that Jan de Leeuw (personal communication) has recently shown to offer some advantages in the avoidance of merely local minima. However, these recently discovered aspects of my original method are mentioned, here, for their inherent or historical interest only; they do not form the basis for any of the recommendations that I shall offer in what follows. I am not sure what I communicated in 1973 to Shepard. One of the purposes of this current paper is to try to reconstruct what I could possibly have meant by "interchange methods". There is a major clue in the references of @kruskal_77. Kruskal refers to a paper of mine from 1975, supposedly "in press", titled "Nonmetric Scaling with Rearrangement Methods". I vaguely remember an unpublished manuscript with a title like that, written in 1973-1974 while I was visiting Bell Labs. It's irretrievably lost. We'll just pretend it still exists (@deleeuw_R_73j). I will try to reconstruct some if it. The claim that Shepard's original method was "equivalent to a gradient method" that minimized a well defined "measure of departure from monotonicity" still deserves some attention. I agree that Kruskal's improvement of Shepard's technique was "mostly technical", but it definitely was not "merely technical". The use of monotone regression, the emphasis on explicit loss functions, and the use of gradient methods, created a whole new area of research in psychometrics and multivariate analysis. Meulman's contribution on page 1194-1197 of @heiser_hubert_kiers_koehn_lewis_meulman_mcardle_sijtsma_takane_16 has some additional remarks on the impact of Kruskal's work. #Shepard's Method A complete reconstruction of Shepard's method would only be possible if we had the original FORTRAN program, but that has disappeared into the folds of time. We do have, however, @kruskal_carroll_77, which discusses the gradient interpretation and the rearrangement basis of the loss function. The method as originally conceived had two somewhat contradictory objectives. * It was looking for a configuration in which distances were as monotonic as possible with the given measures of proximity. * It was looking for a configuration which had as much variance as possible concentrated in the first few principal components. In other words, it was a form of full-dimensional scaling in which configurations vary in $n-1$ dimensions. After convergence the first $p$ principal components of the result are reported as the $p$-dimensional solution. Because of the two different objectives the displacement of points in an iteration was a compromise of two different displacements: one to improve monotonicity and one to improve low-dimensionality. In this paper we will only look at the deviations from monotonicity and, unlike Shepard, we will only look at configurations of a fixed dimension $p$. In this, of course, we follow Kruskal. Versions of Shepard's full-dimensional method will be treated in another paper. We change the notation somewhat, and we shift from working with *measures of proximity* to working with *dissimilarities*. To measure departure from monotonicity we compare a vector of dissimilarities $\delta$ and a vector of distances $d(x)$. Distances are defined as $d_k(x)=\sqrt{x'A_kx}$, where the $A_k$ are matrices of the form $(e_i-e_j)(e_i-e_j)'$, or direct sums of $p$ copies of such matrices (@deleeuw_R_93c). The loss function *implicitly* used by Shepard to measure departure from monotonicity is $$\label{E:loss1} \sigma^\star(x):=\sum_{k=1}^K(\hat\delta_k-\delta_k)d_k(x).$$ Here the $\hat\delta_k$ are dissimilarities permuted in such a way that they are monotonic with the distances. Thus $\hat{\delta}_k$ is a function of both the numerical values of the dissimilarities and the rank order of the distances. We clearly perform "arithmetic with dissimilarities", although I am not convinced that is necessarily a bad thing. Loss function $\eqref{E:loss1}$ does not occur in @shepard_62a, but we infer it, basically by looking at his formula for displacements to optimize monotonicity. So far, we have not shown that $\sigma_M$ is continuous, let alone differentiable. Using the classical rearrangement inequality (for example, @hardy_littlewood_polya_52, section 10.2, theorem 368) we can write $$\label{E:loss2} \sigma^\star(x)=\max_{\pi}\sum_{k=1}^K d_k(x)\delta_{\pi(k)}-\sum_{k=1}^K d_k(x)\delta_k,$$ where $\pi$ varies over the set of permutations of $\{1,\cdots,K\}$. An alternative notation, which is sometimes useful, is $$\label{E:loss3} \sigma^\star(x)=\max_{P}\ \delta'Pd(x)-\delta'd(x),$$ with $P$ varying over all permutation matrices. The representation using the rearrangement inequality shows that $\sigma^\star$ is non-negative, and $\sigma^\star(x)=0$ if and only if there are no violations of monotonicity. Thus it is a proper loss function for non-metric scaling. The representation $\eqref{E:loss2}$ shows that $\sigma^\star$ is the difference of two convex functions, which implies it is continuous and differentiable almost everywhere. We assume throughout that $d_k(x)>0$ for all $k$. If there are no ties in $\delta$ or $d(x)$ then the maximizing permutation $\hat\pi$ is unique, and, by Danskin's theorem (@danskin_66), $\sigma_M$ is differentiable with $$\label{E:dsm} \mathcal{D}\sigma^\star(x)=\sum_{k=1}^K\frac{\delta_{\hat\pi(k)}-\delta_k}{d_k(x)}A_kx$$ Equation $\eqref{E:dsm}$ is basically the same as formula (4) on page 134 of @shepard_62a. Of course directly minimizing $\sigma_M$ over $x$ is not a useful technique, because if we set $x=0$ then we trivially find $\sigma_M(x)=0$. Thus we need some form of *normalization*. @shepard_62a, page 135, says > Finally, at the end of each iteration the adjusted coordinates undergo a "similarity" transformation to re-center the centroid at the origin and to re-scale all distances so that the mean interpoint separation is maintained at unity. (p 135) which seems to imply normalization by $\sum_{k=1}^K d_k(x)=1$. Thus we set $$\label{E:final} \sigma(x)=\frac{\sum_{k=1}^K(\hat\delta_k-\delta_k)d_k(x)}{\sum_{k=1}^Kd_k(x)}$$ which makes $$\label{E:grad} \mathcal{D}\sigma(x)=\frac{1}{\sum_{k=1}^Kd_k(x)}\sum_{k=1}^K\frac{(\delta_{\hat\pi(k)}-\delta_k)-\sigma(x)}{d_k(x)}A_kx$$ The usual gradient iterations are $$\label{E:prgr} x^{(k+1)}=x^{(k)}-\lambda^{(k)}\mathcal{D}\sigma(x^{(k)}),$$ with $\lambda^{(k)}$ some appropriate step-size. ##Kruskal and Carroll's Reconstruction @kruskal_carroll_77 in @kruskal_77 did very much the same thing as I am doing here, and as I presumably did in @deleeuw_R_73j. They also take the second component of Shepard's procedure, the one that aims at low-dimensionality. They reconstruct the fit function as $$\label{E:KC} \sigma_{KC}(x):=\frac{\sum_{k=1}^K(\hat\delta_k-\delta_k)d_k(x)+\alpha\sum_{k=1}^K(\hat\delta_k-\overline{\delta})d_k(x)}{\sqrt{\sum_{k=1}^K d_k^2(x)}}$$ Here $\overline{\delta}$ is the average dissimilarity and $\alpha$ is a trade-off factor that weighs the two components of the fit function. We have modified their treatment somewhat by formulating it in terms of dissimilarties. Note that $\eqref{E:KC}$ is normalized using the root-mean square of the distances, while $\eqref{E:final}$ uses the mean. This difference may again be the "minor alteration in a normalizing factor" that Shepard refers to. @kruskal_carroll_77 note that minimizing $\eqref{E:KC}$ is equivalent to minimizing the numerator over the quadratic surface $\{x\mid \sum_{k=1}^K d_k^2(x)=1\}$, which can be done by gradient projection. Take a step along the negative gradient and then project on the surface. This still leaves us with the problem of choosing the step-size and the trade-off factor $\alpha$, which Shepard solved basically by numerical experimentation. @kruskal_carroll_77 also mention that when working in low-dimensional space Shepard sets $\alpha=0$, and thus uses $\eqref{E:final}$, except again for the different normalizing factor. $$\label{E:gradnorm} \mathcal{D}\sigma_{KC}(x)=\frac{1}{\sqrt{\sum_{k=1}^Kd_k^2(x)}}\left\{\sum_{k=1}^K\frac{\hat\delta_k-\delta_k}{d_k(x)}-\frac{\sigma_{KC}(x)}{\sqrt{\sum_{k=1}^Kd_k^2(x)}}\right\}A_kx$$ In MDS problems we can also choose, by linear transformation of the coordinates, the $A_k$ so that they add up to the identity. Then #More on Differentiability If there are ties in either $\delta$ or $d(x)$ then the optimizing permutation will not be unique, and $\sigma^\star$ may only be directionally differentiable. Again by Danskin's theorem, the derivative in the direction $y$ is $$d\sigma^\star(x,y)=\max_{\pi\in\Pi(x)}\sum_{k=1}^K\frac{\delta_{\pi(k)}-\delta_k}{d_k(x)}y'A_kx$$ where the maximum is over all permutations of the dissimilarities $\hat\pi$ for which $\sum_{k=1}^K (\delta_{\hat\pi(k)}-\delta_k)d_k(x)=\max_\pi\sum_{k=1}^K (\delta_{\pi(k)}-\delta_k)d_k(x)$. The subdifferential is $$\partial\sigma^\star(x)=\mathop{\mathbf{conv}}_{\pi\in\Pi(x)}\left\{\frac{\delta_{\pi(k)}-\delta_k}{d_k(x)}A_kx\right\},$$ with $\mathbf{conv}$ the convex hull. Ties in the dissimilarities do not in themselves lead to non-differentiability. Suppose there are only $\ell$ different dissimilarity values, collect them in the vector $\eta$. The corresponding $\ell$ marginal frequencies are in $f$. Thus $\delta=G\eta$ for some $k\times\ell$ *indicator matrix* (a binary matrix with row sums equal to one). Then $$\sigma^\star(x)=\max_{G\in\mathcal{G}} \eta'Gd(x) -\delta'd(x),$$ where $\mathcal{G}$ is the set of all $\frac{n!}{f_1!\cdots f_\ell!}$ indicator matrices with marginal frequencies $f$. If the elements of $d(x)$ are different, then the maximizer is unique and thus $\sigma^\star$ is differentiable. It is of some interest to see what happens if $n_0$ dissimilarities are zero and $n_1$ dissimilarities are one. In that case we have $\sigma^\star(x)=0$ if and only if the $n_1$ distances corresponding with $\delta_k=1$ are the largest distances. If the $d_{[k]}(x)$ are the ordered distances, we have zero stress if $d_{[k]}\geq d_{[n_0]}$ for all $k>n_0$. We have differentiability if merely d_{[n_0]}(x) {r rectangle_conf_plot, fig.align = "center", echo = FALSE} par(mfrow = c(1,2), pty = "s") plot(h1x, type = "n", xlab = "dim 1", ylab = "dim 2") text (h1$x, as.character (1:16), col = "RED") plot(h2$x, type = "n", xlab = "dim 1", ylab = "dim 2") text (h2x, as.character (1:16), col = "RED")  r figure_nums("rectangle_conf") {r rectangle_shepard_plot, fig.align = "center", echo = FALSE} par(mfrow = c(1,2), pty = "s") plot(h1delta, h1$dist, xlab = "dissimilarities", ylab = "distances", col = "RED") plot(h2$delta, h2$dist, xlab = "dissimilarities", ylab = "distances", col = "RED")  r figure_nums("rectangle_shep") {r rectangle_analysis_norm, echo = FALSE} h3 <- smacofShepard62(delta, norm = 2)  We also compute the solution using loss function$\eqref{E:KC}$with$\alpha=0$instead of$\eqref{E:final}$. After r h3$itel iterations we find loss r h3$stress. The gradient at the solution is {r rectangle_grad_3, echo = FALSE} mprint (h3$grad, f = "+")  The solution in figure r figure_nums("rectangle_shep_norm", display = "n") is virtually indistinguishable from the one in figures r figure_nums("rectangle_conf", display = "n") and r figure_nums("rectangle_shep", display = "n").
{r rectangle_shepard_plot_norm, fig.align = "center", echo = FALSE} par(mfrow = c(1,2), pty = "s") plot(h3$x, type = "n", xlab = "dim 1", ylab = "dim 2") text (h3$x, as.character (1:16), col = "RED") plot(h3$delta, h3$dist, xlab = "dissimilarities", ylab = "distances", col = "RED") 
r figure_nums("rectangle_shep_norm")

##Dutch Political Parties 1967 As the next illustration we use data from @degruijter_67, with average dissimilarity judgments between Dutch political parties in 1967. The data are {r poldist-data, echo = FALSE} poldist <- structure( c( 5.63, 5.27, 4.6, 4.8, 7.54, 6.73, 7.18, 6.17, 6.72, 5.64, 6.22, 5.12, 4.59, 7.22, 5.47, 5.46, 4.97, 8.13, 7.55, 6.9, 4.67, 3.2, 7.84, 6.73, 7.28, 6.13, 7.8, 7.08, 6.96, 6.04, 4.08, 6.34, 7.42, 6.88, 6.36, 7.36 ), Labels = c("KVP", "PvdA", "VVD", "ARP", "CHU", "CPN", "PSP", "BP", "D66"), Size = 9L, call = quote(as.dist.default(m = polpar)), class = "dist", Diag = FALSE, Upper = FALSE ) print (poldist) poldata <- as.vector (poldist) polnames <- attr (poldist, "Labels")  {r pol_analysis, echo = FALSE} h <- smacofShepard62 (poldata, verbose = FALSE, eps = 1e-15, itmax=1000)  The process converges in r h$itel iterations to loss function value r h$stress. The gradient at the solution is {r pol_grad, echo = FALSE} mprint (hgrad, f = "+")  {r pol_plot_conf, fig.align = "center", echo = FALSE} par (pty = "s") plot (hx, type = "n", xlab = "dim 1", ylab = "dim 2", xlim = c(-5, 5), ylim = c(-5, 3)) text (hx, polnames, col = "RED")  r figure_nums("gruijter_conf") {r pol_plot_shep, fig.align = "center", echo = FALSE} par (pty = "s") plot (hdelta, h$dist, xlab = "dissimilarities", ylab = "distances", col = "RED")  r figure_nums("gruijter_shep") ##Ekman Color Data The final example are the color data from @ekman_54. {r ekman_data, echo = FALSE} data (ekman, package = "smacof") delta <- as.vector ((1 - ekman) ^ 3) colnames <- as.character (attributes(ekman)$Labels)  {r ekman_analysis, echo = FALSE} h <- smacofShepard62 (delta, verbose = FALSE, eps = 1e-15, itmax=1000)  The process converges in r h$itel iterations to loss function value r as.character(h$stress). The gradient at the solution is {r ekman_grad, echo = FALSE} mprint (hgrad, f = "+")  {r ekman_plot_conf, fig.align = "center", echo = FALSE} par (pty = "s") plot (hx, type = "n", xlab = "dim 1", ylab = "dim 2", xlim = c(-.5,.5), ylim = c(-.5,.5)) text (hx, colnames, col = "RED")  r figure_nums("ekman_conf") {r ekman_plot_shep, fig.align = "center", echo = FALSE} par (pty = "s") plot (hdelta, h$dist, xlab = "dissimilarities", ylab = "distances", col = "RED")  r figure_nums("ekman_shep") #Discussion As an aside the Shepard loss function is similar to the loss function used in the Guttman-Lingoes programs (@guttman_68a, @lingoes_roskam_73). $$\sigma_{GL}^\star(x):=\frac12\sum_{k=1}^K(d_k(x)-d_k^\star(x))^2=\sum_{k=1}^K(d_k(x)-d_k^\star(x))d_k(x),$$ where the$d_k^\star(x)$are the *rank-images*, i.e. the$d_k(x)\$ permuted in such a way that they are monotonic with the dissimilarities. Thus instead of sort(delta)[rank(dist)] we use sort(dist)[rank(delta)]. There is no arithmetic done with dissimilarities, only their rank order is used. But the rank images do not have any clear optimality properties. As a consequence they are considerably less smooth than loss based on monotone regression or on Shepard's rearrangements. The report @deleeuw_R_73g is of interest in this context, in part because the pdf includes a copy of the report agressively annotated by Joseph Kruskal, as well as some obfuscating correspondence with Jim Lingoes. Also see Appendix 2 of @kruskal_77. If one wants to use rank images, it would make more sense to use $$\sigma^\star_{LG}:=\sum_{k=1}^K(d_k^\star-d_k)\delta_k,$$ because, again, by the rearrangement inequality this is a directionally differentiable difference of two convex functions similar to the one used by Shepard. Shepard himself *evaluates* the departure from monotonicity, after convergence, using $$\frac12\sum_{k=1}^K(\delta_k-\hat\delta_k)^2=\sum_{k=1}^K(\delta_k-\hat\delta_k)\delta_k=\sum_{k=1}^K(\hat\delta_k-\delta_k)\hat\delta_k,$$ which is again hopelessly non-smooth. #Appendix: Code ##shepard.R {r file_auxilary, code = readLines("shepard.R")}  ##auxilary.R {r file_auxilary, code = readLines("auxilary.R")}  ##mdsUtils.R {r file_auxilary, code = readLines("mdsUtils.R")}  ##smacofShepard62.R {r file_auxilary, code = readLines("smacofShepard62.R")}  #References