```{r, echo = FALSE} cat( "eps:", formatC(1e-1,digits = 1, format = "e"), "itel:",formatC(h1$itel, width = 2, format = "d"), "phi_e:", formatC(h1$p,digits = 6, format = "f"), "phi:", formatC(h1$f,digits = 6, format = "f"), "x:", formatC(h1$x,digits = 6, format = "f")) cat( "eps:", formatC(1e-2,digits = 1, format = "e"), "itel:",formatC(h2$itel, width = 2, format = "d"), "phi_e:", formatC(h2$p,digits = 6, format = "f"), "phi:", formatC(h2$f,digits = 6, format = "f"), "x:", formatC(h2$x,digits = 6, format = "f")) cat( "eps:", formatC(1e-3,digits = 1, format = "e"), "itel:",formatC(h3$itel, width = 2, format = "d"), "phi_e:", formatC(h3$p,digits = 6, format = "f"), "phi:", formatC(h3$f,digits = 6, format = "f"), "x:", formatC(h3$x,digits = 6, format = "f")) cat( "eps:", formatC(1e-4,digits = 1, format = "e"), "itel:",formatC(h4$itel, width = 2, format = "d"), "phi_e:", formatC(h4$p,digits = 6, format = "f"), "phi:", formatC(h4$f,digits = 6, format = "f"), "x:", formatC(h4$x,digits = 6, format = "f"), "\n") cat( "eps:", formatC(1e-5,digits = 1, format = "e"), "itel:",formatC(h5$itel, width = 2, format = "d"), "phi_e:", formatC(h5$p,digits = 6, format = "f"), "phi:", formatC(h5$f,digits = 6, format = "f"), "x:", formatC(h5$x,digits = 6, format = "f")) cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(h6$itel, width = 2, format = "d"), "phi_e:", formatC(h6$p,digits = 6, format = "f"), "phi:", formatC(h6$f,digits = 6, format = "f"), "x:", formatC(h6$x,digits = 6, format = "f")) ```

For this example, in which there are only a few ties, we do not expect the secondary approach to be that different from the primary one. ```{r neumann_seco} ss[ss==0] <- 1 diag(ss) <- 0 ``` ```{r neumann_seco_run, echo = FALSE, cache = TRUE} hns <- linpom (f, ss, aeps = 1e-6, eps = 1e-10, itmax = 100, verbose = FALSE) ``` And indeed, the results are virtually the same, although the fit function is slightly different.

```{r neumann_seco_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hns$itel, width = 2, format = "d"), "phi_e:", formatC(hns$p,digits = 6, format = "f"), "phi:", formatC(hns$f,digits = 6, format = "f"), "x:", formatC(hns$x,digits = 6, format = "f")) ```

We can also eliminate redundant inequalities and only have $\sigma_{ij}=+1$ when $\zeta_j$ is the largest of all data elements strictly less than $\zeta_i$, i.e. the next element in the rank order. If there are ties we have $\sigma_{ij}=+1$ for more than one element, and withn tie blocks we have $\sigma_{ij}=0$. ```{r neuman_red} n <- length (g) st<-matrix(0, n, n) for (i in 1:n) { ind <- which (g < g[i]) if (length(ind) == 0) next mnd <- max(g[ind]) jnd <- which(g[ind] == mnd) st[i, ind[jnd]] <- 1 } ``` The `linpom()` analysis gives results on a quite different scale. ```{r neuman_red_run} dim(st) dim(f) hnr <- hns #hnr <- linpom (f, st, aeps = 1e-6, xold = NULL, eps = 1e-10, itmax = 100, verbose = FALSE) ```

```{r neumann_red_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hnr$itel, width = 2, format = "d"), "phi_e:", formatC(hnr$p,digits = 6, format = "f"), "phi:", formatC(hnr$f,digits = 6, format = "f"), "x:", formatC(hnr$x,digits = 6, format = "f")) ```

Although the coefficients and fit function values are very different from the ones we had before, the plots of "observed" vs "predicted" are very similar. ```{r figs, fig.align = "center", echo = FALSE} plot(g, f %*% h6$x, xlab = "Observed", ylab = "Predicted", col = "RED", main = "Redundant") plot(g, f %*% hnr$x, xlab = "Observed", ylab = "Predicted", col = "RED", main = "Reduced") ``` For the "redundant" soution the correlation between observed and predicted is `r cor(g, f %*% h6$x)`, and Kendall's tau is `r cor(g, f %*% hnr$x)`. For the "reduced" solution this is `r Kendall(g, f %*% h6$x)$tau` and `r Kendall(g, f %*% hnr$x)$tau`. # Binary Regression If the data are binary (success/failure, yes/no, alive/death, trump/clinton) then obviously treatment of ties becomes rather essential. It seems the primary approach is more natural. It tries to find a separating hyperplane, and thus it is similar to logistic regression, support vector machines, and discriminant analysis. As an example we use the breast cancer data from @newman_hettich_blake_merz_98, included in the R package mlbench (@leisch_dimitriadou_10) ```{r cancer_get} data(BreastCancer) bccom <- BreastCancer[complete.cases(BreastCancer), ] g<-as.numeric(bccom$Class) g <- 2 * g - 3 h<-as.matrix(bccom[,c(-1,-11)]) f <- array(0, dim(h)) for (j in 1:9) f[,j] <- as.numeric(h[,j]) s <- sign(outer(g, g, "-")) ``` ```{r cancer_p, cache = TRUE} hbin <- linpom (f, s, aeps = 1e-6, eps = 1e-10, itmax = 1000, verbose = FALSE) ```

```{r cancer_p_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hbin$itel, width = 2, format = "d"), "phi_e:", formatC(hbin$p,digits = 6, format = "f"), "phi:", formatC(hbin$f,digits = 6, format = "f"), "x:", formatC(hbin$x,digits = 6, format = "f")) ```

The secondary approach is conceptually not very appealing. We require the predicted values for all negative instances to be equal, and similar for the predicted values of the positive instances. ```{r cancer_s, cache = TRUE} s[s==0]<-1 diag(s)<-0 hus <- linpom (f, s, aeps = 1e-6, eps = 1e-10, itmax = 1000, verbose = FALSE) ```

```{r cancer_s_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hus$itel, width = 2, format = "d"), "phi_e:", formatC(hus$p,digits = 6, format = "f"), "phi:", formatC(hus$f,digits = 6, format = "f"), "x:", formatC(hus$x,digits = 6, format = "f")) ```

Both the primary and secondary approach have the disadvantage that the matrix $\Sigma=\{\sigma_{ij}\}$ is quite large. To make computations more efficient, and more like the usual computations in logistic regression and the like, we define our positive orthant inequalities as $$ \sigma_{i}f_i'x\geq 0, $$ where $\sigma_i=+1$ if $i$ is is positive response and $\sigma_i=-1$ if the instance is negative. We also add a column with ones to $F$ to estimate a cutoff value. The fit function becomes $$ \phi(x)=\frac{\sum_{i=1}^n w_i\sigma_if_i'x}{\sum_{i=1}^n w_i|f_i'x|}, $$ where we assume without any real loss of generality that both all $w_i$ are positive and all $\sigma_i$ are non-zero. We can now apply exactly the same results as before to find a majorization algorithm, which we have implemented in the R function `binpom()`. The inequality $\eqref{E:beps}$ is still valid, but instead of $\eqref{E:bbeps}$ we now have \begin{equation}\label{E:bbin} B_\epsilon(y)\mathop{=}\limits^\Delta\sum_{i=1}^n\frac{w_i}{\sqrt{(f_i'y)^2+\epsilon}}f_i^{\ }f_i'. \end{equation} ```{r binpom, cache = TRUE} hbin <- binpom (f, g, aeps = 1e-6, eps = 1e-10, itmax = 500, verbose = FALSE) ```

```{r binpom_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hbin$itel, width = 2, format = "d"), "phi_e:", formatC(hbin$p,digits = 6, format = "f"), "phi:", formatC(hbin$f,digits = 6, format = "f"), "x:", formatC(hbin$x,digits = 6, format = "f")) ```

# Paired Comparisons In a typical paired comparison experiment each of $m$ subjects generates a matrix $\Sigma_k$ with elements $-1$ and $+1$. If we have units weights, the vector $\rho_k$ has elements $$ \rho_{ik}=\sum_{j=1}^n\ (\sigma_{ijk}-\sigma_{jik}). $$ The results of paired comparison experiments are often presented in aggregated form, summed over subjects or occasions. An an example, consider the vegetable data of Guilford, from the psych package (@revelle_18). ```{r} library(psych) data(vegetables) veg ``` In the aggregated table $N$ cell $n_{ij}$ indicates the number of individuals who prefer $i$ to $j$, i.e. the number or proportion of individuals for which $\sigma_{ijk}=+1$. Thus $$ \sum_{k=1}^m\sigma_{ijk}=n_{ij}-n_{ji}, $$ and $$ \sum_{k=1}^m\rho_{ik}=2\sum_{j=1}^n (n_{ij}-n_{ji}). $$ ```{r} s <- as.matrix(veg) s <- s - t(s) hpc<-pcpom(s, aeps = 1e-6, eps = 1e-10, itmax = 1000, verbose = FALSE) ```

```{r pcpom_out, echo = FALSE} cat( "eps:", formatC(1e-6,digits = 1, format = "e"), "itel:",formatC(hpc$itel, width = 2, format = "d"), "phi_e:", formatC(hpc$p,digits = 6, format = "f"), "phi:", formatC(hpc$f,digits = 6, format = "f"), "x:", formatC(hpc$x/max(hpc$x),digits = 6, format = "f")) ```

In this case we find a scale with a single positive element, corresponding to the largest element of $\rho$, while the other $n-1$ elements are all negative and equal. For such solutions the fit is $\max(\rho)/(n-1)$, or in our case `r max(rowSums(s)/8)`. Clearly such a solution is disappointing, because we do not need elaborate majorization iterations to invariably come up with the largest element in a vector. # Multidimensional Scaling Multidimensional scaling of squared distances means $f_i(x)=x'A_ix$. Thus globally maximizing $\phi$ means maximizing a non-convex quadratic form over a region defined by $n!$ non-convex quadratic inequality constraints. This seems a pretty hopeless task. A mental picture of the problem also suggests how serious the local maximum problem is bound to be. Scaling of distances has $f_i(x)=\sqrt{x'A_ix}$, which is even more complicated than the squared distance case. So the best we can do is start an infinite iterative procedure with an initial solution that is as close as possible, using some form of classical scaling, and then improve the solution with gradient or majorization steps until we find a local maximum. Again, this is a topic for further research. # History The history of positive orthant type loss functions is actually more complicated than our previous discussion suggests. I review that history here, which will give me the opportunity to reconstruct my thoughts on the matter of fifty years ago The idea of measuring fit by looking at differences between model values was inspired by early work of @kendall_38 and @daniels_44. Just to refresh your memory: suppose $x$ and $y$ are two vectors of length $n$ and we want to measure some form of correlation between the two. Here is a nice classical quotation. "Let us assign to each pair $(x_i,x_j)$ what for convenience will be called a score $a_{ij}$ and to each $(y_i,y_j$ a score $b_{ij}$, where $$ a_{ij}=-a_{ji},\hspace{15 pt}b_{ij}=-b_{ji}. $$ Denote by $\Gamma$ the number $$ \Gamma=\frac{\sum a_{ij}b_{ij}}{\sqrt{\sum a_{ij}^2\sum b_{ij}^2}}, $$ the summation extending over all $i$ and $j$ from 1 to $n$." (Daniels, 1944, p. 129). Daniels goes on to point out that $a_{ij}=x_i-x_j$ gives the Pearson correlation, $a_{ij}=\mathbf{sign}(x_i-x_j)$ gives Kendall's tau, and $a_{ij}=\mathbf{rank}(x)_i-(\mathbf{rank}(x)_j$ gives Spearman's rho. I am quite sure that this idea inspired my early work. In @deleeuw_R_68e fit measures for a model $F=\{f_i\}$$ are introduced in the form $$ \sum_{i=1}^n\sum_{j=1}^n \sigma_{ij}(f_i(x)-f_j(x)) $$ where the anti-symmetric weights $\sigma_{ij}$ are derived from the data $\zeta$ and are either Kendall, Spearman, or Pearson weights. There was another very important influence, which we will discuss next. Guttman's work on scale construction, while he was working for the army during WW II, still looms over much of psychometrics. It was elaborated on and tinkered with, but it was never improved, because it was perfect. For our purposes the key publication is @guttman_46, which curiously enough appeared the Annals of Mathematical Statistics. "The new criterion adopted is that the numerical values be determined so as best to distinguish between those things judged higher and thoise judged lower *for each individual*." (@guttman_46, p 144) Guttman's notation for paired comparisons is somewhat different from ours. He codes paired comparisons using zero and one, instead of plus and minus one. Thus for subject or replication $k$ define $$\epsilon_{ijk}\mathop{=}\limits^\Delta\frac12(\sigma_{ijk}+1).$$ Now suppose $x$ is the scale we are looking for, in deviations of the mean. The average value of objects that subject $k$ judges higher or larger or better than something else is $$ t_k=\frac{\sum_{j=1}^n\sum_{i=1}^n\epsilon_{ijk}x_j}{\sum_{j=1}^n\sum_{i=1}^n\epsilon_{ijk}}, $$ and the average of objects judged smaller is $$ u_k=\frac{\sum_{j=1}^n\sum_{i=1}^n\epsilon_{jik}x_j}{\sum_{j=1}^n\sum_{i=1}^n\epsilon_{jik}}. $$ We now want to maximize $$ \sum_{k=1}^m(t_k-u_k)^2 $$ over $x$ normalized by $x'x=1$. In a complete paired comparison experiment in which $\epsilon_{ijk}+\epsilon_{jik}=1$ for all $i\not= j$ we have $t_k=-u_k$. We also have $$ t_k=\frac{\frac12\sum_{j=1}^n x_j\sum_{i=1}^n\sigma_{ijk}}{\frac12 n(n-1)}=\frac{2\ x'\rho_k}{n(n-1)}, $$ where the $\rho_k$ are the ranks corresponding to the $\sigma_{ijk}$. Thus we can maximize $$ \sum_{k=1}^m t_k^2=x'\left\{\sum_{k=1}^n\rho_k^{\ }\rho_k'\right\}x $$ over $x'x=1$, which mean we find the largest eigenvalue and corresponding eigenvector of the cross-product matrix of the centered rank numbers. This means that Guttman's signature based method is equivalent to the method of @slater_60 or @carroll_chang_64, which would later become to be identified with the MDPREF program (@carroll_chang_68). We start with @deleeuw_R_68d on nonmetric discriminant analysis. This proposes the loss function $$ \tau(x)\mathop{=}\limits^\Delta\frac{\sum_{i=1}^n(|t_i|-t_i)^2}{4\sum_{i=1}^n t_i^2}, $$ where $t_i=\sigma_{i}f_i'x$. The partial derivatives are given, and a gradient method is suggested. The actual algorithm in the report, implemented in PL/I for the IBM 360/50, minimizes $$ \kappa(x,y)\mathop{=}\limits^\Delta (Fx-y)'(Fx-y) $$ over $x$ and $y\geq 0$ by alternating least squares (in fact, this is possibly the first use of that term). Of course $$ \min_{y\geq 0}\kappa(x,y)=\frac14\sum_{i=1}^n(|t_i|-t_i)^2, $$ which is the numerator of $\tau$. Although this is not mentioned in the report, this result also shows that $\tau$ is a differentiable function of $f$, despite the absolute value signs. After each update of $x$ it is normalized by $x'F'Fx=1$. The ALS process converges to a local minimum of both $\kappa$ and $\tau$. A note in the back of the report says: "Although there is no program for NDA in the G-L-series as yet, professor Louis Guttman informs me that he would favor the absolute value principle. From the (up to now) rather sketchy accounts of this principle I gather it is quite similar to my positive orthant method." @deleeuw_R_68g defines the positive orthant method, in the context of nonmetric multidimensional scaling, as the minimization of one of a family of loss functions . As in $\eqref{E:ineq}$ we define $$ t_{ij}(x)\mathop{=}\limits^\Delta \sigma_{ij}(f_i(x)-f_j(x)), $$ and then, for some $q>0$, \begin{equation}\label{E:tau} \tau_q(x)\mathop{=}\limits^\Delta\frac{\sum_{i=1}^n\sum_{j=1}^nw_{ij}(|t_{ij}(x)|-t_{ij}(x))^q}{2^q\sum_{i=1}^n\sum_{j=1}^nw_{ij}|t_{ij}(x)|^q}. \end{equation} The matrix $W=\{w_{ij}\}$ has non-negative weights. We assume, without loss of generality, that $w_{ij}|\sigma_{ij}|=w_{ij}$, i.e. if $\sigma_{ij}=0$ then also $w_{ij}=0$. Clearly $0\leq\tau_q(x)\leq 1$ with $\tau_q(x)=0$ if and only if $t_{ij}(x)\geq 0$ for all $i$ and $j$. Using $w_{ij}|\sigma_{ij}|=w_{ij}$ and $\mathbf{sign}(\sigma_{ij})=\sigma_{ij}$ we can show that $\tau_q(x)=\frac12(1-\phi_q(x))$, where $$ \phi_q(x)=\frac{\sum_{i=1}^n\sum_{j=1}^nw_{ij}\sigma_{ij}(f_i(x)-f_j(x))|f_i(x)-f_j(x)|^{q-1}}{\sum_{i=1}^n\sum_{j=1}^nw_{ij}|f_i(x)-f_j(x)|^q}. $$ Thus minimizing $\tau_q$ is equivalent to maximizing $\phi_q$. See also @young_75 or @hartmann_79 for additional accounts of this method, and some historical context. @deleeuw_R_68g and @deleeuw_R_70a discussed the general case $q>0$ and the limiting cases with $q\rightarrow 0$ and $q\rightarrow +\infty$, but the report is computationally rather naieve, because it just suggest to apply the gradient method and it ignores the fact that the absolute value function is not differentiable. The report discusses a From a computational point of view the most interesting members of the family $\eqref{E:tau}$ are $\tau_1$ and $\tau_2$. The case $q=2$ has been discussed in @guttman_69, as the "absolute value principle", and in @johnson_73 as the "pairwise method", again just suggesting to use gradient methods. $$ \phi_2(x)=\frac{\sum_{i=1}^n\sum_{j=1}^nw_{ij}\sigma_{ij}\mathbf{sign}(f_i(x)-f_j(x))(f_i(x)-f_j(x))^2}{\sum_{i=1}^n\sum_{j=1}^nw_{ij}(f_i(x)-f_j(x))^2}. $$ $$ \phi_1(x)=\frac{\sum_{i=1}^n\sum_{j=1}^nw_{ij}\sigma_{ij}(f_i(x)-f_j(x))}{\sum_{i=1}^n\sum_{j=1}^nw_{ij}|f_i(x)-f_j(x)|}. $$ The positive orthant method is introduced in @deleeuw_R_68g. Chapter 3 in that report introduces the loss function $$ \tau_q(x)\mathop{=}\limits^\Delta\frac{\sum_{i=1}^nw_i(|t_i|-t_i)^q}{2^q\sum_{i=1}^nw_i|t_i|^q} $$ The PL/I program NMSPOM for minimizing $\tau_q$ in the case of multidimensional scaling, using various gradient methods, is decribed, and some examples are given. The next report in the sequence is @deleeuw_R_69c. In our notation it suggests maximizing $$ \psi(x)=\frac{\sum_{i=1}^n\sum_{j=1}^n\sigma_{ij}(f_i'x-f_j'x)}{\sqrt{\sum_{i=1}^n (f_i'x)^2}} $$ # Code ## linpom.R ```{r file_auxilary, code = readLines("linpom.R")} ``` ## binpom.R ```{r file_auxilary, code = readLines("binpom.R")} ``` ## pcpom.R ```{r file_auxilary, code = readLines("pcpom.R")} ``` # References