Note

The directory gifi.stat.ucla.edu/homals has a pdf copy of this article, the complete Rmd file with all code chunks, and R and C files with the code.

Loss Function

In the multivariate analysis techniques presented in this paper the data are measurements or categorizations of \(n\) objects by \(p\) variables. Variables are partitioned into \(m\) sets of variables, with set \(j\) containing \(p_j\) variables. Thus \(\sum_{j=1}^m p_j=p\). The number of variables in a set is also called the rank of the set.

Homogeneity Analysis, as defined in this paper, is the minimization of the loss function \[ \sigma(X,H,A)=\frac{1}{m}\sum_{j=1}^m\mathbf{SSQ}\ (X-H_jA_j),\tag{1} \] where \(\mathbf{SSQ}()\) is the sum of squares.

The loss function depends on three sets of parameters \(X, H\) and \(A\). One of them, the matrix \(X\), is common to all sets. The data are in the matrix \(H_j\), which codes observations on the \(p_j\) variables in set \(j\). Homogeneity analysis does not operate on the variables directly, but on admissible transformations of the variables. Matrix \(A_j\) has coefficients used to make linear combinations of the variables in set \(j\).

  1. The \(n\times r\) matrix \(X\) of object scores is required to be column-centered and normalized by \(X'X=I\). The dimensionality \(r\) is chosen by the user.
  2. The \(n\times p_j\) matrices of transformed variables \(H_j\) have columns \(h_{j\ell}\), with \(\ell=1,\cdots,p_j\). We require \(h_{j\ell}\in\mathcal{K_{j\ell}}\cap\mathcal{S}\), where \(\mathcal{K}_{j\ell}\) is a cone of admissible transformations and \(\mathcal{S}\) is the unit sphere in \(\mathbb{R}^n\). For example, \(\mathcal{K}_{j\ell}\) can be the cone of all centered vectors in \(\mathbb{R}^n\) that are monotone with the data.
  3. The \(p_j\times r\) matrices \(A_j\) of component loadings are unconstrained.

Thus we make linear combinations of the transformed variables in such a way that the set scores \(H_jA_j\) are as close as possible to the object scores \(X\), and consequently as close as possible to each other. This explains the name of the technique: we want to make the \(m\) set scores as homogeneous as possible. Note that homogeneity analysis is both linear and non-linear, in the sense that it makes linear combinations of non-linear transformations of the variables.

The history of loss function (1) is complicated. Least squares and eigenvalue methods for quantifying multivariate qualitative data were introduced by Guttman (1941). They were taken up by, among others, De Leeuw (1968) and by Benzécri and his students (Cordier 1965). In this earlier work the emphasis was often on optimizing quadratic forms, or ratios of quadratic forms, and not so much on least squares, distance geometry, and so-called biplots (Gower and Hand 1996).

In De Leeuw (1974) a first attempt was made to unify most classical descriptive multivariate techniques using a single least squares loss function and a corresponding alternating least squares (ALS) optimization method. Guttman’s quantification method, which later became known as multiple correspondence analysis, was merged with linear and nonlinear principal component analysis in the HOMALS/PRINCALS techniques and programs (De Leeuw and Van Rijckevorsel 1980). The homogeneity analysis loss function that was chosen ultimately, for example in the work of Van der Burg, De Leeuw, and Verdegaal (1988), had been used earlier by Carroll (1968) in multi-set canonical correlation analysis.

In the Gifi system (Gifi 1990, Michailidis and De Leeuw (1998)) a slightly different parametrization, and a correspondingly different ALS algorithm, were used. The loss function used by Gifi is \[ \sigma(X,Y)=\frac{1}{m}\sum_{j=1}^m\mathbf{SSQ}\ (X-\sum_{\ell=1}^{p_j}G_{j\ell}Y_{j\ell}),\tag{2} \] where the \(G_{j\ell}\) are known spanning matrices for the cones of admissible transformations, and the \(Y_{j\ell}\) are \(k_{j\ell}\times p\) matrices of category quantifications. By using the full rank decompositions \(Y_{j\ell}=Z_{j\ell}A_{j\ell}\) we can show that (1) and (2) are essentially the same. We feel that the new parametrization in terms of \(H_j\) and and \(A_j\) has some conceptual and computational advantages.

Algorithm

The standard way to minimize loss function (2) is implemented in the OVERALS program (Van der Burg, De Leeuw, and Verdegaal 1988, Meulman and Heiser (2011)). It is also the one used in the homals package (De Leeuw and Mair 2009).

In this paper the algorithm is different because we use the loss function (1). We still use ALS, which means in this case that we cycle through three substeps in each iteration. We update \(A\) for given \(X\) and \(H\), we then update \(X\) for given \(H\) and \(A\), and finally we update \(H\) for given \(X\) and \(A\). Algorithm A goes as follows.

  1. Set \(k=0\) and start with some \(X^{(0)},H^{(0)},A^{(0)}\).
  2. \(X^{(k+1)}=\mathbf{ortho}(\mathbf{center}(\sum_{j=1}^m H_j^{(k)}A_j^{(k)}))\).
  3. For \(j=1,\cdots,m\) compute \(A_j^{(k+1)}=\{H_j^{(k)}\}^+X^{(k+1)}\).
  4. For \(j=1,\cdots,m\) and \(s=1,\cdots p_j\) compute \(h_{js}^{(k+1)}=\mathbf{proj}_{\mathcal{K}_{js}\cap\mathcal{S}}((X^{(k+1)}-\sum_{t<s}h_{jt}^{(k+1)}\{a_{jt}^{(k+1)}\}'-\sum_{t>s}h_{jt}^{(k)}\{a_{jt}^{(k+1)}\}')a_s^{(k+1)})\).
  5. If converged stop. Else \(k\leftarrow k+1\) and go to step 1.

In step 1 we use superscript + for the Moore-Penrose inverse. In step 2 the center operator does column centering, the ortho operator finds an orthonormal basis for the column space of its argument.

The complicated part is step 4, the optimal scaling, i.e. the updating of \(H_j\) for given \(X\) and \(A_j\). We cycle through the variables in the set, each time projecting a single column on the cone of admissible transformations of the variable, and then normalizing the projection to length one. The target, i.e. the vector we are projecting, is complicated, because the other variables in the same set must be taken into account.

In order to simplify the optimal scaling computations within an iteration we can use majorization (De Leeuw 1994, Heiser (1995), Lange, Hunter, and Yang (2000), De Leeuw (2015a)). This has the additional benefit that the optimal scaling step becomes embarassingly parallel. We expand the loss for set \(j\) around a previous solution \(\tilde H_j\). \[ \mathbf{SSQ}(X-H_jA_j)= \mathbf{SSQ}(X-\tilde H_jA_j)-2\mathbf{tr}\ (H_j-\tilde H_j)'(X-\tilde H_jA_j)A_j' +\mathbf{tr}\ A_j'(H_j-\tilde H_j)'(H_j-\tilde H_j)A_j. \] Now \[ \mathbf{tr}\ (H_j-\tilde H_j)A_jA_j'(H_j-\tilde H_j)'\leq\kappa_j\ \mathbf{tr}\ (H_j-\tilde H_j)'(H_j-\tilde H_j), \] where \(\kappa_j\) is the largest eigenvalue of \(A_j'A_j\). Thus \[ \mathbf{SSQ}(X-H_jA_j)\leq\mathbf{SSQ}(X-\tilde H_jA_j)+\kappa_j\ \mathbf{SSQ}(H_j-U_j)-\frac{1}{\kappa_j}\ \mathbf{SSQ}((X-\tilde H_jA_j)A_j'), \] where \(U_j\) is the target \[ U_j=\tilde H_j+\frac{1}{\kappa_j}(X-\tilde H_jA_j)A_j'.\tag{3} \] It follows we can update the optimal scaling of the variables by projecting the columns of \(U_j\) on their respective cones and then normalizing. See De Leeuw (1975) for results on normalized cone regression. This can be done for all variables in the set separately, without taking any of the other variables in the set (or in any of the other sets) into account. Thus the optimal scaling is easy to parallellize. The resulting algorithm B is as follows.

  1. Set \(k=0\) and start with some \(X^{(0)},H^{(0)},A^{(0)}\).
  2. \(X^{(k+1)}=\mathbf{ortho}(\mathbf{center}(\sum_{j=1}^m H_j^{(k)}A_j^{(k)}))\).
  3. For \(j=1,\cdots,m\) compute \(A_j^{(k+1)}=\{H_j^{(k)}\}^+X^{(k+1)}\).
  4. For \(j=1,\cdots,m\) compute \(U_j^{(k+1)}=H_j^{(k)}+\frac{1}{\kappa_j}(X^{(k+1)}-H_j^{(k)}A_j^{(k+1)})\{A_j^{(k+1)}\}'\) and for \(s=1,\cdots p_j\) compute \(h_{js}^{(k+1)}=\mathbf{proj}_{\mathcal{K}_{js}\cap\mathcal{S}}(u_{js}^{(k+1)})\).
  5. If converged stop. Else \(k\leftarrow k+1\) and go to step 1.

Implementation Details

If we follow the ALS strategy strictly the \(\mathbf{ortho}()\) operator should be implemented using Procrustus rotation (Gibson 1962). Thus if \(Z=K\Lambda L'\) is the singular value decomposition of \(X\), then \(\mathbf{ortho}(Z)=KL'\). Note, however, that any other basis for the column space of \(Z\) merely differs from the Procrustus basis by a rotation. And this rotation matrix will carry unmodified into the upgrade of \(A_j\) in step 2 of the algorithm, and thus after steps 1 and 2 the loss will be the same, no matter whch rotation we select. In our algorithm we use the QR decomposition to find the basis, using the Gram-Schmidt code from De Leeuw (2015b).

We implement the cone restrictions by the constraints \(h_{js}=G_{js}z_s\) in combination with \(T_{js}h_{js}\geq 0\). Thus the transformed variables must be in the intersection of the subspace spanned by the columns of the transformation basis \(G_{js}\) and the polyhedral convex cones of all vectors \(h\) such that \(T_{js}h\geq 0\). We suppose that all columns of the \(G_{js}\) add up to zero, and we require, in addition, the normalization \(SSQ(h_{js})=1\).

In earlier homogeneity analysis work, summarized for example in Gifi (1990) or Michailidis and De Leeuw (1998), the transformation basis matrices \(G_{js}\) were binary zero-one matrices, indicating category membership. The same is true for the software in IBM SPSS Categories (Meulman and Heiser 2011) or in the R package homals (De Leeuw and Mair 2009). In this paper we extend the current homogeneity analysis software using B-spline bases, which provide a form of fuzzy non-binary coding suitable for both categorical and numerical variables (Van Rijckevorsel and De Leeuw 1988). These generalizations were already discussed in De Leeuw, Van Rijckevorsel, and Van der Wouden (1981) and Gifi (1990), but corresponding easily accessible software was never released.

We use the code described in De Leeuw (2015c) to generate B-spline bases. Note that for coding purposes binary indicators are B-splines of degree zero, while polynomials are B-splines without interior knots. Also note that binary indicators can be created for qualitative non-numerical variables, for which B-splines are not defined. We have added the option using degree -1 to bypass the B-spline code and generate an indicator matrix. Throughout we first orthonormalize the basis matrices \(G_{js}\), using the Gram-Schmidt code from De Leeuw (2015b).

The matrices \(T_{js}\) in the homogeneous linear inequality restrictions that define the cones \(\mathcal{K}_{js}\) can be used to define monotonicity or convexity of the resulting transformations. In the current implementation we merely allow for monotonicity, which means the \(T_{js}\) do not have to be stored. The transformations for each variable can be restricted to be increasing, or they can be unrestricted. By using splines without interior knots we allow in addition for polynomial transformations, which again can be restricted to be either monotonic or not. This covers the previous Gifi types nominal, ordinal, and numerical, which were of course designed for categorical variables with a small number of categories. Note that it is somewhat misleading to say we are fitting monotone splines or polynomials, we are mainly requiring monotonicity at the data points.

Missing data are incorporated in the definition of the cones of transformations by using a \(G_{js}\) which is the direct sum of a spline basis for the non-missing and an identity matrix for the missing data. This is called missing data multiple in Gifi (1990). There are no linear inequality restrictions on the quantifications of the missing data.

Associated Eigenvalue Problems

Associated with the problem of minimizing loss function (1) are some eigen and singular value problems defined by the matrices \(H_j\). This has been discussed extensively in Gifi (1990), and there are some more recent discussions in Tenenhaus and Tenenhaus (2011) and Van der Velden and Takane (2012).

Suppose \(H_j^+\) is the Moore-Penrose inverse of \(H_j\) and \(P_j=H_jH_j^+\) is the orthogonal projector associated with the column space of \(H_j\). Then \[ \min_{A_j}\sigma(X,H,A)=\frac{1}{m}\sum_{j=1}^m(r-\mathbf{tr}\ X'P_jX)=r-\mathbf{tr}\ X'P_\star X, \] with \(P_\star\) the average of the projectors \(P_j\). Thus \[ \min_{X'X=I}\min_{A_j}\sigma(X,H,A)=r-\sum_{s=1}^r\lambda_s(P_\star), \] where \(\lambda_s\) are the ordered eigenvalues of \(P_\star\) (from large to small). Thus we see that homogeneity analysis chooses the \(H_j\), i.e. transforms or quantifies the variables, in such a way that the sum of the \(r\) largest eigenvalues of \(P_\star\) is maximized.

Now consider alternative restrictions where we do not normalize \(X\), but we normalize the loadings \(A\) by requiring that \[ \frac{1}{m}\sum_{j=1}^m A_j'D_jA_j=I, \] where \(D_j=H_j'H_j\). Also define \(C_{jv}=H_j'H_v.\) We can collect the matrices \(C_{jv}\) in an \(p\times p\) super-matrix \(C\), which we we will call the Burt matrix. Note that if \(H=\begin{bmatrix}H_1&H_2&\cdots&H_m\end{bmatrix}\), then \(C=H'H\). The matrix \(D\) is defined as the direct sum of the \(D_j\), i.e. it consists of the diagonal submatrices of \(C\).

The minimum of loss over unrestricted \(X\) for fixed \(A\) and \(H\) is attained at the average of the set scores \(X=\frac{1}{m}\sum_{j=1}^m H_jA_j,\) and thus \[ \min_{X}\sigma(X,H,A)=r-\frac{1}{m^2}\mathbf{tr}\ A'CA, \] and \[ \min{A'DA=mI}\min_{X}\sigma(X,H,A)=r-\sum_{s=1}^r\lambda_s(C,mD), \] where the \(\lambda_s\) are now the generalized eigenvalues of the pair \(C\) and \(mD\).

If we define \(K\) by \[ K=m^{-\frac12}\begin{bmatrix}H_1(H_1'H_1)^{-\frac12}&\cdots&H_m(H_m'H_m)^{-\frac12}\end{bmatrix}, \] with \((H_j'H_j)^{-\frac12}\) the Moore-Penrose inverse of the symmetric square root. Then \(P_\star=KK'\) and the non-zero eigenvalues of \(P_\star\) are the same as those of \(K'K\), which in turn are equal to the generalized eigenvalues of the pair \((C,mD)\). Thus homogeneity analysis can also be interpreted as transforming the variables in such a way that the sum of the \(p\) largest generalized eigenvalues of \((C,mD)\) is maximized.

Special Cases

  1. If there are only two sets the generalized eigenvalue problem for the Burt matrix becomes \[ \begin{bmatrix} D_1&C_{12}\\C_{21}&D_2 \end{bmatrix} \begin{bmatrix} a_1\\a_2 \end{bmatrix}=2\lambda\begin{bmatrix}D_1&0\\0&D_2\end{bmatrix}\begin{bmatrix} a_1\\a_2 \end{bmatrix}, \] which we can rewrite as \[ \begin{split} C_{12}a_2&=(2\lambda-1)D_1a_1,\\ C_{21}a_1&=(2\lambda-1)D_2a_2, \end{split} \] from which we see that homogeneity analysis maximizes the sum of the \(r\) largest canonical correlations between \(H_1\) and \(H_2\). See also Van der Velden (2012).

  2. Suppose all \(m\) sets each contain only a single variable. Then the Burt matrix is the correlation matrix of the \(H_j\), which are all \(n\times 1\) matrices in this case. It follows that homogeneity analysis maximizes the sum of the \(r\) largest eigenvalues of the correlation matrix over transformations, i.e. homogeneity analysis is nonlinear principal component analysis (De Leeuw 2014).

  3. Suppose all basis matrices \(G_{j\ell}\) in set \(j\) are the same, say equal to \(G_j\). Then the set scores \(H_jA_j\) are equal to \(G_jZ_jA_j\), which we can write simply as \(G_jY_j\). Thus loss must be minimized over \(X\) and the \(Y_j\). If all \(G_j\) are binary indicators of categorical variables, and the \(m\) sets are all of rank one, then homogeneity analysis is multiple correspondence analysis (MCA). The set scores \(G_jY_j\) are \(k_j\) different points, with \(k_j\) the number of categories of the variable, usually much less than \(n\). The plot connecting the set scores to the object scores is called the star plot of the variable.

  4. More generally, we can include an arbitrary number of copies of a variable in a set by using the same basis matrix \(G_j\) a number of times. As soon as we have decided how many copies to include, the algorithm can forget all about the fact that some variables are copies and just treat them like any other variable. The notion of copies replaces the notion of the rank of a quantification used in traditional Gifi, which in turn generalizes the distinction between single and multiple quantifications.

  5. If the second set only contains a single copy of a single variable then we choose transformations that maximize the multiple correlation of that variable and the variables in the first set.

  6. If the second set contains more than one copy of a single variable and we use binary indicator coding for that variable, then we optimize the eigenvalue (between/within ratio) sums for a canonical discriminant analysis.

MCA Example: Thirteen Personality Scales

Our first example is a small data set from the psych package (Revelle 2015) of five scales from the Eysenck Personality Inventory, five from a Big Five inventory, a Beck Depression Inventory, and State and Trait Anxiety measures.

data(epi.bfi, package = "psych")
epi <- epi.bfi
epi_knots <- lapply (epi, function (x) fivenum (x)[2:4])
epi_degrees <- rep (0, 13)
epi_ordinal <- rep (FALSE, 13)
epi_copies <- rep (2,13)
epi_sets <- 1:13

We perform a two-dimensional MCA, using degree zero and inner knots at the three quartiles for all 13 variables.

h <- homals(epi, epi_knots, epi_degrees, epi_ordinal, epi_sets, epi_copies, verbose = FALSE)
We have convergence in 260 iterations to loss 0.7478043. The object scores are in figure 8.
Figure 8: Personality Scales, Object Scores, Multiple Nominal, Degree Zero

Figure 9 has the \(G_jY_j\) for each of the thirteen variables, with the first dimension in red, and the second dimension in blue. Because the degree of the splines is zero, these transformation plots show step functions, with the steps at the knots, which are represented by vertical lines.