```{r points, echo = FALSE, align = "center", fig.dim = c(10,10)} par(pty="s") layout(matrix(c(1,0,2,4,3,0), 2, 3)) hGraphPlot(small, x, y) hGraphAll(small, x, y) ```

For this arbitrary initial configuration the loss, i.e. the sum of squares of the line lengths, or the sum of the squred distances between objects and the categories they fall in, is equal to `r hLoss (small, x, y)`. #Reciprocal Averaging In Homogeneity Analysis we minimize loss by what is known as *reciprocal averaging* or *alternating least squares*. We alternate two substeps. The first substep improves the category quantifications for a given set of object scores, the second substep improves and normalizes the object scores for a given set of category quantifications, namely those we have just computed in the first substep. Taken together these two substeps are an *iteration*. So each iteration starts with object scores and category quantifications and uses its two substeps to improve both. Each of the two substeps decreases the loss functions, i.e. the total squared length of the lines in the graph plots. The two substeps are both very simple. Let's look at the first one. We compute optimal category quantifications for given object scores by taking the averages (or *centroids*) of the objects scores in each of the categories. The corresponding graph plots are

```{r updateY, echo = FALSE, align = "center", fig.dim = c(10,10)} y<-hUpdateY(small, x) par(pty="s") layout(matrix(1:3, 1, 3)) hGraphPlot(small, x, y) ```

and the loss has decreased to `r hLoss(small, x, y)`. Note that we have not improved the object scores yet, so they are still their initial configuration, equally spaced on a circle. Also note, in variable 2 for instance, that category quantifications coincide with object scores, and thus contribute zero to the loss, if the object is the only observation in the category. In addition, because category quantifications are averages of objects points, they are in the convex hull of the object points, which means in this figure that they are within the circle. Averaging objects points makes the category quantifications move closer to the origin. The second substep improves the object scores, while keeping the category quantifications in the locations we have just computed in the first substep. The second substaep has itself two substeps, say $2A$ and $2B$. In the substep $2A$ the score of an object for given category quantifications is computed as the average or centroid of the $m$ category quantifications the object is in.

```{r updateXa, echo = FALSE, align = "center", fig.dim = c(10,10)} z<-hUpdateX(small, y)$z par(pty="s") layout(matrix(1:3, 1, 3)) hGraphPlot(small, z, y) ```

The loss function is down all the way to `r hLoss(small, z, y)`. This is not a proper loss value, however, because the object scores are no longer centered, standardized, and uncorrelated, and that was a Homogeneity Analysis requirement. Substep 1 shrinks the object scores towards the origin by averaging, substep 2A takes the resulting category quantifications and shrinks them more by even more averaging. Thus in substep $2B$ we have to renormalize the object scores such that they are centered, standardized, and uncorrelated. This gives

```{r updateXb, echo = FALSE, align = "center", fig.dim = c(10,10)} x<-hUpdateX(small, y)$x par(pty="s") layout(matrix(1:3, 1, 3)) hGraphPlot(small, x, y) ```

Loss, which is now the proper loss for a normalized configuration, has decreased to `r hLoss(small, x, y)`. Now that we have new category quantifications and new suitably normalized object scores we can start the next iteration, and again improve both in two substeps. Ultimately, after a certain number of iterations, there is no change any more from one iteration to another, and we have reached the optimal solution. In other words, there is convergence, and our Homogeneity Analysis is finished. Note that the renormalization in step 2B is necessary, because without it both object scores and category quantifications would become smaller and smaller, and converge to the origin. Of course the origin does have loss zero, but it is never a proper description of the data. The optimal graph plots, after the iterations have converged, are

```{r optimum, echo = FALSE, align = "center", fig.dim = c(10,10)} h <- hOptimum (small) par(pty="s") layout(matrix(c(1,0,2,4,3,0), 2, 3)) hGraphPlot(small, h$x, h$y) hGraphAll(small, h$x, h$y) ```

The minimum loss for these data is `r hLoss(small, h$x, h$y)`. Because the object scores are in deviations from the mean, and the category quantifications are weighted means of object scores it follows that category quantifications are in deviations from the weighted mean, with weights equal to the marginals of the variable. Thus category quantifications for each variable are distributed around the origin. In Homogeneity Analysis the graph plots for individual variables are often called *star plots*, because the optimal category quantification is the centroid of the scores of the objects in the category. Thus it is somewhere in the middle of a bunch of objects, which are connected to it by lines. Thus the subset of the graph for each category is a star graph, and the corresponding plot for the variable with these stars is a star plot. The top three plots in figure `r figure_nums("graph_optim", display = "n")` are examples of such star plots. One could formulate the objective of Homeogeneity Analysis as finding normalized object scores in such a way that the stars (over all categories of all variables) are as small as possible. Or, in yet another formulation, we want to maximize the between-category variation and minimize the within-category variation. Variables with a small star, which is necessarily close to the origin, have poor discrimination power. The average object score for each category is about the same. In general, categories with a large number of observations will have an average close to the average of all observations, and thus they will be close to the origin. And, conversely, categories with a small number of observations will tend to be relatively far from the origin. #Specifics ##Equality Restrictions Now, going back to the Durent Bend data, we do not have the values of all 5888 sherds on the three variables. The sherds are aggregated over various site/depth combinations and the original data cannot be recovered from the aggregated table. But the framework of Homogeneity Analysis can still be applied by using *equality restrictions* (@vanbuuren_deleeuw_A_92). The only thing added is that we require that sherds in the same site/depth get the same object score. Or, geometrically, all sherds in the same site/depth are mapped into the same point in the joint plot. The Homogeneity Analysis loss function is still minimized in two steps. The first step, updating the category quantifications, is still the same as in Homogeneity Analysis without equality restrictions. The second step, which updates the object scores, now has three substeps instead of two. In substep $2A$ we compute the average of the category quantifications of the categories the sherd is in. In substep $2B$ we replace these tentative object scores for the sherds by the site/depth averages, and in substep $2C$ we normalize the object scores, making them centered, standardized, and uncorrelated. The graph plots on unconstrained Homogeneity Analysis must now be replaced by plots of *valued graphs*. For any variable each site/depth point is now connected to all category points, and the edge connecting the object and category point has a value equal to the number of sherds in the category. ```{r galo, fig.align = "center", cache = TRUE, echo = FALSE} data(galo) galo[,2]<-as.factor(galo[,2]) k <- as.matrix(galo[,1:4]) h <- homals (k) x<-h$objectscores ``` As an example we use the GALO data, which has been used innumerable times before as a Homogeneity Analysis example, and is in the `Gifi` package in R (@mair_deleeuw_17). ``` galo Gifi R Documentation GALO dataset Description The objects (individuals) are 1290 school children in the sixth grade of elementary school in the city of Groningen (Netherlands) in 1959. Usage galo Format Data frame with the five variables Gender, IQ, Advice, SES and School. IQ (original range 60 to 144) has been categorized into 9 ordered categories and the schools are enumerated from 1 to 37. SES: LoWC = Lower white collar; MidWC = Middle white collar; Prof = Professional, Managers; Shop = Shopkeepers; Skil = Schooled labor; Unsk = Unskilled labor. Advice: Agr = Agricultural; Ext = Extended primary education; Gen = General; Grls = Secondary school for girls; Man = Manual, including housekeeping; None = No further education; Uni = Pre-University. References Peschar, J.L. (1975). School, Milieu, Beroep. Groningen: Tjeek Willink. ``` We first ignore the school variable, and analyze the four variables Gender, IQ, Advice, and SES. Separate joint plots for the four variables, with both object scores and category quatifications, are in figure `r figure_nums("joint_galo", display = "n")`. We do not make star plots (by drawing the lines from the object points to the category points they are in) in this case, because 1290 lines in a plot just create a big black blob. The joint plots show a curved one-dimensional solution with good students on the left and poor students on the right. Both IQ and Advice differentiate students well (because teachers used IQ in their secondary education advice), which means they willhave the smallest stars. Girls are better students than boys, and SES mainly contrasts the two extremes categories PROF and UNSK.

```{r galo_sex, fig.align = "center", echo = FALSE, fig.dim = c(15,15), cache = TRUE} par(mfrow=c(2,2), pty = "s") g <- ifelse(outer(galo[,1], levels(galo[,1]), "=="), 1, 0) y <- crossprod (g, x) / colSums(g) plot(x, col = "RED", main = "Gender") text (y, levels(galo[,1]), col = "BLUE", cex = 3) g<-ifelse(outer(galo[,2], levels(galo[,2]), "=="), 1, 0) y <- crossprod (g, x) / colSums(g) plot(x, col = "RED", main = "IQ") text (y, levels(galo[,2]), col = "BLUE", cex = 3) g<-ifelse(outer(galo[,3], levels(galo[,3]), "=="), 1, 0) y <- crossprod (g, x) / colSums(g) plot(x, col = "RED", main = "Advice") text (y, levels(galo[,3]), col = "BLUE", cex = 3) g<-ifelse(outer(galo[,4], levels(galo[,4]), "=="), 1, 0) y <- crossprod (g, x) / colSums(g) plot(x, col = "RED", main = "SES") text (y, levels(galo[,4]), col = "BLUE", cex = 3) ```

The analysis does not use the school variable at all. In the terminology of Gifi "school" is a *passive variable*. We now repeat the analysis, requiring that students in the same school get the same obect score. Computationally this is easiest to do using the the R package `anacor` (@deleeuw_mair_A_09b). In this analysis each school get an object score, and these scores are plotted in figure `r figure_nums("school_galo", display = "n")`.

```{r aggregate, fig.align = "center", echo = FALSE, fig.dim = c(10,10), cache = TRUE} galo[,5]<-as.factor(galo[,5]) h <- ifelse(outer(galo[,5], levels(galo[,5]), "=="), 1, 0) g1 <- ifelse(outer(galo[,1], levels(galo[,1]), "=="), 1, 0) g2 <- ifelse(outer(galo[,2], levels(galo[,2]), "=="), 1, 0) g3 <- ifelse(outer(galo[,3], levels(galo[,3]), "=="), 1, 0) g4 <- ifelse(outer(galo[,4], levels(galo[,4]), "=="), 1, 0) f <- crossprod (h, cbind(g1,g2,g3,g4)) m <- anacor (f, ndim = 3) xa <- m$row.scores d <- rowSums(f) e<-diag(crossprod(xa, d*xa)) xa<-xa %*% diag(1 / sqrt (e)) plot(xa, type = "n", xlab = "", ylab = "", axes = FALSE) box() axis(1, labels = FALSE) axis(2, labels = FALSE) text(xa, as.character(1:nrow(xa)), col = "RED", cex = 2) ```

School 25 is an outlier, but otherwise schools are dispersed in ways that can possibly be interpreted.

```{r outlier, fig.align = "center", echo = FALSE, fig.dim = c(10,10)} plot(xa[-25,], type = "n", xlab = "", ylab = "", axes = FALSE) box() axis(1, labels = FALSE) axis(2, labels = FALSE) text(xa[-25,], as.character((1:nrow(xa))[-25]), col = "RED", cex = 2) ```

There is another less radical way to incorporate "school" into the analysis, by using it as what is commonly known as a *supplementary variable*. Such a variable does not enter into the analysis, but we can still compute the category quantifications as centroids of object scores in the category. Thus we can make star plots for passive variables that have not been used in the analysis. ```{r supple, fig.align = "center", echo = FALSE, fig.dim = c(10,10)} par(pty = "s") y <- crossprod(h, x) / colSums(h) plot (x, col = "RED", xlab = "", ylab = "", axes = FALSE) box() axis(1, labels = FALSE) axis(2, labels = FALSE) text(y, as.character(1:nrow(y)), col = "BLUE", cex = 2) ``` ##Binary Variables We have seen that category quantifications are in deviations from the weighted mean, with the weights equal to the marginal frequencies of the variable. If a variable has only two categories, and our Homogeneity Analysis has two dimensions, that means that the two category quantifications are on a line through the origin. The direction of the line is determined by the marginals of the variables. What the Homogeneity Analysis gives us is how far from the origin the category quantifications are placed to get optimal discrimination #Analysis Durant Bend Data ```{r anacor, align = "center", echo = FALSE, fig.dim = c(10,10)} par(pty="s") h <- anacor (sherds) x <- h$row.scores d <- rowSums(sherds) e<-diag(crossprod(x, d*x)) x<-x %*% diag(1 / sqrt (e)) y1 <- crossprod (sherds[, 1:2], x) / colSums(sherds[, 1:2]) y2 <- crossprod (sherds[, 3:4], x) / colSums(sherds[, 3:4]) y3 <- crossprod (sherds[, 5:6], x) / colSums(sherds[, 5:6]) mA <- 1.1 * max (x) mI <- 1.1 * min (x) plot(0, xlim=c(mI,mA), ylim = c (mI, mA), type = "n", xlab = "", ylab = "", axes = FALSE) box() axis(1, labels = FALSE) axis(2, labels = FALSE) text(x[1:5,],row.names(sherds)[1:5], col = "RED", cex = 2) text(x[10:13,], row.names(sherds)[10:13], col = "BLUE", cex = 2) text(x[20:24,], row.names(sherds)[20:24], col = "GREEN", cex = 2) text(x[14:16,], row.names(sherds)[14:16], col = "ORANGE", cex = 2) text(x[6:9,], row.names(sherds)[6:9], col = "MAGENTA", cex = 2) text(x[17:19,], row.names(sherds)[17:20], col = "BLACK", cex = 2) text(x[25:27,], row.names(sherds)[25:27], col = "PURPLE", cex = 2) lines(y1, lwd = 3) text(y1, row.names(y1), cex = 2) lines(y2, lwd = 3) text(y2, row.names(y2), cex = 2) lines(y3, lwd = 3) text(y3, row.names(y3), cex = 2) ``` ```{r yplot, fig.align = "center", fig.dim = c(10,10)} par (pty = "s") y <- rbind(y1, y2, y3) mA <- 1.1 * max (y) mI <- 1.1 * min (y) plot(0, xlim=c(mI,mA), ylim = c (mI, mA), type = "n", xlab = "", ylab = "") lines(y1, lwd = 3) text(y1, row.names(y1), cex = 2) lines(y2, lwd = 3) text(y2, row.names(y2), cex = 2) lines(y3, lwd = 3) text(y3, row.names(y3), cex = 2) ``` #References