Saturday, May 30, 2015

Non-hierarchical clustering of mixed data in R

I had wondered for some time how one could do an analysis similar to STRUCTURE on morphological data. Of course not quite like STRUCTURE, as in using a model of population genetics, but in the sense of having a method that gives you the best phenological clusters of your specimens for a given number k of clusters.

The problem was always that I didn't know where to get a method that can handle all data types. You may have numerical variables such as leaf length, where it appears obvious that 5 cm is twice as far away from 3 1 cm as from 1 3 cm (ye gods, did I really get that wrong?); but in the same dataset you may also have categorical variables such as hairy or not hairy. These could be coded as 0 and 1 for Euclidean distances, but what if they are multi-state? What is more, you can have categorical variables where there is a clear ordered relationship, e.g. subglobose is somewhere between globose and hemispherical, and others where there is no clear relationships, e.g. white, yellow and purple flower colour.

The R package homals can do a homogeneity analysis, a generalised form of PCA-type ordination analysis, with such mixed datasets. But that is it, it will give you a 2d or 3d plot. It will not do clustering.

This week some concerted googling finally produced the solution: The R package cluster (unsurprising, I know) and its functions daisy and pam (of which I had not heard before). The name doesn't have anything to do with the Asteraceae family, the distance functions in cluster rather appear to be named after female names. When presented with mixed data, daisy defaults to the Bower metric as a distance measure, which is exactly what I wanted.

Because others might be interested in doing non-hierarchical clustering with morphological data, and because getting this right was quite a struggle (as so often with R), I thought I would post a how-to.

First, you should have a text file with your data table, perhaps something like this:
Specimen,Max_leaf_length,Leaf_indument,Flower_colour,...
Sample1,2.3,woolly,white,...
Sample2,4.5,woolly,yellow,...
Sample3,2.9,glabrous,purple,...
...
In this case it is a comma separated values file, but you can just as well use tab separated; in fact that is what I usually have. You also obviously need to have the package cluster installed in R.

In your R script you now want to have the following. Set your working directory with setwd("your/working/directory"). Load the package with library("cluster"). Now you want to import your data, but first you should be able to tell R how each variable in your dataset should be treated - is it numerical or categorical? I make a variable chartreatment like this: chartreatment <- c("numeric","factor","factor",...).

Now we are ready to import the data with myData <- read.csv("myfilename", colClasses = chartreatment, sep=",", header=TRUE, row.names=1, na.strings=c("")). Here, the sep argument tells R to expect comma separated values; use "\t" for tab separated. The header argument specifies that the file contains variable names, the row.names argument does the same for row names, and finally the na.strings argument specifies what indicates missing data in your file. In this case it is simply an empty field. And yes, one of the awesome aspects of the daisy function is that it can deal with missing data.

Now that the data are in, you may want to turn some of the categorical variables into ordered states. Assume that you have the one I mentioned above, and that it is called Glomerule_shape. You would then write myData$Glomerule_shape <- ordered(myData$Glomerule_shape, levels = c("hemispherical", "subglobose", "globose")) . Now R will know that subglobose is between hemispherical and globose, but you would for example leave the Flower_colour variable as it is, so that the three colours all have the same distance from each other.

Now for the clustering analysis. It proceeds in two steps: first you make a distance matrix with mydm <- daisy(myData), then you hand the distance matrix to the clustering function with myresult <- pam(mydm, k=2). That is it. To see how the specimens were assigned to clusters, simply look into myresult$clustering. To save the results, use write.table(myresult$clustering, file="outfilename", sep="\t").

However, this is only for k = 2 clusters. What you really want to do is run the analysis across a large number of k values and see which of them is the best. That is the point of non-hierarchical clustering, after all; often you may want to know how many species you have among your specimens, and merely deciding that it must be two isn't very objective.

The best thing to do is to create a whole array of results, and we found it useful to use the apply function for this. First, declare the starting and ending k with fromk <- 1 and tok <- 35 (for example). I prefer to start at 1 because it will mean that the index of the results is always the same as k. Now make an array of those values with myarray <- array(c(fromk:tok)).

The apply function is very confusing but it saves a lot of scripting: myresult <- apply(myarray, 1, pam, x=mydm). This means that pam will be applied to all values in the array, using mydm as the distance matrix. You now have a whole array of results; confusingly, you have to access the contents using double angular brackets, see below.

The next thing to do is to get out the scores for the various clustering solutions. Because the score is called "average width", I will call the relevant variable widthes. widthes <- matrix(data=NA, nrow=tok, ncol=2)  creates an empty matrix. colnames(widthes) <- c("k","width")  assigns column names to it. Now you need a loop to get the scores out:
for (i in fromk:tok)  {
widthes[i,1] <- i
if (i>1) { widthes[i,2] <- myresult[[i]]$silinfo$avg.width }
}
You can either look at the table with widthes or you can look at a graph display of them with plot(widthes[,2] ~ widthes[,1]). The scores should show an increase with increasing k, although admittedly given the name I would have expected them to go down. What you want is the first value of k where the width scores start plateauing. For your desired k, save the results with write.table(myresult[[k]]$clustering, file="outfilename", sep="\t").

To summarise:
setwd("your/working/directory")
library("cluster")
# import and prepare data
chartreatment <- c("numeric","factor","factor", ...)
myData <- read.csv("yourfilename", colClasses=chartreatment, sep=",", header=TRUE, row.names=1, na.strings=c(""))
myData$Glomerule_shape <- ordered(myData$Glomerule_shape, levels = c("hemispherical", "subglobose", "globose"))
# make distance matrix
mydm <- daisy(myData)
# do a whole range of clustering analyses
fromk <- 1
tok <- 35
myresult <- data.frame()
myarray <- array(c(fromk:tok))
myresult <- apply(myarray, 1, pam, x=mydm)
# collect and plot scores
widthes <- matrix(data=NA, nrow=tok, ncol=2)
colnames(widthes) <- c("k","width")
for (i in fromk:tok)
{
    widthes[i,1] <- i
    if (i>1)
    {
      widthes[i,2] <- myresult[[i]]$silinfo$avg.width
    }
}
plot(widthes[,2] ~ widthes[,1])
widthes
# write results, in this case of k = 5
write.table(unlist(myresult[5])$clustering, file="outfilename", sep="\t")
I thank Nunzio Knerr for help with my analysis, in particular with the apply function.

No comments:

Post a Comment