Tuesday, February 17, 2015

Dated phylogenies: My experience using r8s

In the spirit of my previous posts on species tree software and fastStructure, the following post is to summarise my experience trying to use the software r8s, again so that somebody who also tries to use it for the first time may have the chance of finding these remarks and thus avoid some of the frustrations I had.

First, what is this about? It can do more, but for my present purposes Mike Sanderson's r8s (~rates) is a program that takes a phylogenetic tree and at least one fossil calibration point provided by the user and then dates the other nodes of the tree.

Imagine you have a phylogeny of a group of plants - a tree of their evolutionary relationships, produced with molecular data and using one of the standard phylogenetic software tools, TNT, RAxML or MrBayes perhaps - and you want to know when some of the subgroups evolved. For example because you want to know whether a given climatic or geological event is associated with the diversification of one of the subgroups.

You need some kind of information that allows you to calibrate the phylogeny somewhere; either you know mutation rates in the molecular data you are using, or you have fossils that you can use to assign minimum ages to the groups they belong to (the group can be older but not younger than the fossil), or, weakest of all perhaps, you believe other people's dated phylogenies and use some of their results as calibration points. You also need to assume that branch lengths in your phylogenetic tree - mutations along the branches - have at least some kind of rough relationship with the age of that branch.

Clearly there are a lot of assumptions entering into this kind of analysis, and there are scientists who are highly sceptical of these kinds of methods. Still, the assumptions that a group is at least as old as its oldest fossil and that groups accumulate more differences the longer they are apart are surely reasonable, and so as long as we take precise ages in the results with a bucket of salt we can at least use the broad strokes to address some questions. Conversely, if we get an age of more than a billion years for a group of flowering plants we know that something must be amiss.

r8s is one of the two principal tools for doing dated analyses; the other one is the Bayesian software package BEAST. Many people, especially religious Bayesians, would probably say that r8s has been made redundant by BEAST. But as I have written here before, all of these methods have their own advantages and disadvantages. One of the major disadvantages of Bayesian phylogenetics is that it rests on an even greater number of assumptions and, specifically, priors than simpler methods. Add to that the often ridiculously long computing time especially for larger datasets or the problems BEAST often has with missing data and it should become clear that there will always be a comfortable niche for other approaches.

With this, we finally arrive at the program r8s itself, which I have tried out over the past few days. The manual does a good job of explaining its functionality and how to set up an analysis, so I will not deal with that here. Rather, I want to focus on the practical details that one usually has to find out the hard way:

On the program's website, the author makes available a Mac executable and the source code. This means that if you are a Mac user, you are in luck; if you are a Linux user, you will have to compile it; and if you are a Windows user, just forget about it and ask a colleague who has got one of the other two OS to install it for you on their machine. Well, maybe you are of tougher material than I am but I gave up.

Luckily, I am also an Ubuntu user. On Linux, you need to sudo get-apt install the following three items (if they aren't installed already): make, gfortran and gcc. Once you have them, you can try to compile by running "make" in the unpacked r8s folder. If you are like me, this will produce an error message about "errno.h". In that case open the makefile and change one line as follows (the out-commented one is how it is originally):

# memory.o: /usr/include/errno.h /usr/include/sys/errno.h
memory.o: /usr/include/errno.h

Try to compile again; this time it worked for me. Tip of the hat goes to the Linux Forum, and to Google for finding the former for me.

Now that you can use r8s, turn to the manual to set up your analysis. The program implements three methods: Langley-Fitch (LF), Non-Parametric Rate Smoothing (NPRS) and Penalised Likelihood (PL). Langley-Fitch is a strict clock. The analysis is very fast and very simple but a strict clock is nearly always a wrong assumption, so let's ignore it.

It is less clear to me which of the other two is more defensible. They are both relaxed clock methods in that they allow different rates across the tree. NPRS appears to have the advantage of being simple from an end user perspective in that you just run it and get a result (with the caveats discussed in the r8s manual). PL on the other hand requires the user to try out different smoothing factors and select the best one. In my case the results of the two were very similar except in one subclade, and our dataset is certainly not simple.

I still do not quite understand what determines computing time. The last run I did seemed to take a few hours for NPRS but very little time indeed for PL, but when I tried to run PL cross-validation across seven smoothing values, PL ran for a day and a half before the program spontaneously shut down. The latter may have been the fault of the environment though.

Finally, I got some really wonky results with my first attempts. What I learned today from ANU professor Mike Crisp, who kindly provided some advice on r8s, is that this was due to the way I set the constraints. His advice cleared it up, and now the results are starting to make sense: (1) always have some realistic but conservative maximum age for the root of the entire tree, otherwise it will be pushed way too far into the distant past, and (2) make all constraints inside the tree minimum only, avoid maxima.

No comments:

Post a Comment