PhyloBotanist: *BEAST, species trees in general, and Bayesian versus parsimony

Thursday, July 18, 2013

*BEAST, species trees in general, and Bayesian versus parsimony

Alexei Drummond is visiting Canberra this week and today he has given a BEAST workshop hosted by our ANU/CSIRO Centre for Biodiversity Analysis. I have participated in the workshop and as one might expect I am most interested in the species tree analyses one can do with *BEAST. It was very rewarding and great fun. A few learnings I am taking from this, mostly as a kind of supplement to my species tree post from April, and a few thoughts:

The official position is that one can run an analysis with only one sample per species but it is not advisable because at least two samples are needed to estimate population sizes. From what people who have run their own analyses tell me, two samples are generally not enough for good results either.

That has an interesting consequence: because one should preferably have several samples per species but computing time explodes with larger sample numbers, these analyses are then only realistic for limited numbers of species. Unless, that is, you are prepared to assemble a ridiculously large dataset even for small studies and run your analysis on a supercomputer for a few weeks.

There are methodological alternatives (see my post linked above) but of course they have their own weaknesses. And as one of the course instructors pointed out, the parsimony based methods I like so much may come without such warnings and happily give me a species tree for every data set I throw at them, but they may give me a false sense of security where *BEAST would honestly show me how uncertain the result is.

There is a way around the problem of *BEAST not accepting missing data (i.e. one locus missing for a species): one can make a dummy sample of the species for that locus and fill the whole sequence with Ns. The analysis will then run without an error but it may take longer to mix.

It was confirmed that *BEAST will run with only one locus. This and the previous point are really important because it means that the program is even more flexible than I had assumed so far, apart from being reasonably fast and user friendly for a Bayesian phylogenetics tool.

---

It is funny to observe how different scientists see different methodological issues, and how everybody is convinced that the one they have experienced is the most important one. The same instructor mentioned above was really concerned about the problem of non-Bayesian methods giving me one best tree even if there might be two more or less two equally good tree islands in the landscape. A parsimony method, for example, would pick one of them but a properly done (!) Bayesian analysis would sample over both and then show the uncertainty.

Apart from the fact that one can implement measures of certainty also for non-Bayesian methods (and of course that has been done), Bayesian approaches come with their own host of issues. They are slow, there are all the controversies around prior selection, and they demand an enormous up front investment on part of the end user. What is gamma? What is theta? What chain heat should I chose? How do I know what is a realistic prior for any of these dozens of items? What substitution model to chose? How to evaluate whether the run has been sufficiently long? Why do I have to learn how to use at least four different programs for one measly analysis?

Of course, some would argue that one should not do an analysis if one is not willing to do it right, but that steep learning curve is definitely also reducing the accessibility of science and sometimes borders on Herrschaftswissen (a term for which there might not be a good English translation). Not everybody interested in the phylogenetic relationships of one genus can be expected to become one of the world experts in Bayesian phylogenetics.

And again, sometimes I find it good to know where the computer has its hands... Bayesian analyses come with a huge and ever increasing number of variables and, importantly, assumptions that one has to accept. Parsimony analyses, on the other hand, have one simple assumption: of two possible explanations, the simpler one is to be preferred. It is very clear what actually happens inside the computer when you use them. So, sometimes you need a Sojus to get somewhere, but in other situations you would be better served just taking your bicycle.

Still, if you have the right data and can defend your assumptions, then BEAST is the most flexible and most sophisticated tool one may find.

---

Finally, do not try to conduct any significant MCMC run on an ASUS Eee PC Seashell series. Good for conferences and field trips, not so good for doing science.

Thursday, July 18, 2013

*BEAST, species trees in general, and Bayesian versus parsimony

No comments:

Post a Comment