Friday, April 13, 2018

Monophyletic species, kind of

A paper by bryologist Brent Mishler and philosopher of biology John Wilkins has just come out, with the title The Hunting of the SnaRC: A Snarky Solution to the Species Problem. It is open access in the journal Philosophy Theory and Practice in Biology, so anybody with internet access can check it out.

Many bloggers have issues that they return to again and again even if they are not necessarily the nominal topics of their blogs - for example, Jerry Coyne frequently posts about Free Will and about students trying to shut down talks by speakers they don't like, and Larry Moran regularly takes apart papers claiming that junk DNA has been disproved. This much less widely known blogger can reliably be coaxed out from behind the oven by at least two such recurring issues: bad arguments for the acceptance of paraphyletic taxa, and the in my eyes incoherent concept of "monophyletic species".

As the title indicates, Mishler & Wilkins present a solution for the species problem, i.e. the perennial question in biology of what 'a species' even is. Especially as the paper is freely accessible it would serve no purpose to summarise its introduction, so I will move immediately to what I find most interesting: their views on how to view species and some pointers on how to do classification at the lowest levels in practice.

Note that I say "their views", plural, deliberately, because this is one aspect of the paper that I have not quite understood yet:

Wilkins has argued in the past that the popular approach of developing a theoretical species concept and then applying it to a potentially recalcitrant reality is a dead end. What biologists should do is the opposite, i.e. consider species as empirical phenomena in need of individual explanations. And here in this paper, Wilkins' argument is reiterated concisely in section 3, A Way Forward: Species Are at Least Initially Phenomena.

What I like about this flip in perspective is that it allows much more flexibility; obviously the empirical phenomena that we generally identify as species, be it popularly or as biologists - generally gaps in morphological or genetic variation - need a different scientific explanation for example in asexual than in sexual species, making one-size-fits-all species concepts difficult to apply.

Mishler, in turn, has argued in the past that species are not a special biological category different from e.g. monophyletic genera and families. The species category is arbitrary, and we should just classify all organisms into nested monophyletic groups, AKA clades, all the way down to the individual specimens. And here in this paper, Mishler's argument is reiterated in sections 4, Rankless Taxonomy, 5, Capturing the SNaRC, and 6, Using SNaRCs in Systematic, Evolutionary, and Ecological Studies.

The thing is, while there is perhaps technically no direct contradiction between those two arguments to the degree that there is a contradiction between "all taxa should be monophyletic" and "taxa should be allowed to be paraphyletic", they appear to be two rather different prescriptions. If I understand correctly, the first says,
  • We should treat species as empirical phenomena in need of explanation instead of indiscriminately applying a given theoretical concept to them.
The second says,
  • It makes no sense to even talk of species, we should stop doing so, and here is a single theoretical concept (everything is clades) that we should indiscriminately apply to all specimens.
In fact I am currently unable to see how sections 4-6 and the conclusions of this paper would have to change if section 3 were to be deleted in its entirety. What am I missing?

What I found most useful about this paper was that it has some thoughts on how to do classification into nested clades all the way down to the individual specimens in practice, because that was completely unclear to me in all past instances when this approach was suggested. There are some apparent problems with it, particularly that we need items forming a tree structure to even have clades. It is sometimes difficult to illustrate the issue, but it can perhaps be presented as follows:
  1. The prescription is, as mentioned above, that a classification should be clades (= monophyletic groups) all the way down to individual specimens.
  2. A clade is a complete branch in a tree structure, and usually understood to be specifically a complete branch of a species phylogeny.
  3. In other words, the way the term clade is defined, it applies only in a tree-structure but is inapplicable in a net-like structure.
  4. Sexually reproducing species are systems consisting of individual specimens that have net-like relationships with each other, because they share numerous ancestors instead of one ancestor in each sufficiently earlier generation.
  5. It follows necessarily from the previous two points that the term clade cannot be applied to describe the relationship between specimens if what we are looking at includes multiple specimens from the same sexually reproducing species.
  6. If follows then that it is logically impossible to classify into clades all the way down to these specimens, unless the meaning of the word clade is changed to a degree that the whole purpose of having that word is defeated.
To my understanding this is why Hennig spent so much time discussing the different ways that specimens (or snapshots of them, which he called semaphoronts) can be related to each other. The relationship between four (non-hybridogenic) species is tree-like, so they can, and should, be classified into clades. But relationships between individuals within a sexually reproducing species are net-like, so they cannot possibly be classified into clades, as the word does not even have a meaning in that structure.

The point at which approaches to classification change is approximately at the species level. Phylogenetic systematics applies only above it, and it uses species as the units that it groups into clades, because if it used any smaller units there would not be clades. This is also why in my opinion one cannot coherently reject the reality of species and be a phylogenetic systematist and, conversely, coherently accept the reality of species and promote paraphyletic taxa, because clades are species that have diversified. Many others, of course, disagree.

Now, what is the practical approach suggested by the present paper? It argues that the terminal units of classification should be "the finest-scale clades that can be convincingly demonstrated with current data", here called Smallest Named and Registered Clades (SNaRCs). Obviously such a 'clade' cannot be based on information from a single gene, as it may show a different history than other genes, for example because of introgression or incomplete lineage sorting. The solution is to use as evidence for monophyly "the preponderance of gene lineages making up a clade", or in other words "congruence among the majority of gene trees and other types of phylogenetic characters available".

On the plus side, this is a very empirical and testable prescription. But consider two thought experiments. First, take three samples A, B and C, look at, say, 100 gene trees, and if 51 of them show ((A,B),C) then A and B form a 'clade', even if all three of them are members of the same sexually reproducing species. Again, that is doable, empirical and testable, and we get a clear answer.

Nonetheless this approach does not convince me at the moment, nor will it even if we assume a scenario of 100 gene trees supporting (A,B), simply because no matter what the gene trees say, in reality there is no tree-structure inside the species. Yes, we can easily sequence for example the DNA of three siblings and run an analysis that will produce a phylogenetic tree for each gene, but in reality these three people just don't have a tree-relationship with each other, so it does not make sense to me to use terminology or a classification that implies there is one.

For the second thought experiment, take three samples D, E, and F, and if 33 gene trees say ((D,E),F), 33 say (D,(E,F)), and 34 say (E,(D,F)), we are inside a SNaRC and should not delimit any more narrowly, even if D is a specimen from an arid zone ephemeral, E from an alpine perennial, and F from a narrow endemic of the northwestern Blue Mountains that only occurs on ironstone-sandstone outcrops, and all three of them are geographically isolated from each other.

This hypothetical case has three very distinct entities that show a lot of gene tree discordance for the genes we used for our analysis. This is a much weaker problem than the previous one because Mishler & Wilkins argue that SNaRCs are, as all scientific hypotheses, tentative and await revision after the examination of more data. Maybe the next 100 gene trees will clinch it for (A,(B,C)), and then at least we could separate out A; more realistically, sampling more individuals of all three species will presumably resolve the three species as three SNaRCs, even if we cannot figure out the relationship of those three SNaRCs with each other (they may even form a true polytomy, and that's fine).

Still it bothers me that in a situation where we unfortunately have only one sample per species available for analysis the approach promoted in the present paper might lead to the tentative lumping of clearly distinct entities. And unless something is added to the approach, or unless I am missing something, it would have to, because it does not seem to include a way of recognising single-specimen SNaRCs except in the case of one being left alone as sister to another SNaRCs, that, in turn, would still consist of two potentially vastly different specimens. But maybe I am taking this too literally.

On top of that there is perhaps another methodological issue, or again maybe just something I don't understand. It seems to me as if "majority vote of the gene trees" is not actually how multi-locus phylogenetic analyses generally work. To the best of my understanding they reconcile gene trees in rather more complex ways, even in the case of such a simple approach as Gene Tree Parsimony, let alone the multi-gene coalescent model. Many of these approaches actually presuppose the existence of species or populations, and for the same reason as I argued above: what happens within a sexually reproducing lineage is rather different from what happens between such lineages.

More than anything what I find uncomfortable about the approach presented here is that it seems to care not so much about the actual patterns of common descent of what it classifies as about character or gene tree distribution. The difference may come across as subtle, admittedly. What I am trying to say is that I believe phylogenetic systematics should be about classifying organisms by relatedness, by exclusivity of common descent.

I do not, for example, care very much about the fact that most of the ancestral chloroplast genome has been moved over into the nucleus of the host cell, because the chloroplasts are directly descended in an unbroken line from the first cyanobacterium that colonised a plant cell, and the plant species we have today are descended in an unbroken line from that plant cell. To me chloroplasts are a subclade of cyanobacteria and plants are a subclade of eucaryotes, all regardless of what happened to the individual genes.

To use an example from within a species, I have mentioned in the past that it is possible, although statistically unlikely, that I have inherited no genetic material whatsoever from my maternal grandfather, if it just so happened that all the chromosomes my mother gave me were those she got from her mother (the Y chromosome is of course always from the paternal grandfather, by necessity). But even if that were the case we would nonetheless consider it to be an important piece of information that I descended from my maternal grandfather, and I would nonetheless not exist without his involvement. So yes, we use the genes to infer common descent, but the point is really the common descent itself, and the genes are just a data source that can potentially mislead us. Sometimes the right answer may be (A,(B,C)) even if most genes say ((A,B),C).

The "majority vote of the gene trees" approach, however, feels as if its practical concern starts and ends at the pattern shown by the genes, regardless of what the patterns of descent are. To me that feels the wrong way around.

Another way of looking at the issue may be this: If we truly accept the argument made in section 3, that we should look at natural phenomena, consider them to be explananda, and find the most appropriate scientific explanation for each of them, would the logical result not be Hennig's original approach? The phenomenon that a beetle specimen shares more traits with a bee specimen than either share with a slug specimen has an explanation, and that is that the former two share a much more recent common ancestor from which they inherited the shared traits. We express that reality by grouping the former two into a taxon called 'insects' while leaving the slug out.

The fact that I may easily in some cases share more genetic similarity with somebody born in Italy than with another northern German, however, would most likely be due to the stochastic nature of allele inheritance inside our sexually reproducing species. There is no clade wherein two specimens of humanity - the hypothetical Italian and I - share one and only one most recent common ancestor. Instead, beyond some point in the past we share thousands of ancestral 'specimens' in each generation. Because this is a different biological phenomenon than ((beetle,bee),slug), we need a different approach to classification at that level.

No comments:

Post a Comment