Friday, April 17, 2015

Wow. It worked

Finally got fastStructure installed a few days ago, so that supersedes my previous post about, well, not being able to install it. Also, it is really, really, really fast, and the clusters it infers seem to make more sense than the last time I used it, then on a work computer.

So that's good.

Still, I stand by what I wrote: if it needs somebody with my level of computer knowledge to make it work on the fourth attempt and if it only works on Unix/Linux anyway, then the number of end-users will be sharply limited compared to the original STRUCTURE software that comes with Windows executables and a GUI.

And there are a few other issues:

Just like the last time I used it, if infers basically no admixture. All samples are assigned to a population with >99% except one, and that one is assigned to >95% to one population. And this is a dataset where everything else - morphology, neighbor joining phenogram, old STRUCTURE - screams that one population of our samples is a hybrid swarm. So fastStructure may be of limited use to those studying introgression.

In addition, the output is fairly user-unfriendly. First, to know which of several numbers of clusters (K) to accept, one has to compare the likelihood values for each. The STRUCTURE GUI can display them in a nice table, but when using fastStructure you have to look at each log file individually and perhaps copy-paste the relevant values out into a table of your own making. I wrote a Python script to automate that, but not every user can do that.

Second, now that you know that K = 6, for example, is your preferred result, you want to see what that population structure looks like. fastStructure comes with a little tool called 'distruct' that helpfully draws one of those typical little STRUCTURE bar plots. Without sample names.

Yes. Without sample names. You can supply it with a little file containing the population names, but how do you know which population should have which name without knowing which samples belong to each? So you could just as well have a toddler draw a random bunch of colourful patches, because that would have the same information content as the distruct output.

So time to go back to the actual output and draw a bar plot with other means. Here we hit another snag. Where STRUCTURE outputs a file like so:
Yoursample1  0.340  0.660  0.000
Yoursample2  0.999  0.001  0.000
Yoursample3  0.000  0.000  1.000
... the equivalent fastStructure output file with the extension meanQ looks like this:
0.340  0.660  0.000
0.999  0.001  0.000
0.000  0.000  1.000
Again, which sample is which? To make sense of the results I really need to know that, don't I? Well, they are in the same order as in your input file, but as that is usually going to be a '.str' file with two rows for each sample, it is not a trivial exercise to copy the results and the sample names next to each other in a way that you can use for producing the bar plot. Also, the columns are separated by two spacers instead of one tab, making it even harder to copy into an Excel or LibreOffice sheet, but if I remember correctly that was already the case in STRUCTURE.

From a user friendliness perspective this all looks rather poorly thought through. But again, the program is free and several orders of magnitude faster than STRUCTURE, so one can't complain too much.

No comments:

Post a Comment