A few weeks ago I put up a new data set into my repository. As is my usual practice now the populations can be found in the .fam file. But I’ve added more into this. I have to rewrite my ADMIXTURE tutorial soon, so I thought I would bring up an important issue when interpreting these data sets using clustering methods: one has to understand that conclusions can not rest on one single result. Rather, one must attempt to ascertain the statistical robustness of the results. If you arrive at an expected result this is obviously not as important a consideration, but if you arrive at a novel and surprising result, then you have to make sure that it isn’t simply a fluke.
To do this I have been running my PHYLOCORE data set with cross-validation (regular 5-fold). In theory you should be able to see where the value is minimized, and that is your “best” K. But, my personal experience with running ADMIXTURE and STRUCTURE is that the inferred plausibility of a given K derived from the statistic can itself be quite volatile. In other words, it is best to run replicates of a data set when attempt to assess robustness. I’m going to run PHYLOCORE 50 times, but I already have 10 runs.
The results are plotted below
It is seems that the best fit to these data is in the 10 to 15 K range. But notice that < 10 K are not very volatile. There are 10 points, but at K = 5 for example they totally overlay. As you go up the number of populations that the algorithm attempts to infer, the more volatile the cross-validation results are.
Zooming in on the plot you notice that not only does K = 13 have the minimum cross-validation error, but seems to exhibit the least volatility. I suspect that this result will hold, but you never know. The point is not to establish hard and fixed rules. It is to be explicit in the guidelines of how to interpret results, which can be quite varied depending upon the input parameters you begin with.
Addendum: The seed is random, for those who are curious.