A few weeks ago I put up a new data set into my repository. As is my usual practice now the populations can be found in the .fam file. But I’ve added more into this. I have to rewrite my ADMIXTURE tutorial soon, so I thought I would bring up an important issue when interpreting these data sets using clustering methods: one has to understand that conclusions can not rest on one single result. Rather, one must attempt to ascertain the statistical robustness of the results. If you arrive at an expected result this is obviously not as important a consideration, but if you arrive at a novel and surprising result, then you have to make sure that it isn’t simply a fluke.
To do this I have been running my PHYLOCORE data set with cross-validation (regular 5-fold). In theory you should be able to see where the value is minimized, and that is your “best” K. But, my personal experience with running ADMIXTURE and STRUCTURE is that the inferred plausibility of a given K derived from the statistic can itself be quite volatile. In other words, it is best to run replicates of a data set when attempt to assess robustness. I’m going to run PHYLOCORE 50 times, but I already have 10 runs.
The results are plotted below
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn’t release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs).
By this time I’m sure you’ve encountered articles about the reconstructed last common ancestor of all placental mammals. Greg Mayer at Why Evolution is True has an excellent review of the implications, along with a link to a moderately skeptical piece by Anne Yoder in Science. Yoder’s piece is titled Fossils vs. Clocks, while the original paper is The Placental Mammal Ancestor and the Post–K-Pg Radiation of Placentals. The results clearly support the “Explosive Model” in the figure to the left for the origination of placentals. That might prompt the thought: “isn’t this what we knew all along?”
The standard story for the last generation in the popular imagination is that a massive asteroid impact was the direct cause of the extinction of all dinosaurs (and of course a host of other groups) except the lineage which we now term birds. And yet it turns out that there is actually some debate about this, though at least in some form it seems likely that the impact is going to be important (see this Brian Switek piece for exploration of this issue, and the general opinion of the scientific literature as of now). The second aspect to focus on is timing. Contrary to the intuition of many, over the past 20 years molecular phylogenetics has inferred a very definite (on the order of tens of millions of years) pre-K-T boundary coalescence for the common ancestors of the disinct mammalian lineages. A plausible explanation for this is that these lineages diversified through allopatry, as the Mesozoic supercontinent fragmented. Morphological diversification of these mammalian lineages also may have occurred after the K-T event.
To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.
Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.
I put up kind of a ridiculous title. But I do hope that at some point in the near future we’ll have some of the same flavor of debates on the macroevolutionary time scale that we have on the human microevolutionary time scale. There’ll be a surfeit of sequence at nearly every node of interest on the tree of life, and computational power galore devoted to analyzing variation and reconstructing any phylogeny we can conceive of. To be fair, one could argue we aren’t there even with human phylogenetics either. But it is rather strange we’re debating the origin of mammals and the nature of the lineage’s phylogenetic tree at this time. This is the kind of thing that I hope a more robust and assertive molecular phylogenetics can resolve (and paleontology as well, but I’m not up on the latest in computational analysis of morphological characters).