A new paper in Science claims to have ascertained the locus of origin of the Indo-Europeans, Mapping the Origins and Expansion of the Indo-European Language Family. These are bold claims, and naturally have triggered a firestorm. No surprise, the same happened with these researchers when they published the result in 2003 that Proto-Indo-European flourished ~9,000 years ago, in alignment with an “Anatolian hypothesis,” as opposed to a “Steppe/Kurgan hypothesis.” The original paper in 2003 utilized phylogenetic methods which are common within biology, and applied them to linguistics. This second paper now incorporates spatial information into their model, to generate an explicit locus of origination, in addition to the dates for the bifurcations of the node.
In relation to results I think that the figure to the left is the most important, because it gives us their inferred dates of separation between various Indo-European language families. Observe that Italic and Celtic did not diverge in prehistory, but in history (i.e., the Sumerians and Egyptians were flourishing at the time). Additionally, the diversification pattern is not a simple “rake,” there is internal structure. They may date the origin of Indo-European languages to the early Holocene, but the diversification seems to have happened in steps and pulses. Though the authors support the Anatolian hypothesis, they also seem quite comfortable acknowledging that the real story is more complex, though you wouldn’t get that from the media.
School girls in Hunza, Pakistan
A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.