David of the Eurogenes Genetic Ancestry Project has a cautionary post up, When is a genetic map also a geographic map? Always and never. In it, he uses a specific peculiar pattern as a launching point into a broader exploration of the relationship between visualizations of genetic variation, and geography. That pattern is that Russians, the most geographically furthest east of European peoples, are closer to the Slavs of Central Europe than the Balts when plotted on the two largest dimensions of variation. I’ve highlighted this pattern from a PCA David extracted from a paper on northeast European genetics. This disjunction between geography and genetics has a pretty straightforward possible explanation: the current distribution of Russian-speaking peoples is a function of a massive demographic expansion to the east by Slavic farmers within the last 2,000 years. We already know that the borderlands between the steppe and the forest were long dominated by North Iranian people, from the Scythians to the Sarmatians, while further north the Great Russians absorbed a Finnic substrate (clear because some of the absorption is attested down to the early modern period).
With that duly noted, I think there’s definitely some margin in more rigorously estimating the deviations from expectation when one attempts to generate a correspondence between a PCA and a geographic map. What I’m imagining is that you simply enter in the positions of various ethnic groups on a real map, and then transpose the PCA with the ethnic labels on top of that map and shift until you maximize the correlations. When the correlations are maximized, stop, and then note where there are the greatest deviations from expectation. Taking example above a vast swath of eastern Europe would show up as a major deviation. Some of these peculiarities will be due to geography. The chasm between Africans and non-Africans will probably be greater than one would expect as a function of distance, but the intervening Sahara presents itself as a good cause. But, when you look at the genetic data sometimes strange and unexpected correspondences emerge. If one can’t immediately spot a reason, than that bears further investigation.
As I’ve given this some thought, I guess I should admit that I’ve fiddled with R’s mapping functions, and also looked for other applications. But the labor input is such that I’ve put off getting deeper into this topic. I’d be curious if anyone else was interested in this sort of intersection between genetic and geographic data visualization. I think maps are pretty much informational gold.