A few weeks ago I started looking at the 23andMe raw files of some of my friends and integrating them into HGDP and HapMap population data sets. One of the first things I did is remove the African populations from my total data. The reasons is as you can see to the left, Africans occupy the largest principal component of variation, which sets them apart from Eurasians. Without this dimension of variation the non-Africans are squeezed into one dimension, and groups like Oceanians and Amerindians show up in the strangest places. But that’s because these groups are non-African, and do not differ as much along the primary west-east axis of genetic variance which shakes out out of any such analysis. Africans aren’t the only issue though. As I’ve noted before I’ve been running ADMIXTURE, and isolated groups such as the Kalash can “monopolize” one particular color. This may be due to the Kalash being some distilled essence of an ancestral population, but I suspect that it’s more genetic drift due to isolation which has made these sorts of groups distinctive. So I removed these outliers…though do note that other “outliers” often pop out of the data to take their place quite often.
Below is a slide show with the PCAs of the 1st component of variance plotted with the 2nd, 3rd, and 4th, components. At the 5th and beyond it seems that the lower eigenvectors achieve a level of stability in magnitude. Remember that the plots are not scaled. The 1st PC is about an order of magnitude bigger than the 2nd. I’ve also attached an ADMIXTURE plot with K = 12, both for populations, and the individuals who have given me their 23andMe files. I’ve placed them upon the PCA. And yes, ID001 and ID002, are my parents.
As you can see, I’ve color coded the population groups. Europeans are red, Middle Easterners are blue, South Asians brown, and East Asians purple. Initially I assumed I’d made an error when I saw that the Russians and northwest Europeans were adjacent to a Middle Easterner cluster, but it goes to show you what happens when you remove African variance. Much of the Middle Eastern distribution in the conventional HGDP PCAs seem to be due to genetic relatedness with Sub-Saharan Africa. That’s not in this plot, so Middle Easterners form a relatively tight cluster. Not only that, but there’s sometimes a weird connection between northern European populations and groups closer to the heart of Eurasia. I think that’s why Orcadians and my friends tend to be shifted in that direction, while Sardinians, Basques, French, and even Tuscans, are “more European.” This weird pattern is especially evident in ID004, who is by and large a vanilla white American of Germanic heritage, but always seems to exhibit a tendency to have a trace but non-trivial element which connects him to Central and South Eurasian groups.
Finally, also, note that my parents tend to cluster together in all the higher PCs, not just 1 & 2. This stuff isn’t totally arbitrary.
Note: I put the raw PCA results generate by EIGENSOFT here in csv. I would caution that plotting with a conventional desktop spreadsheet might be a touch computationally intensive, but you’re welcome to try.