To the left is a PCA from The History and Geography of Human Genes. If you click it you will see a two dimensional plot with population labels. How were these plots generated? In short what these really are are visual representations of a matrix of genetic distances (those distances being general FST), which L. L. Cavalli-Sforza and colleagues computed from classical autosomal markers. Basically what the distances measure are the differences across populations in regards to their genetics. The unwieldy matrix tables can be visualized as a neighbor-joining tree, or a two dimensional plot as you see here. But that’s not the end of the story.
In the past ten years with high density SNP-chip arrays instead of just representing the relationship of populations, these plots often can now illustrate the position of an individual (the methods differ, from components analysis or coordinate analysis, to multi-dimensional scaling, but the outcomes are the same).
Long time readers know that I have a fixation on people not taking PCA too literally as something concrete. Tonight I finally merged the HGDP data set with some of the HapMap ones I’ve been playing with, and tacked my parents onto the sample. I took the ~50 HGDP populations, added the Tuscans, the two Kenyan groups, and the Gujaratis, and merged them. I thinned the marker set to 105,000 SNPs (I had to flip the HGDP strand too). Then I just let Eigensoft do its magic, and 2 hours on I produced my own plot. I’m still getting a hang of the labeling issues, but first let’s look at what 23andMe produces (I’m green):
I suspect that the gap between my parents and the main South Asian cluster is just an artifact of the lack of South and East Indians in the sample. Additionally, things would look different if I removed the Africans, since the first principal component would be freed up. More on that later. All in all, still pretty awesome that circa 2011 this sort of thing is just an evening’s concentration.
I have noted a few times that one thing you have to be careful about in two dimensional plots which show genetic variance is that the dimensions in which the data are projected upon are often generated from the data itself. So adding more data can change the spatial relationships of previous data points. Additionally, in 23andMe’s global similarity advanced plot you are projected onto the dimensions generated from the HGDP data set. There are some practical reasons for this. First, it’s computationally intensive to recalculate components of variance every time someone is added to the data set. Second, it isn’t as if the ethnic identity of any given individual is validated. What would you do if an alien sent in a kit and spuriously put “French” as their ancestry?
So, in reply to this comment: “Let me rephrase: is there any difference when you switch to the world-wide plot? I imagine not, or you would’ve mentioned it.” Actually, there is a slight difference. Below on the right you have a “world view,” with my position being marked with green, and on the left a “zoom in” for Central/South Asia in the HGDP data set.
I don’t mean to bring up a tangential point to the post, but why does the field of human genetics use PCA to visualize relationships? When I see plots like those shown here that have a ‘geometric pattern’ to them (the sharp right angles; another common pattern is a Y-shape), that tells me that there are lots of samples with zeros for many of the Y-variables (i.e., alleles that are unique to certain populations). Thus, the spatial arrangement of the points is largely an artifact of an inappropriate method: how does one calculate a correlation matrix when many of things one is correlating have values of zero?
If one really was keen on using PCA, one could calculate a pairwise distance matrix and then use that instead of the correlation matrix (Principal Coordinates Analysis).
Since I know some human geneticists do read this weblog, I thought it was worth throwing the question out there.