Personal genomics in two dimensions

By Razib Khan | February 7, 2011 2:09 am

The Pith: In this post I take a different tack at genetic data visualization. Instead of bar plots, I show how genetic relationships can be explored using two dimensional spaces.

Last week I suggested that in some ways I had hit a wall of sharply reduced diminishing returns when it came to understanding my family’s ancestry. It seems rather clear that we’re basically South Asians with a minority East Asian component of ancestry. That being said, I’m still happy to be involved with the Harappa Ancestry Project. I might have a good sense of my own family’s background, but Bengal remains extremely undersampled. An N = 2 is better than none. And, there is still something to be said for understanding how your own lineage is distinctive, or not, in its local geographic context. More specifically I am still curious as to the exact nature of the genetic origins of the Bengali people, especially those on the eastern “marches” (where my family comes from).

But there are still interesting things you can do with the genetic material which I have access to. I converted the autosomal SNPs of the raw 23andMe files with PLINK, and then ran ADMIXTURE with the HapMap Gujaratis, Tuscans, and American Chinese (so probably disproportionately Cantonese). I included my parents, and, two white American friends of western European heritage. My mother is RM, my father is RF, and I will refer to my friends as Euro 1 and Euro 2 . With K = 3, so three putative ancestral populations, you see that the Chinese and Tuscans are disjoint, while the South Asians form a more mixed picture. Euro 1 shows East Asian, and has more than 1% in 23andMe. I assume that the “South Asian” for Euro 2 is just an artifact of the fact that the categories only roughly corresponded to ancestral groups, and South Asians share enough with Europeans genetically to produce this sort of spurious overlap. If Reconstructing Indian population history is correct, then the “South Asian” component is really an ancient admixture of a European-like population, and a South Eurasian group which has distant, but closer, affinities with East Asians.

But when it comes to personal family history until we get some samples from Burma I think I might be near the end of the line using a program like ADMIXTURE. But how about PCA? EIGENSOFT takes pedigree formats, which I had for ADMIXTURE. Remember, PCA is pulling out the largest independent components of variance in the data sequentially rank ordered by the proportion of variance explained by each dimension. So generally you focus on dimensions 1 and 2, since these are the most significant. Positions in each dimensions can nicely be translated into a two dimensional plot.

First, let’s look at Gujaratis, Tuscans, and American Chinese, along with my parents and my two friends. Keep in mind that the x axis is has a magnitude greater than the y axis by a factor of 3.

Isn’t it interesting how well these PCA’s often correspond to geography? The x axis spans Eurasian west-east, while the y-axis spans it north-south. Euro 1 is shifted further toward the Asians, just as ADMIXTURE (and 23andMe’s ancestry painting) would imply. While Euro 2 is shifted toward South Asians a bit (the results might have been different if I used the Utah white sample, but I think the Tuscans need a bit of love at this point). Notice that my parents are outside of the main South Asian cluster of Gujaratis. Additionally, you can see two clusters of Gujaratis. One tight cluster, and another which is more diffuse and whose linear topology is suggestive of differential admixture with the European-like element.

Now let’s constrain the sample to TSI + South Asians. This is not just a zoom-in on that segment of the chart, rather, the dimensions are re-generated using the narrower data set. Please note that the x axis has a magnitude 6 times greater than the y axis.

The distinctiveness of two Gujarati clusters in the HapMap data set has been found by others, so it’s not some error. I do find it interesting that my mother is in the tighter cluster, #1, while my father is in cluster #2.

Now let’s swap the TSI for the CHD, the American Chinese sample from Denver. The x axis is about an 10 times greater in magnitude.

Remember how I always say not to assume that the clusters on PCA plots have some concrete and invariant reality? You see that here. On the world wide plot my parents’ clearly are shifted toward the Chinese, but you can’t see it here, when Tuscans are removed. What’s going on? First, remember that these are the two largest dimensions. Only a small proportion of the total variance. Additionally, yanking the Tuscans may have shifted the rank order of the dimensions too. So let’s run this again and plot the top three dimensions.

It turns out that dimensions 2 and 3, axes y and z, are very close in magnitude (both 10% of dimension 1). I added labels so you can see where my parents are, with G = Gujarati, and C = Chinese.

And now, three dimensions:

As you can see, the Chinese simply do not vary much, and are a tight cluster. But, there is a somewhat equivalent Gujarati cluster too! The HapMap sample was collected from Gujaratis in Houston. To me, it looks like that Houston population can be divided into two groups: one of the tight cluster, and the rest of the population, which is all over the place. Despite my parents’ obvious East Asian affinities, their South Asian variance seems to put them into two different Gujarati clusters. I don’t put much stock in this. Remember these are two dimensions of variation. What’s more interesting is to try and understanding what’s going on with Houston Gujaratis. Anyone in the audience know?

Note: I’ll go into more detail about how I did this later. But, do know that PCA goes much quicker than ADMIXTURE. I had some issues with getting the plotting functions working, so I just took the output and did everything in Open Office aside from the last plot, which I obviously did in R.

CATEGORIZED UNDER: Anthroplogy, Genetics, Genomics
MORE ABOUT: Personal genomics
ADVERTISEMENT

Comments (6)

  1. Ian

    Wondering…if the Gujaratis form two groups, and your parents (mostly) end up associated with one group or the other…might this be somehow related to the ANI/ASI split? In other words, might the separation of these groups (and your parents) represent on one hand a cluster of linked ANI-derived genes, and on the other a cluster of ASI-derived genes?

  2. I find the idea of genetics and tracking population genome very fascinating. However, my education was not in biology so I find it a bit hard to understand. I am willing to read though 🙂 Might you have pointers to a few primer sites? Maybe you’d consider writing a introduction to understanding population genetics yourself.

  3. Do you have any experience with SFA (http://people.cs.uchicago.edu/~engelhardt/pubs/engelhardt-stephens-plosgen-2010.pdf )?
    I have been curious about it, but have not yet used it. Wondered what your thought were.

  4. Wondering…if the Gujaratis form two groups, and your parents (mostly) end up associated with one group or the other…might this be somehow related to the ANI/ASI split? In other words, might the separation of these groups (and your parents) represent on one hand a cluster of linked ANI-derived genes, and on the other a cluster of ASI-derived genes?

    the ANI/ASI % is not that different between the two. evident in the PCA, but see dienekes’ post. the “south asian” has ANI & ASI, remember. but yeah, it looks like #1 has more ASI proportionally. though it might be because #2 has mixture with later ANI-like groups. to be explicit, i think cluster #1 is a particular caste from gujarat. i think cluster #2 is more variegated, just people who speak gujarati.

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+