Who are those Houston Gujus?

By Razib Khan | February 14, 2011 3:38 pm

The figure to the left is a three dimensional representation of principal components 1, 2, and 3, generated from a sample of Gujaratis from Houston, and Chinese from Denver. When these two populations are pooled together the Chinese form a very homogeneous cluster. They don’t vary much across the three top explanatory dimensions of genetic variance. In contrast, the Gujaratis do vary. This is not surprising. In the supplements of Reconstructing Indian population history it was notable that the Gujaratis did tend to shake out into two distinct clusters in the PCAs. This is a finding you see over and over when you manipulate the HapMap Gujarati data set. In reality, there aren’t two equivalent clusters. Rather, there’s one “tight” cluster, which I will label “Gujarati_B” from now on in my data set, and another cluster, “Gujarati_A,” which really just consists of all the individuals who are outside of Gujarati_B cluster. Even when compared to other South Asian populations these two distinct categories persist in the HapMap Gujaratis.

Zack has already identified a major difference between the two clusters: Gujarat_A has some individuals with much more “West Eurasian” ancestry. To be more formal about this in the future I simply assigned individuals in my merged data set to one of the two Gujarati clusters based on their position in the first two PCs. Yesterday night I ran ADMIXTURE K = 2 to 10, with 75,000 SNPs. I also removed the Native American groups, and added more European and East Asian samples from the HapMap. Below are some populations at K = 4:

Let’s drill down to the level of individuals. Here are the Gujarati individuals, along with Sindhis, and my parents (Bengali). I’ve sorted by the “European” and then “South Asian” components (light blue and green respectively, while purple is modal in Papuans and red in East Asians):

The ADMIXTURE plots are in total alignment with the PCA. In the PCA Gujarati_A exhibit a spectrum of distance from the European cluster, and in the ADMIXTURE you see the same. In contrast, Gujarati_B is relatively uniform. So what’s going on? I will be posting something similar over at Sepia Mutiny soon. But my guess is that Gujarati_B are a subset of Patels. In other words, they’re a genetically distinct jati. I suspect that Gujarati_A are a more diverse bunch from a number of different jatis.

Does this matter? I believe it does. If Gujarati_B are a distinct ethno-social group which is a subset of Gujaratis, then they may not be as good a proxy for South Asian medical genetics as Gujarati_A. More concretely, Gujarati_B may have relatively high frequency rare disease alleles because they’re an inbred clan. In contrast, while Gujarati_A may exhibit all the hallmarks of South Asian endogamy, if they’re a larger number of different groups, then they’ll have all sorts of different rare alleles. The ones they have in common may be more generally South Asian.


Comments (5)

  1. Ezequiel

    /The figure to the left is a three dimensional representation/

    It is not. It is a two dimensional representation. Basically we are getting PC1+0.8*PC2 and PC3+0.2*PC2. You could have used dot size, colour (either hue, saturation, brightness…), lines to the ground plane (as I have seen done in galaxy location maps) or some other way of simulating the third dimension. I personally think that a double view, say (PC1, PC2) and (PC1, PC3) side by side, would be a nice way to put it, kind of like a floor and elevation architectural drawing, or a technical drawing in multiview orthographic projection (http://en.wikipedia.org/wiki/Multiview_orthographic_projection).

    With some training, probably someone can learn to look at a properly labeled (PC1, PC2) and (PC3, PC4) and visualize it as if it was a 3D movie.

    I believe that there is a lot of work to do in managing the large amount of dimensions in genomic information… The basic world (PC1, PC2) graph looks a lot like it is rotated a few degrees. I have a strong hunch that that means something.

  2. Ezequiel

    Sheesh. Forgot to put the [smartass on]/[smartass off] signs around the first two phrases of the first post… 🙂

    (Maybe it is a three dimensional representation in your computer screen. Do you have one of those fancy new 3D laptops?)

  3. Perahu

    Does the HapMap project have any plans on expanding their sample populations?

  4. #3, page down: http://www.1000genomes.org/about

    i hope/assume much of the data will be released online.

  5. lines to the ground plane (as I have seen done in galaxy location maps)

    yeah. i have done this before….


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar