Dienekes has a post up where he highlights the fact that the recent paper on South Asian metabolic diseases has a figure which elucidates population structure within the region. Accounting for structure is important for genome-wide associations since you might get a spurious correlations if trait value/disease frequency is simply tracking cryptic population variation. Dienekes says:
The existence of two clusters is kind of obvious, while their interpretation is not as dots of the same color appear in both clusters: a placement of these individuals in a global context might have been useful here. Things are clearer at the top cluster which shows a clear gradient anchored by Punjabi Sikh and Hindu Tamils on either end.
Also of interest is the group of isolated Muslim/Christian individuals on the left which deviate strongly from the mainstream; these probably represent exogenous elements that don’t resembe the bulk of the Indian population.
The second issue is easily addressed. The Christian outliers are both give English as their native language. That suggests to me that they’re Anglo-Indian, a community of mixed South Asian and European origin. South Asian Muslims are overwhelmingly of indigenous origin. But, a minority of the Muslim elite are West Asian, or have substantial West Asian ancestry, as is evident by the fact that they look white. Benazir Bhutto’s mother was of Kurdish and Persian ethnic background (her family was from Esfahan in Iran). I’ve reedited the religious & linguistic PC plots to fit onto the screen.
So what’s going on with the cluster which extends along the second principal component? The first component is probably just a European/West Asian-South Asian axis of variation. But I don’t understand where the variation for the second is coming from. Observe that the one South Indian group, Tamil speakers, are not represented in the secondary cluster. The plot reminded me of something I saw last fall.
Below is figure S4 is from the supplements of Reconstructing Indian population history. I added some labels. The Indian cluster is tight when the genetic variation includes non-Indian groups. But, when you constrain the variation to Europeans and South Asians only, something strange happens:
The Gujarati sample is from Houston, and is from HapMap Phase 3. I have a suspicion that the secondary cluster among the Gujaratis here is of the same class of phenomenon as the secondary cluster in the first plot. The Anglo-Indians and West Asian Muslims serve as rough proxies for Europeans, and you have an expected European-South Asian axis. But you also have this strange orthogonal component. I had assumed that the plot from the Reich et al. paper was an anomaly, but I’m not so sure seeing the second paper.