Harappa Ancestry Project @ N ~ 50

By Razib Khan | March 12, 2011 1:11 pm

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out of his ADMIXTURE runs. I’ve also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn’t want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they’re distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place.

In the broadest sense the first thing that jumps out at you is the high distance value between “Papuans” and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the “South Asians.” What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively “pure” eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn’t make too much of this, but in some ADMIXTURE runs which I’ve done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time.

To the South Asian individuals surveyed so far, there’s nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you’d expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, “Ancestral South Indians,” were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I’d like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers.

Here are all the details about participation.

CATEGORIZED UNDER: Genetics, Genomics

Comments (5)

  1. Papuans being the most distant is a consequence of excluding the San and Pygmy.

  2. RK

    I think Maharashtra actually has a higher population than Bihar, and it only has one participant so far.

    The representation is even more skewed by caste, though. The Nadar and the Rowther are the only participants who clearly aren’t from one of the “forward classes.”

  3. arvind mishra

    I would appreciate an article over the topic from Zack written for lay people in easily understandable manner.

  4. Diogenes

    Contemporary West Africans (and above all contemporary East Africans) likely received unquantified significant “back to Africa” admixture (R1b, etc). Papuans were largely isolated, and there’s that Denisovan story. I would tend to agree with Zack.
    ADMIXTURE is just a stupid computer program. Silicon is really not the best material available in nature for genuine intelligence (as opposed to solving straightforward logic puzzles).
    ADMIXTURE needs to be managed instead of being relied on producing miraculous “ex-machina” results, and the new version tool has proven quite useful in allowing us to think while it does the boring work.

    I think you were right South Asian populations are mostly Eastern Fertile Crescent+Kurgan (ANI)+some Yellow River+Native independent incipient neolithic (ASI)

  5. Diogenes

    For everyone else:
    ADMIXTURE produces unbiased genuine results I believe. But you need to think what it’s doing because it doesn’t know better.
    WAFR-Pygmies and San=1 less (very far away) pole. Also San and Pygmies likely significant WAFR admixture. Thus it pulls WAFR towards supposedly “unadmixed” Pygmies and San.
    If you draw an “admixed” pop in a MDS without including its (more different) “unadmixed” parent, it becomes the “unadmixed” pole and you effectively change the MDS and corresponding distances.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com


See More


Edifying books

Collapse bottom bar