I got my daughter a netbook, so now my computer is doing Harappa Prohect work 24×7.
Also, Simranjit was nice enough to offer me the use of a server. For privacy reasons, I am not going to upload any of the participants’ data there but it is much faster than my machine and hence very useful for running Admixture on the reference data (especially with crossvalidation).
As for steps back, I downloaded the current 1000genomes data (1,212 samples, 2.4 million SNPs). It’s in vcf format. Using vcftools to convert it to ped format will take about 3 weeks.Yes you heard that right. BTW, the good stuff from a South Asian point of view will come later this year with a 100 Assamese AhomF, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis.
Also, I spent most of Sunday evening and night in the ER and got a diagnosis of ureterolithiasis for my efforts. All I can say is: Three cheers for Percocet!!
When Zack first mooted the idea of the Harappa Ancestry Project I had no idea what was coming down the pipe. I wonder if his daughter and wife are curious as to what’s happened to their computer! Since collecting the first wave of participants he’s been a result generating machine. Today he produced a fascinating three dimensional PCA (modifying Doug McDonald’s Javascript) using his “Reference 1″ data set. He rescaled the dimensions appropriately so that they reflect how much of the genetic variance they explain. The largest principal component of variance is naturally Africa vs. non-Africa, the second is west to east in Eurasia, and the third is a north to south Eurasian axis.
I decided to be a thief and take Zack’s Javascript and resize it a bit to fit the width of my blog, blow up the font size, as well as change the background color and aspects of positioning. All to suit my perverse taste. You see the classic “L” shaped distribution familiar from the two-dimensional plots, but observe the “pucker” in the third dimension of South Asian, and to a lesser extent Southeast Asian, populations.
Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!
I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:
Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:
Punjab: 7
Iran: 7
Tamil: 6
Bengal: 5
Andhra Pradesh: 2
Bihar: 2
Karnataka: 2
Caribbean Indian: 2
Kashmir: 2
Uttar Pradesh: 2
Sri Lankan: 2
Kerala: 2
Iraqi Arab: 2
Anglo-Indian: 1
Roma: 1
Goa: 1
Rajasthan: 1
Baloch: 1
Unknown: 1
Egyptian/Iraqi Jew: 1
Maharashtra: 1
Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.
Zack has started to improve on static R plots with Google powered charts. Check it out. Alas, I can’t inject script tags into the body of my posts, so that’s not feasible for me. Notice on Zack’s plot that I’m more East Asian than either of my parents. The tendency first cropped up with 23andMe’s ancestry painting, and I have seen it in my own ADMIXTURE runs, so I don’t dismiss it as V2 vs. V3 chip anymore. Though I’ve ordered an upgrade myself, so we’ll see for sure. Also, though both my parents are about the same East Asian, they exhibit a different balance of East Asian subcomponents. I’ve seen this in my own ADMIXTURE runs, and I’m going to check for more fine-grained matches with the HGDP East Asian populations soon to ascertain whether their eastern ancestral mix is different. Good times.
Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.
Also, Zack has more than 30 individuals in HAP. The “cow belt” is still way underrepresented. The only Bengalis in the data set are my parents.
As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.
This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:
- The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)
- The sample size
- The representativeness
- The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)
Zack is going to post the first batch of results from HAP tomorrow. It looks like he’s going to be using mostly the merged HGDP, HapMap, SVGP, and Behar data set, supplemented by a second set which also merges the Xing et al. sample (the intersection of Xing et al. with the other results is a much smaller number of SNPs, but, it includes a better coverage of various South Asian groups). He’ll initially be posting ADMIXTURE estimates as you’ve seen on Dodecad. I’m especially interested in the Anglo-Indian and Roma individuals which have sent Zack their samples. I don’t know of any genomic investigation of the former community, while the published research on Roma genetics doesn’t include SNP-chip results (usually they’re mtDNA, Y, or only a few autosomal markers). I’d be curious for possible evidence of homozygosity or linkage disequilibrium in the Roma individual due to the population bottlenecks which other studies have detected (I assume that’ll be in the future). The Roma are to a good approximation an admixture of India, West Asia, and European (often Balkan) groups, but, their history of endogamy and small founding groups experience rapid demographic expansion, are also critical to remember.
Zack has been posting his data sources, as well as how he filtered and formatted them, all this week. I assume that the first wave of results will be online soon. As of yesterday, this is what he had (I know he got some more today):
Whole swaths of north-central India are missing. I am hopeful that more people will join in after the first wave of results are put out there. But, from what I have discussed with Zack it looks plausible that the very first wave will have a richer set of results because of the necessity of preliminary steps. So there’s some benefit in getting early. It’s really ridiculous to have literally 1 sample representing the 300 million people of Uttar Pradesh and Bihar. That’s 25% of South Asians represented by one person. I’ve gotten a commitment from one friend who was born U.P. to give his data up once it comes in, but there have to be others out there. (the Bengali N should go up to 2 when I swap my parents in for me)
The public data sources have Gujaratis, Tamils, Pakistanis (Punjabis, Pathans, Sindhis), and some South Indian groups (Tamil and Telugu). This leaves a blank spot on the North Indian plain.
Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, http://www.harappadna.org. Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.
Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and caste diversity is going to be far more useful right now than 300 Jatts from Indian Haryana. Getting 300 Jatts for Haryana would be interesting in that it would give you a window into intra-communal variance, but there’s diminishing returns on the inferences you could make about South Asians as a whole.
If you know someone who has done the 23andMe testing and has preponderant ancestry from South Asia, Iran, Burma, or Tibet, please forward the the URL for the Harappa Ancestry Project. If you are a 23andMe member, and involved in the forums, it might be useful to post a comment thread on this project, as the people you share genes with would see it.
A few weeks ago I hinted at a South Asian equivalent to Dodecad & Eurogenes BGA. It is now public and in the data collection phase. You can read the whole thing here: