It’s been 10 months since Zack Ajmal first contacted me about the possibility of the Harappa Ancestry Project. I was of two minds. On the one hand I did think there was a major problem with undersampling some regions of South Asia. But, it seemed that the 1000 Genomes would fix that soon enough. As it turns out the 1000 Genomes has been a bit slower than I had anticipated (and I assume that the nixing of the Indian samples was a matter of politics not science). So I’m glad Zack started the project when he did.
At this point he’s hit the zone of diminishing marginal returns when it comes to participants. Looking through his samples he has a little over 100 non-founders of unadmixed South Asian ancestry (I’m not a founder because both my parents are in the database). I decided to prune the individuals down to this selection, and tack on a lot of his reference populations, with a bias toward South Asians, and see what I could find. I used his K = 11 ADMIXTURE run, since this seems maximally informative for South Asians. You can find the file here.
A few days ago I noticed that the Dodecad Ancestry Project had nearly nearly 10,000 individuals! ~500 are participants in the project (like myself, I’m DOD075). But most of the individuals were derived from public or shared data sets. You can see them in the Google spreadsheet with all the results. It’s quite an accomplishment, and I commend Dienekes for it. I also have to enter into the record that Dodecad prompted my own forays into genome blogging, and Dienekes also helped Zack with pointers for Harappa in the early days.
Dienekes Pontikos has just released DIY Dodecad, a DIY admixture analysis program. You can download the files yourself. It runs on both Linux and Windows. Since I already have tools in Linux I decided to try out the Windows version, and it seems to work fine. It is somewhat limited in that you start out with the parameters which Dienekes has set for you, but if you don’t want to write your own scripts and get familiar with all the scientific programs out there, I think this is a very good option. Additionally, it seems to run rather fast, so you won’t spend days experimenting with different parameters.
Dienekes has already run me, but I put my parents’ genotype files through the system. Here are the results:
The DNA ancestry testing industry is more than a decade old, yet details about it remain a mystery: there remain no reliable, empirical data on the number, motivations, and attitudes of customers to date, the number of products available and their characteristics, or the industry customs and standard practices that have emerged in the absence of specific governmental regulations. Here, we provide preliminary data collected in 2009 through indirect and direct participant observation, namely blog post analysis, generalized survey analysis, and targeted survey analysis. The attitudes include the first available data on attitudes of those of individuals who have and have not had their own DNA ancestry tested as well as individuals who are members of DNA ancestry-related social networking groups. In a new and fluid landscape, the results highlight the need for empirical data to guide policy discussions and should be interpreted collectively as an invitation for additional investigation of (1) the opinions of individuals purchasing these tests, individuals obtaining these tests through research participation, and individuals not obtaining these tests; (2) the psychosocial and behavioral reactions of individuals obtaining their DNA ancestry information with attention given both to expectations prior to testing and the sociotechnical architecture of the test used; and (3) the applications of DNA ancestry information in varying contexts.
If anyone wants the paper, email me, I can send you a copy. But really it’s just kind of dated because the information was collected in 2009, before the massive increase in 23andMe’s customer base which began in the spring of 2010. Additionally, “genome blogging” really hadn’t started much at that point.
In terms of the reactions to ancestry analysis, my personal experience after doing analysis on hundreds of people (most in public for AAP, but some in private) is that most are pretty calm about whatever they find out. On occasion you run into a stubborn person who is basically going to fix upon a really implausible explanation for a particular ancestral slice rather than the lowest hanging fruit. But there was one individual who had a freak out when their results were published, because it did not accord with family beliefs. I was kind of confused, and checked their results with their self-reported ethnicity. Weirdly the results were exactly what I would have expected from the self-reported ethnicity, so it was a really strange reaction.
Since I started up the African Ancestry Project one of the primary sources of interest has been from individuals whose family hail for Northeast Africa. More specifically, the Horn of Africa, Ethiopia, Eritrea, and Somalia. The problem seems to be that 23andMe’s “ancestry painting” algorithm uses West African Yoruba as a reference population, and East Africans are often not well modeled as derivative of West Africans. So, for example, the Nubian individual who I’ve analyzed supposedly comes up to be well over 50% “European” in ancestry painting. Then again, I”m 55-60% “European” as well according that method! So we shouldn’t take these judgments to heart too much. Obviously something was off, and thanks to Genome Bloggers like Dienekes Pontikos we know what the problem was: the populations of the Horn of Africa have almost no distinctive “Bantu” element to connect them with West Africans like the Yoruba. Additionally, a closer inspection shows that the “Eurasian” component present in these populations is very specific as well, almost totally derived from Arabian-like sources. When breaking apart the West Eurasian populations it is no surprise that Northern Europeans and Arabians are among the most distant pairs, even excluding recent Sub-Saharan African admixture. The HapMap Utah European American sample and the Nigerian Yoruba are very suboptimal for people with eastern African background. In contrast, African Americans are a mixture of West Africans and Northern Europeans, so the ancestry painting algorithm has nearly perfect reference populations for them. The results for African Americans may not be very detailed and rich, but they’re probably pretty accurate at the level of grain which they’re offering results.
Though I’m happy to give people of Northeast African ancestry more detailed results than 23andMe, one of my motivations for the African Ancestry Project was to obtain a data set which would allow me to explore the genomic variation in the east of Africa myself. This region is a strong candidate for “source” populations for non-Africans within the last 100,000 years, and, it seems to have experienced rapid population turnover within the last 2,000-3,000 years. My data set is not particularly adequate to my ambitions, yet. But I do now have 5 unrelated Somalis. To my knowledge there hasn’t much exploration of Somali genomics using thick-marker SNP chips, so why not? N = 5 is better than N = 0 in these cases of extreme undersampling.
Before I proceed to methods and results, I want to note that I put up most of my files here. It’s a ~25 MB compressed folder with images, spreadhseets, as well as raw output from ADMIXTURE and EIGENSOFT. I hope readers will take this as an invitation to poke around themselves.
Both Eurogenes and Harappa now have map interfaces where you can drop in the origin of your location if you’re a participant. If you have submitted your data you should add your information in. We’re at a point where data is relatively plentiful, at least before the tsunami of whole genomes, so visualization and representation is of the essence.
Zack pointed me to two new ones, Fennoscandia Biographic Project, and Magnus Ducatus Lituaniae Project – BGA analysis project for the territories of former Grand Duchy of Lithuania. So I guess the circum-Baltic region is getting some thick coverage. The latter is also releasing some format conversion tools which seem to work in Windows, if you want to play with the analytic software yourself.
At least about some things. In Guns, Germs, and Steel he argued that latitudinal diffusion of agricultural toolkits was much easier than longitudinal diffusion. This seems right, but, one thing which Diamond did not emphasize enough in hindsight I suspect is that demographic diffusion and replacement can follow a similar pattern. I am probably not a “Neolithic population replacement” maximalist to the extent of someone like “Diogenes” or Peter Bellwood, but that is probably mostly a matter of my modest confidence about all of these sorts of issues. But, after running many trials of ADMIXTURE, along with perusing the results generated by Dienekes, David, and Zack, I am more confident in the position that agriculture and agriculture-bearing populations tend to initially follow paths of least ecological resistance. In kilometers the distance between Lisbon and Damascus is 4,000 units, while between Helsinki and Damascus it is 3,000 units, but Lisbon has been much more affected by the migrations from the Middle East than Helsinki. The facilitation of water transportation as well as ecological similarities between Lisbon and Damascus, at least in relation to Helsinki, explains this phenomenon.
To illustrate this issue more broadly, let’s look at some ADMIXTURE results. Zack Ajmal at the Harappa Ancestry Project has one of the most cosmopolitan reference sets around, and he’s been posting results from his “reference 3″ population, which merges a host of different study groups. Today he posted K = 6. That is, he generated 6 ancestral populations and allowed the program to assign proportions of each to individuals within the reference set. He labeled his putative ancestral populations:
– S Asian
– E Asian
– SW Asian
Zack generated his usual nice bar plots, but I thought there might be another way to look at the relationships between the proportions. A scatter plot where each axis represents a proportion of a putative ancestral group. Below you see “SW Asian” on the y-axis and “European” on the x-axis:
Zack Ajmal has been methodically working his way through issues in the public genomic data sets. Often it just involves noting duplicate samples across data sets, which need to be accounted for. But sometimes there seem to be problems within the uploaded data sets, for example relatively close related individuals. Today he highlights an issue which early on was noticeable in the Behar et al. data set:
Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala.
I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians.
Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.
I pulled down the Behar et al. data set too, and the Paniya just look weird enough that I just avoided them. Ideally this sort of stuff should be caught, but errors happen. Best to get as many eyeballs looking over everything.