Dienekes Pontikos has just released DIY Dodecad, a DIY admixture analysis program. You can download the files yourself. It runs on both Linux and Windows. Since I already have tools in Linux I decided to try out the Windows version, and it seems to work fine. It is somewhat limited in that you start out with the parameters which Dienekes has set for you, but if you don’t want to write your own scripts and get familiar with all the scientific programs out there, I think this is a very good option. Additionally, it seems to run rather fast, so you won’t spend days experimenting with different parameters.
Dienekes has already run me, but I put my parents’ genotype files through the system. Here are the results:
Since I know plenty of friends are getting, or just got, their V3 results, I thought I’d pass this on, Open-ended submission opportunity for 23andMe data (#2):
Who is eligible
Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.
Also, Zack has more than 30 individuals in HAP. The “cow belt” is still way underrepresented. The only Bengalis in the data set are my parents.
Dienekes did another run of his data with K = 64. He posted a huge plot with the two largest dimensions of variation. He also posted an accompanying spreadsheet with the coordinates of where the Dodecad samples were. So I found my own position pretty quickly. Before going to that, I thought I’d repost a comparison between myself, the HapMap Gujaratis, the North Kannadi sample, and the HGDP Uygurs. This is at K = 10 in ADMIXTURE from Dodecad.
OK, with that in mind, here’s the full MDS with the two largest components of genetic variation. I’ve added large labels. Also, click the image for a larger file so you can read the small labels.
I am DOD075. The Southeast Asian component is modal in Malays, while the East Asian component is modal in the North Chinese. Vietnamese and Cambodians are mixed, with the former biased toward East Asian, and the latter Southeast Asian. My own proportions are more balanced, but there might be some noise in there. That being said, from what I have read of Southeast Asia it is highly likely that Burmese ethnicities will be between the Cambodians and Vietnamese in proportions. The Burmans were more shaped by the indigenous Mon-Khmer people than the Vietnamese were, though like the Vietnamese they seem to hail from southern China. My family is traditionally from eastern Bengal, and has been at various points the subjects of the kingdom of Tripura.
Here’s the Dodecad Indians, HapMap Gujaratis, and Behar et al. North Kannadans. The orange is Asian. Can you tell which one I am?
In the open thread someone asked: “Any recent stuff on the genetics of Ethiopians.” That prompted me to look around, because I’m curious too. Poking around Wikipedia I couldn’t find anything recent. A lot of the studies are older uniparental lineage based works (NRY and mtDNA). Ethiopia is interesting because unlike almost all other Sub-Saharan African nations it has a long written history. Culturally and linguistically it has both Sub-Saharan African, and non-Sub-Saharan African, affinities. The languages of highland Ethiopia are clearly Semitic. Those of lowland Ethiopia are Cushitic, a branch of the broader Afro-Asiatic language family concentrated around the Horn of Africa (Somali is a Cushitic language, though most Ethiopian nationals who speak a Cushitic dialect are of the Oromo group).
From a human evolutionary genetic perspective, Ethiopia also has specific interest. It is likely that the main recent pulse of humans Out of Africa traversed this region. Additionally, there is some evidence of deep time connections between the groups ancestral to Ethiopians and the Khoisan of southern Africa. It may be that Ethiopians and Khoisan are reservoirs of ancient genetic variation in Sub-Saharan Africa which as been overlain by Bantu in most other regions outside of West Africa. Finally, Ethiopians are known to have high altitude adaptations. This could be due to long term residence in the region, or, assimilation of favorable alleles from the long term residents by later populations.
Fortunately we can get a sense of the genetic affinities of Ethiopians thanks to a paper published last spring, The genome-wide structure of the Jewish people. The focus was clearly on Jews, but they surveyed Amhara & Tigray (Semitic speaking highlanders), Ethiopian Jews (similar ethnically to the Amhara & Tigray, but religiously non-Christian), and Oromo. In the PCA the Oromo and Semitic speaking populations are pretty obviously distinct clusters.
Over at his blog Dienekes Pontikos has taken some public data sets and his own Dodecad samples and generated a massive MDS plot of West Eurasian populations. The MDS is fine as it goes. It illustrates clearly that when you visualize an individual on a plot defined by the two largest dimensions of variation in the total data set clusters naturally emerge. Some of of them are totally expected. For example, the cluster of Ashkenazi Jews. But, some of the relationships need to be interpreted with care. The similar position of Sicilians with Ashkenazi Jews does not mean that these two populations are identical. Rather, their ancestral components exhibit similarities in such a manner that in a representation constrained to a few dimensions they shake out similarly. You can view the full thumbnail by clicking it, but I thought that for purposes of intuitive comprehension it would be useful to “cut out” the outlines of the distributions, and label them by geography. I added Ashkenazi Jews because I thought readers would be interested, but omitted the other Jewish groups.
To the left you see a zoom in of a PCA which Dienekes produced for a post, Structure in West Asian Indo-European groups. The focus of the post is the peculiar genetic relationship of Kurds, an Iranian-speaking people, with Iranians proper, as well as Armenians (Indo-European) and Turks (not Indo-European). As you can see in some ways the Kurds seem to be the outgroup population, and the correspondence between linguistic and genetic affinity is difficult to interpret. For those of you interested in historical population genetics this shouldn’t be that surprising. West Asia is characterized by of endogamy, language shift, and a great deal of sub and supra-national communal identity (in fact, national identity is often perceived to be weak here). A paper from the mid-2000s already suggested that western and eastern Iran were genetically very distinctive, perhaps due to the simple fact of geography: central Iran is extremely arid and relatively unpopulated in relation to the peripheries.
But this post isn’t about Kurds, rather, observe the very close relationship between Turks and Armenians on the PCA. The _D denotes Dodecad samples, those which Dienekes himself as collected. This affinity could easily be predicted by the basic parameters of physical geography. Armenians and Anatolian Turks were neighbors for nearly 1,000 years. Below is a map which shows the expanse of the ancient kingdom of Armenia:
School girls in Hunza, Pakistan
A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.
Nature profiles Dodecad, the Pickrell Affair, and the emergence of amateur genomicists in a new piece. Interestingly David of BGA is going to try and get something through peer review. In particular, the relationship of Assyrians and Jews.
So we have Genomes Unzipped, Dodecad, and BGA. What next? Who next? I hope Dienekes doesn’t mind if I divulge the fact that the computational resources needed to utilize ADMIXTURE as he has is within the theoretical capability of everyone reading this post. Rather, the key is getting familiar with PLINK and writing some code to merge data sets. After you do that, to really add value you’d probably want to get raw data from more than what you can find in the HGDP, HapMap and other public resources.
But here I make an open offer: if you start a blog or a project which replicates the methods of Dodecad and BGA I’ll link to you and promote you. When Dienekes began Dodecad I actually started to play around with the data sets in ADMIXTURE, but I’ve personally held off until seeing what he and David find. What their pitfalls and successes might be. Here’s to 2011 being more interesting than we can imagine!
Update: Already had a friend with a computational background contact me about doing something on South Asian genomics. So again: if you get a site/blog set up, and start pumping out plots, I will promote you. In particular, if you need 23andMe raw data files of geographical region X it might be useful to try and get the word out via blogs and what not.
I decided to take the Dodecad ADMIXTURE results at K = 10, and redo some of the bar plots, as well as some scatter plots relating the different ancestral components by population. Don’t try to pick out fine-grained details, see what jumps out in a gestalt fashion. I removed most of the non-European populations to focus on Western Europeans, with a few outgroups for reference.
Here’s a table of the correlations (I bolded the ones I thought were interesting):
|W Asian||NW African||S Europe||NE Asian||SW Asian||E Asian||N European||W African||E African||S Asian|
After linking to Marnie Dunsmore’s blog on the Neolithic expansion, and reading Peter Bellwood’s First Farmers, I’ve been thinking a bit on how we might integrate some models of the rise and spread of agriculture with the new genomic findings. Bellwood’s thesis basically seems to be that the contemporary world pattern of expansive macro-language families (e.g., Indo-European, Sino-Tibetan, Afro-Asiatic, etc.) are shadows of the rapid demographic expansions in prehistory of farmers. In particular, hoe-farmers rapidly pushing into virgin lands. First Farmers was published in 2005, and so it had access mostly to mtDNA and Y chromosomal studies. Today we have a richer data set, from hundreds of thousands of markers per person, to mtDNA and Y chromosomal results from ancient DNA. I would argue that the new findings tend to reinforce the plausibility of Bellwood’s thesis somewhat.
The primary datum I want to enter into the record in this post, which was news to me, is this: the island of Cyprus seems to have been first settled (at least in anything but trivial numbers) by Neolithic populations from mainland Southwest Asia.* In fact, the first farmers in Cyprus perfectly replicated the physical culture of the nearby mainland in toto. This implies that the genetic heritage of modern Cypriots is probably attributable in the whole to expansions of farmers from Southwest Asia. With this in mind let’s look at Dienekes’ Dodecad results at K = 10 for Eurasian populations (I’ve reedited a bit):
Dienekes Pontikos keeps chugging along, and has cranked out a new bar plot from the ADMIXTURE program with 15 putative ancestral components. He has “69 populations, and 1,189 individuals in total.” Most of these were assembled from public data, but some of them are particular to the Dodecad Ancestry Project. He contends:
In comparison to the K=10 analysis, the increased resolution allows us to:
– South Asians belonged primarily to the South Asian and West Asian components; this South Asian component spilt over to Iran and Central Asia. Now, a new Central-South Asian component, corresponding to the Ancestral North Indian of a recent study is inferred, and a corresponding South Indian component.
– HGDP Bedouins and Behar et al. (2010) Saudis take up their own component which I labeled Arabian. This appears to be a subset of the Southwest Asian component of the K=10 analysis
– There are several components in Siberian and Central Asian populations, alread discovered in my regional analysis. These are Central Siberian, Nganasan, Koryak, Chukchi, and Altaic which replace the K=10 Northeast Asian component
Not only has he generated a bar plot, but there is a PCA showing the relationship between the 15 ancestral groups, as well as a hierarchical tree. Since he references to the ANI and ASI of Reich et al., I thought I would note that the South Indian element from Dienekes’ K = 15 is still found in appreciable portions in the Turkic groups which earlier exhibited the South Asian component. And, on the PCA and phylogenetic tree it still clusters with West Eurasians more than East Eurasians, which is not the case with ASI (or the various Indian mtDNA lineages which coalesce back to a more recent common ancestor with East Eurasians).
The bar plot is below. Of interest are the most “pure” European groups, the Sardinians and Lithuanians. Also, compare Scandinavians and Finns.
The figure to the left is a composite merged from two different papers. One analyzes the patterns of genetic variation within African Americans, and the other the patterns within the East Turkic ethnic group, the Uyghurs. The bar plots show the ancestral element which is similar to two parent populations which resemble Europeans and Africans or East Asians. Looking at total aggregate ancestral quanta we infer that African Americans are on the order of 15-25% European in ancestry, and 75-85% African. Uyghurs seem to be a composite in even measure of a European-like group, and an East Asian-like group. This makes total sense phenotypically; most African Americans look more African, while Uyghurs seem to exhibit a phenotype on average which spans the middle-range between West and East Eurasians.
But we’re clearly missing something when we focus purely on a population level statistic. Each “slice” of the bar plot actually represents an individual. Note the contrast between African Americans and Uyghurs. There is relatively little intra-individual variation among Uyghurs, while there is a great deal of such variation among African Americans. Why? Population geneticists have looked at linkage disequilibrium in both African Americans and Uyghurs, and inferred that the former went through an admixture phase much more recently than the latter. Though you don’t really have to be a population geneticist to have known that about African Americans. The ethnogenesis of the group African Americans as a cultural entity occurred in the period between 1650 and 1850. Genetically they are a compound of African, European, to some extent Native American, ancestry. For the Uyghurs we have thinner textual evidence, but the visual and genetic data point to a “western” Indo-European speaking population in the Tarim basin before the arrival of the Turks sometime in the second half of the first millenium A.D. The assumption is that after the initial admixture event and the absorption of the pre-Turkic substrate there was no population substructure. Over time the two components distributed themselves evenly across the population over a period of 1,000-1,500 years.
From this we can infer that patterns of individual variation within populations, as well as between closely related populations, can tell us a great deal. Today the Dodecad Ancestry Project posted a file with the population ancestries broken down by individuals. Looking at this sort of fine-grained data patterns can jump out based on what you already know. Below is a slide show I created which highlights some patterns of interest.
Dienekes is now allowing people to “out” themselves in terms of their ancestry on a comment thread over at the Dodecad Ancestry Project. One of the major purposes of the project has been to survey variation in under-sampled groups which could give us insights into human genetic history. Yesterday I pointed to an analysis of Europeans from the British Isles to Russia. Basically Northern Europeans. There wasn’t anything too revolutionary about the nature of the results; rather, it confirmed some patterns we’d seen. Additionally it obviously didn’t resolve issues of timing, though it clarified hypotheses on the margin.
The main benefit of the ADMIXTURE bar plots is that it gives you a gestalt sense of relationships in a quantitative fashion. This is especially important for groups in the Eurasian Heartland, who are in some ways at the center of both genetic and cultural exchange. In the comments above some information was divulged as the provenance of two clusters of samples, Finns and Assyrians. The Assyrians here presumably represents the remnants of Mesopotamia’s Christian majority at the time of the Arab conquests in the 7th century. Prior to the Arab conquests Mesopotamia had been under the rule of the Sassanid Persian dynasty for nearly four centuries, but by early 7th century the Syriac speaking majority by and large adhered to a range of Christian sects (the balance seem to have been heterodox non-Christian Gnostics and Jews), with the ancient Church of the East dominant. Because of the social constraints which Christians were placed under within the Muslim Middle East prior to the modern era these communities may be particular informative as to the demographic impact of the Arab conquests, and the cosmopolitan and international nature of the Muslim polities and how they reshaped the genetics of the Middle East. A good approximation is that the Christian minorities are the dominant parent population of the Muslim majority, but that because of their tendency to withdraw into more isolated regions and their enforced economic marginality they would have not intermixed so much with the influx of slaves, both northern (Turk and Slav), Indian, and African, which characterized much of Mesopotamia over the past 1,400 years.
Below the fold is a slide show. I’ve reedited just a touch (removed a few populations, put the labels in larger fonts, etc.). First the total population set. Then I’ve dropped the Finns and Assyrians, respectively, into the global population set (obscure some which are less relevant).
A few days ago Dienekes opened up the Dodecad project to a wider range of Eurasians. I decided to send my 23andMe sample to Dienekes ASAP, and the results came back today. I’m DOD075. Dienekes also just put up an explanation of the 10 ancestral components he’s generating from ADMIXTURE (along with tree-like representations of their distances). Below I’ve placed myself in the more local context of populations to which I’m close to:
Dienekes Pontikos, Introducing the Dodecad ancestry project:
1) Project goals
The Dodecad ancestry project has two goals:
– To provide detailed ancestry analysis to individuals who have tested with 23andMe; other testing companies may be included in the future.
– To build samples of individuals for regions of the world (e.g. Greeks, Finns, Albanians, Southern Italians, etc.) currently under-represented in publicly available datasets.
I neither endorse nor am I affiliated with any genetic testing company. I have chosen to base the project on 23andMe results, because (i) I perceive that quite a few people have used the service, (ii) the Illumina genotyping platform it uses has substantial overlap with the publicly available datasets on which my analysis depends.
Basically some of you need to send him your 23andMe raw data files. The potential sample space of this group is going to be in the tens of thousands from what 23andMe representatives have stated about how many of the Complete Edition kits they’ve sold. Naturally due to labor and computational constraints he only wants people from particular populations. I think that’s fine. I’m a little taken aback by how demanding and critical Dienekes’ readers have been about the choice of populations he analyzes. You can install ADMIXTURE yourself, get data sets, and manipulate them in PLINK, etc. I hope many people will participate in this project. I would have given my sample, but I’m not of an appropriate population, and even if he wanted South Asians I’m pretty sure I’m not very representative of South Asians (I have very few runs-of-homozygosity and seem to have recent admixture from other world population groups).