Some have asked what the point is in poking around African population structure when Tishkoff et al. and Henn et al. have done such a good job in terms of coverage. First, it is nice to run your own analyses so you can slice & dice to your preference, and not rely on the constrained menu provided by others. There’s value in home cooking; you can flavor to your taste. Second, you never know what data people might leave on your doorstep. I’ve received the genotypes of three Somalis. Nothing too surprising, a touch more Cushitic than the Ethiopians in Behar et al., but interesting nonetheless.
Also, you can see how ADMIXTURE tends to come to weird conclusions in certain circumstances. Below is a K = 12 run ~50,000 SNPs. I’ve included in a few Behar et al. and HGDP populations to the Henn et al. set, as well as pruned a lot of the African groups which seem redundant in terms of information. I’ve added a few geographically informative labels as well.
Observe below that there is a Fulani cluster. I think this is pretty much an artifact. At K = 7 the Fulani have a majority component which is modal in West Africa & Bantu speakers, and a minority component which is identical to the one modal in Mozabite Berbers from Algeria. The Mozabites reside in the far northern Sahara, and their modal component drops off as one goes east toward western Asia and the eastern Mediterranean. I suspect that what is showing up in ADMIXTURE is the ancient hybridization of the Fulani, and perhaps their demographic expansion from this core group. We have some glimmers of the prehistory of the Fulani, and no expectation for them to be such a distinctive cluster, so I naturally jump to these inferences. But it does make me reconsider the nature of the “Sandawe,” “Mbuti” or “San” clusters in ADMIXTURE. These populations are culturally distinctive in deep ways from their neighbors, so a reflexive inference one might make is that they’re “pure” ancient substrate groups which have been overlain and marginalized by their Bantu neighbors. But their prehistory is far murkier than the Fulani because of their geographical isolation, so there is far less to go on. These “ancient” isolated groups themselves may have gone through the same sort of distinctive recent ethnogenesis processes which we presume occurred with the Fulani (also, in the plot below the Biaka are pure; but in most of the bar plots they have a minor element which they share with their neighbors, probably due to greater admixture and interaction between western Pygmies and their Bantu neighbors than among the easter ones).
The pith: In this post I examine the most recent results from 23andMe for my family in the context of familial and regional (Bengal) history. I also use these results to offer up a framework for the ethnognesis of the eastern Bengali people within the last 1,000 years, and their relationship to other South Asian and Southeast Asian populations.
Since I received my 23andMe results last May I’ve been blogging about it a fair amount. In a recent post I inferred that perhaps I had a recent ancestor who was an ethnic Burman or some related group. My reasoning was that this explained a pattern of elevated matches on chromosomal segments with populations from southwest China in the HGDP data set. But now we have more than my genome to go on. This week I got the first V3 chip results from a sibling. And finally, yesterday the results from my parents came in. One thing that I immediately found interesting was my father’s mtDNA haplogroup assignment, G1a2. This came from his maternal grandmother, and as you can see it has a distribution which is mostly outside of South Asia. In case you care, I asked my father her background, and like my patrilineage she was a “Khan,” though an unrelated one (“Khan” is just an honorific). I received these results before the total genome assessment, and so initially assumed this confirmed my hunch that my father had some unknown recent ancestry of “eastern” provenance. But it turns out my hunch is probably wrong. In fact, my parents have about the same “eastern” proportion, with my mother slightly more! My expectation was that perhaps my mother would be around 25-30% “Asian,” and my father above 50%. The reality turns out that my father is 38%, and my mother 40%.
Image credit: f_mafra
Below are the “Ancestry Paintings” generated by 23andMe for my family (so far). What you see are the 22 non-sex chromosomes, which have two copies each, and assignments to “Asian,” “European,” and “African,” ancestry groups. The reference populations to generate these assignments come from the HapMap, the northern European sample of white Americans from Utah, Chinese from Beijing, Japanese from Tokyo, and ethnic Yoruba from Nigeria. What the assignment to one of these classes denotes is that that region of the genome is closest to that category in identity. It does not imply that your recent ancestry is European or Asian (African is probably a different matter, but there are many complaints about the results for African Americans and East Africans in the 23andMe forums). This caveat is especially important for South Asians, because we generally find that we’re ~75% European and ~25% Asian. All that means is that though most of our genetic affinity is with Europeans, a smaller fraction seems to resemble Asians more. Via “gene sharing” on 23andMe I can see that the Asian fraction varies from ~35% in South India and Sri Lanka, to ~10% in Pakistan and Punjab. This is not because South Indians have more East Asian ancestry than Punjabis. Rather, to a great extent the South Asian genome can be decomposed into two ancestral elements, one with a distant, but closer, affinity to populations of eastern Eurasia, and one with a close affinity to populations of western Eurasia. What some have termed “Ancient South Indians” (ASI) and “Ancient North Indians” (ANI). ASI ancestry, which is probably just a touch under 50% in South Asians overall, seems to shake out then as somewhat more Asian than European.* The fraction of ASI increases as one moves south and east in South Asia (and as one moves down the caste status ladder).
In my post below I quoted my interview L. L. Cavalli-Sforza because I think it gets to the heart of some confusions which have emerged since the finding that most variation on any given locus is found within populations, rather than between them. The standard figure is that 85% of genetic variance is within continental races, and 15% is between them. You can see some Fst values on Wikipedia to get an intuition. Concretely, at a given locus X in population 1 the frequency of allele A may be 40%, while in population 2 it may be 45%. Obviously the populations differ, but the small difference is not going to be very informative of population substructure when most of the difference is within populations.
But there are loci which are much more informative. Interestingly, one controls variation on a trait which you are familiar with, skin color (unless you happen to lack vision). A large fraction (on the order of 25-40%) of the between population variance in the complexion of Africans and Europeans can be predicted by substitution on one SNP in the gene SLC24A5. The substitution has a major phenotypic effect, and, exhibits a great deal of between population variation. One variant is nearly fixed in Europeans, and another is nearly fixed in Africans. In other words the component of genetic variance on this trait that is between population is nearly 100%, not 15%. This illustrates that the 15% value was an average across the genome, and in fact there are significant differences on the genetic level which can be ancestrally informative. You can take this to the next level: increase the number of ancestrally informative markers to obtain a fine-grained picture of population structure. In the illustration above the top panel shows the frequencies at the SNP mentioned earlier on SLC24A5. The second panel shows variation at another SNP controlling skin color, SLC45A2. This second SNP is useful in separating South and Central Asians from Europeans and Middle Easterners, if not perfectly so. In other words, the more markers you have, the better your resolution of inter-population difference. This is why I found the following comment very interesting: