One of the reasons that the HGDP populations are weighted toward indigenous groups is that there was the understanding that these populations may not be long for the world in their current form. But the Taino genome reconstruction illustrates that even if populations are no longer with us…they are still within us. With that in mind I decided to do some quick “back-of-the-envelope” calculations in relation to the Khoisan people of southern Africa. These are the descendants of the populations which were presumably there before the Bantu, and the basal relationship of the Bushmen to other human lineages is probably a partial testament to their long term residence in this region of Africa.
There are about 300,000 speakers of Khoisan languages left (mostly in South Africa and Namibia). These individuals are not all unmixed in their ancestry. If you look at some of the public genotypes available you can find Bantu African and European ancestry in Bushmen (the European may have come from Griqua). There are about 4 million Cape Coloureds and 8 million Xhosa. Both of these groups have some Khoisan ancestry. Let’s assume that the Cape Coloured are 20% Khoisan, and the Xhosa are 10% Khoisan. This is probably a moderately conservative, but I think it’s close from what I’ve seen. Multiplying that out you get 1.6 million Khoisan represented by the Cape Coloured and the Xhosa. That’s a ratio of over 5:1 in terms of the ancestral components attributed to Khoisan in modern populations being in those groups which don’t identify as Khoisan. This is probably a major underestimate, as other Bantu populations besides the Xhosa likely have some Khoisan ancestry, though less.
Dienekes mentioned today a new paper, Signatures of the pre-agricultural peopling processes in sub-Saharan Africa as revealed by the phylogeography of early Y chromosome lineages. Because of the recent comments in this space on the genetic history of Africa I was curious, but after reading it I have to say I can’t make much sense of the alphabet soup of haplogroups. Remember, there are different ways to capture and analyze the variation in one’s genes. A common activity is to sweep over the whole genome and focus on single nucleotide polymorphisms, variation at the base pair level. So my own analyses using ADMIXTURE focus on tens or hundreds of thousands of such markers. But there are other types of genomic variation, such as copy number, microsatellites, and minsatellites.
Additionally, much of the older human phylogeographic literature focused on mtDNA and Y chromosomal variance. For mtDNA it was partly a function of how easy it was to extract the genetic material (it’s copious on the cellular level). But perhaps more importantly these two types of variance aren’t subject to recombination. This means they are defined by clean phylogenetic trees which do not exhibit reticulation (recombination chops apart correlated markers and mixes & matches them) and presumably are not subject to natural selection, and so perfect for coalescent theory. So you can posit lineages related to each other by steps of sets of mutations, and also easily calculate the time until the last common ancestor for two different branches of the tree using a “molecular clock” model.
Here’s the abstract:
Since we’ve been talking about Fst a fair amount, I thought it might be nice to put it in some concrete graphical perspective. First, to review Fst in the genetic context measures the proportion of genetic variation which can be attributed to between population differences. To give a “toy” example if you randomly divided the population of a large Swedish village into two groups, and calculated their Fst, it should be ~ 0, because if you randomly select from an unstructured population by definition there shouldn’t be noticeable between population differences. In contrast, if you compare a Swedish village to a Japanese village, a large fraction of the genetic variation is going to be distinct to each population. Around ~10% of the genetic variation in fact will be between the two groups. Many of the genes will be extremely informative, so that if you know the allelic state from a given individual you can predict with a high degree of certitude which population that individual was from (e.g., SLC24A5 and EDAR). A small set of ancestrally informative alleles would produce a sequence of conditional probabilities of extremely high certitude (on the order of 10 genes for these two populations should suffice, perhaps three for “government work”).
But to put this in perspective, and show how genetic variation differs from locale to locale, I though I would compare continental-scale Fst values with that in a small region, southern Africa. The Fst values for the first I obtained from Investigation of the fine structure of European populations with applications to disease association studies, and the second, Complete Khoisan and Bantu genomes from southern Africa. The Bantu in this case is Desmond Tutu, who is from the Xhosa tribe, and has substantial admixture from the non-Bantu populations which were resident in South Africa prior to the arrival of the Bantus.
First, in tabular format: