I mentioned this in passing on my post on ASHG 2012, but it seems useful to make explicit. For the past few years there has been word of research pointing to connections between the Khoisan and the Cushitic people of Ethiopia. To a great extent in the paper which is forthcoming there is the likely answer to the question of who lived in East Africa before the Bantu, and before the most recent back-migration of West Eurasians. On one level I’m confused as to why this has to be something of a mystery, because the most recent genetic evidence suggests a admixture on the order of 2-3,000 years before the past.* If the admixture was so recent we should find many of the “first people,” no? As it is, we don’t. I think these groups, and perhaps the Sandawe, are the closest we’ll get.
Publication is imminent at this point (of this, I was assured), so I’m going to just state the likely candidate population (or at least one of them): the Sanye, who speak a Cushitic language with possible Khoisan influences. There really isn’t that much information on these people, which is why when I first heard about the preliminary results a few years back and looked around for Khoisan-like populations in Kenya I wasn’t sure I’d hit upon the right group. But at ASHG I saw some STRUCTURE plots with the correct populations, and the Sanye were one of them. I would have liked to see something like TreeMix, but the STRUCTURE results were of a quality that I could accept that these populations were not being well modeled by the variation which dominated their data set. Though Cushitic in language the Sanye had far less of the West Eurasian element present among other Cushitic speaking populations of the Horn of Africa. Neither were their African ancestral components quite like that of the Nilotic or Bantu populations. The clustering algorithm was having a “hard time” making sense of them (it seemed to wanted to model them as linear combinations of more familiar groups, but was doing a bad job of it).
Here is an interesting article on these groups: Little known tribe that census forgot. Like the Sandawe this is a population which seems to have been hunter-gatherers very recently, and to some extent still engage in this lifestyle. In this way I think they are fundamentally different from Indian tribal populations, who are often held up to be the “first people” of the subcontinent. More and more it seems that the tribes of India are less the descendants of the original inhabitants of the subcontinent, at least when compared to the typical Indian peasant, and more simply those segments of the Indian population which were marginalized and pushed into less productive territory. Over time they naturally diverged culturally because of their isolation, but the difference was not primal. In contrast, groups like the Sanye and Sandawe may have mixed to a great extent with their neighbors (and lost their language like the Pygmies), but evidence of full featured hunting & gathering lifestyles implies a sort of direct cultural continuity with the landscape of eastern Africa before the arrival of farmers and pastoralists from the west and north.
* I understand some readers refuse to accept the likelihood of these results because of other lines of information. I am just relaying the results of the geneticists. I am not interested in re-litigating prior discussions on this. We’ll probably have a resolution soon enough.
In conclusion, we propose that the ancestral H2′ haplotype arose in eastern or central Africa and spread to southern Africa before the emergence of anatomically modern humans…Approximately 2.3 million years ago, the inversion rearranged to what we now refer as the direct orientation haplotype (H1′). This haplotype spread throughout the Homo ancestral populations in the African continent, virtually replacing the H2′ haplotype and becoming the predominant haplotype. We note that both the Denisova and Neandertal sister groups are predicted to have H1′ haplotypes…These early haplotypes were much simpler in their duplication architecture, similar to the patterns seen in great apes. We find that the more complex duplication architectures are particularly enriched in populations that migrated out of Africa. On the basis of sequence at the duplication loci, we estimate that the H2-specific duplication event occurred approximately 1.3 million years ago. Independent of the H2 duplication, the H1-specific duplication event occurred much more recently, approximately 250,000 years ago. Notably, we did not observe this haplotype in any of the African or Asian populations studied, suggesting that it may have been lost in these groups as a result of genetic drift. The H2D haplotype has risen to frequencies of 10–25% in European populations with virtually no genetic variation, suggesting an extremely recent and rapid expansion of this haplotype. High-coverage sequencing of more individuals along with fecundity data will likely shed further light on whether the high frequency of the haplotype-specific duplication in Europeans is due to selection or the effects of demographic history specific to this locus.
H2D individuals are susceptible to disease. If there is a fitness gain, there is also a loss. Despite his eugenical enthusiasms W. D. Hamilton ultimately gave up on the idea because he admitted it was difficult to predict what was beneficial and what is deleterious. Context matters. The distribution of haplotypes in this region seems to reflect echoes of deep pre-“Out of Africa” history in our species.
After the second Henn et al. paper I did download the data. Unfortunately there are only 62,000 SNPs intersecting with the HGDP. This is somewhat marginal for fine-grained ADMIXTURE analyses, though sufficient for PCA from what I recall. That being said, the intersection with the HapMap data sets runs from ~190,000 SNPs, to the full 250,000 SNPs (this makes sense since the Henn et al. #2 data set has some HapMap populations in it). So I’ve been experimenting a fair amount in the past few days, and I thought I would post on one issue which was clear in the original paper, but which I have replicated.
There is a new paper in PLoS Genetics out which purports to characterize the ancestry of the populations of northern Africa in greater detail. This is important. The HGDP data set does have a North African population, the Mozabites, but it’s not ideal to represent hundreds of millions of people with just one group. The first author on this new paper is Brenna Henn, who was also first author on another paper with a diverse African data set. Importantly the data was posted online. Unfortunately though most of the populations didn’t have too many markers. This isn’t an issue in an of itself, but it becomes a big deal when trying to combine it with other data sets. If you limit the markers to those which intersect across two data sets you start to thin them down a lot, to the point where they’re not useful. Though the the results of the paper are worth talking about, the authors claim that they’ll be putting the data online. This is important because they used a large number of markers, so the intersections will be nice (I can, for example, envisage exploring the relationship between the North Africans and the IBS Iberian sample in the near future).
As for the paper itself, Genomic Ancestry of North Africans Supports Back-to-Africa Migrations:
In my post below, Tutsi probably differ genetically from the Hutu, there were many comments. Some I did not post because they were rude, though they did ask valid questions. I will address those issues, but let me quote one comment:
That’s an interesting possibility, but this admixture run didn’t split the non-hunter-gatherer Africans that well. In one of your previous analyses on East Africa you managed to get a pretty accurate ‘Afro-Asiatic/Cushitic’ and ‘Nilotic’ cluster. Is it possible that you could run this Tutsi sample using the same admixture settings as in the ‘Flavors of Afro-Asiatic’ blog post to see if he carries a significant Nilotic component or is mainly Bantu & Cushitic derived?
So I replicated ADMIXTURE runs for many of the same populations as I did in my post, Flavors of Afro-Asiatic. I also pared down the population set and generated a PCA with EIGENSOFT. Before I get to those results, let me tackle the questions.
1) “Are the Luhya suitable proxies for the Hutus?”
Probably. The reason is that Bantu-speaking populations, from the Congo to South Africa, are surprisingly similar. Not only that, but these populations are very distinctive from groups which are close them geographically, but linguistically different (e.g., Khoe, Sandawe, Masai). The Luhya are not exceptional. I’ve run the Henn et al. data sets enough to be convinced that they’re exactly as they should be. They are pretty much what you’d expect from Kenyan Bantu. A predominant element which ties them back to an East-Central African point of origin, with some admixture with other East African elements (similarly, South African Bantu exhibit Khoisan admixture). The Hutu may be peculiar, but we don’t know, and my null is that they’re mostly Bantu with some admixture, as is the case with most Bantu speaking populations (this one Tutsi seems to be an exception in that context, as they are presumably Bantu speaking). If you think that the the Luhya are not suitable, I invite you to download the HapMap Luhya, and merge them with some of the Henn et al. data sets (or HGDP or Behar data sets). I think that should convince you.
A Cape Coloured family
I’ve mentioned the Cape Coloureds of South Africa on this weblog before. Culturally they’re Afrikaans in language and Dutch Reformed in religion (the possibly related Cape Malay group is Muslim, though also Afrikaans speaking traditionally). But racially they’re a very diverse lot. In this way they can be analogized to black Americans, who are about ~75% West African and ~25% Northern European, with the variance in ancestral proportions being such that ~10% are ~50% or more European in ancestry. The Cape Coloureds though are much more complex. Some of their ancestry is almost certainly Bantu African. This element is related to the West African affinities of black Americans. And, they have a Northern European element, which likely came in via the Dutch, German, and Huguenot settlers (mostly males). But the Cape Coloureds also have other contributions to their genetic heritage. Firstly, they have Khoisan ancestry, whether from Bushmen or Khoi. This is well known in their oral memory. The the hinterlands of the Cape of Good Hope are beyond the ecological range of the Bantu agricultural toolkit, so the region was still dominated by the Khoisan when the Europeans arrived. But there are also other suggestions of ancestry from Asia. The existence of the Cape Malays, whose adherence to Islam derives from the Muslims slaves brought by the Dutch, hints at likely relationships to the populations of maritime Southeast Asia. Finally, there are the Indians. This element is not too well recalled in cultural memory. But the Dutch brought many slaves from India as well as Southeast Asia. The Dutch first governor of the Cape Colony had a maternal grandmother who was an Indian slave, by various accounts Goan or Bengali (the town of Stellensbosch is named for him). No doubt it was far more likely that the usual lot of the descendants of Indian slaves during the Dutch era would be to be absorbed into the melange of the Coloured population than assimilated into what later became the Afrikaners.
Why is this aspect of Cape Coloured ancestry forgotten? I think part of the reason is that there is a large South African Indian community present today, but that community post-dates the Dutch period, and arrived with the British. When South Africans think of Indians they think of these people. Interestingly when the new genetic studies confirming Indian ancestry came on the scene I was “corrected” several times by Indians themselves when reporting this part of the Coloured heritage. They were under the impression I must be mistaken, as no one was familiar with the Cape Coloureds having Indian ancestry. Unfortunately pointing to PCA and STRUCTURE plots did not clear up the confusion.
In any case, thanks to the African Ancestry Project I now have three unrelated Coloured samples (I have more, but they are related). Since AAP is Afrocentric I thought it would be appropriate to run the Coloured samples separate first. So that’s what I did.
The Pith: There has been a long running argument whether Pygmies in Africa are short due to “nurture” or “nature.” It turns out that non-Pygmies with more Pygmy ancestry are shorter and Pygmies with more non-Pygmy ancestry are taller. That points to nature.
In terms of how one conceptualizes the relationship of variation in genes to variation in a trait one can frame it as a spectrum with two extremes. One the one hand you have monogenic traits where the variation is controlled by differences on just one locus. Many recessively expressed diseases fit this patter (e.g., cystic fibrosis). Because you have one gene with only a few variants of note it is easy to capture in one’s mind’s eye the pattern of Mendelian inheritance for these traits in a gestalt fashion. Monogenic traits are highly amenable to a priori logic because their atomic units are so simple and tractable. At the other extreme you have quantitative polygenic traits, where the variation of the trait is controlled by variation on many, many, genes. This may seem a simple formulation, but to try and understand how thousands of genes may act in concert to modulate variation on a trait is often a more difficult task to grokk (yes, you can appeal to the central limit theorem, but that means little to most intuitively). This is probably why heritability is such a knotty issue in terms of public understanding of science, as it concerns the component of variation in quantitative continuous traits which is dispersed across the genome. The traits where there is no “gene for X.” Additionally, quantitative traits are likely to have a substantial environmental component of variation, confounding a simple genotype to phenotype mapping.
Arguably the classic quantitative trait is height. It is clear and distinct (there aren’t arguments about the validity of measurement as occurs in psychometrics), and, it is substantially heritable. In Western societies with a surfeit of nutrition height is ~80-90% heritable. What this means is that ~80-90% of the variance of the trait value within the population is due to variance of the genes within the population. Concretely, there will be a very strong correspondence between the heights of offspring and the average height of the two parents (controlled for sex, so you’re thinking standard deviation units, not absolute units). And yet height is at the heart of the question of the “missing heriability” in genetics. By this, I mean the fact that so few genes have been associated with variation in height, despite the reality that who your parents are is the predominant determination of height in developed societies.
Over the past few days I’ve been trying to read a bit on the Sandawe. Most of the stuff I’ve been able to find is in the domain of linguistics, and is basically unintelligible to me in any substantive manner. The crux of the curiosity here is that the Sandawe, like their Hadza neighbors, have clicks in their language, and so have been classified with the Khoisan. Here’s some background:
The most promising candidate as a relative of Sandawe are the Khoe languages of Botswana and Namibia. Most of the putative cognates Greenberg (1976) gives as evidence for Sandawe being a Khoesan language in fact tie Sandawe to Khoe. Recently Gueldemann and Elderkin have strengthened that connection, with several dozen likely cognates, while casting doubts on other Khoisan connections. Although there are not enough similarities to reconstruct a Proto-Khoe-Sandawe language, there are enough to suggest that the connection is real.
I can’t speak to the validity of this at all, obviously. Some scholars do argue that the clicks in the Sandawe language were only acquired through interaction with peoples such as the Hadza, making an analogy to Xhosa, a Bantu language which has been strongly influenced by Khoi dialects. In any case, after having run ADMIXTURE a bunch of times on African population sets, and checked the genetic distances of the inferred ancestral ones, one thing that is clear is that the Sandawe don’t show a particularly close genetic relationship to the Bushmen, nor do they show a close relationship to the Hadza. In fact, the Hadza, Pygmies, and Bushmen show a closer relationship to each other, distant as it is, than to the Sandawe. The Sandawe themselves are distinctive from their Bantu neighbors, but, their connections seem more clear to the Masai and other peoples to the north.
Some of the anthropological stuff that I did find on the Sandawe not having to do with linguistics considered the issue of their status as hunter-gatherers, and their shift toward a form of agriculture within the past few centuries. Not surprisingly much of this literature consisted of ideologically shrill posturing, denouncing past scholarship for insensitivity and bigotry, while taking their own maximalist position. For example there has been the hypothesis that hunter-gatherer populations tend to be genetically and culturally isolated from agriculturalists, with several African groups used as exemplars. A group of anthropologists argue strenuously that this model may just be a construction of the biases of previous generations of scholars. But they offer little in the way of counterargument, more keen on uncovering the faults in the motives and methods of their predecessors than in building anything anew.
Genetics can help us a little here. Below are the results of ADMIXTURE and PCA I ran for a selection of populations. I pulled in some Behar et al. samples and merged it with the Henn et al. data set. The marker list was pruned down to ~160,000 SNPs. The limited selection of populations was conscious, insofar as I was exploring specific questions about the relationship of East African populations to Eurasian ones. At K = 8 the populations in my data set separated rather well. Do not take this separation as evidence that this K is a reflection of absolute concrete ancestral populations. Here’s the bar plot:
Dienekes mentioned today a new paper, Signatures of the pre-agricultural peopling processes in sub-Saharan Africa as revealed by the phylogeography of early Y chromosome lineages. Because of the recent comments in this space on the genetic history of Africa I was curious, but after reading it I have to say I can’t make much sense of the alphabet soup of haplogroups. Remember, there are different ways to capture and analyze the variation in one’s genes. A common activity is to sweep over the whole genome and focus on single nucleotide polymorphisms, variation at the base pair level. So my own analyses using ADMIXTURE focus on tens or hundreds of thousands of such markers. But there are other types of genomic variation, such as copy number, microsatellites, and minsatellites.
Additionally, much of the older human phylogeographic literature focused on mtDNA and Y chromosomal variance. For mtDNA it was partly a function of how easy it was to extract the genetic material (it’s copious on the cellular level). But perhaps more importantly these two types of variance aren’t subject to recombination. This means they are defined by clean phylogenetic trees which do not exhibit reticulation (recombination chops apart correlated markers and mixes & matches them) and presumably are not subject to natural selection, and so perfect for coalescent theory. So you can posit lineages related to each other by steps of sets of mutations, and also easily calculate the time until the last common ancestor for two different branches of the tree using a “molecular clock” model.
Here’s the abstract:
The Pith: I review a recent paper which argues for a southern African origin of modern humanity. I argue that the statistical inference shouldn’t be trusted as the final word. This paper reinforces previously known facts, but does not add much that both novel and robust.
I have now read the paper which I expressed a touch of skepticism toward yesterday. Do note, I did not dispute the validity of their results. They seem eminently plausible. I was simply skeptical that we could, with any level of robustness, claim that anatomically modern humans arose in southern vs. eastern, or western, Africa. If I had to bet, my rank order would be southern ~ eastern > western. But my confidence in my assessment is very low.
First things first. You should read the whole paper, since someone paid for it to be open access. Second, much props to whoever decided to put their original SNP data online. I’ve already pulled it down, and sent off emails to Zack, David, and Dienekes. There are some northern African populations which allow us to expand beyond the Mozabites, though unfortunately there are only 55,000 SNPs in that case (I haven’t merged the data, so I don’t know how much will remain after combining with HapMap or HGDP data set).
In the open thread someone asked: “Any recent stuff on the genetics of Ethiopians.” That prompted me to look around, because I’m curious too. Poking around Wikipedia I couldn’t find anything recent. A lot of the studies are older uniparental lineage based works (NRY and mtDNA). Ethiopia is interesting because unlike almost all other Sub-Saharan African nations it has a long written history. Culturally and linguistically it has both Sub-Saharan African, and non-Sub-Saharan African, affinities. The languages of highland Ethiopia are clearly Semitic. Those of lowland Ethiopia are Cushitic, a branch of the broader Afro-Asiatic language family concentrated around the Horn of Africa (Somali is a Cushitic language, though most Ethiopian nationals who speak a Cushitic dialect are of the Oromo group).
From a human evolutionary genetic perspective, Ethiopia also has specific interest. It is likely that the main recent pulse of humans Out of Africa traversed this region. Additionally, there is some evidence of deep time connections between the groups ancestral to Ethiopians and the Khoisan of southern Africa. It may be that Ethiopians and Khoisan are reservoirs of ancient genetic variation in Sub-Saharan Africa which as been overlain by Bantu in most other regions outside of West Africa. Finally, Ethiopians are known to have high altitude adaptations. This could be due to long term residence in the region, or, assimilation of favorable alleles from the long term residents by later populations.
Fortunately we can get a sense of the genetic affinities of Ethiopians thanks to a paper published last spring, The genome-wide structure of the Jewish people. The focus was clearly on Jews, but they surveyed Amhara & Tigray (Semitic speaking highlanders), Ethiopian Jews (similar ethnically to the Amhara & Tigray, but religiously non-Christian), and Oromo. In the PCA the Oromo and Semitic speaking populations are pretty obviously distinct clusters.
Last year a paper came out in Science which made a rather large splash, The Genetic Structure and History of Africans and African Americans by Tishkoff et al. Since it’s more than a year old I recommend that those of you curious about the details of the paper and don’t have academic access go through the free registration, as you can then read it in full. Unlike Reich et al. the Science paper didn’t unveil a new method of analysis. It was the standard bread & butter, with PCA’s & STRUCTURE plots & phylogenetic trees. But the coverage of populations within Africa was massive. They had a lot of results and relationships to cover, and ended up with a 100 page supplement.
I commend the whole paper to you. But there are two elements I want to highlight. First, a three dimensional PCA plot. It has the first, second and third principal components of variation. In other words, the three largest independent dimensions in terms of explanatory power of genetic variation. Panel A includes all world populations, and panel B just Africans.
Since we’ve been talking about Fst a fair amount, I thought it might be nice to put it in some concrete graphical perspective. First, to review Fst in the genetic context measures the proportion of genetic variation which can be attributed to between population differences. To give a “toy” example if you randomly divided the population of a large Swedish village into two groups, and calculated their Fst, it should be ~ 0, because if you randomly select from an unstructured population by definition there shouldn’t be noticeable between population differences. In contrast, if you compare a Swedish village to a Japanese village, a large fraction of the genetic variation is going to be distinct to each population. Around ~10% of the genetic variation in fact will be between the two groups. Many of the genes will be extremely informative, so that if you know the allelic state from a given individual you can predict with a high degree of certitude which population that individual was from (e.g., SLC24A5 and EDAR). A small set of ancestrally informative alleles would produce a sequence of conditional probabilities of extremely high certitude (on the order of 10 genes for these two populations should suffice, perhaps three for “government work”).
But to put this in perspective, and show how genetic variation differs from locale to locale, I though I would compare continental-scale Fst values with that in a small region, southern Africa. The Fst values for the first I obtained from Investigation of the fine structure of European populations with applications to disease association studies, and the second, Complete Khoisan and Bantu genomes from southern Africa. The Bantu in this case is Desmond Tutu, who is from the Xhosa tribe, and has substantial admixture from the non-Bantu populations which were resident in South Africa prior to the arrival of the Bantus.
First, in tabular format: