Genetic variation within Africa (and the world)

By Razib Khan | August 22, 2010 1:16 pm

Last year a paper came out in Science which made a rather large splash, The Genetic Structure and History of Africans and African Americans by Tishkoff et al. Since it’s more than a year old I recommend that those of you curious about the details of the paper and don’t have academic access go through the free registration, as you can then read it in full. Unlike Reich et al. the Science paper didn’t unveil a new method of analysis. It was the standard bread & butter, with PCA’s & STRUCTURE plots & phylogenetic trees. But the coverage of populations within Africa was massive. They had a lot of results and relationships to cover, and ended up with a 100 page supplement.

I commend the whole paper to you. But there are two elements I want to highlight. First, a three dimensional PCA plot. It has the first, second and third principal components of variation. In other words, the three largest independent dimensions in terms of explanatory power of genetic variation. Panel A includes all world populations, and panel B just Africans.


For panel A, PC1 = 20% of the variance, PC2 = 5%, and PC3 = 3.5%. For panel B the PCs didn’t drop off quite so much, PC1 = 11%, PC2 = 6%, PC3 = 5% and PC4 = 4%. In case you don’t know, the Hazda are Africa’s last obligate hunter-gatherers, and speak a language with clicks in it, just as the Bushmen do. The big division highlighted in this paper is that between the “indigenous” relict populations, the Hazda, Sandawe, Bushmen and Pygmies, and those who belong to the more widespread agriculturalist and pastoralist societies of Africa. Implicit within the paper is the model of a Bantu Expansion of farmers, as well as a possible later Nilotic expansion (which brought the Tutsi and Masaai) of herders, in a north-south direction. In the process they assimilated/and or/displaced the indigenous populations, of whom the aforementioned peoples are relict islands persisting in ecologically isolated or unfavorable domains.

324_1035_F5The map to the left shows the population coverage within this paper of African groups. The pie graphs simply show ancestral quanta as inferred by STRUCTURE. You can read the paper for the blow-by-blow. But ultimately it seems there will be need for a finer-grained coverage to the south of the equator. If the Bantu expansion is as recent as archaeologists and linguists assume, on the order of ~2,000 years ago, then the gradients of genetic signals should persist. From what I can tell it is assumed on both genetic and phenotypic grounds that the Xhosa have a higher load of Khoisan ancestry than the Zulu or Tswana. The Bantu Expansion is recent enough that the semi-legendary Phoenician circumnavigation of Africa would have encountered many Khoisan peoples along the eastern coast.

Below are a selection of figures from the above paper. After selecting an image it is probably best to hit F11 for “Full Screen” if you aren’t a on a very big monitor (you can copy image location and view it in a separate window as well).

no images were found

CATEGORIZED UNDER: Genetics, Genomics

Comments (5)

  1. I don’t mean to bring up a tangential point to the post, but why does the field of human genetics use PCA to visualize relationships? When I see plots like those shown here that have a ‘geometric pattern’ to them (the sharp right angles; another common pattern is a Y-shape), that tells me that there are lots of samples with zeros for many of the Y-variables (i.e., alleles that are unique to certain populations). Thus, the spatial arrangement of the points is largely an artifact of an inappropriate method: how does one calculate a correlation matrix when many of things one is correlating have values of zero?

    If one really was keen on using PCA, one could calculate a pairwise distance matrix and then use that instead of the correlation matrix (Principal Coordinates Analysis).

    Just curious (Really. I’m not implying that TEH DARWINISMZ ARE FALSIFIED! or anything ludicrous like that).

  2. “The big division highlighted in this paper is that between the “indigenous” relict populations, the Hazda, Sandawe, Bushmen and Pygmies .” There’s no doubt that Hadza, Sandawe and San are distinct from other African populations. The authors do include Pygmies into the mix, but this inclusion looks forced and derived from a long-standing belief that Pygmies, because they are short, must have lost their original languages. At some point the authors reveal the following: “Both language and geography explained a significant proportion of the genetic variance, but differences exist between and within the language families (table S5 and fig. S33, A to C) (4). For example, among the Niger-Kordofanian speakers, with or without the Pygmies, more of the genetic variation is explained by linguistic variation (r2 = 0.16 versus 0.11, respectively; P < 0.0001 for both) than by geographic variation (r2 = 0.02 for both; P < 0.0001 for both)." It looks like Pygmy genetic and linguistic affiliations are in sync, which means that either the Niger-Congo family is very old (and the Bantu expansion took place much earlier than we think), or, more likely, that Pygmies have always spoken Niger-Congo (and Bantu) languages and are not a relic population but a foraging "arm" of the Niger-Congo expansion. In mtDNA and Y-DNA terms, Pygmies and Bantu belong to different closely related subclades of the same clades, which is in perfect alignment with their linguistic kinship. In Razib's "Genetic Distance Tree 1" Pygmies and Bantu again cluster next to each other, with San populations being an outgroup for both.

    Tishkoff et al. observe close correspondence between genes and languages in Africa, and this is one good case in which languages and genes tell the same story. This story is different, however, from the common belief that Pygmies are relic African populations with roots Upper and Middle Pleistocene.

    Another quote of note: "Thus, modern humans have existed continuously in Africa longer than
    in any other geographic region and have maintained relatively large effective population sizes,
    resulting in high levels of within-population genetic diversity (1, 2). Africa contains more than
    2000 distinct ethnolinguistic groups representing nearly one-third of the world’s languages (3).
    Except for a few isolates that show no clear relationship with other languages, these languages
    have been classified into four major macrofamilies…" The fact languages and genes in Africa seem to be closely aligned with each other and that African linguistic diversity falls into a limited number of families again suggests that African genetic diversity may not be as old as it's usually assumed. In contrast, in America between-population diversity is high (see Razib's recent posts on Fst), languages fall into 140 language families, which may suggest high antiquity.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar