You learn from failure

By Razib Khan | April 2, 2011 1:32 am

In yesterday’s post on African genetics I tried to work with a large set of populations, but narrowed SNPs down to ~40,000. Today I thought I’d go another route, focus on having a thicker market set, but with fewer populations. So I did a bunch of runs with 400,000 SNPs. Here’s K = 8. Please note, I did some “trial” runs and pulled out people with obvious admixture which was recent or an outlier within their population. (e.g., Mozabites with a lot of Sub-Saharan African or San which obviously had European ancestry).

Notice that there are three non-Sub-Saharan modal components. South of the Sahara the European one is absent. But here’s the weird thing. Below are MDS representations of genetic distance between the ancestral groups inferred above:

Now without Eurasians + North Africans:

All of these “ancestral” groups are abstractions. More plainly, they’re fake but useful (physicists would say “toy models,” economists “stylized facts”). But the Nilotic one seems kind of crazy here. It told the program to go look for 8 populations. It went and looked, and came back with some with a weird one. I guess that means I’ll have to do cross-validation from now on, even though that slows everything down.

CATEGORIZED UNDER: Genetics, Genomics
  • onur

    Unlike the other components, the “Nilotic” component is not modal in any population. So I guess the problem is to do with the number of components. You may have to increase the K for a more plausible result.

  • Eze

    Where did you get the Sandawe 400K data from? Weren’t they from the Henn et al 55K dataset?

  • Dienekes

    Look at the individual results of the Nilotic component. I’ve had the experience of a weird Maasai component before, and it seems to be anchored on a few relatives. If you notice that a few individuals have 100% of this component and the rest have variable %, then that’s a warning sign of the possible inclusion of relatives or relative-like people in the analysis, which screws everything up. Even if 2 people are closely related, they will most likely get their own component at high enough K (100%) and all the rest of their ethnic group will get spurious variable % in that component..

  • onur

    Looking at the above results, what Dienekes says sounds entirely plausible to me. Such a weird “Nilotic” component indeed most probably results from the inclusion of relatives or relative-like people. I didn’t take into consideration such a reason in my above comment, as, unlike Dienekes, my knowledge of the individual results is rather limited.

  • Dwight E. Howell

    If you are sorting grain for making flour throwing away anything that doesn’t match the standard is the thing to do.

    However if you are seeking new knowledge it might be best to look real hard at what you are about to discard. It is among the discards that that you are most likely to find something novel that leads to truly new knowledge.

  • Zack

    Crossvalidation is very important. But first I would do IBD analysis on the data. I have been removing individuals who have high PI_HAT from my datasets and publishing the results now.

  • Razib Khan

    ok, 2 relatives in there….

  • Razib Khan

    Where did you get the Sandawe 400K data from? Weren’t they from the Henn et al 55K dataset?

    they put 550 K khoisan online too. hadza, sandawe, san

  • Diogenes

    This type of problems are avoided in ADMIXTURE’s supervised approach, I think.
    The problem with the “Nilotic” component, is that it, the probable original NorthWest African (Guanches in the Canary islands already had some L mtDNA) and the “!kung” (IMO original Egyptian) component in my last run are, I’m growing to believe, derived from a continuum inhabiting the Green Sahara. West African as found in Yoruba being a more distant relative. Lake Chad was a fresh water lake almost the size of France up to 3000BC, and there were many more rich river valleys in the region, this has to be taken into account…
    Sounds like another rich/unstable environment at the end of the ice age.

    The Saharan desertification may have led to the split only a few thousand years ago, into a NorthWestern branch (found in Fulani and Mozabite and to a lesser extent in other North Africans); an Eastern Nilotic Branch, and the Nile river population (“!kung”?).
    A second rich but unstable “Eden”? West African Neolithic may have taken time to expand, but there are indications it has existed for long in some more incipient form. Also pastoralism generally originates in the periphery of Neolithic regions, and the Fulani and Maasai may be descendants of Green Sahara pastoralists. Funny that Nile villages spring up from nowhere about the time of the drying up of the Eastern desert.

    Clines complicate the whole thing.

  • TGGP

    Off-topic, but you might be interested in the latest move by the “indigenous movement” against science.

  • ohwilleke

    The Nilotic v. West African genetic distance in the MDS certainly is larger than one would expect. Any combination of non-pygmy, non-san Sub-Saharan Africans should be closer to each other than to Pygmy or San populations, or to European/North African/Middle Eastern populations. Dienekes observations seem on point on that score. Notably, the left to right axis seems to be about where it should be, while the top to bottom axis seems messed up? Is it possible to do an analysis replacing the top to bottom axis with a third PC dimension or something?

    San and the two Pygmy populations should be the easiest first divides away from other subsaharan Africans, and North African/Middle Eastern/European should be quite distinct from Subsaharan African. If you have two non-outlier subsaharan components left, what would they be?

    The Red line which has been labeled “Nilotic” may actually be “Ancestral East African” and the Nilotic component may simply be part of the Yellow “West African” component at this level of detail. The archaeology and lingustics and uniparentals don’t suggest an ancient divide between Nilotic and West Africa, but do suggest an ancient divide between West African and East African. If Western Nilotic populations have low levels of red admixture, this theory would be supported.

    If the green components (Mozabite and Middle Eastern) in Horn-of-Africa are comparatively recent, and yellow is a stand in for both Nilotic and West African, the notion of East African Ancestral being a minority across the board makes a certain amount of sense.

  • humayun

    it would be very interesting if you include some haussas and other tchadic speaking populations

  • humayun

    sorry, I did not see the other thread
    it would be very interesting if you include other Chadic speaking populations than Haussas such as Tchadic speaking populations from Chad, Niger and Sudan (folks like the Massas and Dangaleyat)
    Indeed Chadic is strongly connected with Kushomotic branch of AA and according to the free paper below, afrasan was brought to western Africa by Cushitic speaking pastoralists, so one could think that the Chadic speaking populations of Chad, Sudan and Niger have more original Cushite input than the peripherical Haussas of North Nigeria


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar