ADMIXTURE, African Ancestry Project, and confirmation bias

By Razib Khan | May 2, 2011 1:29 pm

I’ve been running the African Ancestry Project for a while now on the side on Facebook. But it’s getting unwieldy, so I finally set up the website. The main reason I started it up is that there have been complaints for a while now of problems with the 23andMe “ancestry painting” and such for some African groups. For example, a Nubian might be 70% “European.” One might argue that this is due to Arab admixture, but this is clearly not so if you look at the PCA plot. What’s going on? Probably a problem with the reference populations (only Yoruba for Africa), ascertainment bias in the chip (they’re tuned to European variation), and the fact that African genetic variance can cause some issues. I don’t know. But the problem has been persistent, and since most of the other genome blogging projects exclude Africans because they’re so genetically diverse I decided to take it on.

Three groups of people have submitted:

– People of the African Diaspora in the New World

– People from Africa, disproportionately Northeast Africans (Horn of African + Nubia, etc.)

– People of some suspected or known minor component of African ancestry

I’m at ~70 participants now. As one reference population set I’ve been using a subset of Henn et al. as well as some populations from Behar et al. I call this my “thin” set since there are only ~40,000 SNPs. A “thick” set has on the order of 300-400 thousand markers. But fewer populations. I’ve been putting the AAP members through ADMIXTURE in batches of 10, but I also run them all together sometimes for apples-to-apples comparisons. Yesterday I ran AF001 to AF070 from K = 2 to K = 14, unsupervised, with the thin reference. If you want to see all the results, go here. Doing all this myself over and over has given me some intuition as to the pitfalls in this sort of analysis.  Especially in the area of confirmation bias.

This is how it happens. Let’s say you have a lot of individuals from dozens of populations and hundreds of thousands of markers. Obviously you can modulate the parameters a bit. The number of individuals, the weighting of the various populations, and how thick your marker set is going to be. There’s a practical reason to make your marker set thinner, the algorithm runs much faster. But as you reduce the number of markers the outcomes become much noisier. That’s evident when you look at individuals results, and not the population pooled ones. Varying the population set also matters a lot. If you have a sample with 75 Yoruba and 25 Druze vs. 50 Yoruba and 50 Druze, that can produce different results over the same number of Ks. Finally, obviously reducing the number of individuals causes problems with representativeness. Here the results become “noisy” at the population level, as a regional bias can distortion your perception of a given population.

How does this work with confirmation bias? If you are proactively searching for patterns which align with a particular model or expectation you can often simply modulate the parameters until you obtain “reasonable” results. An exact same issue crops up with multiple regression. And this need not be conscious. In the course of regular science workers often ignore aberrant results and seek out positive ones. What we’re talking about is a general human bias. Researchers have been known to run experiments and keep tweaking them until the p-value reaches statistical significance. First, this treats the p-value as a “magical” number. That’s really not how it should be viewed, but that’s how it plays out in the course of attempting to get published. Second, the p-value itself is going to vary as well, which is why running an experiment over and over can get you the “right” result. The same general problems can crop up with ADMIXTURE. If you have a dedicated computer you can keep running the algorithm with a range of parameters until you get a “reasonable” result. You may also see bizarre results, and dismiss them out of hand as the program acting wonky. I’ve done it myself. But who knows, perhaps some of the “bizarre” results are stumbling upon a novel insight?

I’m not making a postmodernist pure constructionist argument. These algorithms often give predictable and regular results. And some sought results are harder to attain than others (i.e., you have to keeping fishing in the pool longer until you finally get a “bite”). But, be very careful of relying on one chart or graph as a clincher in any argument. This includes stuff I’m presenting. Attempts at replication are important, but there’s only so much time. That’s why I’m encouraging readers to play with these programs themselves.

Speaking of confirmation of a model. I thought I’d do a little experiment. Below are the ~70 participants in the AAP at K = 10. I’m not showing you the reference populations. I will tell you that:

1) The largest number of participants are of New World African Diaspora descent

2) Second in number are those of mostly non-African descent who have some reputed or known minority African ancestry

3) A bit more than half a dozen individuals are of Northeast African ancestry, in full or part

4) One individual has recent Japanese ancestry and another has recent Maya ancestry

5) A minority of the New World Africans have origins in the Caribbean.

6) There are only a few individuals of West African national origin in the data set, but they are there

First, look at this image and make your guesses (leave them in the comments, please don’t spoil it for others by identifying who is what by ID after you confirm):

All the plots with reference results are here. Explicit self-identification is here.

CATEGORIZED UNDER: Genetics, Genomics
  • pconroy

    I’d reckon that the colors mean the following:

    1. Orange = West African = Yoruba
    2. Light Green = = European = Tuscan
    3. Purple = East Africa
    4. Red = Pygmy
    5. Light Blue = North African

  • http://www.riverellan.blogspot.com Tom Bri

    Well, let’s see. A game. The New Worlders would be likely to show significant Euro signals, as would the NE Africans, and those with some reputed African ancestors. This seems likely to be the predominant signal since the large majority fall into these groups. So:
    Orange is Euro.
    Interesting is the Japanese/Carib/Mayan, whom I suspect all show the same signal. The Carib would be very minor, since they mostly just died in the plagues, but there should be some minor bit. So, The ‘Asian’ component would be quite strong in two and a very minor component in any others. I peg that as the dark blue, but that may be because my older eyes don’t clearly distinguish the two colors of blue.
    West Africa would be the light green.
    East Africa is purple, and shows up in the New Worlders.
    The minor bits I can’t guess.
    Distilled:
    Orange= Euro
    Dark Blue= Asian
    West African= Light Green
    East Africa= Purple

  • Eze

    Cool stuff Razib, your new Nubian sample is interesting, possibly the first time I’ve seen autosomal samples from Nubia (North Sudan). I wonder if Southern Egyptians will be similar to this sample.

  • http://www.riverellan.blogspot.com Tom Bri

    Oops, a typo or something there. Actually a brain-short. I flipped East and West in my mind somehow. West would be what shows up in New Worlders.

  • Al Cibiades

    Extremely pertinent discussion as I am struggling with Atkinson’s linguistic founder effect model of language expansion (Atkinson, Q.D (2011). Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa. Science 332, 346. DOI: 10.1126/science.1199295.)
    Q1: Does Bayesian Information Criterion mitigate confirmation bias?
    Q2: Are there quantative methods which would clarify a confirmation bias?

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »