Tea leaves and population substructure

By Razib Khan | February 22, 2011 1:26 am

Image credit: Wikimol

Over the past few months I’ve been encouraging people to pull down ADMIXTURE, and push the public data sets through it. Additionally, you can also convert your  23andMe raw file into pedigree format pretty easily and integrate it into the public data sets with PLINK. I’ve been following Zack’s Harappa Ancestry Project pretty closely, but I’ve been running the software myself and manipulating its parameters and seeing how things shake out. But the more and more I do it, the more I wonder if it isn’t like regression analysis, a technique which is just waiting to be leveraged by human biases. I began thinking of this more deeply after a conversation with a computational biologist who outlined the structural problems with how ad hoc the utilization of statistics is in the life sciences.

These sorts of qualms are probably why I’m posting my results more on Facebook and passing them around friends, rather than putting them out there in the public domain. It isn’t that I think the results are going to be abused. I just don’t know what they mean a lot of the time. Or, perhaps more honestly I am suspicious of my own propensity to see what I suspect. A case of my priors strongly shaping the inferences which I might generate.

So I decided to do an experiment. Below are 8 runs, displayed as bar plots. Each thin sliver represents an individual. The colors again represent putative ancestral populations of which the modern populations are combinations, generated by the parameter K (so K = 2 means two ancestral populations, each corresponding to a different color). There are two data sets which I analyzed, group A and group B. I’ve also noted the K’s for each plot. But aside from that, I’ll leave you ignorant what these populations are or how many there are. Jot down some ideas as to what you can see. How many populations? How do they relate to each other? Can you perceive any real information in the higher K’s? I’ll put the “answers” below the fold. There’s no point in me saying what I think, I already know which populations these are, so I’m tainted.


All the populations were from the HapMap, with 80,000 markers. They were two sets of three:

Group A: Yoruba, African Americans, and Tuscans

Group B: Gujaratis, African Americans, and Mexicans

I selected these two sets because in Group A you have a population, African Americans, which can plausibly be considered a combination of the two other populations. In Group B, this is not so. The Mexican and African American populations share similar European ancestry, but both are admixed. The Gujarati populations has no close relationship to any of the other populations, so ADMIXTURE is just trying to do the best with what it’s being given.

How did you fare?

CATEGORIZED UNDER: Genetics, Genomics
  • Peter Ellis

    I find these plots remarkably annoying to dig through because the colours keep changing. In group A, for example, you have red/blue components in the k=2 plot. In the k=4 plot, a component that’s essentially identical to the previous “red” component is now coloured green, while a component essentially identical to the previous “blue” component is now red. Reassigning the colours of the various components would make things a lot clearer.

  • John Emerson

    I think that there’s a warning in the recent collapse of macroeconomics only a few years after people were starting to claim that the basic problems had been solved. This was also a case when personal policy preferences contaminated the scientific process; economists denied that this criticism was valid, since all explicitly expressed aspects of these theories were neutral to policy. It was like a game where you could make your biases into pure science if you could manifest them entirely in scientific, neutral language.

    I was suspicious all along, but since I didn’t have an inside understanding of the science my opinions were worthless. I did guess right though. There was a pattern whereby increasing mathematical virtuosity made the science intelligible to increasingly fewer people. A similar pattern was seen in the business world, where unemployed mathematical physicists were hired by finance to produce increasingly sophisticated financial instruments which no one could understand. It turns out that these instruments were booby-trapped, as we now see. It really happened twice, just recently and in 1998 with Long Term Capital Management, which had two Nobelists on the board of directors.

    But it’s not really that if political preferences were finally excluded econ would be science. Political preferences can’t be excluded, after 75 years of trying, and it’s always going to be political, since econ is an applied science.

  • Markk

    Andrew Gelman had an interesting post a while back about what people takeaway. Someone recommended an exercise for a statistics class where several graphs are presented with no description on them. The students are asked to write the labels and a description of what the graph told them. This could be a similar example. We all can take away different things.

    In a larger sense isn’t this why we have real statistical tests? One has to =beforehand= decide what they are looking for and come up with some number for a test where they could say “I found it!” or “It isn’t there” or “I can’t tell”. What you are doing by playing is absolutely necessary to get the “some number” I mentioned above. In the end, in most cases, that number and test are really based on playing around.

  • John Roth

    I have to agree with Peter Ellis. I find these plots difficult to compare at higher Ks, partly because the colors keep changing, partly because the assignment randomly flips top to bottom, and partly because of the apparently random occurrence of vertical white lines. The the first two have been a problem with just about every series of these plots I’ve seen, and tend to be extremely off-putting.

    As far as the analysis is concerned, in the first series, the pattern of three distinguishable populations is fairly obvious in the first chart and continues in the others. The second group isn’t quite as clear. If the higher Ks are trying to tell me anything, it’s not at all obvious what that should be. On the other hand, I’ve long been an advocate of the viewpoint that staring at anything for too long will show you patterns that simply aren’t there.

  • http://washparkprophet.blogspot.com ohwilleke

    It is a matter of using the right tools for the right purposes. Admixture comes into the analysis with a preconceived bias based on an arbitrarily set K number and an admixture of multiple ancesteral population eigenvector model.

    When you know how many ancestral populations there are with some accuracy, the program fits the data to the model without computational agony for the user. But, this program is not designed to figure out how many clusters there are in a set.

    There are other statistical tools that simply look for clusters. PCA analysis and eyeballing the data is pretty good, although the problem it has that computer programs can solve, is that people have a very hard time seeing more than three or four dimensions at once (motion and colors and 3D can get you to five), while computers can see in many dimensions at once. When you have good reason to believe that a tree-like model is appropriate, there are some very good statistical computer programs that create tree-like clusters of data in a phylogenetic relationship.

    Neither admixture nor cluster analysis tells you anything about how closely related the clusters are to each other in an absolute sense as opposed to relative to each other. Tools like Fst measure that aspect.

    There is also nothing wrong with going into statistical analysis with strong Baysean priors about how you expect the data to come out IF YOUR PRIORS ARE ACCURATE. A lot of the time in anthropology, your priors may actually be more accurate than your main data set. You may know exactly how many ancestral populations there are and when they came along, but not what they looked like genetically.

    Indeed, statistics are at their most powerful when you ask them simple questions. For example, the statistics of hypothesis testing, where one compares a small number of possibilities for likelihood (e.g. did modern European populations dervive from predominantly hunter-gatherer populations, predominantly from LBK agriculturalists or predominantly from some other source) can have much more power at resolving a question in a way that supercedes your biases about the choices than when you ask them open ending questions without clear choices.

    One problem with the statistics of dating divergence dates from genetic mutation rates is that the priors that are used to calibrate the dating aren’t very good themselves.

  • Pingback: My genotyping results, plus a brief introduction to population genetics | Opening Delinda's Box()


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar