D.I.Y. population structure inference, part 1 of many

By Razib Khan | February 13, 2011 3:37 pm

If you’ve been reading this weblog for a while you’ve seen many images like the one above. It comes from the 2008 paper Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. The data set is from the Human Genome Diversity Project. It consists 52 groups from around the world, curated for representativeness, but also ethnic distinctiveness. They utilized the FRAPPE program, which like STRUCTURE and ADMIXTURE estimates the ancestry of individuals (and in the aggregate populations) from a a combination of components, the number of which you specify with the parameter K. In other words, this is model based. It works out really well when you have an intuition of the model you’re looking for. Imagine African Americans, who you can presume are a two-way admixture between two distinct ancestral populations. It works less well in other cases. For example, South Asians are modeled by 23andMe as a two-way admixture between Europeans and East Asians. Why this occurs is totally comprehensible; they have three (Chinese + Japanese = one) reference populations which are very different from South Asians. So the computer, being dumb but fast, simply slaps together the best inference possible from the weird constraints placed upon it. Garbage in, garbage out.

But along with PCA these sorts of algorithms which allow one to visualize variance across hundreds of thousands of markers across hundreds of individuals are very useful (though perhaps there are mentats amongst us who have no need of such techniques). You just have to use them with caution. Information may be free, but it can be misinterpreted!

Over the years of my blogging people have regularly asked questions of the form “are East Africans more closely related to West or South Africans?” They are easily answered, I would just look in the literature. But, it did take time, and I’d have to pick the right figure, look for Fst, and so forth. But that is changing.

The Nature piece “The rise of the genome bloggers” covered the change. Since last fall BGA and Dodecad have been dumping lot bar plots and PCAs on the web. Instead of looking for a paper, I have now begun to use those sites as my resource of first choice (since they’re well indexed by Google). Now with HAP you have another source of information. It’s gotten to the point that technically capable commenters are now submitting their own results!

We’ve come a long way. Academics are not miserly with information, and some of my best friends have been the gatekeepers of the data and results. But now you can find the data on the web easily. You can reprocess the data by yourself. And, you can do the analysis yourself.

I’ve been sitting back for a while, letting Dienekes, Zack, etc., do their thing. There are so many technically fluent people out there, I’ve enjoyed just consuming the raw information yield. But that ends today. Over the past week I’ve been slapping together some R functions to make it easier for me to generate bar plots at various K’s, as well as PCA’s. My goal is this: a reader asks a question, and I quickly constrain my data set appropriately and do the analyses, take the screenshots, and upload them to the servers here, and point them to the images in the comments. The main constrain should be the computational resources (ADMIXTURE can take hours). Yes, that’s where we’re at.

Every now and then I’m going to put up a post of ADMIXTURE bar plots or MDS/PCA’s. Part of the reason is that it will be useful for my later reference. Second, I think the slide show display view is probably pretty useful to get a gestalt sense of what’s going on. That’s what we’re going for: human comprehension. Below is my first slide show, from K = 2 to K = 16. That is, the models assumed two to sixteen ancestral populations. I also excluded Sub-Saharan Africans from the data set since they’re so varied. Here are the details:


– ~55,000 markers
– All non-African HGDP populations
– HapMap Tuscans + Gujaratis (as well some some white Americans from 23andMe)
– Bengali = my parents, N = 2

I removed some bar plots because they seemed redundant:

-Makrani
-Melanesian
-White American (these are half a dozen friends whose data I received from 23andMe)
-North Italian
-Colombians
-Karitiana

Note, these populations are simply not displayed. Their variance still was used to generate the results!

In regards to the bar plot, I did not output the legend. There’s plenty of labeling the ancestral fractions elsewhere, and it’s useful, but I think it is also important for people to take the colors in without any bias of what they mean. I have added text to some of the slides though, which you can see at the bottom if you are so inclined. I apologize for the garishness of some of the colors…I have some element of colorblindness in the purple-violent range for what it’s worth.

k12
k10
k11
k13
k14
k15
k16
k2
k3
k4
k5
k6
k7
k8
k9

CATEGORIZED UNDER: Data Analysis, Genetics, Genomics
ADVERTISEMENT
  • http://tibettalk.wordpress.com Otto Kerner

    It looks like the main thing that happens in K = 8 is that the Maya record differentiates itself out from the Pima, but the distinctively Maya component (hot pink) contributes less to the Maya total than does the component that they share with the Pima (turquoise). Could this imply that a Central American substrate population that was marginalized by people from the North in the distant past? Could the light blue/dark blue split in K = 7 imply the same thing for the Middle East?

  • Bob

    This is way cool, Razib. I’m wondering if I could render these as a sequence of little pie charts or something on a map. There’re several names up there I don’t even recognize, much less place on a world map….

    Cheers,
    –Bob

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    #2, yeah, i will get more familiar with R’s mapping functions at some point. ultimately, i want to do gradients.

  • Justin Giancola

    I think this might trigger epilepsy! …seriously whoa… I do think pie charts would be easier to digest! … … :/

  • Bob

    Hi Justin,

    Well, not really for me: I think these bars are probably better than a pie chart would be. I can scroll through the sequence and see interesting structure there, even as a beginner at this.

    I was mostly complaining that, because of my ignorance of peoples and geography, placing those on a map would put that structure into context. I’m just not sure that bars would fit on a map very well, while a little round pie chart would.

    Cheers,
    –Bob

  • french reader

    Is there some sort of norm for the colorisation or is it completely random ?

    The element that is modal in Papuans is first blue then red, then yellow, then green, then pink, then some sort of greenish yellow. I picked this exemple because it’s the easiest but for other populations i find it difficult to keep track of what is what.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    Is there some sort of norm for the colorisation or is it completely random ?
    The element that is modal in Papuans is first blue then red, then yellow, then green, then pink, then some sort of greenish yellow. I picked this exemple because it’s the easiest but for other populations i find it difficult to keep track of what is what.

    yes. but that’s kind of how it should be i think. what you want to see the relationship between populations. don’t get caught up in shifts over K’s. each time one runs a new set of results on a different K it’s starting afresh, so there’s no guarantee that you’re just extending the previous K. notice how east asians separate into two K’s, and then collapse back again.

  • RK

    Thanks for the shout-out — that’s the first time anyone’s called me technically capable. :-)

    I feel like what we really need now are tools for model selection. Time to learn about this AIC() and BIC() functions in R…

  • french reader

    yes. but that’s kind of how it should be i think. what you want to see the relationship between populations. don’t get caught up in shifts over K’s. each time one runs a new set of results on a different K it’s starting afresh, so there’s no guarantee that you’re just extending the previous K.

    ah ok, thanks for explaining

    notice how east asians separate into two K’s, and then collapse back again.

    i had not seen that.

    i still think you have a problem with the colors, perhaprs use shades of the same color for each K,
    example:
    2k = 2 shades of blue.
    3k = 3 shades of green.
    4k = 4 shades of red.
    5k = 5 shades of blue.
    etc…

    images:
    http://img687.imageshack.us/img687/4250/newk2.png
    http://img27.imageshack.us/img27/4996/newk3.png
    http://img222.imageshack.us/img222/4526/newk4.png
    http://img51.imageshack.us/img51/6223/newk6.png
    http://img64.imageshack.us/img64/1696/newk7.png
    http://img267.imageshack.us/img267/7061/newk8.png
    http://img440.imageshack.us/img440/6752/newk9.png
    http://img543.imageshack.us/img543/5263/newk10.png
    i don’t now it’s just an idea.

  • Pingback: Friday Fluff – February 18th, 2011 | Gene Expression | Discover Magazine()

  • http://abugblog.blogspot.com Blackbird

    I am absolutely not bothered with the shades – sorry French reader- , but something these program could do is order the populations according to one of the components, that way the most similar populations would tend to cluster together and would make more intuitive to read.

  • Pingback: Linkpost 02-20-11 | Amerika: New Right, Conservationist, Traditionalist, Deep Ecology and Conservative Thought()

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+