Population structure using haplotype data

By Razib Khan | January 28, 2012 2:44 pm

The Pith: New software which gives you a more fine-grained understanding of relationships between populations and individuals.

According to the reader survey >50 percent of you don’t know how to interpret PCA or model-based (e.g., ADMIXTURE) genetic plots, so I am a little hesitant to point to this new paper in PLoS Genetics, Inference of Population Structure using Dense Haplotype Data, as it extends the results of those earlier methods. But it’s an important paper, and at some point I’ll starting using their software. The “big picture” is that earlier methods left “some information on the table.” That’s partly due to the fact that they were developed (or in the case of PCA leveraged, as it’s a very general technique) in an era where very dense marker data sets were not available (today we’re shifting to full genome sequences in many cases!). The information left on the table would be haplotype structure. Genetic variation in a concrete form manifests as sequences along a line, many of them physically connected. These correlations of nearby variant markers represent haplotypes of great interest, because they are excellent clues to admixture or divergence events across populations. In contrast the older methods, were looking at variation from marker to marker, each in turn independently, which collapses some of the important genomic structure that we can now inspect (in fact, linkage disequilibrium due to these correlations can distort some of the results in the older methods, so you want to “thin” your marker set).

Let me make this concrete for you. On 23andMe you can see where your friends shake out on a PCA plot using the HGDP data set as a reference. What this means is that the HGDP data set is used to generate independent dimensions of genetic variation. As is the usual case in these analyses the largest dimension separates Africans from everyone else, and the second largest dimension separates Asians from Europeans and Africans. 23andMe customers are then projected upon this variation, so you can get a sense where you are positioned in the clusters. To the left is a zoom in on the section for Central/South Asians. You can see that one of my friends, highlighted with a green color, falls almost perfectly in the Uygur cluster. According to ancestry estimates my friend is 50 percent Asian and 50 percent European. The “representative” Uygur in the 23andMe chromosome painting gives about the same results. But these are total genome estimates. The historical nature of my friend’s admixture and that of the Uygur woman is very different, as one can see in the below figure.

 

My friend is to the right, and the Uygur woman is to the left. Why the big difference? My friend has an East Asian parent an a European parent. The Uygur woman is the product of a marriage between Uygurs, a population which is due to admixture betwen East Asians and Europeans one to two thousand years ago. Recombination has broken apart the perfect linkage between European and East Asian regions among the Uygurs. Obviously this isn’t the case with my friend, as recombination has had no time to generate alternative sequences of ancestry. This is critical information which genome-wide estimates displayed on PCA or ADMIXTURE will miss out on.

As for this particular paper and method, I want to point you to figure 5. The darker/bluish colors indicate higher conancestry estimates, and yellower colors lower ones. Red is in the middle. The diagonal tends to be blue/red because that represents populations’ correlations with themselves, which one would expect to be high. You can’t really read the labels, but  I wanted to highlight the Italian and Sardinian blocks. Explanation below.

You can see an ADMIXTURE plot underneath the heat-map. What’s going on? Sardinians exhibit the hallmarks of an isolated population with smaller effective population which has undergone more genetic drift than Italians over the same amount of time. This is naturally one reason that they “break out” rather quickly in ADMIXTURE and PCA. You see this in South Asia with the Kalash, who often emerge as their own cluster rather quickly, and separate out in a PCA as well. This is simply a function of their isolation and lower effective population size. Most of the people who use ADMIXTURE and PCA know this, but those reading these plots do not. Without that knowledge one can make incorrect inferences. The methods outlined here in the paper allow one to visually observe immediately these trends, while keeping in place broader wold-wide correlations across populations in mind. This is a big step forward not only in data analysis, but result visualization.

If you are more interested in this topic, the first author has a comparison of the various tools up. Both Dienekes and Eurogenes are using the new software. Get the software at PaintMyChromosomes.com!

Citation: Lawson DJ, Hellenthal G, Myers S, Falush D (2012) Inference of Population Structure using Dense Haplotype Data. PLoS Genet 8(1): e1002453. doi:10.1371/journal.pgen.1002453

  • Grey

    “so I am a little hesitant to point to this new paper”

    I only get 10% from stuff like this but it’s an interesting 10%.

  • Grandma Shirley

    Thank you for this explanation. I understood a lot more and I’m looking forward to learning more

  • jb

    I wouldn’t feel all that confident interpreting a PCA chart unassisted. (It has something to do with igon vectors, right? I used to know all about igon vectors! :-) ) But when you show me a chart (PCA or otherwise) and talk about what it means and and draw a conclusion from it, then your words help me understand the chart, and the chart — even if I don’t fully understand it — helps me understand what you are saying. Win-win (or something)!

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    #3, well, most people don’t need to know anything about factor analysis to understand what the chart means anyhow. part of the clarity is probably good labeling of the clusters….

  • Antonio

    Well, I belong to another group of readers: the math/stats/programming stuff is typically straightforward to me, but I can easily get lost in the biology.

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »