Re-imagining genetic variation

By Razib Khan | September 26, 2012 12:39 am

To the left is a PCA from The History and Geography of Human Genes. If you click it you will see a two dimensional plot with population labels. How were these plots generated? In short what these really are are visual representations of a matrix of genetic distances (those distances being general FST), which L. L. Cavalli-Sforza and colleagues computed from classical autosomal markers. Basically what the distances measure are the differences across populations in regards to their genetics. The unwieldy matrix tables can be visualized as a neighbor-joining tree, or a two dimensional plot as you see here. But that’s not the end of the story.

In the past ten years with high density SNP-chip arrays instead of just representing the relationship of populations, these plots often can now illustrate the position of an individual (the methods differ, from components analysis or coordinate analysis, to multi-dimensional scaling, but the outcomes are the same).

 

 For example, the famous genetic map of Europe. Here you see the colors representing nationalities, and centroid positions of the populations as well as individuals. In this manner you can take into population genetic variation in a gestalt fashion. Nevertheless, these still leave something to be desired. They are precise and powerful, but they lack a certain elegance due to their scatter. When you have over a dozen color schemes, and overlapping populations, these are not minor matters. Additionally, the human eye is often not well tuned to note the finer gradients of density difference.

This is clear when you move from a manageable number of populations (e.g., Europeans), to the world. In these cases you have to color in specific regions, else you’d get lost rather quickly. I can illustrate this easy enough. I’ve a data set I’m running right now with ~3,000 individuals and 250,000 SNPs. It’s a merge of HGDP, Behar et al., HapMap, etc. I decided to use PLINK to generate an MDS plot.

 

Here you see the unadorned scatter. To the top of the plot are Asian populations, and to the right African ones. Europeans are at the vertex to the bottom left. This should be familiar to you, though you may have to rotate it. One way to extract some clarity out of this picture is to color code the regions, and give different symbols to the lowest level category. Yes, this helps, but there are still limitations (and to be frank I often have a hard time making out triangles on these plots). First and foremost, I think we need to be unable to ascertain the variation in density of the scatter. A further plot will illustrate this (click to enlarge):

Most of the text is basically illegible. This is where a centroid method would do well; in lieu of a scatter of individuals you just label a population. Or, you could do something like allow points in various colors to represent populations, but put the labels at centroids only. This still runs into the problem that populations are not equidistant, so therefore you can have crowding.

Recently to address these issues I decided to use a ‘utilization distribution’ method which I saw in one of the ‘genetic map of Europe’ papers. The logic here is simple.

1) First, take the density distribution of the points on the plot by category and ‘smooth’ them. Basically this creates a continuous distribution where there was a discontinuous ones.

2) Then demarcate the central ~90% area as the bounds of the population distribution. Color these bounding lines differently.

Below you see the results:

Obviously there are some kinks to be worked out. But you see two things. First, some groups are clearly subsets of other groups in their distribution. This is very hard to discern in the other visualization methods above. Second, these plots are taking density into account, so you aren’t distracted by outliers (which may be mislabeling by the analyst or the original collector of the samples).

My ultimate aim is to develop a script which will place the text near the suitable distribution zone, without crowding out other text. I have some ideas of how to do this “on the fly,” but it will take time to implement. Until then some of you may want to know a bit about the packages used for the above.

First, download the adehabitat package from R. Actually, you may want to download various tcl development packages first, because the former won’t install without the latter. Once you have that you need data. I assume you can generate the results from PLINK above. Once you have that you need to have three colums

1) x

2) y

3) the identification

Here’s some R that might help:

#MDSData is the data frame with MDS data
attach(MDSData)
library(adehabitat)
cexValue=0
par(mar=c(0,0,0,0))
plot(C1,C2,cex=cexValue,xlab="Coordinate 1",ylab="Coordinate 2")

# process the data, remove more than 5 individuals in group
loc=subset(MDSData,Group %in% names(which(table(Group) >= 5)))
loc$X = loc$C1
loc$Y = loc$C3
#load ids
id = factor(loc$Group)
#create first parameter, two columns
loc=subset(loc,select=c(X,Y))

vud=kernelUD(loc,id)
#90% utilization
kVert=getverticeshr(vud, 9);
#I'm removing one of the populations
kVert[21]=NULL
kVertLength=length(attr(kVert,"names"))
plot(kVert, add=TRUE, lwd=2,colpol=NA,colborder=rainbow(kVertLength) )
groups=attr(kVert,"names")
legend('topright',groups,cex=.55,lty=1,lwd=3,col=rainbow(kVertLength) )
CATEGORIZED UNDER: Genetics, Genomics
MORE ABOUT: PCA
  • Eurologist

    I like that. You could fit exact ellipses (5 parameters best fit: position, orientation, eccentricity, and scale). Generalizing, you could fit ellipsoids in 3-D PC plots with 6 or 7 degrees of freedom.

  • http://moebio.com Santiago

    nice ideas, intriguing! Could you please share the data?

  • https://plus.google.com/109962494182694679780/posts Razib Khan

    merge these files.

    https://www.dropbox.com/s/ar7d4u6izbexq3v/plinkmds.csv

    https://www.dropbox.com/s/eqvuy11h2kgy9bm/hgdphapmapbehar.csv

    i assume most of you know R, but useful:

    attr(kVert,”names”)

    inspect them and remove populations you don’t want to plot (e.g., kVert[22]=NULL). “friends” you should remove, they’re a bunch of random people (my readers & friends) who create a very large circle because they’re not a real population.

  • https://plus.google.com/109962494182694679780/posts Razib Khan

    3-d plots are when i have a 3-d printer ;-)

  • http://moebio.com Santiago

    Razib, thanks so much! this is exciting data that actually I always wanted (I’m a fan of Cavalli-Sforza work).

    What I want to do here is to visualize the scatter but allowing to shift positions to geographic coordinates (by interpolating). I will do this using an interactive scatter that allows user to focus on regions, similar to what I’ve done here: http://moebio.com/research/wikipediagender yet allowing zoom in any chosen coordinate

    Other interesting approach is to draw inverse geographic distances with lines (geo closeness network)… and the opposite, draw the genetic closeness network when point are placed on geographic positions.

    Finally it would be fantastic to use PC1/PC2 coordinates to place populations and then perform a continual deformation of the geographic coordinates space in order to make the regions cover their population spots. That would be a new genetics-driven geographical projection. I love space deformations a la D’Arcy Thompson or a la Einstein/Poincaré and I’ve already worked on some deformed maps, look: http://moebio.com

    What do you think of these ideas?

  • https://plus.google.com/109962494182694679780/posts Razib Khan
  • http://moebio.com Santiago

    Yes, procrustes analysis would be the first step: finding the (linear + translation) transformation that minimizes deviation. What I want to do is more extreme. The second step would be non-linear isomorphic transformation. As your post shows and the article mentions, Europe will be less deformed. Geographic disruptions such as the Himalayas or the Great Rift Valley will be dramatically seen in the map. If I try to do this, would you help me?

  • https://plus.google.com/109962494182694679780/posts Razib Khan

    #7, check the email you provided on your comment.

  • http://neuroecology.wordpress.com neuroecology

    Out of curiosity, what % of the variance is accounted for by the first two components in your sample? How does it look when presented as a 3 dimensional pca plot?

  • https://plus.google.com/109962494182694679780/posts Razib Khan

    #9, i’ll check when i get home. most of the variance was in the first 2 components from what i recall. not much in the 3rd (did 3 by 1 plot too).

  • petrelharp

    What do you think it means that “some groups are clearly subsets of other groups in their distribution”? It probably does not mean that the genetic variation in one is a subset of the other — quite different groups could get squished into the same area by the projection down onto two dimensions, no? One way to interpret the plot might be that PCA is representing everyone as admixed between three idealized populations, the vertices of the triangle; so groups that are quite different but share the same admixture coordinates for those populations (even if they have another important chunk of ancestry) would end up in the same place.

    Thoughts? Investigations?

  • https://plus.google.com/109962494182694679780/posts Razib Khan

    #11, there are several options. some of the subsetting is an artifact of human cultural classification. e.g., ethiopian jews are a subset of ethiopians, because they tend to be sampled from a the peoples of northern ethiopian highlands. some of it is simply that you have some populations which are genetically homogeneous, and so have a ‘small circle,’ and happen to just lay along the same axis as a more diffuse group.

    One way to interpret the plot might be that PCA is representing everyone as admixed between three idealized populations

    this is not really right, but goes in the right direction. don’t confuse this for mode-based clustering, which does assume ‘pure’ populations in many cases. the other dimensions are in the results.

  • Eurologist

    3-d plots are when i have a 3-d printer

    Rotating the view and making an animated .gif or similar seems to work quite well in many instances.

  • http://emilkirkegaard.com Emil

    Very cool data indeed! Perhaps I should get more into pop gen., so I can understand more of the technicalities. Unfortunately, I’m too interested in other things, and time is limited. However, I intend to read https://en.wikipedia.org/wiki/The_10,000_Year_Explosion soon(ish).

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »