A genomic map of human variation, where we're at

By Razib Khan | February 1, 2011 2:16 pm

Zack has started exploring the K’s of his merged data set for HAP. A commenter suggests that:

As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:

– The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)

– The sample size

– The representativeness

– The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)

This isn’t a qualitative issue, easily to divide into “right” and “wrong.” Sometimes an N = 1 is very insightful. That’s why the whole genome of one Bushman was very useful. In fact, the whole genome of any random Sub-Saharan African, and the whole genome of any random non-African (this means ancestry from before 1500 in those regions), is going to reflect clearly the differences between these two broad population sets in terms of genomic variation. Subsequent addition of individuals to generate a larger sample would be very informative of course, and allow us to answer many more questions. But the point is that even small sample sizes can answer properly framed queries.

Another issue is representativeness. The HGDP data set was biased at the outset toward more isolated and distinctive groups. There was a belief that many of these groups were going to disappear within a generation, and their genetic uniqueness should be recorded (this seems to have been correct). So apparently the clusters generated from HGDP are “cleaner” in their separation than those from the POPRES sample, which is derived from a more cosmopolitan urban set of populations. We also have the HapMap sample, and some of the ones Zack has merged into HGDP and HapMap (there are likely other public data sets, Zack was looking for those with South Asians).

After 10 years of results generated from these data sets I think we have some idea of the errors and baises introduced because of skewed representativeness and small sample size (HapMap has a thicker marker set, but HGDP has a better population coverage). In other words, we should have some intuition of where to be careful, and where not to be. For example, small tribal groups are likely to exhibit genetic distinctiveness (as well as cultural isolates, like the Roma) due to low longer term effective population size. On the other hand, if you have a set of distinct tribal groups, one presumes that the common patterns would reflect broad macro-regional genetic variation. In Zack’s combined data set he has a South Indian tribe and a Pakistani one (I mean Kalash, I understand Pathans and Baloch are tribal people, but they’re expansive and heterogeneous). Any common element between these two groups in relation to Iranians is presumably not a coincidence. Random genetic drift usually results in different allele frequencies between populations, so genetic commonalities between different isolates probably reflect common ancestry.

The main point I’m trying to make is that we’re beyond the point of generic cautions. Rather, there are specific pitfalls which we need to be cognizant of. So if you know specific ethnographic details, that is useful. If there are statistical tricks and tips, that is also useful (larger sample sizes exhibit diminishing returns in statistical power). Also, one needs to keep in mind ascertainment bias, the current generation of SNP chips are tuned to European polymorphisms, so they might miss out on the loci where other populations are polymorphic, but Europeans are not.

By analogy, unsecured credit can be problematic. Yes, I think we knew that. The key is to identify those with the means and ability to use credit responsibly. The tools and data are now available to the masses. A big “BE CAREFUL” sticker is not helpful. What is helpful are concrete and specific pointers.

For what it’s worth, I found Zack’s bar plot hard to read, so here is one I generated with larger labels (K = 6):

Yesterday Zack gave me a personal vector: 66, 1, 4, 10, 14, 0, 4, 0, 0, 3. If you’ve been reading my posts I think you know how to interpret that….

CATEGORIZED UNDER: Genetics, Genomics, Personal Genomics
  • marcel

    If you’ve been reading my posts I think you know how to interpret that….

    or not.

    Would you recommend something to read for someone who is quite literate in statistics and will own up to knowing squat (but not much more) about genetics, so that I can begin to sound out charts like the one above, or vectors like the one above.


  • http://blogs.discovermagazine.com/gnxp Razib Khan

    marcel, i inferred zack was sending me the proportions for ancestral quanta. so:

    66 = “south asian”
    10 = “west asian”
    14 = “east asian”

    those are my guesses.

    also, start here:


  • http://www.zackvision.com/weblog/ Zack

    I am going to fix the barplots tonight.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    as long as you provide raw spread sheets i don’t care.

  • rimon

    razib- I’m not the right background for this project or dodecad, but I would love to have my 23 and me data run through admixture. can you tell me where/who could do this for me? thanks!

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    zach’s site has instructions on how you could run it yourself. the key is you need to convert formats to pedigree. you can make it transposed pedigree pretty easily, and go from there. if zach doesn’t i guess i could write something up myself on the step-by-step.

  • rimon

    thanks for the info. since I have no idea about any of this (pedigree?) I will wait until there are instructions! …unless somebody wants to run an Ashkenazi Jew who 23 and me said was most like a Southern European through admixture…..

  • RK

    rimon, pedigree refers to the file format used by ADMIXTURE and plink, the GWAS analyzer that you’ll need to manipulate the data. It consists of a .ped file and a .map file; see here for a description of the file format: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

    Here’s a pretty klugey perl script I threw together to convert your 23andme raw data file into pedigree format: http://pastebin.com/M4PbHUYb Check the plink manual for information on how to merge the resulting files with your reference datasets. (Convert to binary first!)

    A word of warning: ADMIXTURE takes forever to run, and it needs a lot of memory. Way more than my ancient box has, which is why I’m glad people like Zack are doing it for us.

  • RK

    Razib, how did you generate that chart? It looks way better than the ones barplot() generates in R.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    Razib, how did you generate that chart? It looks way better than the ones barplot() generates in R

    open office. just took zack’s spreadsheet.

  • http://www.zackvision.com/weblog/ Zack

    If there’s interest, I can put my kludgy conversion scripts online.

    RK: Admixture does take a long time, though it doesn’t use a lot of memory. I think it’s using about half of my 6GB.

    Someone needs to make the plot functions of R and Matlab usable. They are ugly.

  • http://www.zackvision.com/weblog/ Zack

    POPRES data access requires approval from NIH.

  • Zohar

    If anyone’s willing to run Admixture on the 23andme data on this Ashkenazi they’ll get a bracha.
    It took me an hour just to get Eurocad working

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    zack, why not just throw in onto sourceforge? for 99% of ppl kludgey is better than nothing.

  • RK

    3 GB is a lot! At least for me: My total usable RAM is only 1.7 GB, so ADMIXTURE runs out of memory even with relatively small datasets unless I drop into runlevel 3.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    i got 4 GB. i just let it run overnight.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar