Long time readers know that I spend a lot of time with Plink, developed by Shaun Purcell. That being said, even with the modest data sets I play with I’ve had to make recourse to to writing shell scripts to perform various Plink manipulations serially and let them run overnight. Well, perhaps no more. Here’s the description for WDIST genomic analysis toolset:
Using your 23andMe data: exploring with MDS
Using your 23andMe data in Plink
From Reconstructing Indian Population History:
…We hypothesize that founder effects are responsible for an even higher burden of recessive diseases in India than consanguinity. To test this hypothesis, we used our data to estimate the probability that two alleles from a group share a common ancestor more recently than that group’s divergence from other Indians, and compared this to the probability that an individual’s two alleles share an ancestor in the last few generations due to consanguinity…Nine of the 15 Indian groups for which we could make this assessment had a higher probability of recessive disease due to founder events than to consanguinity, including all the Indo-European speaking groups (Table 2). It is important to systematically survey Indian groups to identify those with the strongest founder effects, and to prioritize them for studies to identify recessive diseases and map genes.
South Asian populations exhibit a lot of between population genetic distance, and not simply as a function of geography. With more markers and an expansive data set Dan MacArthur will be able to assess exactly which South Asian caste his ancestry is from.
But this is an issue where I have fancied myself an outlier. My own background is moderately heterogeneous, and I’ve always explained to people that I’m not inbred like most South Asians, only half in jest (from what I can tell Muslims in the subcontinent have castes too, though they may somewhat different terminology). I know that my paternal grandmother came from a Brahmin family (clear by the customs preserved in the family even in her generation), while my maternal grandfather was almost certainly from a group with a Kayastha origin (going by surname, and who my mother actually clusters with). My maternal grandmother had considerable non-Bengali ancestry, which does show up in Middle Eastern signatures in my mother.
But this is talk. Am I truly not as inbred as the average brown? Leveraging methods which I discussed earlier (see posts above) I can very quickly check this.
With the recent $99 price point for 23andMe many of my friends have purchased kits (finally!). 23andMe’s interpretive results are pretty rich now, but there are still things missing. There are plenty of third party tools you can use, but I know some people might want to do their own data analysis. There are many ways you could go about this, but I want to put up some posts on DIY genomic data analysis to making the learning curve a little less steep, and get people started. Motivation to actually begin going down this road is a big issue, but I think once you get over the hump it gets a lot easier.
First, you need Plink. It is really preferable that you work on a Mac or in Linux to engage in heavy duty analysis, but in this post I’ll assume you are working on the Windows platform. Again, the point here is to make this accessible. Download Plink if you don’t have it, and extract it where ever you like.
Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!
I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:
MORE ABOUT: Admixture
, ancestry inference
, BGA 500K
, Harappa Ancestry Project
, How to analyze ancestry
, how to run ADMIXTURE
, Personal genomics