Analyzing ancestry with ADMIXTURE, step by step

By Razib Khan | March 14, 2011 3:55 pm

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of  genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

!Kung Buryats Hausa Mada Punjabi Arain Totonac
Adygei Cambodian Hazara Makrani Pygmy Tu
African Americans Chinese Hema Malayan Romanians Tujia
Algeria Chinese Americans Hezhen Mandenka Russian Tunisia
Altaians Chukchis Hungarians Maya Sahara Occ Turks
Alur Chuvashs Iban Mbuti Sakilli Tuscans
Ap Brahmin Cochin Jews Igbo Melanesian Samaritians Tuvinians
Ap Madiga Colombian Iranian Jews Mexicans Samoan Urkarah
Ap Mala Cypriots Iranians Miao San Utahn Whites
Armenians Dai Iraq Jews Mongola San Nb Uygur
Armenians B Daur Irula Mongolians Sandawe Uzbekistan Jews
Ashkenazy Jews Dogon Italian Moroccans Sardinian Uzbeks
Azerbaijan Jews Dolgans Japanese Morocco Jews Saudis Vietnamese
Balochi Druze Jordanians Morocco N Selkups Greenlanders
Bambaran Greenlanders Kaba Morocco S Sephardic Jews Xhosa
Bamoun Egypt Kalash Mozabite She Xibo
Bantukenya Egyptans Karitiana N European Sindhi Yakut
South Africa Ethiopian Jews Kets Naxi Singapore Chinese Yemen Jews
Basque Ethiopians Khmer Nepalese Singapore Indians Yemenese
Bedouin Evenkis Kongo Nganassans Singapore Malay Yi
Beijing Chinese Fang Koryaks Nguni Slovenian Yoruba
Belorussian French Kurd North Kannadi Sotho/Tswana Yukaghirs
Biaka Fulani Kyrgyzstani Orcadian Spaniards
Bnei Menashe Georgia Jews Lahu Oroqen Stalskoe
Bolivian Georgians Lebanese Palestinian Surui
Brahui Gujaratis Lezgins Paniya Syrians
Brong Gujaratis B Libya Papuan Thai
Bulala Hadza Lithuanians Pathan Tamil Brahmin
Burusho Han Luhya Pedi Tamil Dalit
Buryat Han Nchina Maasai Pima Tongan

The data set has ~4,000 individuals, and ~30,000 markers. The binary file is ~25 MB. The download has four files. The .bed, .bim, and .fam, are pedigree formatted. The .csv is a “master list” of the information on each individual (population, region, etc., tied to a specific identification number). This is important because once you have some output files…you need to figure out what it means, and visualize it, and that’s only informative if you have a master list with more than just family and individual information.

Here is the link to the file to download with all the above populations. I’ve pulled it down and run it, so I know it’s not malware.

So what now? The post will be divided into three portions.

1) Running this data in ADMIXTURE

2) Visualizing it in R

3) Manipulating this data in Plink

#1 is not contingent on #2 and #3, so I’ll do that first. You don’t need to read #2 and #3. In fact some of you might be really good at manipulating spreadsheet formatted data, so it might not be needful to go to #2. But in the R section I’ll also have a easier spreadsheet output for you, so even if you don’t care for R’s visualization, you’ll at least have a better to manage set of .csvs. #3 matters if you want to constrain your data set, and also add your own 23andMe file to the end of it.

#1 Running the data in ADMIXTURE

First, you need Linux or MacOS. If you are on Windows, the Wubi application allows you have to have a dual boot. It runs Ubuntu Linux next to Windows, and you can uninstall it as if it is a Windows application.

I am doing this on Ubuntu Linux, for your information. Assuming you have the right operating system, now you need ADMIXTURE. You can put the folder anywhere.

You need to use the terminal to go to the folder where you have ADMIXTURE. The image to the left shows me doing so. You need to click the terminal application, and ender the “cd” command to get to the appropriate folder. My ADMIXTURE program is on the Desktop, within the “GA” folder, and the “admix2″ subfolder. So I typed what you see. The “cd” command moves you around the folders, up and down. Google it if it confuses you, though without knowing what it does it should be fine if you just extract ADMIXTURE to the Desktop, and you type “cd Desktop”. This will clutter up your desktop in the future…but if you need to get some stuff done ASAP without knowing how to navigate in Linux, that should work.

So now you have ADMIXTURE, and the files which ADMIXTURE is going to analyze. What do you do? You need to make sure that ADMIXTURE and your files are in the same folder/location. So if ADMIXTURE is on the Desktop, just extract the files to the Desktop. Now you need to run a command. You see a screenshot of me running ADMIXTURE. You may need to omit the ./ (i.e., “admixture” vs. “./admixture”). You see the file name. The option -j2 is due to the fact that I have two cores. If you don’t know what that means, just omit it. It speeds up the run though. The last number is the K. So this is for K = 4.

Now the program will run. How long depends on the size of the file, and the number of K’s. I often run the program overnight for larger K’s. If you want to get fancy and do stuff like cross-validation, it will take even longer. Be warned. The screenshot to the left is typical of what you’ll run in to as ADMIXTURE does its thing. No worries, the algorithm is running. If you watch long enough you’ll get a sense of what values on the screen point to a high likelihood that it’s almost done, and you can start anticipating the output files from which you can make inferences.

Completion! To the right is what you’ll see when ADMIXTURE is done. As noted, there are output files. This is what is really interesting & useful, but even on this screen there’s goodness. The primitive matrix shows you Fst distances between putative ancestral populations. Fst is measuring the proportion of variance within the data set which can be attributed to between population variance. The smaller the value, the less the magnitude of differences between two populations. On this screen you see four populations, since I set K = 4. The Fst is generated from ancestral allele frequencies, which are within the output files. Remember, these are distances between abstract populations, not real ones.

The original files were euraocean.bed, euraocean.bim, and euraocean.fam. So the output files are like so:

euraocean.4.Q
euraocean.4.F

The 4 represents the K. The first file has a list of the proportions for putative ancestral populations for each individual in the data set, the individuals being on separate lines. The second file has all the allele frequencies for the ancestral populations, generated by the parameter K.

What do you do with this? euraocean.4.Q is related to euraocean.fam, which has family and individual IDs line by line. I don’t know how to use spreadsheets in anything but a primitive way, so I assume there are ways to merge the files and get each line to have ancestry proportions as well as more detailed IDs. Generating mean values for populations also seems essential.

But I use R to do this dirty work.

#2 Visualizing the output with R

If you don’t have R, you need to install it. If you don’t know how to start, control-f sudo. That should yank it down for you. Once R is installed, make sure to be in the folder where you have ADMIXTURE. Then type “R” (no quotes when you type a command!). Now you are in R, what do you do? Here are the specifics of what you need to do:

1) Take the Q file, pump it into a data frame

2) Take the master list, pump it into a data frame

3) Take the .fam file, pump it into a data frame

4) Mix & match

5) Calculate mean proportions, output populations, etc.

6) Visualize!

If you needed to know how to install R, you probably don’t know how to do this. When I first started playing around with ADMIXTURE output files I wrote a quick & dirty script. I barely remember what I am doing with this script now, as I don’t care about the details. But it is now at your service. Still, first you need to do one thing: use a master list which is formatted slightly differently from the one that you downloaded. Here is the revised master list.

Put it in the same folder as ADMIXTURE. Then start R, again, by typing “R.” Run the command you see above. This creates an “HGDPMaster” data frame. That’s necessary for the script I’m giving you to run.

The script is here. If it doesn’t download, copy & paste, and create a file “Rstuff.R”, in the same folder as ADMIXTURE. There are a few variables which you have to manipulate. Here is the relevant section:

###############
# change these
###########
### outputfiles
fileName<-"euraocean"
fileType<-"Q"

#### sets the number of populations to through
#lowest K
Start_K<-12
#highest K
End_K<-12

You need to change the file name to the one you have output. If you did do any manipulation, it should be ref.2.Q for K = 2, so the name is “ref.” You also need to put in the number of K’s. I often run many simultaneously, which I have output files for in the morning. So I often start with 2 and end with 12. If you just want to output one, for example, 2, change Start_K to 2, and End_K to 2. These are the only variables you need to change. But there is a lot more you could do. R “comments” with #, so there is a section which I commented out where you can limit the output to particular populations to make the bar plot less busy. You’ll see what I mean if you look at the script, just remove all the #’s, and reedit as to your taste. Please note that casing matters, so make sure to keep it lower case when possible (if you looked at the master list, you understand). The script does have a string to upper case function, but that’s only for the output. There’s also a small section where you can reedit the names to your taste.

To run the script, do like so:

source("Rstuff.R")

It should output out bar plots, as well as generating some spreadsheet files. There’s a lot more you can do…but if you can do a lot more, you wouldn’t be reading this post. Let’s move to the next issue. So now you wonder: is there any way I can change the data file, or add myself to it? Read on….

#3 Using Plink to manipulate the data file

Now you need Plink. I usually put it within the same larger folder as a subfolder parallel with ADMIXTURE. You run the Plink command like so: “./plink” or, “plink.” Depends on the environment (remember, the quotes are only for the post!). There are many things you can do with Plink. I will show you how to do two things.

#1 remove individuals from the data set

#2 add yourself (or someone whose 23andMe file you have) to the data set

#1 is important because the plots get busy with too much variance. Additionally, Africans, and genetic isolates which have gone through population bottlenecks, tend to overwhelm ADMIXTURE. You probably want to remove them. To do this you need to use the remove option. You need to remove individuals.

Here’s one option with the file you’ve got:

./plink --bfile ref --remove removelist.txt --make-bed --out refRemoved

What’s going on above? You’re using a binary pedigree file, so you have the –bfile option on. You do the deed with –remove, and then you create a second binary pedigree file, refRemoved. So you’ll have refRemoved.bed, refRemoved.bim, and refRemoved.fam. Obviously removelist.txt has what you want to remove. Each line has a family ID and individual ID, separated by a space, of those who you want to remove. The easiest way is probably to open up the master list. For the one I gave you above the last column is the family ID and the first is the individual ID. Cut & paste the first column after the last, delete the other columns, and save. I usually get rid of quotations and tabs, change it to a .txt file, and there you have it.

But what about your 23andMe file? You need to convert it to pedigree. I have created a quick & dirty perl script to do so. You can find it here. Download or cut & paste it. You need to remove the comments at the top of the 23andMe file. That is, you need to remove everything before the first SNP. Assuming that’s done, do this at the command line within the folder where you put the script (you get to that folder with “cd” recall):

perl convert.pl "YourFileName" "001" "001"

The script fires, gets the file name from the first parameter, and outputs two files, YourFileName.ped and YourFileName.map. What about the two other parameters? They’re generating your family ID and individual ID. They’d be FAM001 and ID001 in this case. You need to enter these into the master list! Otherwise you won’t come out on the bar plots. Also enter your ethnicity, etc. Or, just your name if you want to be your own slice of the bar plot.

Note that you have .ped, not .bed, files. These are big. Now you need to convert the text to binary pedigree. Move the YourName files to the plink folder. Make binary:

./plink --file YourFileName --make-bed --out YourFileName

Now you have YourFileName.bed YourFileName.bim YourFileName.fam. It is best to limit your SNPs to the same as those in the reference data set. So get those from the reference:

./plink --bfile ref --write-snplist --out SNPs

You should have a file, SNPs.snplist. Use them to filter your 23andMe file.

./plink --bfile YourFileName --extract SNPs.snplist --make-bed --out YourFileNameFiltered

Now you want to merge:

./plink --bfile ref --bmerge YourFileNameFiltered.bed  YourFileNameFiltered.bim  YourFileNameFiltered.fam --make-bed --out ref

You are now appended to the reference data set! If you open up the ref.fam file your family ID and individaul ID should be at the end of the list.

If you’ve slogged through this far, I thought it would be nice to end with something which shows what this is all about. Below I’ve filtered the reference data set of most African and New World populations, and run it from K = 2 to K = 12. It took about ~10 hours to complete. I’ve also limited the populations to display using the script above so that it isn’t too clustered. Here are the spreadsheets generated from the runs (they will be in folder where you run the R script, and have the form “K =2″ and such for names).

k10
k11
k12
k2
k3
k4
k5
k6
k7
k8
k9

CATEGORIZED UNDER: Genetics, Genomics, Personal Genomics
  • RK

    Wow, this is great — especially the merged dataset. Thanks for putting it together. I haven’t used the dataset, but shouldn’t you be filtering your 23andMe file for the SNPs found in the merged dataset? After all, the latter has many fewer SNPs than the former. Also, I’m not sure if it’s needed for this dataset, but people should familiarize themselves with how to flip strands when merging datasets.

    Have you ever used Shellfish to do PCA? It lets you create “loadings” for reference populations, so that you can see where samples fall on the dimensions generated by the reference data without changing the dimensions themselves. Basically, it’s like the advanced genetic similarity plot on 23andMe. I’m trying to create a loading for the HGDP populations, but now I may as well add these in.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    yeah, i put the –extract stuff there. also, no issue with flipping when i did a test run. it seemed oriented 23andMe direction. haven’t used shellfish, but i’ll check it out.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    hey, anyone can leave a question here. didn’t want 2 overload the post.

  • http://www.zackvision.com/weblog/ Zack

    RK: I downloaded shellfish just for that reason, but it gives me (Python 2.6.6 on Ubuntu 10.10) errors in the initial conversion step. Have you been able to get it to work? I haven’t had time to look into it.

  • RK

    Zack, yeah, the latest version of gtool, which shellfish depends on, seems to be broken. I downgraded to version 0.6.1 and it works fine: http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool_v0.6.1.html

    I’m using Python 2.6.6 as well, on Debian unstable.

  • http://mygenomix.wordpress.com Moreno

    Thanks Razib for this great tutorial! I’ve always wanted to play with this kind of things, and this post will save me a lot of time.

  • Garvan

    The Rstuff.R link gives me a 403 Forbidden error message. Could you check the link please?

    Thanks,
    Garvan

  • http://blogs.discovermagazine.com/gnxp Razib Khan
  • James

    Thanks a bunch for this….can’t wait to graduate college so I can mess around with this…

  • Garvan

    Thanks, I downloaded the script using a vista computer on another network. I still have no idea why I can not access it here. But I have it now. Garvan

  • Pingback: Eurasia, ADMIXTURE supervised & unpservised | Gene Expression | Discover Magazine

  • iron0037

    Thanks for this useful post. I am an engineer with a casual interest in genetics. So I have a decent math background but understand little about the domain. Up til now, all these admixture analyses have been like black magic to me. Now I at least have a bit better of an understanding. And yet, your post begs more questions…

    1) What does “pedigree formatted” mean? Are these beds, bims, and fams in a binary or ascii format? I assume binary because I wasn’t able to open them in a text editor. Besides ADMIXTURE (which I can’t use because like most people in the world, I use Windows :) ), is there any way to read these files? Could I write a custom binary processor? Is there any ascii version of the data? Is that what the csv is?

    2) What information is actually contained within these files? ADMIXTURE’s page says it’s “multilocus SNP genotype datasets.” What the heck is that? What part of the genome are they examining for single nucleotide polymorphisms? Are these standard points for examination? How many points per individual?

    3) If I understand correctly, ADMIXTURE is doing is a cluster analysis? It’s crunching the data, finding the eigenvectors for the given K, and displaying the vector components of each population mapped onto the eigenvectors? Or to put it another way, it’s creating K fictitious populations that are as genetically distinct as possible, and showing the proportion each modern population has of these fictitious populations?

    4) Is it possible to specify that one of your K populations is completely one of the modern populations? Could you say, as an example, that you want the French and the Moroccans to show up as 100% one of the resultant clusters?

    5) These bar charts that you end up with are rather hard to follow. Do people ever make pie charts for these things? How about a map of the Earth with the ranges of each of the K populations? You can clearly see in your K=12 map, for example, that the pink population is most prevalent in northeastern Europe and dissipates moving outward from there.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    Besides ADMIXTURE (which I can’t use because like most people in the world, I use Windows ), is there any way to read these files? Could I write a custom binary processor? Is there any ascii version of the data? Is that what the csv is?

    no idea about windows. .bed is binary. the text version is .ped.

    What information is actually contained within these files? ADMIXTURE’s page says it’s “multilocus SNP genotype datasets.” What the heck is that? What part of the genome are they examining for single nucleotide polymorphisms? Are these standard points for examination? How many points per individual?

    the data have snps and individuals. also stuff like sex, phenotype, etc., but that’s extraneous for our purpose. in these data 27,000 snps per individual.

    Or to put it another way, it’s creating K fictitious populations that are as genetically distinct as possible, and showing the proportion each modern population has of these fictitious populations?

    yes

    4) Is it possible to specify that one of your K populations is completely one of the modern populations? Could you say, as an example, that you want the French and the Moroccans to show up as 100% one of the resultant clusters?

    yes. see follow up post.

    These bar charts that you end up with are rather hard to follow. Do people ever make pie charts for these things? How about a map of the Earth with the ranges of each of the K populations? You can clearly see in your K=12 map, for example, that the pink population is most prevalent in northeastern Europe and dissipates moving outward from there.

    yes. i have thought of that. need to get fluid with R’s map functions. i don’t want to learn GIS. the thematic heat maps are easy with paid software, but i’m not going to shell out $400 for that :)

  • iron0037

    Hi Razib,
    Thank you for the quick answers. I hadn’t made it to your next post at the time of my comment! 27000 SNPs huh? That’s quite a bit of data. I see that I can open the bim and fam files in a text editor, so there’s hope I can import all of it into MATLAB and play with it. I’ll have to read your posts more carefully and do some investigation as to the meaning of each row of the file. We’ll see if anything comes of it.

    Cheers

  • biologist

    1) What does “pedigree formatted” mean?

    http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

    Are these beds, bims, and fams in a binary or ascii format? I assume binary because I wasn’t able to open them in a text editor.

    bed and bim are binary

    http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed

    Besides ADMIXTURE (which I can’t use because like most people in the world, I use Windows :) ), is there any way to read these files? Could I write a custom binary processor? Is there any ascii version of the data? Is that what the csv is?

    you can convert bed/bim to ped/map using plink (see links above)

  • biologist

    2) What information is actually contained within these files? ADMIXTURE’s page says it’s “multilocus SNP genotype datasets.” What the heck is that? What part of the genome are they examining for single nucleotide polymorphisms? Are these standard points for examination? How many points per individual?

    Each SNP is a single base-pair in the genome at which two alternative versions are measured. Given two possible alleles (say A and B), there are three possible genotypes (AA, AB, BB). The data are genotypes.

    The chromosome and base-pair coordinate is given in the map file. These coordinates are with respect to a genome build, such as hg18. http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg18

    These SNPs were likely selected because they are otherwise well characterized, so in essence they are standardized.

  • Jean

    Is the dataset listed above complete or might there be other popuations that were not included?

    Thanks

  • Antonio

    “Do people ever make pie charts for these things?” what else could be worse than that?

  • Pingback: Input determining output in ADMIXTURE | Gene Expression | Discover Magazine

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »