Over the past few months I’ve been encouraging people to pull down ADMIXTURE, and push the public data sets through it. Additionally, you can also convert your 23andMe raw file into pedigree format pretty easily and integrate it into the public data sets with PLINK. I’ve been following Zack’s Harappa Ancestry Project pretty closely, but I’ve been running the software myself and manipulating its parameters and seeing how things shake out. But the more and more I do it, the more I wonder if it isn’t like regression analysis, a technique which is just waiting to be leveraged by human biases. I began thinking of this more deeply after a conversation with a computational biologist who outlined the structural problems with how ad hoc the utilization of statistics is in the life sciences.
These sorts of qualms are probably why I’m posting my results more on Facebook and passing them around friends, rather than putting them out there in the public domain. It isn’t that I think the results are going to be abused. I just don’t know what they mean a lot of the time. Or, perhaps more honestly I am suspicious of my own propensity to see what I suspect. A case of my priors strongly shaping the inferences which I might generate.
So I decided to do an experiment. Below are 8 runs, displayed as bar plots. Each thin sliver represents an individual. The colors again represent putative ancestral populations of which the modern populations are combinations, generated by the parameter K (so K = 2 means two ancestral populations, each corresponding to a different color). There are two data sets which I analyzed, group A and group B. I’ve also noted the K’s for each plot. But aside from that, I’ll leave you ignorant what these populations are or how many there are. Jot down some ideas as to what you can see. How many populations? How do they relate to each other? Can you perceive any real information in the higher K’s? I’ll put the “answers” below the fold. There’s no point in me saying what I think, I already know which populations these are, so I’m tainted.
All the populations were from the HapMap, with 80,000 markers. They were two sets of three:
Group A: Yoruba, African Americans, and Tuscans
Group B: Gujaratis, African Americans, and Mexicans
I selected these two sets because in Group A you have a population, African Americans, which can plausibly be considered a combination of the two other populations. In Group B, this is not so. The Mexican and African American populations share similar European ancestry, but both are admixed. The Gujarati populations has no close relationship to any of the other populations, so ADMIXTURE is just trying to do the best with what it’s being given.
How did you fare?