In the near future one of my projects is revising and expanding the “PHYLO” pedigree file which I put up a week ago. Basically I want there to be a public data set which has a modest number of SNPs useful for phylogenetic analysis (100-200,000) with a wide population coverage. Additionally, I am going to do a few things like rename the family ids to populations, and also release it with scripts to help in running Admixture (for example, shell scripts which will automate replication and later analysis of replicates). Finally, I’m planning on running ~50 replicates of K = 2 to K = 20 with 10-fold cross-validation (yes, this is will take a while) to get a good sense of the “best” K’s. The reality is that most people probably are only interested in the “most informative” K, +/- 1, so there’s no need for everyone to run K = 2 to K = 20. The time saved should be used on running replicates, and then CLUMPP to merge the results.
I would say that this is for ‘amateurs’ only, but I don’t think it’s betraying confidence to observe that several academic researchers at prominent institutions have ended up inquiring of me of how to get good public data sets. This sort of information still hasn’t percolated to the general public, including scientists who don’t work on population genomics. After a few trial runs with public data sets people with academic access could move to things like the POPRES data set.
But the ultimate point of this post is to ask: do you want to be in this data set? If so, I need the file (23andMe format is fine, otherwise, pedigree files only), your name, and some minimal ethnic information. I’m not going to add everyone. I just want to diversify the public data set a little. But I am going to put names in the sample sheet, so you won’t have anonymity. As you know I don’t particular care about this personally, but your mileage may vary. Researchers might need to contact or check that people are who they are.
Email: contactgnxp -at- gmail -dot- com