As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.
This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:
- The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)
- The sample size
- The representativeness
- The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)
This isn’t a qualitative issue, easily to divide into “right” and “wrong.” Sometimes an N = 1 is very insightful. That’s why the whole genome of one Bushman was very useful. In fact, the whole genome of any random Sub-Saharan African, and the whole genome of any random non-African (this means ancestry from before 1500 in those regions), is going to reflect clearly the differences between these two broad population sets in terms of genomic variation. Subsequent addition of individuals to generate a larger sample would be very informative of course, and allow us to answer many more questions. But the point is that even small sample sizes can answer properly framed queries.
Another issue is representativeness. The HGDP data set was biased at the outset toward more isolated and distinctive groups. There was a belief that many of these groups were going to disappear within a generation, and their genetic uniqueness should be recorded (this seems to have been correct). So apparently the clusters generated from HGDP are “cleaner” in their separation than those from the POPRES sample, which is derived from a more cosmopolitan urban set of populations. We also have the HapMap sample, and some of the ones Zack has merged into HGDP and HapMap (there are likely other public data sets, Zack was looking for those with South Asians).
After 10 years of results generated from these data sets I think we have some idea of the errors and baises introduced because of skewed representativeness and small sample size (HapMap has a thicker marker set, but HGDP has a better population coverage). In other words, we should have some intuition of where to be careful, and where not to be. For example, small tribal groups are likely to exhibit genetic distinctiveness (as well as cultural isolates, like the Roma) due to low longer term effective population size. On the other hand, if you have a set of distinct tribal groups, one presumes that the common patterns would reflect broad macro-regional genetic variation. In Zack’s combined data set he has a South Indian tribe and a Pakistani one (I mean Kalash, I understand Pathans and Baloch are tribal people, but they’re expansive and heterogeneous). Any common element between these two groups in relation to Iranians is presumably not a coincidence. Random genetic drift usually results in different allele frequencies between populations, so genetic commonalities between different isolates probably reflect common ancestry.
The main point I’m trying to make is that we’re beyond the point of generic cautions. Rather, there are specific pitfalls which we need to be cognizant of. So if you know specific ethnographic details, that is useful. If there are statistical tricks and tips, that is also useful (larger sample sizes exhibit diminishing returns in statistical power). Also, one needs to keep in mind ascertainment bias, the current generation of SNP chips are tuned to European polymorphisms, so they might miss out on the loci where other populations are polymorphic, but Europeans are not.
By analogy, unsecured credit can be problematic. Yes, I think we knew that. The key is to identify those with the means and ability to use credit responsibly. The tools and data are now available to the masses. A big “BE CAREFUL” sticker is not helpful. What is helpful are concrete and specific pointers.
For what it’s worth, I found Zack’s bar plot hard to read, so here is one I generated with larger labels (K = 6):
Yesterday Zack gave me a personal vector: 66, 1, 4, 10, 14, 0, 4, 0, 0, 3. If you’ve been reading my posts I think you know how to interpret that….