One of the quasi-facts which I often stumble upon is the idea that in 10 percent of cases paternity is misattributed. That is, the presumed father is cuckolded. I often encounter this “fact” in a biological context, where someone with an advanced degree in biology will relate how it turns out that there is a great deal of delicacy in situations of transplant matching because of this fact. When pressed on the provenance of this fact most demur. The reason people demur is that the factual basis of this assertion is very thin. In particular, very high estimates of cuckoldry come from databases of disputed paternity, which are obviously going to be a biased sample. A more thorough survey suggests that there is a wide variation in misattributed paternity across populations.
In the links below I alluded to a controversy over the “Neurodiversity movement”. The basic issue is that people with Asperger syndrome and high functioning autism are being accused of putting their concerns above and beyond those of the large number of mentally disabled autistic individuals (some of whom are non-verbal, and exhibit severe cognitive deficits) in the grab for “rights.” Rights here understood as the rights which black Americans, women, and gays have claimed, to be recognized as equal before the law and endowed with the same value in the eyes of society. As a deep philosophical matter I’m skeptical of Rights in a fundamental sense. As a conservative I’m skeptical of the push for a huge array of rights by a plethora identity groups. Socially recognized rights are valuable, and are cheapened and debased by dispensing them too liberally.
Since John Hawks already hit it I don’t have much to add about the dog-starch-adaptation-paper in Nature, The genomic signature of dog domestication reveals adaptation to a starch-rich diet. I’m impressed at the yield from the sample sizes that they had, but as John alludes to this area of study has huge possibilities. The authors suggest that agriculture catalyzed domestication. That’s fair enough, and carefully stated I’d say, because the Amerindians seem to have brought domestic dogs to the New World long before agriculture. In other words, the “domestication” event was probably a multi-layered affair. Looking through the supporting information it’s obvious that the domestics were almost all Western breeds. As the search for adaptive variants expands to other lineages we might be in for surprises in terms of the signatures of selection as they vary across the dogs.
The above is a graph which illustrates phylogenetic relationships using the TreeMix package. It is from the paper I alluded to yesterday. The paper, DNA analysis of an early modern human from Tianyuan Cave, China, is open access, so everyone should be able to read it. Its mtDNA analysis shows that the Tianyuan sample, from the region of Beijing and dating to ~40,000 years B.P., is a basal clade in haplogroup B, which is common in eastern Eurasia and the New World. This is a satisfying result insofar as the understanding in relation to this haplogroup is that it diversified ~50,000 years B.P. There is very strong support in these data for the proposition that Tianyuan forms a distinct clade with the populations you see above, as opposed to western Eurasians. This is important because this sample seems to date with relatively good precision to 40,000 years B.P., supporting the archaeological contention that modern humans were already diversifying into western and eastern lineages 40-50,000 years ago. In contrast statistical genomic inferences tend toward a lower date for divergence. We can be moderately confident at this point that some aspect of the west-east divergence predates subsequent later gene flow events, which might lead to confusing archaeology-blind methods.
Over the past decade or so much of the reconstruction of the human genetic past has occurred through inferences generated from variation of extant human beings. In more plain English the patterns of genetic variation of modern populations have been used to map out the patterns of the past. There are serious difficulties with these sorts of inferences. For example you generate a huge number of potential phylogenetic trees and zero in on the “most probable tree” (or, the distribution of trees). But at the end of the day these inferences are only as good as your assumptions.
The above figure is from a paper which leaves me somewhat befuddled, Genome-wide data substantiate Holocene gene flow from India to Australia. The authors ran several hundred thousand SNPs through treemix, and generated the above graph which leads one to the conclusion that there has been significant gene flow from Indian populations to Australia. More precisely, from Dravidian populations to the Aboriginal peoples of Northern Australia. In plain English the authors found the tree which was the best fit to the data, and then they improved it by by adding migration across branches which were the poorest fits.
Obviously the whole paper is not going to rest on the above graph. They performed some clustering analysis on the data, which you’ll recognize. PCA and Admixture:
In my earlier posts where I gave a short intro to using Plink I distributed a data set termed PHLYO. One thing I did not mention is that I’ve also been running it on Admixture. But here’s an important point: I ran the data set 10 times from K = 2 to K = 15. Why? Because the algorithm produces somewhat different results on each run (if you use a different seed, which you should), and I wanted to not be biased by one particular result. Additionally, I also turned on cross-validation error, which gives me a better sense of which K’s to trust. But after I select the K which I want to visualize which replicate run will I then use to generate the bar plots? I won’t pick any specific one. Rather, I’ll merge them together with an off-the-shelf algorithm. Additionally, I also want to sort the individuals by their modal population cluster.
This sounds rather convoluted, and it is somewhat. I have a pipeline that I use, but it’s not too user friendly. One of my projects is to clean it up, document it, and publish it online. Though if you have your own pipeline all ready to go, please post it in the comments with a link! The general steps are as follows for me:
1) Convert Admixture Q files into Structure format, transform family identifications to numeric values, and generate a file with family identification and numeral pairs
2) Merge the results across runs using Clumpp
3) Sort the individual results within populations
4) The use Distruct to produce an output file
Before I show you the resultant bar plot, here are the cross-validation results with standard deviation ticks:
Over at Scientific American Christie Wilcox has a post up with the provocative title, People With Brown Eyes Appear More Trustworthy, But That’s Not The Whole Story, which reports on a new PLoS ONE paper, Trustworthy-Looking Face Meets Brown Eyes. Like Christie I would enjoy illustrating this post with my own trustworthy and youthful brown eyed visage, but I worry that my mien is a bit on the sly side! In any case, what of the paper? Wilcox reviews the salient points of the results. In short, the issue here is that brown eyed men seem to have more ‘trustworthy faces’ than blue eyed men. When the eyes were digitally manipulated it turned out that color had no influence on perception. Rather, it was the correlation between eye color and facial proportion which which was driving the initial association. Christie finishes:
Given the importance of trust in human interactions, from friendships to business partnerships or even romance, these findings pose some interesting evolutionary questions. Why would certain face shapes seem more dangerous? Why would blue-eyed face shapes persist, even when they are not deemed as trustworthy? Are our behaviors linked to our bodies in ways we have yet to understand? There are no easy answers. Face shape and other morphological traits are partially based in genetics, but also partially to environmental factors like hormone levels in the womb during development. In seeking to understand how we perceive trust, we can learn more about the interplay between physiology and behavior as well as our own evolutionary history.
While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.
Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.
The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.
A few days ago I was browsing Haldane’s Sieve,when I stumbled upon an amusing discussion which arose on it’s “About” page. This “inside baseball” banter got me to thinking about my own intellectual evolution. Over the past few years I’ve been delving more deeply into phylogenetics and phylogeography, enabled by the rise of genomics, the proliferation of ‘big data,’ and accessible software packages. This entailed an opportunity cost. I did not spend much time focusing so much on classical population and evolutionary genetic questions. Strewn about my room are various textbooks and monographs I’ve collected over the years, and which have fed my intellectual growth. But I must admit that it is a rare day now that I browse Hartl and Clark or The Genetical Theory of Natural Selection without specific aim or mercenary intent.
Like a river inexorably coursing over a floodplain, with the turning of the new year it is now time to take a great bend, and double-back to my roots, such as they are. This is one reason that I am now reading The Founders of Evolutionary Genetics. Fisher, Wright, and Haldane, are like old friends, faded, but not forgotten, while Muller was always but a passing acquaintance. But ideas 100 years old still have power to drive us to explore deep questions which remain unresolved, but where new methods and techniques may shed greater light. A study of the past does not allow us to make wise choices which can determine the future with any certitude, but it may at least increase the luminosity of the tools which we have iluminate the depths of the darkness. The shape of nature may become just a bit less opaque through our various endeavors.
There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:
But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.
Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”
In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.
The New Republic has a piece up, How Older Parenthood Will Upend American Society, which won’t have surprising data for readers of this weblog. But it’s nice to see this sort of thing go “mainstream.” My daughter was born when her parents were in their mid-30s, so I know all the statistics. They aren’t good bed-time reading (she’s healthy and robust so far!). If I had to do it over again I definitely wouldn’t have waited this long. After becoming a father it brought home to me that waiting was one of the worst decisions of my life. Why postpone something this incredible for the more far more prosaic pleasures of an extended adolescence? Granted, I’m not sure that I would have been the best father at 25, but I don’t think there’s much I can say in reply to the argument that I should have become a father by 30.
More concretely, we would have had sperm and egg “banked” if we had been smart delaying parenthood. The article notes that storage of sperm costs $850 up front, and $300 to $500 per year after that, and that many balk at the cost. And how much do you spend on your cell phone every year? The issue here seems to be time preference.
Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.
My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).
Here’s my daughter’s new chromosome painting:
One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!
But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.
In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:
As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.
What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.
I did two things:
1) I ran ADMIXTURE at K = 3, unsupervised
2) I ran an MDS, which visualized the genetic variation in multiple dimensions
Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.
|ADMIXTURE results for chromosome 10|
|K 1||K 2||K 3|
You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:
And here’s the MDS with the two largest dimensions:
Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.
Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.
Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.
In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.
But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?
1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages
2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire
3) They are an admixture between an ancestral Indian element, and local populations
4) Their history of endogamy has resulted in a strong genetic drift effect
The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.
First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:
Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.
Two trees illustrate the results succinctly:
The bottom line:
- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India
- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes
But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:
The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.
The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.
To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.
Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?
I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).
Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.
So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.
One of the primary concerns/questions I had about Luca Pagani’s paper on the genetic origin of Ethiopians is that he found that their West Eurasian ancestor was closer to Levantine than Arabian. I was confused by this because on model-based clustering (e.g., Admixture) when you push down to a fine level of granularity you always see that the Ethiopians cluster with the Yemenis for their non-African ancestry. More precisely, Yemeni Jews are often ~100% component X, which ~50% of the ancestry of Ethiopians.
From what I recall Pagani et al. used haplotype windows which they assigned to Eurasian or African ancestral components, and they compared these to the populations related to the putative ancestral groups. Because Pagani et al. used blocks of the genome, rather than just on specific genotypes, I weight their finding more strongly. But I wanted to double check with TreeMix if the finding in Admixture was peculiar.
So again, I took a ~150,000 SNP set ran it on TreeMix with migration = 5.
Again, you see that the gene flow to the Ethiopians is coming from a position on the tree rather close to Yemenite Jews. One model which may explain this, and still align with Pagani’s findings, is that Arabians themselves are a synthetic population. A “pure” Yemenite Jew may have ancient admixture of African affinity beneath an intrusive element from the north. The parallelism between Ethiopia and Arabia in this model is clear, with the major difference being magnitude of the source population admixture (greater in Arabia), as well as some differences of the target population.
This again reiterates us to be careful of trust first-blush summaries.
As a follow up to my post from yesterday, I decided to run TreeMix on a data set I happened to have had on hand (see Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data for more on TreeMix). Basically I wanted to display a tree with, and without, gene flow.
The technical details are straightforward. I LD pruned ~550,000 SNPs down to ~150,000. I ran TreeMix without and with migration parameters with the Bantu Kenya population being the root. Finally, when I did turn on the migration parameter I set it for 5. You can see the results below.
Most of the flows are pretty expected. The West Eurasian flow from the Turks to the Uygurs makes sense, because there is a large West Asian component to what the Uygurs have (from East Iranians?). The Chuvash are a Turkic group with minor, but significant, Turkic component. The HGDP Russian sample does have some East Eurasian ancestry. And the Moroccans also have African ancestry. But your guess is as good as mine with the Bantu flow in. These are I think Kenya, so it might be trying to interpret Nilotic admixture as generalized Eurasian.
A minor note: installing TreeMix and generating the appropriate files from pedigree format is not to difficult. But you might have confusion in how to generate the pedigree input file. You do it like so in PLINK:
./plink --noweb --bfile YourFile --freq --within YourGroupNamesFile --out YourOutPutFile
It’s the last you want to put into TreeMix’s python conversion script. The YourGroupNamesFile is basically the .fam file with an extra column, the population names for each individual.
I mentioned this in passing on my post on ASHG 2012, but it seems useful to make explicit. For the past few years there has been word of research pointing to connections between the Khoisan and the Cushitic people of Ethiopia. To a great extent in the paper which is forthcoming there is the likely answer to the question of who lived in East Africa before the Bantu, and before the most recent back-migration of West Eurasians. On one level I’m confused as to why this has to be something of a mystery, because the most recent genetic evidence suggests a admixture on the order of 2-3,000 years before the past.* If the admixture was so recent we should find many of the “first people,” no? As it is, we don’t. I think these groups, and perhaps the Sandawe, are the closest we’ll get.
Publication is imminent at this point (of this, I was assured), so I’m going to just state the likely candidate population (or at least one of them): the Sanye, who speak a Cushitic language with possible Khoisan influences. There really isn’t that much information on these people, which is why when I first heard about the preliminary results a few years back and looked around for Khoisan-like populations in Kenya I wasn’t sure I’d hit upon the right group. But at ASHG I saw some STRUCTURE plots with the correct populations, and the Sanye were one of them. I would have liked to see something like TreeMix, but the STRUCTURE results were of a quality that I could accept that these populations were not being well modeled by the variation which dominated their data set. Though Cushitic in language the Sanye had far less of the West Eurasian element present among other Cushitic speaking populations of the Horn of Africa. Neither were their African ancestral components quite like that of the Nilotic or Bantu populations. The clustering algorithm was having a “hard time” making sense of them (it seemed to wanted to model them as linear combinations of more familiar groups, but was doing a bad job of it).
Here is an interesting article on these groups: Little known tribe that census forgot. Like the Sandawe this is a population which seems to have been hunter-gatherers very recently, and to some extent still engage in this lifestyle. In this way I think they are fundamentally different from Indian tribal populations, who are often held up to be the “first people” of the subcontinent. More and more it seems that the tribes of India are less the descendants of the original inhabitants of the subcontinent, at least when compared to the typical Indian peasant, and more simply those segments of the Indian population which were marginalized and pushed into less productive territory. Over time they naturally diverged culturally because of their isolation, but the difference was not primal. In contrast, groups like the Sanye and Sandawe may have mixed to a great extent with their neighbors (and lost their language like the Pygmies), but evidence of full featured hunting & gathering lifestyles implies a sort of direct cultural continuity with the landscape of eastern Africa before the arrival of farmers and pastoralists from the west and north.
* I understand some readers refuse to accept the likelihood of these results because of other lines of information. I am just relaying the results of the geneticists. I am not interested in re-litigating prior discussions on this. We’ll probably have a resolution soon enough.