For the past year or so I’ve been getting queries about what I think about Eran Elhaik’s preprint on the genetic character of European Jews. I found some of the conclusions frankly a little weird, but I assumed that things would be cleaned up for publication. Well, it’s been out for a while now: The Missing Link of Jewish European Ancestry: Contrasting the Rhineland and the Khazarian Hypotheses. But some reporting in The Jewish Daily Forward has brought the author and his detractors a bit into the spotlight. The reason is that as you can tell from the title of the author takes a position on the Khazarian origin model of Ashkenazi Jews (in favor). Here is a non-genetic take over at GeoCurrents, the thrust of which I basically concur with.
In any case, many of the problems with the paper remain. Really it all begins and ends here:
Because of Angelina Jolie’s revelation, the Myriad Genetics case is in the news again. If you don’t know what I’m talking about, look it up. Because of the patent Myriad can charge thousands of dollars for a test which would otherwise be much cheaper (and putting it out of reach of many without health insurance). My question here is simple: if you are a geneticist do you think Myriad’s position has any validity? The reason I ask is that I know many geneticists, and I know many geneticists read me, and I follow many geneticists on Twitter, but I’ve never encountered one who would be willing to defend Myriad’s position as plausible and passing the smell test. If you are one of those geneticists please leave a comment, because I’m honestly curious.
I went to the talks about the Myriad case at ASHG, and I have to say it was all law, and no science. The science was confused and laughable. The panelists themselves rolled their eyes and expressed resignation as to the garbled ratiocinations of the judges who reviewed the case. There is a classic “two cultures” problem.
Standard apologies that I have had not the marginal time to blog much, but I thought it was important that I least note that Dr. Peter Ralph and Dr. Graham Coop’s paper on identity-by-descent segments and European populations and history is out in its final form in PLoS Biology, The Geography of Recent Genetic Ancestry across Europe. I’ve been familiar with the outlines of these results for about a year now, and to be frank I am still digesting them. The media hype will come and go, with true but to some extent trivial headlines that “all Europeans are related,” but the consequences of these sorts of genetic inquiries into the relatedness of populations are going to be long lasting. At least they should be.
But before I go on about that, if you find the paper itself a bit daunting (though the main body of the text strikes me as eminently readable for a piece of statistical genetics), see Carl Zimmer’s condensation. With this sort of result there is liable to be confusion, so note that Graham Coop has been posting comments on Carl’s blog (and elsewhere, and you can always send him a note on Twitter). Additionally he has a very readable FAQ out. Dr. Coop told me on Twitter that there would even be updates tomorrow as well! In particular one aspect of the paper which I noticed is that most relatively short, but detectable segments (~10 cM), between any two individuals in many nationalities is not going to be evidence of recent genealogical affinities, but deeper historical process.
A few years back I was rather fixated on issues of maternal fetal health. In particular I was worried about gestational diabetes in relation to my wife because I come from an ethnic group with an elevated risk for these sorts of problems, and the effect when you are in mixed-race marriages seems to be additive (i.e., unlike some risk factors associated with pregnancies the mother’s ethnicity is not the only relevant variable). This is embedded in the broader suite of metabolic diseases which exhibit ethnic variation. Early work on genome-wide selection in humans yielded the result that there was a strong enrichment for signals of adaption within regions of the genome associated with metabolism, so this should not be that surprising. Humans are a geographically dispersed species that inhabits a wide range of environments, so natural selection would shape the distribution of phenotypes within populations if evolution is a significant historical process (it is).
A paper in last month’s Trends in Genetics highlights more precisely how natural selection would operate in a life history context in specific cases. Many ways to die, one way to arrive: how selection acts through pregnancy:
When considering selective forces shaping human evolution, the importance of pregnancy to fitness should not be underestimated. Although specific mortality factors may only impact upon a fraction of the population, birth is a funnel through which all individuals must pass. Human pregnancy places exceptional energetic, physical, and immunological demands on the mother to accommodate the needs of the fetus, making the woman more vulnerable during this time-period. Here, we examine how metabolic imbalances, infectious diseases, oxygen deficiency, and nutrient levels in pregnancy can exert selective pressures on women and their unborn offspring. Numerous candidate genes under selection are being revealed by next-generation sequencing, providing the opportunity to study further the relationship between selection and pregnancy. This relationship is important to consider to gain insight into recent human adaptations to unique diets and environments worldwide.
Yesterday I pointed to a paper which was interesting enough, but didn’t pass the smell test in relation to other evidence we have (at least in my opinion!). A primary concern was the fact that uniparental (male and female lineages) show a peculiar distribution of variation in comparison to autosomal genetic variation (i.e., the vast majority of the genome) in the case of Europe (genome-wide analysis suggest more of Europe’s variation is partitioned north-south, but Y and mtDNA results often imply an east-west split). But a secondary concern I had was that I felt the models were a bit too stylized. In particular following Cavalli-Sforza and Ammerman the authors concluded that demic diffusion better fits their results of genetic variation in Europe (as opposed to continuity of Paleolithic hunter-gatherers). This is likely correct, but these are not the only two models.
A paper out in Nature Communications, using analysis of the phylogenetics of whole ancient mitchondrial genomes, outlines my primary concern when it comes to the models being tested, Neolithic mitochondrial haplogroup H genomes and the genetic origins of Europeans:
Haplogroup H dominates present-day Western European mitochondrial DNA variability (>40%), yet was less common (~19%) among Early Neolithic farmers (~5450 BC) and virtually absent in Mesolithic hunter-gatherers. Here we investigate this major component of the maternal population history of modern Europeans and sequence 39 complete haplogroup H mitochondrial genomes from ancient human remains. We then compare this ‘real-time’ genetic data with cultural changes taking place between the Early Neolithic (~5450 BC) and Bronze Age (~2200 BC) in Central Europe. Our results reveal that the current diversity and distribution of haplogroup H were largely established by the Mid Neolithic (~4000 BC), but with substantial genetic contributions from subsequent pan-European cultures such as the Bell Beakers expanding out of Iberia in the Late Neolithic (~2800 BC). Dated haplogroup H genomes allow us to reconstruct the recent evolutionary history of haplogroup H and reveal a mutation rate 45% higher than current estimates for human mitochondria.
Every now and then Richard Dawkins stirs controversy by bringing up the topic of eugenics. This is not surprising in terms of Dawkins’ intellectual pedigree. The most influential British evolutionary biologist in the generation before Dawkins, R. A. Fisher, was a eugenicist. Arguably the most the most eminent evolutionist of Dawkins’ own generation, W. D. Hamilton, clearly had eugenical sympathies, though he was keenly aware how unfashionable that had become.* University College London’s Galton Laboratory still had the word eugenics in its title until 1965. More recently Dawkins has brought up the issue of consanguinity amongst the British Pakistani community. A practice which one might argue is non-eugenical due to the high rate of recessive diseases.
As recently as 10 years ago one could plausibly talk about mtDNA Eve and Y chromosomal Adam. The “Human Story” might then be stylized into a rapid expansion from a small core East African population which flourished ~100,000 years ago, and engaged in a jailbreak sweep out of Africa and across the rest of the World Island, and beyond, to Oceania and the New World. In the process all other human lineages extirpated, marginalized, and eliminated, their culture and genes consigned to oblivion. No longer, the origin of our species may have been characterized by several admixture events with “other” lineages, both within, and outside of, Africa. Instead of a bifurcating tree, imagine a graph with reticulation. A phylogenetic tree with a light, but noticeable lattice scaffold, tying together disparate branches.
There’s an excellent paper up at Cell right now, Modeling Recent Human Evolution in Mice by Expression of a Selected EDAR Variant. It synthesizes genomics, computational modeling, as well as the effective execution of mouse models to explore non-pathological phenotypic variation in humans. It was likely due the last element that this paper, which pushes the boundary on human evolutionary genomics, found its way to Cell (and the “impact factor” of course).
The focus here is on EDAR, a locus you may have heard of before. By fiddling with the EDAR locus researchers had earlier created “Asian mice.” More specifically, mice which exhibit a set of phenotypes which are known to distinguish East Asians from other populations, specifically around hair form and skin gland development. More generally EDAR is implicated in development of ectodermal tissues. That’s a very broad purview, so it isn’t surprising that modifying this locus results in a host of phenotypic changes. The figure above illustrates the modern distribution of the mutation which is found in East Asians in HGDP populations.
One thing to note is that the derived East Asian form of EDAR is found in Amerindian populations which certainly diverged from East Asians > 10,000 years before the present (more likely 15-20,000 years before the present). The two populations in West Eurasia where you find the derived East Asian EDAR variant are Hazaras and Uyghurs, both likely the products of recent admixture between East and West Eurasian populations. In Melanesia the EDAR frequency is correlated with Austronesian admixture. Not on the map, but also known, is that the Munda (Austro-Asiatic) tribal populations of South Asia also have low, but non-trivial, frequencies of East Asian EDAR. In this they are exceptional among South Asian groups without recent East Asian admixture. This lends credence to the idea that the Munda are descendants in part of Austro-Asiatic peoples intrusive from Southeast Asia, where most Austro-Asiatic languages are present.
Yesterday I re-ran Plink with a narrower European-biased data set, and generated some MDS plots. I only had a few Asian and African populations, mostly so that I could replicate the standard dimensions 1 and 2, producing the classic “v-shape” which you’ve seen before. But what’s more interesting are lower coordinates. They may not capture as much of the variation in the distance matrix, but illustrate important dynamics. I haven’t used the directlabels package yet, so right now the labels are still imperfect. I’m giving black text as well as colored text. Also, here’s the original data (as in MDS results, not the raw data).
A reader points me to a talk given by David Reich at the Center for Human Genetic Research 2013 Retreat. One of the issues Reich brought up is old, but perhaps worth reemphasizing: due to endogamy many South Asians carry a higher load of recessive ailments. This is not due to recent inbreeding (which is barred by custom in many South Asian groups, which enforce kin-level exogamy), but long term genetic isolation. Over time even a moderate sized population can be affected by drift. This was one of the major points in the 2009 paper Reconstructing Indian History, but not one particularly emphasized in the press follow up. A major implication is that a relatively simple public health measure for South Asians would be to marry outside of their jati. The social or genetic distance need not be great. But one generation of outbreeding should “mask” many of the deleterious alleles. If this model is correct one should be able to track decreases in morbidity within the American South Asian population, where there are many inter-caste and inter-regional marriages (yes, this is between people of putative high status, but this doesn’t matter).
With all the crazy talk about George Church and an adventurous young woman conspiring to bring back Neandertals, I do think it is important to keep in mind that we can bring back an individual with a predominantly Neandertal genome in a very old fashioned manner: controlled breeding. The most humane and viable manner in which you might do this is simply start a religion in a Bene Gesserit fashion where the prophesied Kwisatz Haderach is a Neandertal. Over the generations by selecting individuals within the population (which could draw in converts) enriched for Neandertal ancestry to mate assortatively one could slowly increase the proportion of that ancestral component. The population would become more and more “Neandertal,” probably to the point of being phenotypically distinctive in a dozen generations (even a minority of non-modern human ancestry is probably significant, just as many individuals who are 3/4 European and 1/4 African still exhibit features of their minority heritage). One could apply the same logic to the Denisovans.
Most people in South Asia speak one of two varieties of language, Indo-Aryan and Dravidian. These two are not particularly closely related. Indo-Aryan is an Indo-European language, as is evident in the plethora of obvious cognates with other Indo-European dialects. I have a minimal fluency in Bengali, the easternmost of the Indo-European languages, and quite a bit more fluency with English, one of the most westernmost, and it was evident to me rather early on (e.g., grass vs. gash, man vs. manush, nose vs. nak). In contrast to me Dravidian languages are peculiar because the accent and cadence are clearly South Asian, but they are utterly impenetrable (though there are many loan words into Indo-Aryan from Dravidian).
In the links below I alluded to a controversy over the “Neurodiversity movement”. The basic issue is that people with Asperger syndrome and high functioning autism are being accused of putting their concerns above and beyond those of the large number of mentally disabled autistic individuals (some of whom are non-verbal, and exhibit severe cognitive deficits) in the grab for “rights.” Rights here understood as the rights which black Americans, women, and gays have claimed, to be recognized as equal before the law and endowed with the same value in the eyes of society. As a deep philosophical matter I’m skeptical of Rights in a fundamental sense. As a conservative I’m skeptical of the push for a huge array of rights by a plethora identity groups. Socially recognized rights are valuable, and are cheapened and debased by dispensing them too liberally.
The above is a graph which illustrates phylogenetic relationships using the TreeMix package. It is from the paper I alluded to yesterday. The paper, DNA analysis of an early modern human from Tianyuan Cave, China, is open access, so everyone should be able to read it. Its mtDNA analysis shows that the Tianyuan sample, from the region of Beijing and dating to ~40,000 years B.P., is a basal clade in haplogroup B, which is common in eastern Eurasia and the New World. This is a satisfying result insofar as the understanding in relation to this haplogroup is that it diversified ~50,000 years B.P. There is very strong support in these data for the proposition that Tianyuan forms a distinct clade with the populations you see above, as opposed to western Eurasians. This is important because this sample seems to date with relatively good precision to 40,000 years B.P., supporting the archaeological contention that modern humans were already diversifying into western and eastern lineages 40-50,000 years ago. In contrast statistical genomic inferences tend toward a lower date for divergence. We can be moderately confident at this point that some aspect of the west-east divergence predates subsequent later gene flow events, which might lead to confusing archaeology-blind methods.
The above figure is from a paper which leaves me somewhat befuddled, Genome-wide data substantiate Holocene gene flow from India to Australia. The authors ran several hundred thousand SNPs through treemix, and generated the above graph which leads one to the conclusion that there has been significant gene flow from Indian populations to Australia. More precisely, from Dravidian populations to the Aboriginal peoples of Northern Australia. In plain English the authors found the tree which was the best fit to the data, and then they improved it by by adding migration across branches which were the poorest fits.
Obviously the whole paper is not going to rest on the above graph. They performed some clustering analysis on the data, which you’ll recognize. PCA and Admixture:
In my earlier posts where I gave a short intro to using Plink I distributed a data set termed PHLYO. One thing I did not mention is that I’ve also been running it on Admixture. But here’s an important point: I ran the data set 10 times from K = 2 to K = 15. Why? Because the algorithm produces somewhat different results on each run (if you use a different seed, which you should), and I wanted to not be biased by one particular result. Additionally, I also turned on cross-validation error, which gives me a better sense of which K’s to trust. But after I select the K which I want to visualize which replicate run will I then use to generate the bar plots? I won’t pick any specific one. Rather, I’ll merge them together with an off-the-shelf algorithm. Additionally, I also want to sort the individuals by their modal population cluster.
This sounds rather convoluted, and it is somewhat. I have a pipeline that I use, but it’s not too user friendly. One of my projects is to clean it up, document it, and publish it online. Though if you have your own pipeline all ready to go, please post it in the comments with a link! The general steps are as follows for me:
1) Convert Admixture Q files into Structure format, transform family identifications to numeric values, and generate a file with family identification and numeral pairs
2) Merge the results across runs using Clumpp
3) Sort the individual results within populations
4) The use Distruct to produce an output file
Before I show you the resultant bar plot, here are the cross-validation results with standard deviation ticks:
While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.
Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.
The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.
The above map shows the population coverage for the Geno 2.0 SNP-chip, put out by the Genographic Project. Their paper outlining the utility and rationale by the chip is now out on arXiv. I saw this map last summer, when Spencer Wells hosted a webinar on the launch of Geno 2.0, and it was the aspect which really jumped out at me. The number of markers that they have on this chip is modest, only >100,000 on the autosome, with a few tens of thousands more on the X, Y, and mtDNA. In contrast, the Axiom® Genome-Wide Human Origins 1 Array Plate being used by Patterson et al. has ~600,000 SNPs. But as is clear by the map above Geno 2.0 is ascertained in many more populations that the other comparable chips (Human Origins 1 Array uses 12 populations). It’s obvious that if you are only catching variation on a few populations, all the extra million markers may not give you much bang for the buck (not to mention the biases that that may introduce in your population genetic and phylogenetic inferences).
To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.
Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.