Kevin Mitchell of Wiring the Brain has a very long post up inveighing against the specter of eugenics. I don’t have a great deal of time to engage Kevin right now.* But in addition to Kevin’s post I highly recommend this episode of WBUR’s On Point. It has Steve Hsu on, and he articulates many of the positions that I myself hold. Steve’s work with BGI has triggered the latest discussion of eugenics thanks to Vice‘s sensational representation of the research project and its aims. But it’s a useful discussion to engage in, even if the starting point is a little unfortunate.
I will state though Kevin’s argument seems to be predicated on the implicit assumption that his interlocutors hold to some sort of Platonic ideal of the most-perfect-human. There’s no such thing obviously, and even those who sympathized with eugenic policies such as W. D. Hamilton rejected this notion at the end of the day. Rather, human traits are evaluated in terms of how they serve the flourishing of individuals and society according to understood values. Intelligence is generally assumed to benefit individuals, and, I believe that it benefits society as well through innovation. Innovation drives the productivity growth which is the foundation of our post-Malthusian age.
A paper on the genetics of the Roma (“Gypsies”), Reconstructing Roma History from Genome-Wide Data, has finally come out in a journal. It’s been on arXiv for a while, so nothing too surprising. But, reading through the paper I have to note one rather clear aspect for me: there is a crispness and detail to the way they outlined and integrated their methods into the results section. Unfortunately there is an obvious tendency in the pressure to publish for people to use methods and tools (which usually consists of software written by others which you use in a blackbox fashion) in a slapdash manner with an aim toward arriving at a publishable unit. Because of the specialization within science it seems one can entirely make it through peer review by using methods which signal that one does not really know what one is talking about. To give a concrete example, a year ago I was told about a phylogenetic package isin moderate usage which seems to basically be a “random number generator.” The fact that this package is used is a testament to the fact that many researchers who are not phylogeneticists simply reach for the nearest method at hand, and trust the results if they make some intuitive sense (presumably in this case they would simply report the results which were intelligible).
The ultimate future, I’m hoping, is for open data, open code, and open methods. When a shady or sketchy paper makes it through peer review there is now visible public anger which bubbles out of the scientific community, but the process of reproducing the results can still be tedious (see Arsenic life). This is less true in cases where the means are more computational. The only things stopping the process of science from operating more efficiently are human barriers (e.g., cultural norms, institutional barriers toward data release).
Bears are big deal today. I’ve talked about this before, so I won’t belabor the point in this post. Rather, I want to persuade you that there’s a really interesting paper out in PLOS Genetics right now, Genomic Evidence for Island Population Conversion Resolves Conflicting Theories of Polar Bear Evolution. I know that seems like a mouthful, and despite the fact that I nodded to the reality that this is highly relevant in part because of policy concerns, the paper itself makes salient the reality that oftentimes we are confronted with the juxtposition between useful abstractions and the empirical shape of the world. In this case the abstraction is that of species, the one taxonomic category which many people find to be a natural kind, so to speak. These sorts of confusions of our expectations are often highly informative. They illustrate the limits of our abstractions, and drive us toward more complex and/or elegant formalisms which are capable of modeling nature as it is, rather than as it we wish it would be.
When I read Genome-Wide Diversity in the Levant Reveals Recent Structuring by Culture in PLoS Genetics last week, one of my thoughts was “where is the tree”? Thankfully all the data is online, so I simply ran TreeMix on it. After a number of runs I know understand perhaps why there is no figure emphasizing a tree. There just isn’t that much informative yield from what I can tell, though the basic inference from the paper is recapitulated. You can see the results in the figure above, from one of my TreeMix runs. Overall, what this paper reinforces is that there are sharp genetic distinctions across ethno-religious boundaries within the modern Middle East which confound attempts to use geography to predict variation.
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn’t release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs).
This is a follow up to my post from yesterday. In case you care about the technical details (after I clean this stuff up I will put it on GitHub) I’m using R’s adehabitat package to create a 95% distribution curve after smoothing with kernel density. The goal is to give you a better intuition about where the populations are dispersed across two dimensional visualizations of genetic variation.
Thinking about how to plot text, I came up with a quick hack, which just used the initial data and found the median x and y position. That explains why some of the labels are shifted so, in populations with a huge range the label position is going to be sensitive to not being smoothed (if you know how to pull out the centroid out of the kver, tell!). I’ve given them colors and also used black. The latter actually seems to be clearer!
Note: This is not just for fun, as I plan to start rolling out results and methods from some of the data sets I have more regularly in the near future.
I’ve been thinking about how best to visualize PCA/MDS type of results, which allow for the two dimensional representation of genetic variation. Below are a few of my efforts with a data set I have. You can see the individuals in gray, but also ellipses which cover ~95% of the distribution of a given population.
Please click the images for a larger version. They represent coordinate 1 on the y axis and 2 on the z axis derive from a multidimesional scaling representing identity by state across individuals.
A reader points me to a talk given by David Reich at the Center for Human Genetic Research 2013 Retreat. One of the issues Reich brought up is old, but perhaps worth reemphasizing: due to endogamy many South Asians carry a higher load of recessive ailments. This is not due to recent inbreeding (which is barred by custom in many South Asian groups, which enforce kin-level exogamy), but long term genetic isolation. Over time even a moderate sized population can be affected by drift. This was one of the major points in the 2009 paper Reconstructing Indian History, but not one particularly emphasized in the press follow up. A major implication is that a relatively simple public health measure for South Asians would be to marry outside of their jati. The social or genetic distance need not be great. But one generation of outbreeding should “mask” many of the deleterious alleles. If this model is correct one should be able to track decreases in morbidity within the American South Asian population, where there are many inter-caste and inter-regional marriages (yes, this is between people of putative high status, but this doesn’t matter).
In my earlier posts where I gave a short intro to using Plink I distributed a data set termed PHLYO. One thing I did not mention is that I’ve also been running it on Admixture. But here’s an important point: I ran the data set 10 times from K = 2 to K = 15. Why? Because the algorithm produces somewhat different results on each run (if you use a different seed, which you should), and I wanted to not be biased by one particular result. Additionally, I also turned on cross-validation error, which gives me a better sense of which K’s to trust. But after I select the K which I want to visualize which replicate run will I then use to generate the bar plots? I won’t pick any specific one. Rather, I’ll merge them together with an off-the-shelf algorithm. Additionally, I also want to sort the individuals by their modal population cluster.
This sounds rather convoluted, and it is somewhat. I have a pipeline that I use, but it’s not too user friendly. One of my projects is to clean it up, document it, and publish it online. Though if you have your own pipeline all ready to go, please post it in the comments with a link! The general steps are as follows for me:
1) Convert Admixture Q files into Structure format, transform family identifications to numeric values, and generate a file with family identification and numeral pairs
2) Merge the results across runs using Clumpp
3) Sort the individual results within populations
4) The use Distruct to produce an output file
Before I show you the resultant bar plot, here are the cross-validation results with standard deviation ticks:
While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.
Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.
There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:
But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.
Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”
In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.
Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.
In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.
But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?
1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages
2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire
3) They are an admixture between an ancestral Indian element, and local populations
4) Their history of endogamy has resulted in a strong genetic drift effect
The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.
First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:
Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.
Two trees illustrate the results succinctly:
The bottom line:
- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India
- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes
But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:
The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.
The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.
To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.
Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?
I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).
Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.
So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.
A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:
Native Americans and Northern Europeans More Closely Related Than Previously Thought
Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America
Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS
The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.
What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.
So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.
- Life Technologies/Ion Torrent apparently hires d-bag bros to represent them at conferences. The poster people were fine, but the guys manning the Ion Torrent Bus were total jackasses if they thought it would be funny/amusing/etc. Human resources acumen is not always a reflection of technological chops, but I sure don’t expect organizational competence if they (HR) thought it was smart to hire guys who thought (the d-bags) it would be amusing to alienate a selection of conference goers at ASHG. Go Affy & Illumina!
- Speaking of sequencing, there were some young companies trying to pitch technologies which will solve the problem of lack of long reads. I’m hopeful, but after the Pacific Biosciences fiasco of the late 2000s, I don’t think there’s a point in putting hopes on any given firm.
- I walked the poster hall, read the titles, and at least skimmed all 3,000+ posters’ abstracts. No surprise that genomics was all over the place. But perhaps a moderate surprise was how big exomes are getting for medically oriented people.
- Speaking of medical/clinical people, I noticed that in their presentations they used the word ‘Caucasian‘ a lot. This was not evident in the pop-gen folks. It shows the influence of bureaucratic nomenclature in modern medicine, as they have taken to using somewhat nonsensical US Census Bureau categories.
- Twitter was a pretty big deal. There were so many interesting sessions that I found myself checking my feed constantly for the #ASHG2012 hashtag. It was also an easy way to figure out who else was at the same session (e.g., in my case, very often Luke Jostins).
- If you could track the patterns of movements of smartphones at the conference it would be interesting to see a network of clustering of individuals. For example, the evolutionary and population genomics posters were bounded by more straight-up informatics (e.g., software to clean your raw sequence data), from which there was bleed over. But right next to the evolution and population genomics sections (and I say genomics rather than genetics, because the latter has been totally subsumed by the former) you had some type of pediatric disease genetics aisles. I wasn’t the only one to have a freak out when I mistakenly kept on moving (i.e., you go from abstruse discussions of the population structure of Ethiopia, to concrete ones about the likely probability of death of a newborn with an autosomal dominant disorder, with photos of said newborn!).
Last week Luke Jostins (soon to be Dr. Luke Jostins) published an interesting paper in Nature. To be fair, this paper has an extensive author list, but from what I am to understand this is the fruit of the first author’s Ph.D. project. In any case, you may know Luke because I have used his loess curve on hominin encephalization for years. His bread & butter is statistical genetics, and it shows in this Nature paper. God knows how he managed to cram so much density into ~5.5 pages of plain text. Luke is also a contributor to Genomes Unzipped, and has put up a post over there on one implication of the paper, Dozens of new IBD genes, but can they predict disease? The short answer is that for individual prediction complex traits are going to be a hard haul over the long term.*
They are subject to what Jim Manzi would term “high causal density.” A simple way to state this is that outcome X is dependent on a host of variables, and if you capture only a small number of variables, you aren’t going to be explaining much in a general fashion. This is obvious from the text of Luke’s paper. Let’ look at the abstract, Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease:
The Pith: Natural selection comes in different flavors in its genetic constituents. Some of those constituents are more elusive than others. That makes “reading the label” a non-trivial activity.
As you may know when you look at patterns of variation in the genome of a given organism you can make various inferences from the nature of these patterns. But the power of those inferences is conditional on the details of the real demographic and evolutionary histories, as well as the assumptions made about the models one which is testing. When delving into the domain of population genomics some of the concepts and models may seem abstruse, but the reality is that such details are the stuff of which evolution is built. A new paper in PLoS Genetics may seem excessively esoteric and theoretical, but it speaks to very important processes which shape the evolutionary trajectory of a given population. The paper is titled Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation. Here’s the author summary:
Considerable effort has been devoted to detecting genes that are under natural selection, and hundreds of such genes have been identified in previous studies. Here, we present a method for extending these studies by inferring parameters, such as selection coefficients and the time when a selected variant arose. Of particular interest is the question whether the selective pressure was already present when the selected variant was first introduced into a population. In this case, the variant would be selected right after it originated in the population, a process we call selection from a de novo mutation. We contrast this with selection from standing variation, where the selected variant predates the selective pressure. We present a method to distinguish these two scenarios, test its accuracy, and apply it to seven human genes. We find three genes, ADH1B, EDAR, and LCT, that were presumably selected from a de novo mutation and two other genes, ASPM and PSCA, which we infer to be under selection from standing variation.
The dynamic which they refer to seems to be a reframing of the conundrum of detecting hard sweeps vs. soft sweeps. In the former you case have a new mutation, so its frequency is ~1/(2N). It is quickly subject to natural selection (though stochastic processes dominate at low frequencies, so probability of extinction is high), and adaptation drives the allele to fixation (or nearly to fixation). In the latter scenario you have a great deal of extant genetic variation, present in numerous different allelic variants. A novel selection pressure reshapes the frequency landscape, but you can not ascribe the genetic shift to only one allele. It is no surprise that the former is easier to model and detect than the latter. Much of the evolutionary genomics of the 2000s focused on hard sweeps from de novo mutations because they were low hanging fruit. The methods had reasonable power to detect them (as well as many false positives!). But of late many are suspecting that hard sweeps are not the full story, and that much of evolutionary genetic process may be characterized by a combination of hard sweeps, soft sweeps (from standing variation), various forms of negative selection, not to mention the plethora of possibilities which abound in the domain of balancing selection.
Many of the details of the paper may seem overly technical and opaque (and to be fair, I will say here that the figures are somewhat difficult to decrypt, though the subject is not one that lends itself to general clarity), but the major finding is straightforward, and illustrated in figure 4 (I’ve added labels):
Rice is a pretty big deal. There’s really no need to justify research on this crop. It feeds literally billions, so the funding will always flow. Would that we knew rice as well as we know C. elgans. After yesterday’s travesty of a paper on barley I thought that readers might find a new paper in Nature, A map of rice genome variation reveals the origin of cultivated rice, more interesting and illuminating. The authors used genomic sequencing, of varied coverage (i.e., very deep, repeated, and therefore accurate coverage vs. a single pass which is a very rough draft), to assess the relationship between Asian wild rice and two of the dominant domestic cultivars, indica (long-grain paddy rice) and japonica (short-grain dry cultivation rice). Presumably the two cultivars derive from a wild ancestor, but the details are still being hashed out.