There has been a lot of attention to Erika Check Hayden’s piece Ethics: Taboo genetics, at least judging by people commenting on my Facebook feed. In some ways this is not an incredibly empirically grounded argument, because the biological basis of complex traits is going to be rather difficult to untangle on a gene-by-gene basis. In other words, this isn’t a clear and present “concern.” The heritability of many behavioral traits has long been known. This is not revolutionary, though for cultural reasons may well educated people are totally surprised when confronted with data that many traits, such as intelligence and personality, have robust heritabilities* (the proportion of trait variation explained by variation in genes across the population). The literature reviewed in The Nurture Assumption makes clear that a surprising proportion of contribution any parents make to their offspring is through their genetic composition, and not their modeled example. You wouldn’t know this if you read someone like Brian Palmer of Slate, who seems to be getting paid to reaffirm the biases of the current age among the smart set (pretty much every single one of his pieces that touch upon genetics is larded with phrases which could have been written by a software program designed to sooth the concerns of the cultural Zeitgeist). But the new genomics is confirming the broad outlines of the findings from behavior genetics. There’s nothing really to see there. The bigger issue of any interest is normative; the values we hold dear as a culture.
Getting a paper published with a newly sequenced genome is considered somewhat passé and so aughts at this point, but there are cases which are exceptions to this rule. Tigers are a charismatic and rare (<10,000 in the wild) super-predator, so when you see that they, along with a few other Panthera species, have been sequenced you take some note. The paper in question is open access, so you can read it yourself: The tiger genome and comparative analysis with lion and snow leopard genomes (not to spoil it, but there’s a Venn diagram!).
Before today only Felis silvetris catus had a reference sequence within the mammalian family Felidae. This fact should make you reconsider the idea that a new genome sequence is always boring and not noteworthy, as most lineages of mammals are represented by only one representative individual from one representative species. In ~5 years it is true that we’ll be beyond this stage of data scarcity in the sense of phylogenetic coverage, but we’re not there yet.
Registration is free. It will be hosted by the Nielsen Group in Berkeley on October 5th. As I am not going to be at the methods-orgy (OK, my own peculiar perspective) that is going to be ASHG 2013 I am definitely going to BAPG to get a preview of anything that might be unveiled in Cambridge a few months later (and to be frank I registered last June when it was announced).
For many the image of evolutionary processes brings to mind something on a macro scale. Perhaps that of the changing nature of protean life on earth writ large, depicted on a broad canvas such as in David Attenborough’s majestic documentaries over millions of years and across geological scales. But one can also reduce the phenomenon to a finer-grain on a concrete level, as in specific DNA molecules. Or, transform it into a more abstract rendering manipulable by algebra, such as trajectories of allele frequencies over generations. Both of these reductions emphasize the genetic aspect of natural history.
Obviously evolutionary processes are not just fundamentally the flux of genetic elements, but genes are crucial to the phenomena in a biological sense. It therefore stands to reason that if we look at patterns of variation within the genome we will be able to infer in some deep fashion the manner in which life on earth has evolved, and conclude something more general about the nature of biological evolution. These are not trivial affairs; it is not surprising that philosophy-of-biology is often caricatured as philosophy-of-evolution. One might dispute the characterization, but it can not be denied that some would contend that evolutionary processes in some way allow us to understand the nature of Being, rather than just how we came into being (Creationists depict evolution as a religion-like cult, which imparts the general flavor of some of the meta-science and philosophy which serves as intellectual subtext).
There is the fact of evolution. And then there is the long-standing debate of how it proceeds. The former is a settled question with little intellectual juice left. The latter is the focus of evolutionary genetics, and evolutionary biology more broadly. The debate is an old one, and goes as far back as the 19th century, where you had arch-selectionists such as Alfred Russel Wallace (see A Reason For Everything) square off against pretty much the whole of the scholarly world (e.g., Thomas Henry Huxely, “Darwin’s Bulldog,” was less than convinced of the power of natural selection as the driving force of evolutionary change). This old disagreement planted the seeds for much more vociferous disputations in the wake of the fusion of evolutionary biology and genetics in the early 20th century. They range from the Wright-Fisher controversies of the early years of evolutionary genetics, to the neutralist vs. selectionist debate of the 1970s (which left bad feelings in some cases). A cartoon-view of the implication of the debates in regards to the power of selection as opposed to stochastic contingency can be found in the works of Stephen Jay Gould (see The Structure of Evolutionary Theory) and Richard Dawkins (see The Ancestor’s Tale): does evolution result in an infinitely creative assortment due to chance events, or does it drive toward a finite set of idealized forms which populate the possible parameter space?*
My friend Zack Ajmal has been running the Harappa Ancestry Project for several years now. This is a non-institutional complement to the genomic research which occurs in the academy. His motivation was in large part to fill in the gaps of population coverage within South Asia which one sees in the academic literature. Much of this is due to politics, as the government of India has traditionally been reluctant to allow sample collection (ergo, the HGDP data uses Pakistanis as their South Asian reference, while the HapMap collected DNA from Indian Americans in Houston). Of course this sort of project is not without its own blind spots. Zack must rely on public data sets to get a better picture of groups like tribal populations and Dalits, because they are so underrepresented in the Diaspora from which he draws many of the project participants.
Once Zack has the genotype one of the primary things he does is add it to his broader data set (which includes many public samples) and analyze it with the Admixture model-based clustering package. What Admixture does is take a specific number of populations (e.g. K = 12) and generate quantity assignments to individuals. So, for example individual A might be assigned 40% population 1 and 60% population 2 for K = 2. Individual B might be 45% population 1 and 55% population 2. These are not necessarily ‘real’ populations. Rather, the populations and their proportions are there to allow you to discern patterns of relationships across individuals.
Since Zack has put his results online, I thought it would be useful to review what patterns have emerged over the past two years, as his sample sizes for some regions are now moderately significant. Though he has K=16 populations, not all of them will concern us, because South Asians do not tend to exhibit many of the components. I will focus on seven: S Indian, Baloch, Caucasian, NE Euro, SE Asian, Siberian and NE Asian. These are not real populations, but the labels tell you which region these components are modal. So, for example, the “S Indian” component peaks in southern India. The “Baloch” in among the Baloch people of southeastern Iran and southwest Pakistan. The “NE Euro” among the eastern Baltic peoples. The last three are Asian components, running the latitude from south to north to center. They only concern the first population of interest, Bengalis. I will combine these last three together as “Asian.”
Below is a table, mostly individuals from Zack’s results (though there are some aggregate results from public data sets). Comments below.
One of the elementary aspects of understanding genetics on a biophysical scale is to characterize the set of processes which span the chasm between the raw sequence information of base pairs (e.g. AGCGGTCGCAAG….) and the assorted macromolecules which are woven together to create the collection of tissues, and enable the physiological processes, which result in the organism. This suite of phenomena are encapsulated most succinctly in the often maligned Central Dogma of Molecular Biology. In short, the information of the DNA sequence is transcribed and translated into proteins. Though for greater accuracy and precision one must always add the caveats of phenomena such as splicing. The baroque character of the range of processes is such an extent that molecular genetics has become a massive enterprise, to a great extent superseding classical Mendelian genetics.
One critical structural detail from an evolutionary perspective is that the amino acids which are the building blocks of proteins are generally encoded by multiple nucleotide triplets, or codons. For example the amino acid Glyceine is “four-fold degenerate,” GGA, GGG, GGC, GGU (for RNA Uracil, U, substitutes for Thymine in DNA, T), all encode it. Notice that the change is fixed upon the third position in the codon. Altering the first or second position would transform the amino acid end product, and possibly perturb the function of the final protein (or perhaps disrupt transcription altogether in some case). These are synonymous substitutions because they don’t change the functional import of the sequence, as opposed to the nonsynonymous positions (which may abolish or change function). In an evolutionary context one may presume that these synonymous substitutions are “silent.” Because natural selection operates upon heritable variation of a phenotype, and synonymous substitutions presumably do not change phenotype, it is often assumed that evolutionary change on these bases is selectively neutral. In contrast, nonsynonymous changes may be deleterious or beneficial (far more likely the former than the latter because breaking contingent complexity is easier than creating new contingent complexity). Therefore the ratio of gentic change on nonsynonymous and synonymous bases across lineages has been a common measure of possible selection on a gene.
It is generally understood that inbreeding has some negative biological consequences for complex animals. Recessive diseases are the most straightforward. The rarer a recessive disease is the higher and higher fraction of sufferers of that disease will be products of pairings between relatives (the reason for this is straightforward, as extremely rare alleles which express in a deleterious fashion in homozygotes will be unlikely to come together in unrelated individuals). But when it comes to traits associated with inbred individuals recessive diseases are not what comes to mind for most, the boy from the film Deliverance is usually the more gripping image (contrary to what some of the actors claimed the young boy did not have any condition).
Some are curious about the consequences of inbreeding for a trait such as intelligence. The scientific literature here is somewhat muddled. But it seems likely that all things equal if two people of average intelligence pair up and are first cousins the I.Q. of their offspring will be expected to be 0-5 points lower than would otherwise be the case. By this, I mean that the studies you can find in the literature suggest when correcting for other variables that the inbreeding depression on the phenotypic level is greater than 0 (there is an effect) but less than 5 (it is not that large, less than 1/3 of a standard deviation of the trait value). Presumably for higher levels of inbreeding the consequences are going to be more dire.
Modern evolutionary genetics owes its origins to a series of intellectual debates around the turn of the 20th century. Much of this is outlined in Will Provines’ The Origins of Theoretical Population Genetics, though a biography of Francis Galton will do just as well. In short what happened is that during this period there were conflicts between the heirs of Charles Darwin as to the nature of inheritance (an issue Darwin left muddled from what I can tell). On the one side you had a young coterie around William Bateson, the champion of Gregor Mendel’s ideas about discrete and particulate inheritance via the abstraction of genes. Arrayed against them were the acolytes of Charles Darwin’s cousin Francis Galton, led by the mathematician Karl Pearson, and the biologist Walter Weldon. This school of “biometricians” focused on continuous characteristics and Darwinian gradualism, and are arguably the forerunners of quantitative genetics. There is some irony in their espousal of a “Galtonian” view, because Galton was himself not without sympathy for a discrete model of inheritance!
In the end science and truth won out. Young scholars trained in the biometric tradition repeatedly defected to the Mendelian camp (e.g. Charles Davenport). Eventually, R. A. Fisher, one of the founders of modern statistics and evolutionary biology, merged both traditions in his seminal paper The Correlation between Relatives on the Supposition of Mendelian Inheritance. The intuition for why Mendelism does not undermine classical Darwinian theory is simple (granted, some of the original Mendelians did seem to believe that it was a violation!). Many discrete genes of moderate to small effect upon a trait can produce a continuous distribution via the central limit theorem. In fact classical genetic methods often had difficulty perceiving traits with more than half dozen significant loci as anything but quantitative and continuous (consider pigmentation, which we know through genomic methods to vary across populations mostly due to half a dozen segregating genes or so).
Anyone who reads the genomic posts with any interest on his weblog must read Daniel Lawson’s fine review of the topic which he has posted on arXiv, Populations in statistical genetic modelling and inference (via Haldane’s Sieve). Even if you don’t have a population genetic and genomic background the gist is entirely accessible. If you do have a population genetic and genomic background and haven’t used various packages such as STRUCTURE or EIGENSOFT yourself, I would recommend reading Lawson’s characterizations, as they are all spot on.
Also, if you have not, I recommend Lawson’s website for ChromoPainter and fineSTRUCTURE. The utility of these methods is outlined in the paper Inference of Population Structure using Dense Haplotype Data.
Kevin Mitchell of Wiring the Brain has a very long post up inveighing against the specter of eugenics. I don’t have a great deal of time to engage Kevin right now.* But in addition to Kevin’s post I highly recommend this episode of WBUR’s On Point. It has Steve Hsu on, and he articulates many of the positions that I myself hold. Steve’s work with BGI has triggered the latest discussion of eugenics thanks to Vice‘s sensational representation of the research project and its aims. But it’s a useful discussion to engage in, even if the starting point is a little unfortunate.
I will state though Kevin’s argument seems to be predicated on the implicit assumption that his interlocutors hold to some sort of Platonic ideal of the most-perfect-human. There’s no such thing obviously, and even those who sympathized with eugenic policies such as W. D. Hamilton rejected this notion at the end of the day. Rather, human traits are evaluated in terms of how they serve the flourishing of individuals and society according to understood values. Intelligence is generally assumed to benefit individuals, and, I believe that it benefits society as well through innovation. Innovation drives the productivity growth which is the foundation of our post-Malthusian age.
A paper on the genetics of the Roma (“Gypsies”), Reconstructing Roma History from Genome-Wide Data, has finally come out in a journal. It’s been on arXiv for a while, so nothing too surprising. But, reading through the paper I have to note one rather clear aspect for me: there is a crispness and detail to the way they outlined and integrated their methods into the results section. Unfortunately there is an obvious tendency in the pressure to publish for people to use methods and tools (which usually consists of software written by others which you use in a blackbox fashion) in a slapdash manner with an aim toward arriving at a publishable unit. Because of the specialization within science it seems one can entirely make it through peer review by using methods which signal that one does not really know what one is talking about. To give a concrete example, a year ago I was told about a phylogenetic package isin moderate usage which seems to basically be a “random number generator.” The fact that this package is used is a testament to the fact that many researchers who are not phylogeneticists simply reach for the nearest method at hand, and trust the results if they make some intuitive sense (presumably in this case they would simply report the results which were intelligible).
The ultimate future, I’m hoping, is for open data, open code, and open methods. When a shady or sketchy paper makes it through peer review there is now visible public anger which bubbles out of the scientific community, but the process of reproducing the results can still be tedious (see Arsenic life). This is less true in cases where the means are more computational. The only things stopping the process of science from operating more efficiently are human barriers (e.g., cultural norms, institutional barriers toward data release).
Bears are big deal today. I’ve talked about this before, so I won’t belabor the point in this post. Rather, I want to persuade you that there’s a really interesting paper out in PLOS Genetics right now, Genomic Evidence for Island Population Conversion Resolves Conflicting Theories of Polar Bear Evolution. I know that seems like a mouthful, and despite the fact that I nodded to the reality that this is highly relevant in part because of policy concerns, the paper itself makes salient the reality that oftentimes we are confronted with the juxtposition between useful abstractions and the empirical shape of the world. In this case the abstraction is that of species, the one taxonomic category which many people find to be a natural kind, so to speak. These sorts of confusions of our expectations are often highly informative. They illustrate the limits of our abstractions, and drive us toward more complex and/or elegant formalisms which are capable of modeling nature as it is, rather than as it we wish it would be.
When I read Genome-Wide Diversity in the Levant Reveals Recent Structuring by Culture in PLoS Genetics last week, one of my thoughts was “where is the tree”? Thankfully all the data is online, so I simply ran TreeMix on it. After a number of runs I know understand perhaps why there is no figure emphasizing a tree. There just isn’t that much informative yield from what I can tell, though the basic inference from the paper is recapitulated. You can see the results in the figure above, from one of my TreeMix runs. Overall, what this paper reinforces is that there are sharp genetic distinctions across ethno-religious boundaries within the modern Middle East which confound attempts to use geography to predict variation.
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn’t release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs).
This is a follow up to my post from yesterday. In case you care about the technical details (after I clean this stuff up I will put it on GitHub) I’m using R’s adehabitat package to create a 95% distribution curve after smoothing with kernel density. The goal is to give you a better intuition about where the populations are dispersed across two dimensional visualizations of genetic variation.
Thinking about how to plot text, I came up with a quick hack, which just used the initial data and found the median x and y position. That explains why some of the labels are shifted so, in populations with a huge range the label position is going to be sensitive to not being smoothed (if you know how to pull out the centroid out of the kver, tell!). I’ve given them colors and also used black. The latter actually seems to be clearer!
Note: This is not just for fun, as I plan to start rolling out results and methods from some of the data sets I have more regularly in the near future.
I’ve been thinking about how best to visualize PCA/MDS type of results, which allow for the two dimensional representation of genetic variation. Below are a few of my efforts with a data set I have. You can see the individuals in gray, but also ellipses which cover ~95% of the distribution of a given population.
Please click the images for a larger version. They represent coordinate 1 on the y axis and 2 on the z axis derive from a multidimesional scaling representing identity by state across individuals.
A reader points me to a talk given by David Reich at the Center for Human Genetic Research 2013 Retreat. One of the issues Reich brought up is old, but perhaps worth reemphasizing: due to endogamy many South Asians carry a higher load of recessive ailments. This is not due to recent inbreeding (which is barred by custom in many South Asian groups, which enforce kin-level exogamy), but long term genetic isolation. Over time even a moderate sized population can be affected by drift. This was one of the major points in the 2009 paper Reconstructing Indian History, but not one particularly emphasized in the press follow up. A major implication is that a relatively simple public health measure for South Asians would be to marry outside of their jati. The social or genetic distance need not be great. But one generation of outbreeding should “mask” many of the deleterious alleles. If this model is correct one should be able to track decreases in morbidity within the American South Asian population, where there are many inter-caste and inter-regional marriages (yes, this is between people of putative high status, but this doesn’t matter).
In my earlier posts where I gave a short intro to using Plink I distributed a data set termed PHLYO. One thing I did not mention is that I’ve also been running it on Admixture. But here’s an important point: I ran the data set 10 times from K = 2 to K = 15. Why? Because the algorithm produces somewhat different results on each run (if you use a different seed, which you should), and I wanted to not be biased by one particular result. Additionally, I also turned on cross-validation error, which gives me a better sense of which K’s to trust. But after I select the K which I want to visualize which replicate run will I then use to generate the bar plots? I won’t pick any specific one. Rather, I’ll merge them together with an off-the-shelf algorithm. Additionally, I also want to sort the individuals by their modal population cluster.
This sounds rather convoluted, and it is somewhat. I have a pipeline that I use, but it’s not too user friendly. One of my projects is to clean it up, document it, and publish it online. Though if you have your own pipeline all ready to go, please post it in the comments with a link! The general steps are as follows for me:
1) Convert Admixture Q files into Structure format, transform family identifications to numeric values, and generate a file with family identification and numeral pairs
2) Merge the results across runs using Clumpp
3) Sort the individual results within populations
4) The use Distruct to produce an output file
Before I show you the resultant bar plot, here are the cross-validation results with standard deviation ticks: