Last week Luke Jostins (soon to be Dr. Luke Jostins) published an interesting paper in Nature. To be fair, this paper has an extensive author list, but from what I am to understand this is the fruit of the first author’s Ph.D. project. In any case, you may know Luke because I have used his loess curve on hominin encephalization for years. His bread & butter is statistical genetics, and it shows in this Nature paper. God knows how he managed to cram so much density into ~5.5 pages of plain text. Luke is also a contributor to Genomes Unzipped, and has put up a post over there on one implication of the paper, Dozens of new IBD genes, but can they predict disease? The short answer is that for individual prediction complex traits are going to be a hard haul over the long term.*
They are subject to what Jim Manzi would term “high causal density.” A simple way to state this is that outcome X is dependent on a host of variables, and if you capture only a small number of variables, you aren’t going to be explaining much in a general fashion. This is obvious from the text of Luke’s paper. Let’ look at the abstract, Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease:
In science, like most things, one prefers simple over complex whenever possible. You keep adding variables until the explanatory juice starts hitting diminishing marginal returns. So cystic fibrosis is due to a mutation at one gene, and the disease expresses recessively at that locus. The reality is that one mutation accounts for ~65-70% of cystic fibrosis cases around the world, and there are nearly ~1,400 known mutations on the CFTR locus. How about skin color? Mutations on a dozen genes can probably explain ~90% of the variance in the trait value across the world between populations. In fact, one single mutation on one base pair can explain ~30-40% of the trait value difference between Europeans and Africans. This is a more complex story that cystic fibrosis; you have not just many mutations, but many mutations across many genes. But, the number of genes and mutations are manageable. You can keep track of most of them in your head (e.g., I can tell you that SLC24A5, SLC45A2, KITLG, and HERC2, can explain most of the trait value difference between Africans and Europeans without looking it up).
Now think about something like height. The only gene I can think of off the top of my head is HMGA2. With obesity I know FTO. The reason is that there’s a veritable alphabet soup of genes which pop out of the numerous studies focusing on these traits. But the reality is that it seems possible that there are many genes which harbor variants of small effect size which in totality account for the range of the trait value. Abstractly this isn’t really that much more complex than the models above. You can imagine it as a concrete instantiation of the central limit theorem. But in practice it does change things when you can’t focus on one gene, or a few genes, but have to understand that there exists a huge class of genetic causes which modulate the expression of the phenotype.
We’ve reached a stage where the mapping from genotype to phenotype is getting a bit on the baroque side. We have come to confront and wrestle with ‘genetic architecture.’ Here’s what Wikipedia says about this term:
PLoS Biology has four items of great interest out today:
- Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results
- Synthetic Associations Are Unlikely to Account for Many Common Disease Genome-Wide Association Signals
- The Importance of Synthetic Associations Will Only Be Resolved Empirically
- Common Disease: Are Causative Alleles Common or Rare?
These are a response to last year’s paper on synthetic associations from the Goldstein lab. Here’s a critique of that that paper. I plan on reviewing the first in the list above soon. #3 is a response to #1 and #2 from David Goldstein, while #4 is a summation more aimed at the general audience.
I recall projections in the early 2000s that 25% of the American population would be employed as systems administrators circa 2020 if rates of employment growth at that time were extrapolated. Obviously the projections weren’t taken too seriously, and the pieces were generally making fun of the idea that IT would reduce labor inputs and increase productivity. I thought back to those earlier articles when I saw a new letter in Nature in my RSS feed this morning, Hundreds of variants clustered in genomic loci and biological pathways affect human height:
Most common human traits and diseases have a polygenic pattern of inheritance: DNA sequence variants at many genetic loci influence the phenotype. Genome-wide association (GWA) studies have identified more than 600 variants associated with human traits1, but these typically explain small fractions of phenotypic variation, raising questions about the use of further studies. Here, using 183,727 individuals, we show that hundreds of genetic variants, in at least 180 loci, influence adult height, a highly heritable and classic polygenic trait2, 3. The large number of loci reveals patterns with important implications for genetic studies of common human diseases and traits. First, the 180 loci are not random, but instead are enriched for genes that are connected in biological pathways (P = 0.016) and that underlie skeletal growth defects (P < 0.001). Second, the likely causal gene is often located near the most strongly associated variant: in 13 of 21 loci containing a known skeletal growth gene, that gene was closest to the associated variant. Third, at least 19 loci have multiple independently associated variants, suggesting that allelic heterogeneity is a frequent feature of polygenic traits, that comprehensive explorations of already-discovered loci should discover additional variants and that an appreciable fraction of associated loci may have been identified. Fourth, associated variants are enriched for likely functional effects on genes, being over-represented among variants that alter amino-acid structure of proteins and expression levels of nearby genes. Our data explain approximately 10% of the phenotypic variation in height, and we estimate that unidentified common variants of similar effect sizes would increase this figure to approximately 16% of phenotypic variation (approximately 20% of heritable variation). Although additional approaches are needed to dissect the genetic architecture of polygenic human traits fully, our findings indicate that GWA studies can identify large numbers of loci that implicate biologically relevant genes and pathways.
The supplements run to nearly 100 pages, and the author list is enormous. But at least the supplements are free to all, so you should check them out. There are a few sections of the paper proper that are worth passing on though if you can’t get beyond the paywall.
It looks like Genomes Unzipped has their own Mortimer Adler, with an excellent posting, How to read a genome-wide association study. For those outside the biz I suspect that #4, replication, is going to be the easiest. In the early 2000s a biologist who’d been in the business for a while cautioned about reading too much into early association results which were sexy, as the same had occurred when linkage studies were all the vogue, but replication was not to be. Goes to show that history of science can be useful on a very pragmatic level. It can give you a sense of perspective on the evanescent impact of some techniques over the long run.