Back in 2001, the Human Genome Project gave us a nigh-complete readout of our DNA. Somehow, those As, Gs, Cs, and Ts contained the full instructions for making one of us, but they were hardly a simple blueprint or recipe book. The genome was there, but we had little idea about how it was used, controlled or organised, much less how it led to a living, breathing human.
That gap has just got a little smaller. A massive international project called ENCODE – the Encyclopedia Of DNA Elements – has moved us from “Here’s the genome” towards “Here’s what the genome does”. Over the last 10 years, an international team of 442 scientists have assailed 147 different types of cells with 24 types of experiments. Their goal: catalogue every letter (nucleotide) within the genome that does something. The results are published today in 30 papers across three different journals, and more.
For years, we’ve known that only 1.5 percent of the genome actually contains instructions for making proteins, the molecular workhorses of our cells. But ENCODE has shown that the rest of the genome – the non-coding majority – is still rife with “functional elements”. That is, it’s doing something.
It contains docking sites where proteins can stick and switch genes on or off. Or it is read and ‘transcribed’ into molecules of RNA. Or it controls whether nearby genes are transcribed (promoters; more than 70,000 of these). Or it influences the activity of other genes, sometimes across great distances (enhancers; more than 400,000 of these). Or it affects how DNA is folded and packaged. Something.
According to ENCODE’s analysis, 80 percent of the genome has a “biochemical function”. More on exactly what this means later, but the key point is: It’s not “junk”. Scientists have long recognised that some non-coding DNA has a function, and more and more solid examples have come to light [edited for clarity – Ed]. But, many maintained that much of these sequences were, indeed, junk. ENCODE says otherwise. “Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more,” says Tom Gingeras, one of the study’s many senior scientists.
And what’s in the remaining 20 percent? Possibly not junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator and self-described “cat-herder-in-chief”. He explains that ENCODE only (!) looked at 147 types of cells, and the human body has a few thousand. A given part of the genome might control a gene in one cell type, but not others. If every cell is included, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” says Birney. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”
That the genome is complex will come as no surprise to scientists, but ENCODE does two fresh things: it catalogues the DNA elements for scientists to pore over; and it reveals just how many there are. “The genome is no longer an empty vastness – it is densely packed with peaks and wiggles of biochemical activity,” says Shyam Prabhakar from the Genome Institute of Singapore. “There are nuggets for everyone here. No matter which piece of the genome we happen to be studying in any particular project, we will benefit from looking up the corresponding ENCODE tracks.”
There are many implications, from redefining what a “gene” is, to providing new clues about diseases, to piecing together how the genome works in three dimensions. “It has fundamentally changed my view of our genome. It’s like a jungle in there. It’s full of things doing stuff,” says Birney. “You look at it and go: “What is going on? Does one really need to make all these pieces of RNA? It feels verdant with activity but one struggles to find the logic for it.
Think of the human genome as a city. The basic layout, tallest buildings and most famous sights are visible from a distance. That’s where we got to in 2001. Now, we’ve zoomed in. We can see the players that make the city tick: the cleaners and security guards who maintain the buildings, the sewers and power lines connecting distant parts, the police and politicians who oversee the rest. That’s where we are now: a comprehensive 3-D portrait of a dynamic, changing entity, rather than a static, 2-D map.
And just as London is not New York, different types of cells rely on different DNA elements. For example, of the roughly 3 million locations where proteins stick to DNA, just 3,700 are commonly used in every cell examined. Liver cells, skin cells, neurons, embryonic stem cells… all of them use different suites of switches to control their lives. Again, we knew this would be so. Again, it’s the scale and the comprehensiveness that matter.
“This is an important milestone,” says George Church, a geneticist at the Harvard Medical School. His only gripe is that ENCODE’s cells lines came from different people, so it’s hard to say if differences between cells are consistent differences, or simply reflect the genetics of their owners. Birney explains that in other studies, the differences between cells were greater than the differences between people, but Church still wants to see ENCODE’s analyses repeated with several types of cell from a small group of people, healthy and diseased. That should be possible since “the cost of some of these [tests] has dropped a million-fold,” he says.
The next phase is to find out how these players interact with one another. What does the 80 percent do (if, genuinely, anything)? If it does something, does it do something important? Does it change something tangible, like a part of our body, or our risk of disease? If it changes, does evolution care?
[Update 07/09 23:00 Indeed, to many scientists, these are the questions that matter, and ones that ENCODE has dodged through a liberal definition of “functional”. That, say the critics, critically weakens its claims of having found a genome rife with activity. Most of the ENCODE’s “functional elements” are little more than sequences being transcribed to RNA, with little heed to their physiological or evolutionary importance. These include repetitive remains of genetic parasites that have copied themselves ad infinitum, the corpses of dead and once-useful genes, and more.
To include all such sequences within the bracket of “functional” sets a very low bar. Michael Eisen from the Howard Hughes Medical Institute said that ENCODE’s definition as a “meaningless measure of functional significance” and Leonid Kruglyak from Princeton University noted that it’s “barely more interesting” than saying that a sequence gets copied (which all of them are). To put it more simply: our genomic city’s got lots of new players in it, but they may largely be bums.
This debate is unlikely to quieten any time soon, although some of the heaviest critics of ENCODE’s “junk” DNA conclusions have still praised its nature as a genomic parts list. For example, T. Ryan Gregory from Guelph University contrasts their discussions on junk DNA to a classic paper from 1972, and concludes that they are “far less sophisticated than what was found in the literature decades ago.” But he also says that ENCODE provides “the most detailed overview of genome elements we’ve ever seen and will surely lead to a flood of interesting research for many years to come.” And Michael White from the Washington University in St. Louis said that the project had achieved “an impressive level of consistency and quality for such a large consortium.” He added, “Whatever else you might want to say about the idea of ENCODE, you cannot say that ENCODE was poorly executed.” ]
Where will it lead us? It’s easy to get carried away, and ENCODE’s scientists seem wary of the hype-and-backlash cycle that befell the Human Genome Project. Much was promised at its unveiling, by both the media and the scientists involved, including medical breakthroughs and a clearer understanding of our humanity. The ENCODE team is being more cautious. “This idea that it will lead to new treatments for cancer or provide answers that were previously unknown is at least partially true,” says Gingeras, “but the degree to which it will successfully address those issues is unknown.
“We are the most complex things we know about. It’s not surprising that the manual is huge,” says Birney. “I think it’s going to take this century to fill in all the details. That full reconciliation is going to be this century’s science.”
Find out more about ENCODE:
So, that 80 percent figure… Let’s build up to it.
We know that 1.5 percent of the genome codes for proteins. That much is clearly functional and we’ve known that for a while. ENCODE also looked for places in the genome where proteins stick to DNA – sites where, most likely, the proteins are switching a gene on or off. They found 4 million such switches, which together account for 8.5 percent of the genome.* (Birney: “You can’t move for switches.”) That’s already higher than anyone was expecting, and it sets a pretty conservative lower bound for the part of the genome that definitively does something.
In fact, because ENCODE hasn’t looked at every possible type of cell or every possible protein that sticks to DNA, this figure is almost certainly too low. Birney’s estimate is that it’s out by half. This means that the total proportion of the genome that either creates a protein or sticks to one, is around 20 percent.
To get from 20 to 80 percent, we include all the other elements that ENCODE looked for – not just the sequences that have proteins latched onto them, but those that affects how DNA is packaged and those that are transcribed at all. Birney says, “[That figure] best coveys the difference between a genome made mostly of dead wood and one that is alive with activity.” [Update 5/9/12 23:00: For Birney’s own, very measured, take on this, check out his post. ]
That 80 percent covers many classes of sequence that were thought to be essentially functionless. These include introns – the parts of a gene that are cut out at the RNA stage, and don’t contribute to a protein’s manufacture. “The idea that introns are definitely deadweight isn’t true,” says Birney. The same could be said for our many repetitive sequences: small chunks of DNA that have the ability to copy themselves, and are found in large, recurring chains. These are typically viewed as parasites, which duplicate themselves at the expense of the rest of the genome. Or are they?
The youngest of these sequences – those that have copied themselves only recently in our history – still pose a problem for ENCODE. But many of the older ones, the genomic veterans, fall within the “functional” category. Some contain sequences where proteins can bind, and influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” says Birney.
He expects that many skeptics will argue about the 80 percent figure, and the definition of “functional”. But he says, “No matter how you cut it, we’ve got to get used to the fact that there’s a lot more going on with the genome than we knew.”
[Update 07/09 23:00 Birney was right about the scepticism. Gregory says, “80 percent is the figure only if your definition is so loose as to be all but meaningless.” Larry Moran from the University of Toronto adds, “Functional” simply means a little bit of DNA that’s been identified in an assay of some sort or another. That’s a remarkably silly definition of function and if you’re using it to discount junk DNA it’s downright disingenuous.”
This is the main criticism of ENCODE thus far, repeated across many blogs and touched on in the opening section of this post. There are other concerns. For example, White notes that many DNA-binding proteins recognise short sequences that crop up all over the genome just by chance. The upshot is that you’d expect many of the elements that ENCODE identified if you just wrote out a random string of As, Gs, Cs, and Ts. “I’ve spent the summer testing a lot of random DNA,” he tweeted. “It’s not hard to make it do something biochemically interesting.”
Gregory asks why, if ENCODE is right and our genome is full of functional elements, does an onion have around five times as much non-coding DNA as we do? Or why pufferfishes can get by with just a tenth as much? Birney says the onion test is silly. While many genomes have a tight grip upon their repetitive jumping DNA, many plants seem to have relaxed that control. Consequently, their genomes have bloated in size (bolstered by the occasional mass doubling). “It’s almost as if the genome throws in the towel and goes: Oh sod it, just replicate everywhere.” Conversely, the pufferfish has maintained an incredibly tight rein upon its jumping sequences. “Its genome management is pretty much perfect,” says Birney. Hence: the smaller genome.
But Gregory thinks that these answers are a dodge. “I would still like Birney to answer the question. How is it that humans “need” 100% of their non-coding DNA, but a pufferfish does fine with 1/10 as much [and] a salamander has at least 4 times as much?” [I think Birney is writing a post on this, so expect more updates as they happen, and this post to balloon to onion proportions].]
[Update 07/09/12 11:00: The ENCODE reactions have come thick and fast, and Brendan Maher has written the best summary of them. I’m not going to duplicate his sterling efforts. Head over to Nature’s blog for more.]
* (A cool aside: John Stamatoyannopoulos from the University of Washington mapped these protein-DNA contacts by looking for “footprints” where the presence of a protein shields the underlying DNA from a “DNase” enzyme that would otherwise slice through it. The resolution is incredible! Stamatoyannopoulos could “see” every nucleotide that’s touched by a protein – not just a footprint, but each of its toes too. Joe Ecker from the Salk Institute thinks we should be eventually able to “dynamically footprint a cellular response”. That is, expose a cell to something—maybe a hormone or a toxin—and check its footprints over time. You can cross-reference those sites to the ENCODE database, and reconstruct what’s going on in the cell just by “watching” the shadows of proteins as they descend and lift off.)
Find out more about ENCODE:
The simplistic view of a gene is that it’s a stretch of DNA that is transcribed to make a protein. But each gene can be transcribed in different ways, and the transcripts overlap with one another. They’re like choose-your-own-adventure books: you can read them in different orders, start and finish at different points, and leave out chunks altogether.
Fair enough: We can say that the “gene” starts at the start of the first transcript, and ends at the end of the final transcript. But ENCODE’s data complicates this definition. There are a lot of transcripts, probably more than anyone had realised, and some connect two previously unconnected genes. The boundaries for those genes widen, and the gaps between them shrink or disappear.
Gingeras says that this “intergenic” space has shrunk by a factor of four. “A region that was once called Gene X is now melded to Gene Y.” Imagine discovering that every book in the library has a secret appendix, that’s also the foreword of the book next to it.
These bleeding boundaries seem familiar. Bacteria have them: Their genes are cramped together in a miracle of effective organisation, packing in as much information as possible into a tiny genome. Viruses epitomise such genetic economy even better. I suggested that comparison to Gingeras. “Exactly!” he said. “Nature never relinquished that strategy.”
Bacteria and viruses can get away with smooshing their protein-encoding genes together. But not only do we have more proteins, but we also need a vast array of sequences to control when, where and how they are deployed. Those elements need space too. Ignore them, and it looks like we have a flabby genome with sequence to spare. Understand them, and our own brand of economical packaging becomes clear. (However, Birney adds, “In bacteria and viruses, it’s all elegant and efficient. At the moment, our genome just seems really, really messy. There’s this much higher density of stuff, but for me, emotionally it doesn’t have that elegance when we see in a bacterial genome.“)
Given these blurred boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honour falls to the transcript, made of RNA rather than DNA. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.” A “gene” then becomes a collection of transcripts, united by some common factor.
There’s something poetic about this. Our view of the genome has long been focused on DNA. It’s the thing the genome project was deciphering. It is converted into RNA, giving it a more fundamental flavour. But out of those two molecules, RNA arrived on the planet first. It was copying itself and evolving long before DNA came on the scene. “These studies are pointing us back in that direction,” says Gingeras. They recognise RNA’s role, not as simply an intermediary between DNA and proteins, but something more primary.
Find out more about ENCODE:
For the last decade, geneticists have run a seemingly endless stream of “genome-wide association studies” (GWAS), attempting to understand the genetic basis of disease. They have thrown up a long list of SNPs – variants at specific DNA letters—that correlate with the risk of different conditions.
The ENCODE team have mapped all of these to their data. They found that just 12 percent of the SNPs lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within functional, non-coding regions, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease. “It was one of those too good to be true moments,” says Birney. “Literally, I was in the room [when they got the result] and I went: Yes!”
Imagine a massive table. Down the left side are all the diseases that people have done GWAS studies for. Across the top are all the possible cell types and transcription factors (proteins that control how genes are activated) in the ENCODE study. Are there hotspots? Are there SNPs that correspond to both? Yes. Lots, and many of them are new.
Take Crohn’s disease, a type of bowel disorder. The team found five SNPs that increase the risk of Crohn’s, and that are recognised by a group of transcription factors called GATA2. “That wasn’t something that the Crohn’s disease biologists had on their radar,” says Birney. “Suddenly we’ve made an unbiased association between a disease and a piece of basic biology.” In other words, it’s a new lead to follow up on.
“We’re now working with lots of different disease biologists looking at their data sets,” says Birney. “In some sense, ENCODE is working form the genome out, while GWAS studies are working from disease in.” Where they meet, there is interest. So far, the team have identified 400 such hotspots that are worth looking into. Of these, between 50 and 100 were predictable. Some of the rest make intuitive sense. Others are head-scratchers.
Find out more about ENCODE:
Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. It’s anything but. DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. The upshot is that parts of the genome that look distant when you write the sequences out can actually be physical neighbours. And this means that some switches can affect the activity of far away genes
Job Dekker from the University of Massachusetts Medical School has now used ENCODE data to map these long-range interactions across just 1 percent of the genome in three different types of cell. He discovered more than 1,000 of them, where switches in one part of the genome were physically reaching over and controlling the activity of a distant gene. “I like to say that nothing in the genome makes sense, except in 3D,” says Dekker. “It’s really a teaser for the future of genome science,” Dekker says.
Gingeras agrees. He thinks that understanding these 3-D interactions will add another layer of complexity to modern genetics, and extending this work to the rest of the genome, and other cell types, is a “next clear logical step”.
Find out more about ENCODE:
ENCODE is vast. The results of this second phase have been published in 30 central papers in Nature, Genome Biology and Genome Research, along with a slew of secondary articles in Science, Cell and others. And all of it is freely available to the public.
The pages of printed journals are a poor repository for such a vast trove of data, so the ENCODE team have devised a new publishing model. In the ENCODE portal site, readers can pick one of 13 topics of interest, and follow them in special “threads” that link all the papers. Say you want to know about enhancer sequences. The enhancer thread pulls out all the relevant paragraphs from the 30 papers across the three journals. “Rather than people having to skim read all 30 papers, and working out which ones they want to read, we pull out that thread for you,” says Birney.
And yes, there’s an app for that.
Transparency is a big issue too. “With these really intensive science projects, there has to be a huge amount of trust that data analysts have done things correctly,” says Birney. But you don’t have to trust. At least half the ENCODE figures are interactive, and the data behind them can be downloaded. The team have also built a “Virtual Machine” – a downloadable package of the almost-raw data and all the code in the ENCODE analyses. Think of it as the most complete Methods section ever. With the virtual machine, “you can absolutely replay step by step what we did to get to the figure,” says Birney. “I think it should be the standard for the future.”
Find out more about ENCODE:
Compilation of other ENCODE coverage
Note: Serious concerns have been raised about the conclusions of this study. I’ve written a summary of the backlash in a separate post.
Arsenic isn’t exactly something you want to eat. It has a deserved reputation as a powerful poison. It has been used as a murder weapon and it contaminates the drinking water of millions of people. It’s about as antagonistic to life as a chemical can get. But in California’s Mono Lake, Felisa Wolfe-Simon has discovered bacteria that not only shrug off arsenic’s toxic effects, but positively thrive on it. They can even incorporate the poisonous element into their proteins and DNA, using it in place of phosphorus.
Out of the hundred-plus elements in existence, life is mostly made up of just six: carbon, hydrogen, oxygen, nitrogen, sulphur and phosphorus. This elite clique is meant to be irreplaceable. But the Mono Lake bacteria may have broken their dependence on one of the group – phosphorus – by swapping it for arsenic. If that’s right, they would be the only known living things to do this.
Even extinction and the passing of millennia are no barriers to clever geneticists. In the past few years, scientists have managed to sequence the complete genome of a prehistoric human and produced “first drafts” of the mammoth and Neanderthal genomes. More controversially, some groups have even recovered DNA from dinosaurs. Now, a variety of extinct birds join the ancient DNA club including the largest that ever lived – Aepyornis, the elephant bird.
In a first for palaeontology, Charlotte Oskam from Murdoch University, Perth, extracted DNA from 18 fossil eggshells, either directly excavated or taken from museum collections. Some came from long-deceased members of living species including the emu, an owl and a duck. Others belonged to extinct species including Madagascar’s 3-metre tall elephant bird and the giants moas of New Zealand. A few of these specimens are just a few centuries old, but the oldest came from an emu that lived 19,000 years ago.
It turns out that bird eggshells are an excellent source of ancient DNA. They’re made of a protein matrix that is loaded with DNA and surrounded by crystals of calcium carbonate. The structure shelters the DNA and acts as a barrier to oxygen and water, two of the major contributors to DNA damage. Eggshells also stop microbes from growing and it seems that ancient ones still do the same. Oskam found that the fossil shells had around 125 times less bacterial DNA than bones of the same species did.
This is important – bacteria are a major problem for attempts to extract ancient DNA and they force scientists to search for uncontaminated sources, like frozen hair. Eggshells, it seems, provide similarly bacteria-free samples. Still, Oskam’s team took every precaution to prevent contamination. They used clean rooms and many control samples. Many of their sequences, like those of Aepyornis, were checked by two independent laboratories.
The Aepyornis sequences are particularly encouraging because many scientists have previously tried to extract DNA from the bones of this giant and failed. Eggshells seem like a more promising source and it certainly helps that the eggs of many of these giant species were massive and thick. But Oskam did also recover DNA from a fossil duck egg, which suggests that it should be possible to sequence the genes of even small extinct birds, like the dodo.
All of our cells are staffed by armies of executioners. They are usually restrained but when unleashed, they can set off a fatal chain reaction that kills the cell. This suicide squad does away with billions of cells every day. It helps to balance the production of new cells with the loss of old ones, to sculpt growing tissues and to destroy potential cancer cells.
But a new study suggests that the executioners aren’t always lethal. In fact, they’re essential for life. Through the unorthodox method of damaging our DNA, they can actually activate important genes. This technique for switching genes on is new to science but it’s apparently vital for allowing some types of stem cell to produce new types of tissue.
Stem cells are bundles of untapped potential, with the ability to produce hundreds of specialist cells across the body. This process is called differentiation. Its details vary depending on which type of cell is being produced, but scientists have recently found that some aspects are apparently common to all tissues, be they muscle, blood or bone. Surprisingly, one of these is the recruitment of executioner proteins – caspases.
Caspases cut up other proteins and in doing so, some of them produce yet more caspases. The result is a growing army of death, hacking and slashing its way through the cell. But one of these killers – caspase-3 – is a necessary part of differentiation. Get rid of it and, suddenly, stem cells can’t produce their specialised daughters. Now, thanks to Brian Larsen from the Sprott Centre for Stem Cell Research, we know why.
Meet “Inuk”. He is the ninth human to have their entire genome sequenced but unlike the previous eight, he has been dead for some 4,000 years old. Even so, DNA samples from a tuft of his frozen hair have revealed much about his appearance and his ancestry.
Inuk had brown eyes and brown skin. His blood type was A+. His hair was thick and dark but had he lived, he might not have kept it – his genes reveal a high risk of baldness. Inuk may well have died quite young. Like many Asians and Native Americans, his front teeth were “shovel-graded”, meaning that their back faces had ridged sides and concave middles. We even know about his earwax – it was dry, again like many Asians and Native Americans, rather than the wet wax that dominates other ethnic groups.
Inuk is the singular of Inuit and it means “man”. He was one of the Saqqaq people, one of the first cultures to settle in the frozen north of the New World. Few of their remains have been found – all we have are four small tufts of hair and four small pieces of bone. So Inuk’s genome is a treasure trove of knowledge about this extinct Eskimo culture. His remains were discovered in Greenland in the 1980s and his genome has just been sequenced by a large team of scientists from 8 countries, led by Morten Rasmussen, Yingrui Li and Stinus Lindgreen.
This isn’t the first time that scientists have tried to sequence the genes of an ancient human (or related species). So far, the most successful result was a first draft of the Neanderthal genome based on bone and tooth samples. It comprises just 63% of the total genome, but even getting this much was a struggle. Ancient genomes aren’t easy to decipher. Even if enough tissue is preserved, it is often riddled with the DNA of fungi and bacteria. The very act of extracting the tissues often adds human DNA to the list of contaminants.
Scientists have developed ingenious workarounds to this problem, but Rasmussen’s team solved it by working with a well-frozen specimen and focusing on his hair. Hair is a rich source of DNA and it protects genomes from both damaging elements and contaminating microbes. It allowed scientists to sequence the genome of the woolly mammoth and it has now done the same for Inuk. Around 80% of the DNA recovered from a tuft was Inuk’s hair was human, with no evidence of modern contamination. After all, all the scientists who handled the samples were European and there weren’t any traces of European sequences in the deciphered genome.
Rasmussen’s group used next-generation sequencing technology to analyse the recovered DNA. These powerful techniques allowed them to sequence around 80% of the genome around 20 times. With such extensive coverage, they could be incredibly confident about exactly which sequences lay in each location. Eske Willerslev who headed the group says, “It’s comparable to a modern human genome in terms of quality.” For comparison, the Human Genome Project’s gold standard required that the entire genome should be sequenced just 10 times.
If you’ve ever put a pair of headphones in your pocket, you’ll know how difficult it is to keep a long cord in a bundle without getting it hopelessly tangled and knotted. You’ll also start to appreciate the monumental challenge that our cells face when packaging our DNA. At 2 metres in length, the human genome is longer than the average human. But in every one of our cells, the genome needs to fit inside the nucleus, a tiny compartment just 6 millionths of a metre long. How does it do it?
One of the secrets behind this monumental feat of folding has just been revealed by research that shows the human genome’s three-dimensional structure. A team of scientists led by Erez Lieberman-Aiden and Nynke van Berkum found that our genome folds into a shape called a “fractal globule”, where the long strands of DNA are densely packed but without a single knot. It’s an awe-inspiring feat of space-saving and keeps DNA accessible. When a particular gene is needed, the DNA it sits on can be easily unpacked.
Lieberman explains, “The best way to think about it is that it looks like a pack of ramen noodles when you just start cooking them: really dense, but totally unentangled, so you can pull out a noodle or a bunch of noodles without disrupting the rest.” Previously, scientists suggested that the genome folds into a more tangled structure called the “equilibrium globule”, which is more like ramen noodles post-cooking – a massive knotted mess from which single noodles are difficult to extract.
Until now, the fractal globule was a theoretical shape that existed only in the minds of mathematicians. This is the first time that it has been observed in reality. The shape was first described by a mathematician Guiseppe Peano in 1890 and in 1988, Alexander Grosberg proposed that a long molecule might spontaneously fold into such a shape under the right conditions. Still, it took till this week for anyone to observe a fractal globule in reality. “[Peano] had no idea that it described any actual object in the universe,” says Lieberman-Aiden, “but it turns out it describes the genome!”
Some of the other tricks that cells use to fold the genome are well documented. At the most basic level, DNA is wrapped around proteins called histones, like a series of beads on a string. These are then twisted around each other to form a wider filament, like the individual strands of a piece of rope. Beyond that, things become less clear but this new study shows what happens at these higher levels.
Imagine a series of beads on a string. You gather clumps of beads and crumple them together into a globule, carefully avoiding any knots or crossovers. Every row of, say, five beads gets crumpled into a globule, every row of five globules gets crumpled together, and so on and so forth. The final result is a single ball – a “globule-of-globules-of-globules”.
Lieberman-Aiden developed a technique called Hi-C that simultaneously analyses adjacent DNA across the entire genome, in order to reveal its 3-D shape. It relies on formaldehyde to immobilise pieces of DNA that sit next to each other, effectively freezing the genome and forming cross-links between adjacent strands. The DNA is then shredded and the cross-linked fragments are isolated, sequenced and mapped onto the reference copy of the human genome. The result is a library of all the DNA strands that were neighbours in the nucleus, which can be analysed to understand how the genome must be folded.
The technique confirmed that parts of the genome that would sit far apart if it was fully stretched out are actually very close to each other in space. Because of the complicated molecular origami that goes on inside the nucleus, around three quarters of the close-contact sequences identified by the Hi-C method are actually distant ones.
The research also confirmed that the nucleus is divided into two territories – an “ON” compartment where DNA is rich in genes, highly active and loosely packed, and an “OFF” compartment where DNA is gene-poor, largely inactive and densely packed for storage. Individual chromosomes snake in and out of these two compartments and when a given gene is activated, it moves from one to the other. It’s not clear what defines the boundaries between these two compartments, but Lieberman-Aiden suspects that these boundaries are very sharp.
As an example, Lieberman-Aiden use glow-in-the-dark molecules to tag four stretches of DNA called L1, L2, L3 and L4. They lie one after the other on chromosome 14, but in the nucleus, they pair up differently. L1 and L3 are typically found in the “ON” compartment and are always closer to each other than L2. Meanwhile, L2 and L4 are closer to each other than L3, and are usually found in OFF territory.
“A huge question in biology is how all the different cells in the body perform totally different functions when all of them have the same genome,” says Lieberman-Aiden. “This work suggests that the spatial arrangement of the genome in a particular nucleus is a big part of why different cells do different things.”
PPS: You may remember Erez from the irregular verbs paper that I recently reposted. Many thanks to Erez for the heads-up about the paper and the awesome ramen noodle analogy.
Reference: Science 10.1126/science.1181369
More on genomes:
DNA is most famous as a store of genetic information, but Shawn Douglas from the Dana-Farber Cancer has found a way to turn this all-important molecule into the equivalent of sculptor’s clay. Using a set of specially constructed DNA strands, his team has fashioned a series of miniscule sculptures, each just 20-40 nanometres in size. He has even sculpted works that assemble from smaller pieces, including a stunning icosahedron – a 20-sided three-dimensional cage, built from three merged parts.
Douglas’s method has more in common with block-sculpting that a mere metaphor. Sculptors will often start with a single, crystalline block that they hack away to reveal the shape of an underlying figure. Douglas does the same, at least on a computer. His starting block is a series of parallel tubes, each one representing a single DNA helix, arranged in a honeycomb lattice. By using a programme to remove sections of the block, he arrives at his design of choice.
With the basic structure set down, Douglas begins shaping his molecular clay. He builds a scaffold out of a single, long strand of DNA. For historical purposes, he uses the genome of the M13 virus. This scaffold strand is ‘threaded’ through all the tubes in the design with crossovers at specific points to give the structure some solidity. The twists and turns of the scaffold are then fixed in place by hundreds of shorter ‘staple’ strands, which hold the structure in place and prevent the scaffold from unfolding.
The sequences of both the scaffold and staple strands are tweaked so that the collection of DNA molecules will stick together in just the right way. Once all the strands are created, they’re baked together in one hotpot and slowly cooled over a week or so. During this time, the staples stick to predetermined parts of the scaffold and fold it into the right shape. The slow cooling process allows them to do this in the right way; faster drops in temperature produce more misshapen forms.
The result: a series of six structures that Douglas viewed under an electron microscope: a monolith, a square nut, a railed bridge, a slotted cross, a stacked cross and a genie bottle. These basic shapes illustrate the versatility of the nano-origami approach, and they can also be linked together to form larger structures. Using staples that bridge separate scaffolds, Douglas created a long chain of the stacked cross units. Most impressively of all, he made an icosahedron by fusing three distinct subunits.
Nocturnal animals face an obvious challenge: collecting enough light to see clearly in the dark. We know about many of their tricks. They have bigger eyes and wider pupils. They have a reflective layer behind their retina called the tapetum, which reflects any light that passes through back onto it. Their retinas are loaded with rod cells, which are more light-sensitive than the cone cells that allow for colour vision.
But they also have another, far less obvious adaptation – their rod cells pack their DNA in a special way that turns the nucleus of each cell into a light-collecting lens. Their unconventional distribution is shared by the rods of nocturnal mammals from mice to cats. But it’s completely opposite to the usual genome packaging in the rods of day-living animals like primates, pigs and squirrels, and indeed, in almost all other eukaryotic cells.
In our cells, massive lengths of DNA are packaged into small spaces by wrapping them around proteins. These DNA-protein unions are known as chromatin, and they come in two different forms. Euchromatin is lightly packed and resembles a string of beads. Wrapping DNA in this way puts it within easy reach of other proteins and allows its genes to be actively transcribed. But imagine scrunching up that string of beads and you get heterochromatin – a tight, condensed ball of repressed genes that proteins cannot reach.
The two forms of chromatin are found in different areas, with euchromatin spread throughout the nucleus and heterochromatin concentrated at its edges. That pattern is nigh-universal and it applies from amoebae to plants to animals. There are only a few exceptions to this rule, including a minority of single-celled species and surprisingly, the rod cells in the eyes of nocturnal mammals. Now, Irina Solovei from the Ludwig-Maximilians University in Munich had found that this inverted distribution helps these species to see in the dark.