Will we ever… decipher everything about a life form based just on its DNA?

By Ed Yong | November 6, 2012 9:00 am

Here’s the 12th piece from my BBC column

In 2001, the Human Genome Project gave us an almost complete draft of the 3 billion letters in our DNA. We joined an elite club of species with their genome sequences, one that is growing with every passing month.

These genomes contain the information necessary for building their respective owners, but it’s information that we still struggle to parse. To date, no one can take the code from an organism’s genes and predict all the details of its shape, behaviour, development, physiology—the collection of traits known as its phenotype. And yet, the basis of those details are there, all captured in stretches of As, Cs, Gs and Ts. “Cells know pretty reliably how to do this,” says Leonid Kruglyak from Princeton University. “Every time you start with a chicken genome, you get a chicken, and every time you start with an elephant genome, you get an elephant.”

As our technologies and understanding advance, will we eventually be able to look at a pile of raw DNA sequence and glean all the workings of the organism it belongs to? Just as physicists can use the laws of mechanics to predict the motion of an object, can biologists use fundamental ideas in genetics and molecular biology to predict the traits and flaws of a body based solely on its genes? Could we pop a genome into a black box, and print out the image of a human? Or a fly? Or a mouse?

Not easily. In complex organisms, some traits can be traced back to specific genes. If, for instance, you’re looking at a specific variant of the MC1R gene, chances are you’ve got a mammal in front of you, and it has red hair. Indeed, people have predicted that some Neanderthals were red-heads for precisely this reason. “But beyond that, predicting [if something is] a mouse or a whale or a armadillo, we still wouldn’t do well,” says Kruglyak.

[Bernhard Palsson from the University of California, San Diego agrees. “Sequencing a woolly mammoth will not predict its properties,” he says. “But you might be able to do a lot better with bacteria.” With simpler and smaller genomes, this should in theory make it easier to predict the basic features of their metabolism, or whether they grow using oxygen or not. Even though we can sequence a bacterial genome in under a day, and for just £50, we would still struggle to determine important traits, like how good a disease-causing microbe is at infecting its host.

Even finding all the genes in a small genome is hard. Earlier this year, scientists discovered a new gene in a flu virus whose genome consists of just 14,000 letters (small enough to fit into 100 tweets), and had been sequenced again and again. It should be unsurprising that our own genome, with 3 billion letters, is full of errors and gaps, despite ostensibly being “complete”. In May, another group showed that the reference human genome is missing a gene that may have shaped the evolution of our large brains. “There’s no genome that is completely understood even in terms of the genes within it,” says Markus Covert from Stanford University. “Typically, no function is known for a fourth to a fifth of the genes.”

Genes encode the instructions for assembling proteins, molecular machines that perform vital jobs in our cells. A protein is a long chain of amino acids, and we can predict that chain with perfect precision. But the chain also folds, origami-like, into a complex three-dimensional shape, and the shape dictates everything that the protein does, from the chemical reactions it speeds up to the other molecules it sticks to. Discerning those shapes is laborious work, involving growing pure crystals of the proteins, and bombarding them with X-rays. Despite having hundreds of such structures, even the most powerful computers struggle to accurately compute a protein’s shape from the DNA sequences that produce them. “I see that challenge as the stifling one,” says Palsson.

Protein-coding genes make up just 1.5 percent of our genome. The rest includes a lot of what is thought to be useless junk with no discernible function. But it also contains regulatory sequences that control when, where and how our genes are used. We need to identify these if we’re ever to predict how a genome leads to a living, breathing organism. The technology for doing that is being developed, and the ENCODE project – the Encyclopaedia of DNA elements – has put it to good use, compiling a catalogue of the various regulatory sequences in our own genome. But ENCODE involved 442 scientists intensely running experiments for a decade, and even its unprecedented catalogue is incomplete.

And even if we have all this information—every gene, protein structure, and regulatory sequence—we’d still need to figure out how it all works together, and how it interacts with its environment. We would need patterns: when and where different genes are activated as an organism develops. We need timings: how quickly chemical reactions take place in a cell, and how proteins speed up that process.

Here, our metaphors let us down. Science writers like to compare the genome to a textbook or a blueprint. That conveys the fact that it stores information, but glosses over its buzzing, dynamic nature—proteins docking on and off to control the activity of genes, huge stretches of DNA that fold and unfold to reveal or hide their sequences, parasitic jumping genes that copy themselves and hop throughout the genome… None of our information stores – not sheet music, not recipe books – are this intricate.

That hasn’t stopped some scientists from trying to simulate this intricacy. In July, Covert announced that he had created a rough simulation of an entire organism – a single-celled microbe called Mycoplasma genitalium. Covert’s model simulates how all of the bacterium’s 525 genes are used, the proteins they produce, how quickly the proteins act, how they interact, and more. It is not completely accurate, but it captures much of M.genitalium’s lifestyle. Two colleagues wrote that the project “should be commended for its audacity alone”.

Still, the stimulation was hard-won. At 525 genes, M.genitalium has the smallest genome outside of viruses (humans have 20-25,000 genes, by comparison), pared down to extreme minimalism by its life as a parasite. It may be one of the simplest living things we can imagine, but modelling this microbe still took around 1,900 experiments and a lot of borrowed knowledge. “Around half of our model comes from experiments that were done in other bacteria,” says Covert. “There’s no way [the genome] would have been predictive by itself.”

Covert also needed to factor in M.genitalium’s environment. It lives only in the stable environment of our urethra, with no light, and steady temperature. “But even then, it occasionally sees the immune system coming after it and there’s no way of modelling that,” says Covert.

The influence of the environment becomes even more crucial for more complex free-ranging organisms. Temperature and acidity affect how proteins behave. The food that an organism consumes, the infections that plague it, and the competitors it interacts with, all affect how it develops, and how its genes are used. Many of these factors leave marks on the genome itself – “epigenetic” tags that dictate the deployment of genes, and can be passed on to the next generation. The environment clearly matters. When making predictions from a genome, the elephant in the room is the room.

Still, Covert’s approach shows one way forward – the dawn of virtual biology. You could sequence a genome, construct a model or simulation, compare that to the real organism, work out the flaws in the model, and rectify those flaws with further experiments. Rinse and repeat. Eventually, you would have a zoo of models. If you have a new genome, start by comparing it to one of the existing simulations and work from there. It’s not quite the black box we envisaged, but it’s something.

If scientists are trying to find fungi or bacteria that can perform a specific job – say, clean up hazardous waste, to produce certain nutrients—it would be valuable to identify such organisms from their genomes alone. “We can use the sequencing to look for phenotypes that are relevant for our objective,” says Nielsen. And if that objective is to artificially design new life-forms, as folks like Craig Venter are trying to do, then prediction becomes essential, rather than wishful. “You’d worry about side effects and you’d want computational tool that can avoid them,” says Covert. “When we talk about rationally designing a new organism, you’d want to predict a phenotype.”

“I doubt we’d ever get to 100% prediction because biology is so variable,” says Jens Nielsen from the Chalmers University of Technology in Sweden. But Kruglyak adds, “I don’t think that in principle, there are any showstoppers that would make it impossible. It would just take a whole lot more work and continued technological development beyond what we can do today.”

More from Will we ever…?


CATEGORIZED UNDER: Genetics, Genomics

Comments (13)

  1. gc

    If the cost of decoding human DNA was cheap, what level of family relationship is possible to determine? Could science produce a complete family tree if they had a database of everyone’s DNA?

  2. dcwarrior

    Ed, weren’t there reports of heritable factors in flowers that came not from the DNA in the nucleus, but from what organelles came with the particular fertilized seed? If so, even in animals, wouldn’t the egg (is it part of the “environment” for your purposes?) and the mitochondria and other support structures, affect the organism’s eventual size and other characteristics, in ways that might be inherited, but outside the genome?

  3. You mention that physicists can use the laws of mechanics to predict the motion of an object. Theoretically they can predict the movements of molecules (down to the limits of Heisenberg’s uncertainty principle). But they can’t predict or model the formation of a hurricane. The complexity and chaotic nature make it computationally intractable for the foreseeable future.

    The same is probably true of complex genomes. We struggle to predict the structure of a single protein. Yet they want to predict the simultaneous production and interaction of thousands of proteins, not to mention ribozymes and all the other types of molecules involved, the methylation used to turn genes on and off. Even ignoring environmental interactions that affect organism development, it’s an intractable problem.

  4. Nicely done sir!
    Your piece reminded me of a classic paper from Phil Anderson. Worth a read, still.


  5. Tim

    Expanding on Hunter’s statements: If you take the view of computational biology, that the system of proteins, DNA and so on represents code and the processors which run that code, then you have a highly complex computer which not only reproduces itself (so that all life is, essentially, a Quine program), but actively reprograms itself. We already know from experience that self-modifying code is a debugging nightmare, and we know from theory and practice that many rather simple questions (presented in the form of algorithms) have uncomputable answers!

    Whether life, or the weather, fits into this class is arguable I suppose — one could say that, since nature has obviously solved the equations, or ran the algorithms, up to this point in time, then an equivalent process can be computed in a purely abstract (mathematical) environment, also as a function of time. On the other hand, one could argue that the system, as a whole (for all time), is uncomputable, and life is merely the chaotic expression of an iterative algorithm flailing about, trying to converge on a solution that doesn’t exist. The practical question then becomes, “well I don’t care where life ultimately is going, I just want to know, what does this thing turn into after ten years?” And the answer is most likely: small increments are computable, so a finite time in the future is, in principle, computable; but even with really good computers, you may be waiting a very long time to get those results. (In other words… “an intractable problem.”)

  6. DavidB

    Not sure I’d call the urethra a ‘stable environment’. Every so often…

  7. On a lighter note, I am reminded of Dr. Margo Green’s Genetic Sequence Extrapolator “a computer program designed to describe the characteristics of a given species from a reading of its DNA … you can use this program to tell the species and sex of the animal, whether it was nocturnal, what it ate, how it hunted, how big it was…”

    source: http://www.dailyscript.com/scripts/the-relic_early.html

  8. Brian Too

    In principle, yes.
    In practice, this could take a couple of centuries.
    Patience, Grasshopper.

  9. SteveW

    The problem with the physics analogy is that physics works best for predicting properties of a single entity: the location of a ball moving at a certain velocity, say, is governed by a single equation. If you introduce another object into the system, you can’t do much more than predict when or if they’ll collide. After that it’s all simulation, computing successive states of the system. So it is with DNA and cell reproduction. There is no single equation to predict the shape an organism will take after it divides more than once. The organism’s future states can be predicted to some extent using fractals and chaos theory, technologies that are computationally expensive and grow increasingly imprecise in proportion to the number of states.

  10. Nathan Myers

    So far as we have any experience, chicken DNA only results in a chicken if it’s grown in chicken cells. Similarly, elephants have only arisen from DNA that was contained entirely within elephant cells. Grow an elephant from DNA substituted into a chicken egg, and I will be the first to admit being impressed.

    But let’s not jump the gun. Sure, an elephant cell can be hard to tell apart from a chicken cell, from the outside. It seems fair to guess that none of us have experienced the inside of a chicken cell, or an elephant cell either. They could be as different as a Honda Civic and a Mack truck, from that vantage point. The Mack truck is not going to be easy to find a parking place for downtown, and your “mobile” home is going to remain pretty stationary as long as it’s attached to your Civic, whatever its highway MPG rating.

    What I’m really saying here is that chicken DNA has millions of years of practice in telling chicken cells what to do, but orchestrating the operations of an elephant cell is likely to stretch its skills beyond their limits. Sure, chicken and elephant cells both respire and metabolize. So do chickens and elephants, but that doesn’t mean a chicken can eat a whole tree, nor an elephant (Horton notwithstanding) hatch an egg. Can a chicken mitochondrion even use a protein transcribed from the sequences found in elephant DNA? I doubt it, and that’s just the beginning.

    Nothing in nature makes a membrane from scratch; each bit of membrane that exists today started as part of another bit, going back to that first cell. Today’s cells can extend and pinch off new bits of membrane at a ferocious — indeed, given the resources, an exponential — rate, but if DNA can’t direct construction of a membrane from scratch, is DNA really so central to life? Really, DNA is just a tool to help the world-spanning membranium make more membrane. If a better way comes along, DNA will get the old heave-ho before you can say icosacatecholamine, and with no severance pay.

    If you slurped out the whole innards of a chicken cell and, separately, an elephant cell, and swapped them, could you grow an elephant with chicken membranes or a chicken with elephant membranes? I’m far from sure that you could do even that. Would you grow the chicken in a culture medium tailored for chicken cells, or for elephant cells? I doubt they can be the same, but the cholesterol islands floating around in the envelope determine what gets in and out, at least at first.

    Does a chicken Golgi apparatus process RNA precisely the same way as an elephant’s, or do they have slightly (or radically) different signaling and regulatory mechanisms? 300 million years is time for quite a lot of divergence. You might need to bring along a lot more than just the DNA to end up with a functioning cell. You might just need to bring along pretty much everything.

  11. I was going to say something similar to Nathan Meyers… Let’s say that practicality is an issue. Could you even in principle predict an animal from the DNA alone? Quite possibly not, because you have a literal chicken-and-egg problem: Even if you could simulate everything perfectly, you don’t know how the egg is going to develop unless you can simulate it being in a chicken. And if the only information you are starting with is the DNA, then you can’t simulate the environment of a chicken’s womb until you’ve already properly simulated the development of an egg.

    MAYBE, with unlimited resources, you could reach some kind of convergence by performing countless simulations… egg in “generic” womb, then see what kind of womb that produces in the adult animal, then resimulate, lather-rinse-repeat until you start getting the same adult generation after generation. But even ignoring practical concerns, there’s no guarantee that the local minima you reach would be a “genuine” chicken!

    tl;dr: I don’t think the DNA alone is sufficient information even in principle — never mind that, even if you could solve the “unknown womb” problem, the problem is so complex that it is likely to be intractable in practice.

  12. What about the epigenome, which is basically the master of the genome? Cells literally do not know what to do with DNA without histones (which come in probably thousands of post-translationally modified forms that code for different DNA functions) and DNA methylation. The cellular reprogramming that gives us iPS cells is the best example of how the same DNA can manifest completely differently depending on the epigenetic context. So, a pile of raw DNA tells only a tiny bit of the story of any given organism.


  13. amphiox

    The answer is no. Because not all the information needed to define a lifeform is contained within its genome. The initial conditions in which the genome finds itself, the intracellular environment, also matters, and not all of that is set by the genome. Some of it is inherited as cells reproduce, soma to soma, in a continuous line all the way back to the very first cell.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Not Exactly Rocket Science

Dive into the awe-inspiring, beautiful and quirky world of science news with award-winning writer Ed Yong. No previous experience required.

See More

Collapse bottom bar