Within your body, a huge amount of information is copied over and over again, reliably and predictably. Your life depends on it. Typos occur, but they are quickly corrected. Edits are made, but sparingly. Or, at least, that’s what we thought.
It starts with DNA. This famous molecule is a chain of four ‘bases’, denoted by the letters A, C, G and T. These four letters, in various combinations, contain instructions for building thousands of proteins, a workforce of molecular machines that keep you alive and well. But first, DNA has to be copied (or “transcribed”) into a related molecule called RNA. It too is made of four bases: A, C and G reprise their roles, but U stands in for T. Each triplet of letters in RNA denotes a different amino acid, the building blocks of proteins. Small factories read along the RNA like a piece of tickertape, using it to string together amino acids in the right sequence.
So DNA leads to RNA leads to proteins – this is the grandiosely-named “central dogma of life”.
People often assume that this flow of information happens with exacting precision. Every stretch of RNA should be a perfect match for the piece of DNA it is copied from. Take a piece of DNA, and you could predict the exact string of letters in its corresponding piece of RNA, and the amino acids of the resulting protein.
But that’s not always the case.
Typos creep into the transcripts. Some of these are genuine errors where the wrong letter is put in place – proofreading proteins usually fix these mistakes. Other typos are deliberate edits – for example, proteins called deaminases will often convert some As into Gs, and (more rarely) some Cs into Us.
Now, Mingyao Li and Isabel Wang from the University of Pennsylvania School of Medicine have found that these typos go far beyond the edits that we knew about.
Li and Wang studied white blood cells from 27 unrelated people, and looked at both their DNA and RNA sequences. They found over 10,000 places across the genome where the molecules didn’t match up, spread over a third of all our genes. Some of these looked like the type of RNA edits that scientists already knew about, but around half of them were clearly something new. Li and Wang called these changes ‘RDDs’, short for RNA-DNA differences.
The duo took great care to make sure that the RDDs weren’t just the result of errors in their own sequencing methods. So they asked different laboratories to prepare and sequence the samples. They focused on parts of the genome that they had scanned several times over, and where the DNA letter is the same from person to person. And they used cells from people whose DNA had already been sequenced as part of two big genetics initiatives – the International HapMap Project and the 1000 Genomes Project. These existing sequences matched those that Li and Wang produced afresh.
The RDDs weren’t just random errors. Every one of them showed up in at least two people and 80% showed up in half of the sample. They were there in infants and adults. They were there in people outside the original group of 27. They were there in other types of cells – neurons, skin cells, embryonic stem cells, cancer cells. And they were always the same at any given site, even in different people – if an T in DNA becomes a G in RNA, then it always becomes a G rather than an A or C. There must be some sort of guide that determines which DNA letters are edited and what they’re edited to.
These typo-ridden molecules co-exist with those that more accurately reflect the DNA they were copied from. At any given RDD, around 20% of the RNA sequences differ from their corresponding DNA, while the rest are accurate matches. But that’s an average figure – at some sites, Li and Wang found RDDs in nearly every RNA sequence they examined.
These typos carry over into proteins. Li and Wang found several proteins whose amino acids correspond to the altered RNA sequence rather than the underlying DNA one. Around a third of the RDDs lead to a different amino acid, but about one in a hundred change the size of the protein altogether. For example, one RDD in the gene RPL28 lengthens the resulting protein by 55 amino acids.
For now, Li and Wang don’t know how the RDDs are produced. Are the different letters slipped in as the RNA strand is assembled, or is the strand edited afterwards? What determines which letter is substituted at a given site? And perhaps most importantly, what do they do? Do they affect our behaviour, our development, our physical features, or our risk of disease?
To answer these questions, Li and Wang argue that as well as studying the genome – the sum of our DNA – we need to pay equal attention to the transcriptome – the sum of our RNA. So far, DNA has hogged the limelight; for example, we have poured millions of dollars into scouring our genome for DNA variants that affect our risk of disease. But DNA is the tip of the iceberg. Identical pieces of DNA can be transcribed and edited into subtly different strands of RNA, which can produce very different proteins. These other layers of diversity are now being uncovered.
The wave of next-generation sequencing technology has certainly helped, according to George Church, a pioneer of genomic sequencing. As our tools have become more powerful, our knowledge has grown deeper. “We are seeing a huge uptick in observations of modified bases,” says Church. “These are very exciting times to be studying -omes.”
Reference: Li, Wang, Li, Bruzel, Richards, Toung & Cheung. 2011. Widespread RNA and DNA Sequence Differences in the Human Transcriptome. http://dx.doi.org/10.1126/science.1207018
UPDATE: Interesting reactions to this paper are appearing on Twitter and around the web. I’m collecting them on this Storify:
[View the story "RNA-DNA differences" on Storify]