By 2020, the volumes of data that humanity generates may reach 44 trillion gigabytes, according to information technology analyst firm International Data Corporation in Framingham, Massachusetts. That’s equivalent to over 6 towers of 128-gigabyte iPad Airs, each reaching from Earth to the moon.
To make use of all this data, it needs to be stored somewhere, and DNA may be up for the task.
Now, using a new strategy called DNA Fountain, scientists have nearly reached DNA’s theoretical storage capacity, and still recovered their data with zero errors. The secret of the new technique is that it essentially encodes files in DNA as very simple Sudoku puzzles, says study lead author Yaniv Erlich, a computational biologist at Columbia University in New York.
Data to DNA
DNA is made of strands of molecules known as nucleotides: adenine, thymine, cytosine and guanine, abbreviated A, T, C and G. Just as patterns of ink can represent letters of the alphabet, sequences of nucleotides can be used to encode data.
As genetic analyses of woolly mammoth and Neanderthal fossils has revealed, DNA can remain stable for millennia — unlike, say, magnetic tape, which can degrade within a decade. DNA is also compact and does not require any power for storage, so keeping and shipping it could prove relatively easy.
Previous attempts at encoding data in strands of DNA only reached about half of DNA storage’s theoretical maximum capacity. In addition, prior work often experienced small gaps in retrieved data because of errors introduced during DNA synthesis. But Erlich took a cue from the entertainment section of the newspaper in developing DNA Fountain.
In Sudoku, players are given mostly empty grids, and the few numbers provided within the grids serve as hints as to how the rest of the grids should get filled out. In much the same way, DNA Fountain generates many ‘hints’ about the contents of files. All this data gets encoded into DNA, and when it comes to retrieving data from these molecules, even if a few ‘hints’ and fragments of the files are lost, the other hints can help reveal what data was lost, Erlich says.
Erlich at his team used the new technique to encode six files into DNA:
- A complete computer operating system known as Kolibri.
- A kind of computer virus known as a zip bomb.
- The 1895 French film “Arrival of a train at La Ciotat,” which according to urban legends terrified audiences with the moving image of a life-sized train.
- A Pioneer plaque, a copy of the metal plates placed onboard the Pioneer spacecraft meant to deliver a message to any extraterrestrial intelligences that might pick them up.
- The 1948 study “A Mathematical Theory of Communication” by information theory founder Claude Shannon, which helped shape virtually all systems that store, process or transmit digital information.
- A $50 Amazon gift card.
The researchers included the operating system, computer virus and film because “these files are highly sensitive to errors, and we wanted to show that it is possible to perfectly retrieve them from our data,” Erlich says. In addition, “we selected the Shannon manuscript because of its importance to our work, and the Pioneer plaque because of its importance for humanity.”
The scientists added the Amazon gift card to encourage others to reproduce the research team’s work.
“We shared the DNA sequencing data with a Twitter follower that was interested in the study,” Erlich says. “I told him that he could get the card if he could decode the data, which he gladly did, and bought a nice book.”
Early, Early Technology
The researchers incorporated the six files into a single compressed file a little over 2.1 megabytes in size. They next used DNA Fountain to encode it into 72,000 strands of DNA, which took two weeks to synthesize.
To read the files, the scientists used DNA sequencing technology, followed by software that translated the DNA sequences into binary data. They recovered their files with zero errors.
All in all, this new coding strategy could pack up to nearly 215 petabytes of data — that is, nearly 215 million, billion bytes — in a single gram of DNA. For comparison, the brain’s memory storage capacity is estimated at about 2.5 petabytes.
DNA Fountain reached nearly 90 percent of the theoretical maximum capacity of DNA storage, packing nearly 10 times more data per gram than the previous best DNA storage method. This may be the highest-density data-storage technique developed yet, Erlich says.
In addition, the researchers showed they could easily copy DNA-encoded files using polymerase chain reaction (PCR), a technique now commonplace in genetics labs. The data in these copies, and even copies of the copies, and so on, were also recovered error-free.
“I don’t want people to think that we claim that they can get DNA hard drives in Best Buy in five years,” Erlich cautions.
Instead, the researchers think the best application for DNA storage is for online archiving services such as Amazon Glacier, which are designed for long-term storage of data that are only accessed infrequently and where waits of several hours to retrieve files are acceptable.
“Even such a service is still probably a decade away from us,” Erlich says.
The greatest barrier to practical DNA storage is likely cost. For instance, the researchers spent $7,000 to synthesize the DNA they used to record their data and another $2,000 to read it. Still, “these are the early days of DNA storage,” Erlich says. While magnetic data storage is currently relatively cheap, “we have spent billions in R&D in the last 50 years to get to this stage; only a fraction of that was invested in cheap DNA synthesis,” he says.
One way to slash costs is to go for “quick and dirty” DNA synthesis approaches that are more error-ridden, Erlich says. The way in which the new technique can overcome errors “suggests that we could use much lower-quality synthesis and still perfectly decode a file,” he says.
Erlich and his colleague Dina Zielinski at the New York Genome Center detailed their findings in the March 3 issue of the journal Science.