Data storage in DNA gains from a larger ‘alphabet’

DataStorageInDNAGainsFromALargerAlphabet
Cost-effective method turns DNA synthesis drawback to an advantage

Original source: Physics World

A technique for storing data in DNA manages to overcome the information redundancy that is associated with previous implementation attempts. Scientists from Israel recently demonstrated how this process may become more cost-effective and efficient by introducing more “letters” to the DNA “alphabet”.

Storing data in DNA is an attractive prospect as that way the same amount of information may be stored in a much smaller physical volume than is possible with current storage media. Because the DNA molecule is so stable, it can be suitable for long-term archives. The inspiration for using DNA in this way comes naturally, as DNA’s main function is to store the genetic information for all living organisms.

Encoded in letters

DNA strands are polynucleotides and combine four different nucleobases – adenine (A), cytosine (C), guanine (G) and thymine (T). It is the sequence of these bases that determines the information stored. In this way A, C, G and T represent the letters of the DNA alphabet, by which data is encoded. In 2017 a group of US scientists demonstrated a DNA-based storage system capable of storing 215 PB per gram. (A petabyte is a million gigabytes.) This equates to six orders of magnitude more data stored per unit volume than achievable with current storage devices.

To encode a message in DNA, every 2 bits of binary data are associated with a different DNA letter, making up a full sequence, which is then synthesized. One issue with this method, however, is that current DNA synthesis technology produces a large number of molecules with the same sequence, making a lot of the stored information redundant. Now scientists from Tecnion Israel Institute of Technology and the Herzliya Interdisciplinary Centre exploit this quirk of the synthesis process in order to make DNA data storage more efficient. The research team accomplishes this by using the concept of “composite letters”.

Increasing the alphabet

The scientists defined a “composite letter” as a combination of the letters A, C, G and T, in which each one appears a certain number of times. In any given position along the DNA strands, that are synthesised, the four nucleobases will appear with a probability that reflects how often they are represented in the “composite letter”.  This way the large number of synthesized DNA strands becomes an advantage – the “composite letter”, which can encode multiple bits of information, is identified from the distribution of the four bases across all the synthesized strands in a certain position. As a result the same message may be recorded on a shorter DNA molecule and more data is stored for the same number of synthesis cycles.

The “composite letter” method can thus reduce the costs of storing data in DNA. The scientists observed a trade-off between the increasing expense of molecule sequencing versus cheaper synthesis. However, they confirmed that this method provides a net gain and identified that an alphabet of 56 letters reduces the DNA data storage cost by an optimal 56%. Furthermore, with this alphabet the researchers observed a three-fold increase in the bits encoded per synthesis cycle compared with previous implementations of DNA data storage.

While DNA synthesis hardware is not yet ready for large-scale realization of this method, research towards that goal continues. Future work may overcome more of the challenges that DNA data storage systems currently face.

You can read the full paper in Nature Biotechnology.