your current location is:Home > TechnologyHomeTechnology

The DNA in your body can store the data of the entire universe

  • joy
  • 2022-08-04 15:00:56
  • 402 read
More than 60 million years after the extinction of dinosaurs, scientists have obtained a piece of amber with a ...

More than 60 million years after the extinction of dinosaurs, scientists have obtained a piece of amber with a prehistoric mosquito, and obtained the genes of the dinosaur from the mosquito's blood, thus bringing the distant creature back to life. "Jurassic Park", which tells this story, is still in the top ten at the global box office. The rationale of this series of stories is simple: DNA stores the biological information of dinosaurs, and technology allows it to be re-expressed.

Now, imagine another story in DNA: in the universe, the "Anthropocene" also died out. Another intelligent creature appeared, and they went to explore the ancient "human civilization". What will carry the memory of human civilization? The temperature changes, and the huge data centers on the earth leave remnants.

And there is a DNA in the frozen soil, which is light, only 1 kilogram, and looks like some white powder encapsulated in a capsule. After reading it, it records a huge amount of information that once existed on the earth. Videos, texts, and codes show countless inventions and literary works in the course of human history. So the traces of that distant civilization unfolded again in the universe.

This is another sci-fi setting. The technology behind it is a cutting-edge direction that is currently being paid attention to: DNA storage of information. In nature, DNA is responsible for storing genetic information. The average diameter of a single human cell is 5 to 200 microns, and the DNA in it can contain a person's entire genetic information: 3 billion base pairs.

So why can't bases be used to store other information? This sci-fi idea is going out of the lab and is being used as a future solution for information storage.

01 There is too much genomic data, what should I do?

Originally, biologists wanted to solve the problem of biological development.

Eleven years ago, a group of bioinformaticians discussed "the problem of data storage" in a hotel in Germany. Nick Goldman was among them, in his second year as a senior scientist at the European Bioinformatics Institute (EBI).

Large-scale genome sequencing is underway, and the resulting data is rapidly growing in size. Storing and compressing this data is a hassle, and the existing technical solutions don't seem to work. It is estimated that the human genome requires up to 2-40EB of storage capacity. That's probably more than the cloud storage of a world-class tech company - the total amount of data stored on Google Cloud by Apple users around the world is roughly 8 EB. The monthly storage fee for this 8EB of data is $218 million. (1EB= 102^3GB)

Biologists fell into depression.

Nick Goldman holds the DNA that stores all of Shakespeare's sonnets, a photo and snippets of his "I have a dream" speech | Source: EBI

Someone had an epiphany: What's stopping us from using DNA to store data?

It seemed like a joke, but the biologists realized it wasn't just a joke. They took a napkin and carefully calculated the feasibility with a ballpoint pen.

The principle of DNA storage of genetic information is not complicated. It consists of four nucleotides A, T, G, and C, which correspond to each other in pairs to form a double helix structure. The sequence of nucleotides, which records genetic information.

In the digital world, all information is essentially a string of 0s and 1s. If you want DNA to store digital information, a simple understanding is to convert the coding sequence of 0 and 1 into a sequence of nucleotides. The advantage of DNA storage is its high density. About the size of a comma in front of your eyes, 1 cubic millimeter of DNA can hold 9TB (1TB=1024GB) of information.

Using DNA to store data is not a completely new idea, and scientists have tried it before. But it belongs to the pioneering cross-border experiment of science and art.

In 1988, artist Joe Davis and Harvard researchers stored a pattern called "Micro Venus" in short strands of DNA.

Microvenus stored in DNA Image credit: Related paper

The coding of this pattern is simple, the white part is marked as 0, the black line part is marked as 1, the file size is only 35bits, and it uses a DNA chain of 28 nucleotides to store.

Two years after that hotel discussion, in 2013, Goldman's team published research. This time, they stored files in 5 different formats, totaling 0.75MB. In order to ensure that the information can be read without error, when scientists store it, each piece of information is stored with four times the amount of redundancy.

The five files are:

154 Shakespeare 14-line poems (ASCII encoding)

Paper presenting the DNA double helix structure (PDF version)

A photo (in JPEG format)

Martin Luther King Jr.'s "I Have a Dream" speech, a 26-second clip (in MP3 format)

A string of Huffman ciphers

In recent years, the online DNA storage capacity has been continuously broken through. In 2019, Catalog, a US startup, stored 16GB of Wikipedia in its DNA. The company says it is building the world's first DNA-based large-scale digital data storage and computing platform.

02 Encoding and decoding, a lot to deal with

In the opinion of some biologists, using DNA for storage is a very "smooth" thing. "Nature's coding language is very similar to the binary language we use in computing. On hard drives we use 0s and 1s to represent data, whereas in DNA we have 4 forms of nucleotides, A, C, T and G". says biologist Robert Grass of the Swiss Federal Institute of Technology.

One of the keys to DNA storage is to map the numbers 0 and 1 with four nucleotides. The scheme can be very simple. For example: A corresponds to 00, C corresponds to 01, G corresponds to 10, and T corresponds to 11. Then according to the required nucleotide sequence, the nucleotides are strung together like a string of beads. (This is DNA synthesis) When the information needs to be read, gene sequencing technology is used to read out this string of nucleotide sequences, and then translate it into a string of 0s and 1s. This process is encoding-DNA synthesis-sequencing-decoding.

This sounds like the process of "putting elephants in the refrigerator", and there are still many issues to be considered in operation. Otherwise scientists wouldn't have to keep working on new coding schemes.

In DNA existing in nature, A and T, C and G are paired in pairs. In a DNA, the proportion of CG and AT is basically uniform, about 50%. If the content of C and G is too high, it may cause some complex physical structures in the DNA strand. This complicates DNA sequencing (decoding).

Steps to DNA Storage | Source: DNA Data Storage Alliance

And in the process of "stringing beads" (that is, synthesizing DNA chains), the error rate is inevitable. There is currently an error about every 100 bases synthesized. This is the bottleneck brought about by the current chemical synthesis technology. Each base synthesized has a correct rate of more than 99.9%. But when the base string gets longer, the 0.01% probability is multiplied, and errors are unavoidable. At present, the length of a single strand of synthetic DNA is generally not more than 100 bases, and the limit is about 300 bases. In nature, DNA often has several thousand base pairs.

That is, although DNA has a great storage capacity, they have to exist in many short chains. If the amount of information stored is relatively large, these short DNA strands are like a loose book. It can store a lot of information, but it exists in the form of sheets of paper with page numbers. Of course, short DNA strands can be spliced into long strands. This means an additional process. In the process of sequencing, it is necessary to break long chains into short chains. This is because current technology cannot read long chains in one go.

In the process of sequencing, there is also an error rate. Although the current error rate is as low as 10^-3 orders of magnitude, it is still at least 9 orders of magnitude different from the read and write error rates of commercial hard drives.

The accuracy rate is affected by the two technologies of synthesis and sequencing, and scientists have thought of designing an encoding scheme to avoid it: adding error correction mechanisms to the encoding. In this way, even if there is an error in base synthesis and sequencing, it can still ensure that the content stored in the DNA can be read out correctly.

03 Go out of the lab, but also consider speed and cost

DNA storage is also trying to move out of the lab.

In October 2020, Microsoft, Western Digital and gene sequencing giant Illumina, DNA synthesis startup Twist Bioscience and others jointly established the DNA Data Storage Alliance.

This is the world's first academic and industrial chain alliance in this field. The consortium hopes to develop technical and format standards that will eventually lead to a common business system.

Microsoft Research launched the DNA storage project in 2015 and hired Karin Strauss, an associate professor at the University of Washington's School of Computer Science and Engineering, as Senior Principal Research Manager.

In 2013, she and her colleagues visited EBI in the United Kingdom and learned about Goldman and her colleagues' research on DNA storage, and became very interested in this direction. "We're excited about the density, stability and maturity of DNA," Strauss said.

In their research, they wanted to develop another feature: random reads. In common DNA sequencing technology, all base strings must be read at one time to obtain information. Either don't read it or read it all. If you only want a small fragment of the data, it will be very troublesome.

In 2016, they published a study that could search for a given image in the information already stored in DNA, locate it, use an enzyme to copy the desired piece of DNA, and then simply read that small segment.

Karin Strauss (right) and two research collaborators | Source: csenews

Synthesis speed and cost also need to be addressed to bring DNA storage one step closer to commercial use. Now the synthesis speed is stored in kilobytes (KB) per second, and mature cloud storage solutions already have more than gigabytes (GB) per second.

This means that the speed of writing DNA needs to be improved by another six orders of magnitude. How to increase the amount of data processing? Just as parallel computing can improve the speed of data processing, scientists hope that DNA can be synthesized in parallel and processed at the same time.

In 2021, Microsoft developed the first nanoscale DNA memory, which can simultaneously synthesize 25X106 (2650) base sequences on each square centimeter area. This new technology has raised the original number of simultaneously synthesized base sequences from the single digit to the thousand digit. This throughput turns DNA synthesis into megabytes per second (MB).

New method greatly increases the number of arrays for DNA synthesis | Source: Microsoft Research

Greater throughput means lower cost. The cost of DNA storage today is $800 million per terabyte (TB). And tape storage costs have fallen below $16 per terabyte. It seems uncompetitive in comparison. However, the maintenance cost of large data centers in real life is extremely high, and the hardware needs to be updated regularly; the advantages of DNA storage density, small size, and long-term non-deterioration become a dimensionality reduction blow.

Therefore, "cold data" with large volume and low reading frequency is considered to be the most recent application scenario for DNA storage. Twist Bioscience highlighted in a recent market report that the technology could help tech companies deploy more efficiently at "large scale, low power."

Other optimistic scientists believe more in technological progress.

Since the completion of the Human Genome Project in 2003, the cost of sequencing has decreased by a factor of 2 million. In 2016, when faced with kilobytes per second, Goldman said, “[Read and write speeds] 6 orders of magnitude is no big deal for genomics. You just have to wait a little longer.”

How long is this "moment"? This field seems to be on the verge of breaking through.


TAG: No label

Article Comments (0)

    • This article has not received comments yet, hurry up and grab the first frame~


Top