When the Human Genome Project was completed over a decade ago, in 2001, many biologists were surprised that the human genome only contains about 20000 to 25000 genes, which occupy less than 1% of the total DNA sequence. This was surprising news: how could such a complex system as a human being be coded for by only 25000 instructions, when a microscopic worm (C. elegans) has 21000, and a bacterium (E. coli) has 5000? And what was the point of carrying all this other DNA around? The term “junk DNA” was coined to refer to these non-protein-coding sequences. Are they really junk — relics of past evolution, gene duplication, viral infection, and other hypothesized mechanisms of sequence accumulation — or do they have a function that we simply haven’t found out about yet?
The ENCODE (Encyclopedia of DNA Elements) project has recently released its first results in a series of publications in Nature and other scientific journals. There was plenty of news surrounding it; not surprising given the $180 million that have already been spent, and the over 400 scientists working on it. Articles appeared in major newspapers and news outlets, but then I got an email from a college friend, who’s not a biologist, asking me what all this fuss was about. He’d read the newspaper articles, which were full of quotes from scientists lavishing praise on the project and its promise, but couldn’t quite figure out exactly what it was about. The generic-sounding name of the project doesn’t give many clues, either. And so, even though I’m supposed to be a microbiologist and not a human geneticist, I thought I’d take a shot at explaining the significance of the ENCODE results.
The idea of the “gene” has been a long and problematic one. It has a peculiar history for a biological concept, because the existence of genes was predicted by theory (the experiments of Mendel and his successors) long before we had any clue of what the physical nature of genes were. Eventually biologists figured out that DNA was the hereditary material, and the Central Dogma took shape: that genetic information stored on DNA was first transcribed to RNA, which was then translated to proteins, and that proteins were the main functioning parts of the cell: the scaffolding, motors, and carriers that performed the business of life. A gene was just the information needed to make a protein, and if asked to define the physical manifestation of a gene, one would have said that it was the stretch of DNA that held the instructions for making that protein.
As we found out more about genes and how they were regulated, however, the story began to get complicated. It wasn’t necessarily true that one gene = one protein: some genes could be spliced in different ways to give different products. Nor was it true that proteins were the only functional pieces of the cell: sure, we’ve known about ribosomal RNAs (rRNA) and transfer RNAs (tRNA) for a long time, but other small RNAs, some with catalytic functions, have also been discovered. Gradually, we’ve also come to appreciate that the complexity of life is not just the result of proteins interacting with other proteins, with DNA sitting passively by; many proteins also interact with DNA, determining which genes are going to be expressed, and which should stay silent. In the classical textbook examples, regulatory elements lie close to the genes which they administer, but it is now known that this isn’t necessarily so. New mechanisms of silencing genes have been discovered, some which work by chemically modifying DNA (methylation), and others which involve a dizzying dance of RNA and protein molecules (RNA interference). Is it time to revise the concept of the “gene”, or should we simply acknowledge that protein-coding genes are not the only significant pieces of a genome?
The ENCODE project has shown this very clearly. Far from being “junk”, the non-protein-coding sequences in a genome are actually doing something and not just sitting there passively. At least 80% of the genome can be experimentally shown to have some kind of function (recall that less than 1% of the genome codes for proteins), or as the authors put it: a “demonstrable biochemical function”. They may be regulators or promoters, i.e. sequences which control the expression of genes by binding to proteins involved in the machinery of transcription. They could be regions that are controlled by DNA methylation, which silences expression. Or they could be regions which are exposed to transcription factors, instead of being coiled up in histones.
This was a large scale project with many participants. What they did to discover this was a process of systematic cataloging. They used almost 150 different human cell lines in culture (including the famous HeLa cells) and performed different experiments to spot different functions. For example, to see what portion of the DNA was actually being transcribed, they extracted and sequenced RNA en masse, an approach made possible by new methods of nucleic acid sequencing which can sequence huge numbers of small lengths very quickly. To see what sequences bind to known proteins, they performed a method called ChIP, chromatin immunoprecipitation. For a protein which is thought to bind to DNA, antibodies are raised against it. The protein is then exposed to genomic DNA; it seeks out and binds to the specific sequence that it interacts with. A chemical is added to cross-link the protein and DNA, and this complex is fished out using the antibodies developed earlier. The crosslinks are removed, and the DNA is sequenced, to find out what region of the genome this protein interacts with. Other types of experiments were performed to find out which parts of the genome are methylated, are accessible to transcription factors, and so on. They’ve also analyzed how the regulatory elements in a genome interact with each other, and how the three-dimensional folding of the DNA itself affects the interactions between different parts of the genome.
One of the surprising findings was that much of the genome (75%) is actually transcribed to RNA at some point or another, even though most of these don’t end up being translated to proteins. Textbooks usually mention the three classical types of RNA: messenger RNA (mRNA), rRNA, and tRNA (described above), but aside from these workhorses, RNA was usually thought of as a sort of “relic” molecule, doing only these few menial jobs. But recent research is accumulating evidence for the importance of various kinds of small RNAs in the eukaryotic cell, and the ENCODE project could help the task of cataloging them all.
There are a whole bunch of other sub-projects (or “threads”, as they call them on the ENCODE website) that have been carried out by the ENCODE consortium. As a way of doing science, I think it points the way to the future: big consortia collecting big data and crunching big numbers. For biology, the new sequencing technologies (collectively called “next-generation sequencing”, or NGS) are a tremendous advance over traditional Sanger-type sequencing, which was developed decades ago and still based on the same principles. NGS came into the market just in time for the ENCODE project, allowing them to sequence several times as much DNA for the same cost. We are now reaching the point when you could have your own personal genome sequence for a thousand dollars or even less. The limit on what we can do is imposed instead by our ability to store, transmit, and compute such massive quantities of data.
The first batch of papers published by the ENCODE consortium are available via the website of Nature. They’ve also got some snazzy interactive graphics.