The Biology Refugia: Human DNA contaminates genome databases

Friday, February 18, 2011

Human DNA contaminates genome databases

Scientists rely extensively on public databases of genomic sequences, such as GenBank, based in the USA, and the European EMBL. They might use this for detailed and exacting statistical studies of molecular evolution, or for doing a quick search to check whether a sequence they've just amplified is indeed what it is. Most biology students today have done a BLAST search at some point, underscoring the ubiquity and usefulness of these bioinformatics tools.

One big problem that haunts these sequence databases, however, is contamination and mis-annotation. Sequences are not always what they say they are, and scientists who don't carefully check the provenance of their data, or who trust that "because it's in GenBank it should be ok" may find their painstaking analysis to be in vain if they had used the wrong data to start with. While this problem is widely acknowledged, it comes as a surprise, perhaps to hear that nearly 20% of non-primate genomes in public databases have some degree of contamination from human sequences, probably DNA shed from the skin of the very scientists and technicians who prepared the material for sequencing.

Why non-primate genomes in particular? The research team from the University of Connecticut decided to use a particular short (less than 300 bases) sequence in the human genome, called AluY, which is repeated in great numbers and is highly conserved among humans and other primates. Alu elements are retrotransposons and an example of SINEs (short interspersed elements) that make up much of the genome. The AluY subfamily is derived from an insertion event that happened in the common ancestor of primates, and its specificity to primates is a marker of our common ancestry.

This is a creative use of past evolutionary events for tackling what is effectively an applied, technical question. It really helps to know some history!

Friday, February 18, 2011

Human DNA contaminates genome databases

No comments: