noncoding DNA

Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K & Lander ES 2007 Distinguishing protein-coding and noncoding genes in the human genome. PNAS 104:19428-19433.

  • current catalogs list a total of ≈24,500 putative protein-coding genes
  • a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts
  • there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation
  • the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages
  • we reject this hypothesis by carefully analyzing the nonconserved ORFs—specifically, their properties in other primates
  • the vast majority of these ORFs are random occurrences
  • the analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to ≈20,500
  • nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein
  • current catalogs of protein-coding genes vary widely among mammals, with a recent analysis of the dog genome (8) reporting ≈19,000 genes and a recent article on the mouse genome (2) reporting at least 33,000 genes
  • the analysis implies that the mammalian protein-coding genes have been largely stable, with relatively little invention of truly novel genes
  • the mouse and dog genomes were used
  • high-quality genomic sequence is available
  • the extent of sequence divergence is well suited for gene identification
  • the nucleotide substitution rate relative to human is ≈0.50 per base for mouse and ≈0.35 for dog, with insertion and deletion (indel) events occurring at a frequency that is ≈10-fold lower
  • the nonconserved ORFs studied here were typically included in current gene catalogs because they have the potential to encode at least 100 amino acids
  • we thus do not know whether our conclusions would apply to much shorter ORFs
  • there may be a few hundred additional protein-coding genes to be found but that the final total is likely to remain under ≈21,000
  • truly novel protein-coding genes (encoding at least 100 amino acids) arise only rarely in mammalian lineages
  • there are only 168 "human-specific" genes
  • they belong to small paralogous families within the human genome (2 to 9 members) or contain Pfam domains homologous to other proteins