noncoding DNA
Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K & Lander ES 2007 Distinguishing protein-coding and noncoding genes in the human genome. PNAS 104:19428-19433.
- current catalogs list a total of ≈24,500 putative protein-coding genes
- a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts
- there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation
- the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages
- we reject this hypothesis by carefully analyzing the nonconserved ORFs—specifically, their properties in other primates
- the vast majority of these ORFs are random occurrences
- the analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to ≈20,500
- nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein
- current catalogs of protein-coding genes vary widely among mammals, with a recent analysis of the dog genome (8) reporting ≈19,000 genes and a recent article on the mouse genome (2) reporting at least 33,000 genes
- the analysis implies that the mammalian protein-coding genes have been largely stable, with relatively little invention of truly novel genes
- the mouse and dog genomes were used
- high-quality genomic sequence is available
- the extent of sequence divergence is well suited for gene identification
- the nucleotide substitution rate relative to human is ≈0.50 per base for mouse and ≈0.35 for dog, with insertion and deletion (indel) events occurring at a frequency that is ≈10-fold lower
- the nonconserved ORFs studied here were typically included in current gene catalogs because they have the potential to encode at least 100 amino acids
- we thus do not know whether our conclusions would apply to much shorter ORFs
- there may be a few hundred additional protein-coding genes to be found but that the final total is likely to remain under ≈21,000
- truly novel protein-coding genes (encoding at least 100 amino acids) arise only rarely in mammalian lineages
- there are only 168 "human-specific" genes
- they belong to small paralogous families within the human genome (2 to 9 members) or contain Pfam domains homologous to other proteins