genome assembly

Schatz MC, Delcher AL & Salzberg SL 2010 Assembly of large genomes using second-generation sequencing. Genome Res 20:1165-1173.

  • the scaffolding phase of assembly focuses on resolving repeats by linking the initial contigs into scaffolds, guided by mate-pair data
  • mate pairs constrain the separation distance and the orientation of contigs containing mated reads
  • a scaffold is a collection of contigs linked by mate pairs
  • in which the gaps between contigs may represent either repeats, in which case the gap can in theory be filled with one or more copies of the repeat
  • or true gaps in which the original sequencing project did not capture the sequence needed to fill the gap
  • if the mate pair distances are long enough, they permit the assembler to link contigs across almost all repeats
  • if a contig contains too many reads, then it is flagged as a repeat
  • after flagging repeats, an assembler can build scaffolds by connecting unique contigs using mate-pair links
  • if the contigs in a scaffold overlap, the assembler can merge them at this point
  • potential drawback of the de Bruijn approach is that the de Bruijn graph can require an enormous amount of computer space (random access memory, or RAM)
  • unlike conventional overlap computations, which can be easily partitioned into multiple jobs with distinct batches of reads, the construction and analysis of a de Bruijn graph is not easily parallelized
  • as a result, de Bruijn assemblers such as Velvet and ALLPATHS, which have been used successfully on bacterial genomes, do not scale to large genomes
  • for a human-sized genome, these programs would require several terabytes of RAM to store their de Bruijn graphs, which is far more memory than is available on most computers
  • only two de Bruijn graph assemblers have been shown to have the ability to assemble a mammalian-sized genome
  • ABySS (Simpson et al. 2009) assembled a human genome in 87 h on a cluster of 21 eight-core machines each with 16 GB of RAM (168 cores, 336 GB of RAM total)
  • SOAPdenovo assembled a human genome in 40 h using a single computer with 32 cores and 512 GB of RAM (Li et al. 2010)
  • the lowest cost 454 (GS FLX) method was ~22 times more expensive, per megabase, than Illumina