genome assembly
Schatz MC, Delcher AL & Salzberg SL 2010 Assembly of large genomes using second-generation sequencing. Genome Res 20:1165-1173.
- the scaffolding phase of assembly focuses on resolving repeats by linking the initial contigs into scaffolds, guided by mate-pair data
- mate pairs constrain the separation distance and the orientation of contigs containing mated reads
- a scaffold is a collection of contigs linked by mate pairs
- in which the gaps between contigs may represent either repeats, in which case the gap can in theory be filled with one or more copies of the repeat
- or true gaps in which the original sequencing project did not capture the sequence needed to fill the gap
- if the mate pair distances are long enough, they permit the assembler to link contigs across almost all repeats
- if a contig contains too many reads, then it is flagged as a repeat
- after flagging repeats, an assembler can build scaffolds by connecting unique contigs using mate-pair links
- if the contigs in a scaffold overlap, the assembler can merge them at this point
- potential drawback of the de Bruijn approach is that the de Bruijn graph can require an enormous amount of computer space (random access memory, or RAM)
- unlike conventional overlap computations, which can be easily partitioned into multiple jobs with distinct batches of reads, the construction and analysis of a de Bruijn graph is not easily parallelized
- as a result, de Bruijn assemblers such as Velvet and ALLPATHS, which have been used successfully on bacterial genomes, do not scale to large genomes
- for a human-sized genome, these programs would require several terabytes of RAM to store their de Bruijn graphs, which is far more memory than is available on most computers
- only two de Bruijn graph assemblers have been shown to have the ability to assemble a mammalian-sized genome
- ABySS (Simpson et al. 2009) assembled a human genome in 87 h on a cluster of 21 eight-core machines each with 16 GB of RAM (168 cores, 336 GB of RAM total)
- SOAPdenovo assembled a human genome in 40 h using a single computer with 32 cores and 512 GB of RAM (Li et al. 2010)
- the lowest cost 454 (GS FLX) method was ~22 times more expensive, per megabase, than Illumina