Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Exploiting sparseness in de novo genome assembly

Identifieur interne : 000942 ( Pmc/Corpus ); précédent : 000941; suivant : 000943

Exploiting sparseness in de novo genome assembly

Auteurs : Chengxi Ye ; Zhanshan Sam Ma ; Charles H. Cannon ; Mihai Pop ; Douglas W. Yu

Source :

RBID : PMC:3369186

Abstract

Background

The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

Methods

In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.

Results

We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.


Url:
DOI: 10.1186/1471-2105-13-S6-S1
PubMed: 22537038
PubMed Central: 3369186

Links to Exploration step

PMC:3369186

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Exploiting sparseness in
<italic>de novo </italic>
genome assembly</title>
<author>
<name sortKey="Ye, Chengxi" sort="Ye, Chengxi" uniqKey="Ye C" first="Chengxi" last="Ye">Chengxi Ye</name>
<affiliation>
<nlm:aff id="I1">Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Ecology, Conservation, and Environment Center; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ma, Zhanshan Sam" sort="Ma, Zhanshan Sam" uniqKey="Ma Z" first="Zhanshan Sam" last="Ma">Zhanshan Sam Ma</name>
<affiliation>
<nlm:aff id="I4">Computational Biology and Medical Ecology Lab; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cannon, Charles H" sort="Cannon, Charles H" uniqKey="Cannon C" first="Charles H" last="Cannon">Charles H. Cannon</name>
<affiliation>
<nlm:aff id="I5">Ecological Evolution Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I6">Department of Biological Sciences, Texas Tech University, Lubbock, TX 79410 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pop, Mihai" sort="Pop, Mihai" uniqKey="Pop M" first="Mihai" last="Pop">Mihai Pop</name>
<affiliation>
<nlm:aff id="I3">Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yu, Douglas W" sort="Yu, Douglas W" uniqKey="Yu D" first="Douglas W" last="Yu">Douglas W. Yu</name>
<affiliation>
<nlm:aff id="I2">Ecology, Conservation, and Environment Center; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I7">School of Biological Sciences, University of East Anglia, Norwich, Norfolk NR47TJ UK</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22537038</idno>
<idno type="pmc">3369186</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3369186</idno>
<idno type="RBID">PMC:3369186</idno>
<idno type="doi">10.1186/1471-2105-13-S6-S1</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000942</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000942</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Exploiting sparseness in
<italic>de novo </italic>
genome assembly</title>
<author>
<name sortKey="Ye, Chengxi" sort="Ye, Chengxi" uniqKey="Ye C" first="Chengxi" last="Ye">Chengxi Ye</name>
<affiliation>
<nlm:aff id="I1">Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Ecology, Conservation, and Environment Center; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ma, Zhanshan Sam" sort="Ma, Zhanshan Sam" uniqKey="Ma Z" first="Zhanshan Sam" last="Ma">Zhanshan Sam Ma</name>
<affiliation>
<nlm:aff id="I4">Computational Biology and Medical Ecology Lab; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cannon, Charles H" sort="Cannon, Charles H" uniqKey="Cannon C" first="Charles H" last="Cannon">Charles H. Cannon</name>
<affiliation>
<nlm:aff id="I5">Ecological Evolution Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I6">Department of Biological Sciences, Texas Tech University, Lubbock, TX 79410 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pop, Mihai" sort="Pop, Mihai" uniqKey="Pop M" first="Mihai" last="Pop">Mihai Pop</name>
<affiliation>
<nlm:aff id="I3">Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yu, Douglas W" sort="Yu, Douglas W" uniqKey="Yu D" first="Douglas W" last="Yu">Douglas W. Yu</name>
<affiliation>
<nlm:aff id="I2">Ecology, Conservation, and Environment Center; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I7">School of Biological Sciences, University of East Anglia, Norwich, Norfolk NR47TJ UK</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The very large memory requirements for the construction of assembly graphs for
<italic>de novo </italic>
genome assembly limit current algorithms to super-computing environments.</p>
</sec>
<sec>
<title>Methods</title>
<p>In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed
<italic>k-</italic>
mers as nodes and the links between these nodes allows the
<italic>de novo </italic>
assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.</p>
</sec>
<sec>
<title>Results</title>
<p>We implement this sparse graph concept in a proof-of-principle software package,
<italic>SparseAssembler</italic>
, utilizing a new sparse
<italic>k-</italic>
mer graph structure evolved from the
<italic>de Bruijn </italic>
graph. We test our
<italic>SparseAssembler </italic>
with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing
<italic>de novo </italic>
assemblers.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Dew, Im" uniqKey="Dew I">IM Dew</name>
</author>
<author>
<name sortKey="Fasulo, Dp" uniqKey="Fasulo D">DP Fasulo</name>
</author>
<author>
<name sortKey="Flanigan, Mj" uniqKey="Flanigan M">MJ Flanigan</name>
</author>
<author>
<name sortKey="Kravitz, Sa" uniqKey="Kravitz S">SA Kravitz</name>
</author>
<author>
<name sortKey="Mobarry, Cm" uniqKey="Mobarry C">CM Mobarry</name>
</author>
<author>
<name sortKey="Reinert, Khj" uniqKey="Reinert K">KHJ Reinert</name>
</author>
<author>
<name sortKey="Remington, Ka" uniqKey="Remington K">KA Remington</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
<author>
<name sortKey="Jaffe, Db" uniqKey="Jaffe D">DB Jaffe</name>
</author>
<author>
<name sortKey="Stanley, K" uniqKey="Stanley K">K Stanley</name>
</author>
<author>
<name sortKey="Butler, J" uniqKey="Butler J">J Butler</name>
</author>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S Gnerre</name>
</author>
<author>
<name sortKey="Mauceli, E" uniqKey="Mauceli E">E Mauceli</name>
</author>
<author>
<name sortKey="Berger, B" uniqKey="Berger B">B Berger</name>
</author>
<author>
<name sortKey="Mesirov, Jp" uniqKey="Mesirov J">JP Mesirov</name>
</author>
<author>
<name sortKey="Lander, Es" uniqKey="Lander E">ES Lander</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mullikin, Jc" uniqKey="Mullikin J">JC Mullikin</name>
</author>
<author>
<name sortKey="Ning, Zm" uniqKey="Ning Z">ZM Ning</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Havlak, P" uniqKey="Havlak P">P Havlak</name>
</author>
<author>
<name sortKey="Chen, R" uniqKey="Chen R">R Chen</name>
</author>
<author>
<name sortKey="Durbin, Kj" uniqKey="Durbin K">KJ Durbin</name>
</author>
<author>
<name sortKey="Egan, A" uniqKey="Egan A">A Egan</name>
</author>
<author>
<name sortKey="Ren, Yr" uniqKey="Ren Y">YR Ren</name>
</author>
<author>
<name sortKey="Song, Xz" uniqKey="Song X">XZ Song</name>
</author>
<author>
<name sortKey="Weinstock, Gm" uniqKey="Weinstock G">GM Weinstock</name>
</author>
<author>
<name sortKey="Gibbs, Ra" uniqKey="Gibbs R">RA Gibbs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sjm" uniqKey="Jones S">SJM Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, M" uniqKey="Chaisson M">M Chaisson</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
<author>
<name sortKey="Tang, Hx" uniqKey="Tang H">HX Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S Gnerre</name>
</author>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I MacCallum</name>
</author>
<author>
<name sortKey="Przybylski, D" uniqKey="Przybylski D">D Przybylski</name>
</author>
<author>
<name sortKey="Ribeiro, Fj" uniqKey="Ribeiro F">FJ Ribeiro</name>
</author>
<author>
<name sortKey="Burton, Jn" uniqKey="Burton J">JN Burton</name>
</author>
<author>
<name sortKey="Walker, Bj" uniqKey="Walker B">BJ Walker</name>
</author>
<author>
<name sortKey="Sharpe, T" uniqKey="Sharpe T">T Sharpe</name>
</author>
<author>
<name sortKey="Hall, G" uniqKey="Hall G">G Hall</name>
</author>
<author>
<name sortKey="Shea, Tp" uniqKey="Shea T">TP Shea</name>
</author>
<author>
<name sortKey="Sykes, S" uniqKey="Sykes S">S Sykes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Himmelbauer, H" uniqKey="Himmelbauer H">H Himmelbauer</name>
</author>
<author>
<name sortKey="Dohm, Jc" uniqKey="Dohm J">JC Dohm</name>
</author>
<author>
<name sortKey="Lottaz, C" uniqKey="Lottaz C">C Lottaz</name>
</author>
<author>
<name sortKey="Borodina, T" uniqKey="Borodina T">T Borodina</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Rq" uniqKey="Li R">RQ Li</name>
</author>
<author>
<name sortKey="Zhu, Hm" uniqKey="Zhu H">HM Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Qian, Wb" uniqKey="Qian W">WB Qian</name>
</author>
<author>
<name sortKey="Fang, Xd" uniqKey="Fang X">XD Fang</name>
</author>
<author>
<name sortKey="Shi, Zb" uniqKey="Shi Z">ZB Shi</name>
</author>
<author>
<name sortKey="Li, Yr" uniqKey="Li Y">YR Li</name>
</author>
<author>
<name sortKey="Li, St" uniqKey="Li S">ST Li</name>
</author>
<author>
<name sortKey="Shan, G" uniqKey="Shan G">G Shan</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, Hx" uniqKey="Tang H">HX Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sundquist, A" uniqKey="Sundquist A">A Sundquist</name>
</author>
<author>
<name sortKey="Ronaghi, M" uniqKey="Ronaghi M">M Ronaghi</name>
</author>
<author>
<name sortKey="Tang, Hx" uniqKey="Tang H">HX Tang</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warren, Rl" uniqKey="Warren R">RL Warren</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Jones, Sjm" uniqKey="Jones S">SJM Jones</name>
</author>
<author>
<name sortKey="Holt, Ra" uniqKey="Holt R">RA Holt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conway, Tc" uniqKey="Conway T">TC Conway</name>
</author>
<author>
<name sortKey="Bromage, Aj" uniqKey="Bromage A">AJ Bromage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deng, Hw" uniqKey="Deng H">HW Deng</name>
</author>
<author>
<name sortKey="Lin, Y" uniqKey="Lin Y">Y Lin</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Shen, H" uniqKey="Shen H">H Shen</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Papasian, Cj" uniqKey="Papasian C">CJ Papasian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Zimin, Av" uniqKey="Zimin A">AV Zimin</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Treangen, T" uniqKey="Treangen T">T Treangen</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Wy" uniqKey="Zhang W">WY Zhang</name>
</author>
<author>
<name sortKey="Chen, Jj" uniqKey="Chen J">JJ Chen</name>
</author>
<author>
<name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author>
<name sortKey="Tang, Yf" uniqKey="Tang Y">YF Tang</name>
</author>
<author>
<name sortKey="Shang, J" uniqKey="Shang J">J Shang</name>
</author>
<author>
<name sortKey="Shen, Br" uniqKey="Shen B">BR Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mago, T" uniqKey="Mago T">T Magoč</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author>
<name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author>
<name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22537038</article-id>
<article-id pub-id-type="pmc">3369186</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-S6-S1</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-S6-S1</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Exploiting sparseness in
<italic>de novo </italic>
genome assembly</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Ye</surname>
<given-names>Chengxi</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>cxy@umd.edu</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Ma</surname>
<given-names>Zhanshan Sam</given-names>
</name>
<xref ref-type="aff" rid="I4">4</xref>
<email>ma@vandals.uidaho.edu</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Cannon</surname>
<given-names>Charles H</given-names>
</name>
<xref ref-type="aff" rid="I5">5</xref>
<xref ref-type="aff" rid="I6">6</xref>
<email>chuck@xtbg.ac.cn</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A4">
<name>
<surname>Pop</surname>
<given-names>Mihai</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>mpop@umiacs.umd.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A5">
<name>
<surname>Yu</surname>
<given-names>Douglas W</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I7">7</xref>
<email>douglas.yu@uea.ac.uk</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</aff>
<aff id="I2">
<label>2</label>
Ecology, Conservation, and Environment Center; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</aff>
<aff id="I3">
<label>3</label>
Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA</aff>
<aff id="I4">
<label>4</label>
Computational Biology and Medical Ecology Lab; State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China</aff>
<aff id="I5">
<label>5</label>
Ecological Evolution Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China</aff>
<aff id="I6">
<label>6</label>
Department of Biological Sciences, Texas Tech University, Lubbock, TX 79410 USA</aff>
<aff id="I7">
<label>7</label>
School of Biological Sciences, University of East Anglia, Norwich, Norfolk NR47TJ UK</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>4</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<issue>Suppl 6</issue>
<supplement>
<named-content content-type="supplement-title">Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012)</named-content>
<named-content content-type="supplement-editor">Paul Medvedev and Eleazar Eskin</named-content>
</supplement>
<fpage>S1</fpage>
<lpage>S1</lpage>
<permissions>
<copyright-statement>Copyright ©2012 Ye et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Ye et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/S6/S1"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>The very large memory requirements for the construction of assembly graphs for
<italic>de novo </italic>
genome assembly limit current algorithms to super-computing environments.</p>
</sec>
<sec>
<title>Methods</title>
<p>In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed
<italic>k-</italic>
mers as nodes and the links between these nodes allows the
<italic>de novo </italic>
assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.</p>
</sec>
<sec>
<title>Results</title>
<p>We implement this sparse graph concept in a proof-of-principle software package,
<italic>SparseAssembler</italic>
, utilizing a new sparse
<italic>k-</italic>
mer graph structure evolved from the
<italic>de Bruijn </italic>
graph. We test our
<italic>SparseAssembler </italic>
with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing
<italic>de novo </italic>
assemblers.</p>
</sec>
</abstract>
<conference>
<conf-date>19-20 April 2012</conf-date>
<conf-name>Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing</conf-name>
<conf-loc>Barcelona, Spain</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>In contrast with traditional Sanger methods, second-generation sequencing technologies, such as Roche/454 and Illumina/Solexa, produce millions of genome fragments as short DNA sequence reads ( < ~150 bp for Illumina, and < ~500 bp in length for 454, currently). Entire genomes are reconstructed from such fragmented data through a computational process called genome assembly [
<xref ref-type="bibr" rid="B1">1</xref>
]. The most common approaches for solving this problem (Overlap-Layout-Consensus, and the
<italic>de Bruijn </italic>
graph) first construct a graph encoding the relationships between the sequencing reads generated during the shotgun sequencing process. For the Overlap-Layout-Consensus [
<xref ref-type="bibr" rid="B2">2</xref>
-
<xref ref-type="bibr" rid="B5">5</xref>
], and the related string graph approach [
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
], each node of the graph represents a sequencing read in the input and an edge connects two nodes if the corresponding sequences 'overlap' (the prefix of one sequence matches the suffix of the other with sufficient similarity). In the
<italic>de Bruijn </italic>
graph approach [
<xref ref-type="bibr" rid="B8">8</xref>
-
<xref ref-type="bibr" rid="B16">16</xref>
], the nodes of the graph are sub-strings of length
<italic>k </italic>
(
<italic>k-</italic>
mers) and the edges link together
<italic>k-</italic>
mers that overlap by exactly
<italic>k </italic>
- 1 bp only if the
<italic>k </italic>
+ 1 bp sequence obtained by joining the adjacent nodes is present in at least one of the sequences in the input.</p>
<p>As we will describe in more detail below, irrespective of the approach, computational representations of the resulting graphs require large amounts of memory, thereby requiring substantial computational resources (both memory and run time) to assemble large genomes (such as human). Typical memory requirements for modern assemblers range in the hundreds of giga-bytes (GB) for human genome assembly. Recently, several methods were aimed at reducing the memory requirement of
<italic>de novo </italic>
genome assembly. In [
<xref ref-type="bibr" rid="B17">17</xref>
], the authors proposed a highly-compressed bitmap representation of a
<italic>de Bruijn </italic>
graph that can be queried for the existence of individual edges. With this succinct data structure, they were able to reduce the memory consumption by a factor of ~10 compared to common
<italic>de Bruijn </italic>
graph structure. In [
<xref ref-type="bibr" rid="B7">7</xref>
] the authors relied on read compressed text data structures (the FM index) to construct, on the fly, an assembly string graph. In this paper, we propose an alternative approach to reduce memory usage which exploits the idea of sparseness in genome assembly. Specifically, instead of storing every single
<italic>k-</italic>
mer (in a
<italic>de Bruijn </italic>
graph) or read (in an overlap graph) as nodes, we store a sparse subset of these nodes while still ensuring the assembly can be performed. Here, we demonstrate that this approach greatly reduces computational memory demands without sacrificing the accuracy of assembly.</p>
<sec>
<title>Memory usage of graph-based assembly paradigms</title>
<p>To introduce concepts central to the approach implemented in
<italic>SparseAssembler</italic>
, we will briefly discuss the main assembly paradigms and their corresponding memory usage.</p>
</sec>
<sec>
<title>Overlap-Layout-Consensus and string graphs</title>
<p>As briefly outlined above, in the OLC paradigm, the graph contains the reads as nodes, and the edges indicate that the reads overlap. For a given genome coverage
<italic>c </italic>
(the average number of reads covering a particular base in the genome) for every read, this approach, thus, requires storing approximately
<italic>c </italic>
overlaps, each of which requires storing a 4-8 byte pointer, as well as at least another 2 bytes of additional information about the overlap (coordinates within the reads, level of similarity, etc.) If we take into account that each read must also record its sequence and possible quality value, we also require an additional 2-8 bits of information per base-pair per read. Storing this graph for a typical human genome sequenced with reads of length 100 at a coverage of 50, requires between ~300-900 GB of memory. Note that in this analysis we omit repeats and errors, both of which further increase the memory requirement.</p>
<p>The string graph approach dramatically reduces the memory requirement by a factor roughly proportional to the depth of coverage. A string graph is an overlap graph where transitive edges have been removed, specifically if read
<italic>A </italic>
overlaps reads
<italic>B </italic>
and
<italic>C</italic>
, and
<italic>B </italic>
also overlaps
<italic>C</italic>
, (Figure
<xref ref-type="fig" rid="F1">1a</xref>
) the overlap (
<italic>A, C</italic>
) is removed from the graph as it can be inferred from the overlaps between (
<italic>A, B</italic>
), and (
<italic>B, C</italic>
) (Figure
<xref ref-type="fig" rid="F1">1b</xref>
). As a result, each read only needs to store roughly one overlap (multiple overlaps may need to be recorded due to sequencing errors and repeats), reducing the theoretical memory requirement to roughly 6-18 GB of memory. On real data, a recent assembler relying on the string graph approach was reported to use 54 GB memory for human genome assembly [
<xref ref-type="bibr" rid="B7">7</xref>
].</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>From overlap graph to a string graph</bold>
. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (
<italic>a, c</italic>
) is removed.</p>
</caption>
<graphic xlink:href="1471-2105-13-S6-S1-1"></graphic>
</fig>
</sec>
<sec>
<title>
<italic>de Bruijn </italic>
graph based assembly</title>
<p>In the
<italic>de Bruijn </italic>
graph, edges can be implicitly represented by saving only the presence of the neighbouring nucleotides (at most 4 for each
<italic>k-</italic>
mer). A common first stage of
<italic>de Bruijn </italic>
graph-based
<italic>de novo </italic>
assemblers is to build the graph by storing all the
<italic>k-</italic>
mers and their neighbouring nucleotide(s). A
<italic>k-</italic>
mer is considered being different only in orientation with its reverse complement, and only one of the two (chosen by lexical-order) is saved. Let all
<italic>k-</italic>
mers be encoded in bits: 00, 01, 10, 11, respectively, for A, C, G, T, and let 4 bits be used to indicate the presence/absence of the 4 possible edges/nucleotides on every side (Figure
<xref ref-type="fig" rid="F2">2a, b</xref>
). Thus, each
<italic>k-</italic>
mer uses 2 ×
<italic>k </italic>
+ 4 × 2 bits of memory, and the minimum space requirement S
<sub>1 </sub>
for a genome of size
<italic>g </italic>
is approximately S
<sub>1 </sub>
= G × (2 ×
<italic>k </italic>
+ 4 × 2), assuming no additional information needs to be saved. Note that this number does not, in theory, depend on depth of coverage (a
<italic>k-</italic>
mer is only stored once irrespective of how many reads contain it). However sequencing errors add a huge number of false
<italic>k-</italic>
mers, thus extra space has to be used to reach successful assembly.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>A node with branches in the
<italic>de Bruijn </italic>
graph and the sparse
<italic>k-</italic>
mer graph</bold>
. (a) A node with branches in a
<italic>de Bruijn </italic>
graph. (b) The binary implementation of (a). (c) A node with branches in a sparse
<italic>k-</italic>
mer graph. (d) The binary implementation of (c). The
<italic>k-</italic>
mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.</p>
</caption>
<graphic xlink:href="1471-2105-13-S6-S1-2"></graphic>
</fig>
<p>Typically,
<italic>k-</italic>
mer sizes of 21~51 bp are used because shorter
<italic>k-</italic>
mers result in branching, and therefore, in ambiguity in the assembly. As a consequence, the memory space required for saving all
<italic>k-</italic>
mers can be huge. Using traditionally techniques, over 300 GB memory can be used even with a small
<italic>k-</italic>
mer size of 20 [
<xref ref-type="bibr" rid="B9">9</xref>
], and it is common to use over 100 GB memory even with error-corrected reads with few false
<italic>k-</italic>
mers [
<xref ref-type="bibr" rid="B13">13</xref>
]. Recent advances in
<italic>k-</italic>
mer counting (e.g.,
<italic>Jellyfish </italic>
[
<xref ref-type="bibr" rid="B18">18</xref>
] and
<italic>BFcounter </italic>
[
<xref ref-type="bibr" rid="B19">19</xref>
]) can help improve the memory requirements of
<italic>de Bruijn </italic>
graph construction. The approach we describe below targets the actual information stored in the graph, thus allowing further memory reductions beyond those achieved by the aforementioned tools.</p>
</sec>
<sec>
<title>Sparse assembly graph</title>
<p>The approach we propose here involves skipping some fraction of the
<italic>k-</italic>
mers or reads, thus reducing the size of the overall assembly graph necessary to capture the information. Using the example above from the OLC graph, instead of storing overlaps
<italic>(A, B</italic>
) and (
<italic>B, C</italic>
), we could simply store overlap (
<italic>A, C</italic>
) and eliminate read
<italic>B </italic>
from the graph. In the
<italic>de Bruijn </italic>
graph, we simply store only one out of every
<italic>g </italic>
(
<italic>g </italic>
<
<italic>k</italic>
)
<italic>k-</italic>
mers, attempting to subsample as evenly across the original graph as possible. As a result, the size of the
<italic>de Bruijn </italic>
graph is reduced by a factor of approximately
<italic>g </italic>
(see full details in materials and methods).</p>
<p>We would like to note that our approach is similar in spirit to the
<italic>minimizer </italic>
idea introduced by Roberts
<italic>et al. </italic>
[
<xref ref-type="bibr" rid="B20">20</xref>
]. Their approach is targeted at the task of detecting
<italic>k-</italic>
mers shared between different reads. The 'traditional' indexing approach requires storing all
<italic>k-</italic>
mers within a read. Roberts
<italic>et al. </italic>
[
<xref ref-type="bibr" rid="B20">20</xref>
] propose only storing the lexicographically smallest
<italic>k-</italic>
mer (
<italic>minimizer</italic>
) in a size
<italic>w </italic>
detection window. They noted that the number of
<italic>minimizers </italic>
is generally smaller than the number of all
<italic>k-</italic>
mers (consecutive
<italic>k-</italic>
mers often share a same
<italic>minimizer</italic>
), and the memory requirement is further reduced by storing the smaller-sized
<italic>minimizers </italic>
and by using a large detection window. Our approach for simplifying the
<italic>de Bruijn </italic>
graph is similar in spirit with the
<italic>minimizer </italic>
approach, as we only store a sparse sub-sample of the
<italic>k-</italic>
mers found in the reads. Our choice of the
<italic>k-</italic>
mers stored in the graph attempts to approximate a uniform
<italic>k-</italic>
mer sampling of the genome, rather than based on lexicographic ordering. In future research we plan to explore the relative benefits of the two approaches, by including lexicographic information within the sparse
<italic>k-</italic>
mer selection process.</p>
</sec>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Moving to the sparse
<italic>k-</italic>
mer graph</title>
<p>In the sparse
<italic>k-</italic>
mer graph structure, the nodes in the graph represent a 1/
<italic>g </italic>
subsample of the
<italic>k-</italic>
mer diversity in the entire genome. The resulting graph differs from the typical
<italic>de Bruijn </italic>
graph by having longer links, i.e. more nucleotides per branch (see Figure
<xref ref-type="fig" rid="F2">2c</xref>
). With the sparsely spaced nodes, the memory requirement for constructing the sparse
<italic>k-</italic>
mer graph can be considerably less than that for building
<italic>de Bruijn </italic>
graphs.</p>
<p>Our subsampling procedures proceeds as follows: let
<italic>g </italic>
equal the number of base pairs skipped between
<italic>k-</italic>
mers that are stored from each sequencing read. In the ideal case, with no branches and assuming that the
<italic>k-</italic>
mers are staggered by
<italic>g </italic>
= 5 bases, we can store ≤ 5 neighbouring bases on each side of the
<italic>k-</italic>
mer. Although we store more information for each
<italic>k-</italic>
mer by also extending its links, we store many fewer
<italic>k-</italic>
mers than currently implemented approaches. More precisely, the total memory space requirement can be calculated as,
<inline-formula>
<mml:math id="M1" name="1471-2105-13-S6-S1-i1" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">ptr_sz</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">ptr_sz</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>
where
<italic>ptr_sz </italic>
is the extra space required by the pointer structures for the edge links. Compared with the
<italic>de Bruijn </italic>
graph approach, we reduce
<italic>k-</italic>
mer storage to 1/
<italic>g</italic>
, and the portion for storing edges to a half, but add a new space requirement for storing edge links, which requires 2 × g bits for each side of the
<italic>k-</italic>
mer. In our experiments, we found that using
<italic>g </italic>
= 10 - 25 was effective.</p>
<p>Interestingly, the lower bound of memory space usage for the sparse
<italic>k-</italic>
mer graph will decrease as sequencing read length increases and more information is stored in the links. Let reads be of length
<italic>r</italic>
; the sparse
<italic>k-</italic>
mers scheme becomes more efficient when
<italic>r-k </italic>
is large. In this case we can use large
<italic>g</italic>
's and still retain sufficient information, which will become more common with the future improvements in sequencing technology. Following this trend of technology advances, we can also increase
<italic>k </italic>
to larger values than those used currently while still keeping memory usage low.</p>
<p>As a detailed example, we explain how a sparse
<italic>k-</italic>
mer graph can be constructed. We divide the process into two rounds. In the first round, we select the
<italic>k-</italic>
mers that will be used as nodes. For each sequencing read, we first query whether any of the subsequent
<italic>g k-</italic>
mers has already been used as a node. If so, we begin our sub-sample of the read from that node. Otherwise, we select the first
<italic>k-</italic>
mer as a new node. After the first scan, the nodes are selected, and they are expected to be nearly
<italic>g-</italic>
gapped if there are no sequencing errors. In real data, we filter out the low-coverage nodes before we move to the second round. Low-coverage nodes are regarded as nodes in spurious branches such as tips or bubbles or real
<italic>k-</italic>
mer nodes connected to a spurious branch. In the second round, the links between the remaining nodes are built. The accurate coverage of the
<italic>k-</italic>
mer nodes is recalculated in this round. After two rounds of processing, the
<italic>k-</italic>
mers picked as nodes are approximately, though not strictly
<italic>g-</italic>
gapped, which results in some redundancy in space.</p>
<p>To show the effect of
<italic>g </italic>
on memory storage during assembly, we conducted two simple comparisons using simulated 30× error free 100 bp reads from the fifth chromosome of the
<italic>Saccharomyces cervisiae </italic>
genome (NC_001137.2), which is around 600 kbp long. First, without any sequencing error, the number of selected nodes, with
<italic>k </italic>
= 31,
<italic>g </italic>
= 16, is 36,932. If we set
<italic>g </italic>
= 1, meaning no skipping, we observe 566,045 nodes in the graph, indicating a 15.3 fold reduction in the size of the graph, which does approximate 1/
<italic>g</italic>
. When we introduce a uniform 1% error rate into our sequence data, we now obtain 830,309 nodes, with
<italic>k </italic>
= 31 and
<italic>g </italic>
= 16, but the full graph also gets larger with the error rate, now containing 12,494,172 nodes, with an effective reduction by 15.0 for
<italic>g = 16</italic>
. Assembly time and results are also comparable to existing
<italic>de novo </italic>
assemblers, but a sparse approach greatly reduces the memory requirements and
<italic>g </italic>
can be adjusted according to the local computing environment. Like in the
<italic>de Bruijn </italic>
graph the complexity of the graph depends on the value chosen for
<italic>k</italic>
. The larger
<italic>k </italic>
is, the less complex the graph. Since the sparse graph encodes the same branches as the original
<italic>de Bruijn </italic>
graph, the conversion to a sparse graph reduces the memory requirement but does not increase the overall complexity of the resulting structure.</p>
<p>In our experiments, we found that different values for
<italic>g </italic>
lead to only a slight difference in assembly results, thus, setting
<italic>g </italic>
to 10~15 provides substantial memory saving without sacrifice in quality. Although increasing
<italic>g </italic>
should cause us to miss some of the true links due to sequencing error, bias, and low-coverage, we found the assembly quality, measured by corrected NG50 and contig mean size, is usually slightly improved with increasing
<italic>g</italic>
. Also usually the assembly is better with a larger
<italic>g </italic>
because many of the short repeating
<italic>k-</italic>
mers are not saved as nodes, and are only implied by the edge links. The larger the
<italic>g</italic>
, the fewer repeats are saved.</p>
<p>Last, if two reads overlap by
<italic>k + g </italic>
bases, an overlap between these 2 reads is also found with the sparse
<italic>k-</italic>
mer graph and there will be at least one
<italic>k-</italic>
mer selected in each read. And if two reads share one specific
<italic>k-</italic>
mer that has been saved in round 1, then the overlap can also be found. Thus, the ability of encoding overlaps between reads within the sparse
<italic>k-</italic>
mer graph is between using that of
<italic>de Bruijn </italic>
graphs constructed with
<italic>k-</italic>
mer sizes between
<italic>k </italic>
and (
<italic>k+g)</italic>
. Practically we found using
<italic>g = </italic>
10~15 provides a good balance point on real datasets.</p>
<p>Note that the graph construction procedure outlined above is order-dependent, i.e., different read orderings will result in a (slightly) different choice of the sparse
<italic>k-</italic>
mers selected within the graph. In our tests, this non-deterministic behaviour does not seem to affect the performance of the algorithm (either in terms of memory requirement or accuracy), however we plan to explore in the future the use of the
<italic>minimizer </italic>
concept (described in the Background) to ensure determinism in graph construction.</p>
</sec>
<sec>
<title>Circumventing sequencing errors and graph simplification</title>
<p>Sequencing errors and polymorphisms can result in tips or bubbles [
<xref ref-type="bibr" rid="B8">8</xref>
] in the assembly graphs, irrespective of the underlying paradigm (
<italic>de Bruijn</italic>
, sparse
<italic>k-</italic>
mer, or overlap). To remove these unwanted structures, we first remove the low-coverage nodes and edges. After that, like in
<italic>Velvet </italic>
[
<xref ref-type="bibr" rid="B8">8</xref>
], we developed a Dijkstra-like breadth-first search algorithm to detect bubbles and tips, using the distance in bp to traverse the branches from near to far. The search backtracks to the last branching node upon reaching a visited node or a tip end. Upon a bubble we choose the higher-coverage branch and remove the weaker branch. Tips are directly removed. After this step, spurious paths and redundant structures like tiny loops and bubbles are removed (Figure
<xref ref-type="fig" rid="F3">3</xref>
).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Breadth-first search bubble removal in the sparse
<italic>k-</italic>
mer graph</bold>
. Removing unwanted structures in the sparse
<italic>de Bruijn </italic>
graph. (a) Before removal. (b) After removal.</p>
</caption>
<graphic xlink:href="1471-2105-13-S6-S1-3"></graphic>
</fig>
</sec>
<sec>
<title>Genome assembly</title>
<p>The full assembly process consists of (
<italic>i</italic>
) building the sparse assembly graph as described above and (
<italic>ii</italic>
) graph simplification and traversal. The procedure for reconstructing the genome is similar to that used in the
<italic>de Bruijn </italic>
and string graph algorithms. A new traversal begins at a node not visited in previous traversals, and breaks when branches are detected; the separate traversals form the individual contigs reported by the assembler.</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<p>Recent comparisons of assembly software [
<xref ref-type="bibr" rid="B21">21</xref>
-
<xref ref-type="bibr" rid="B23">23</xref>
], including
<italic>SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS, SOAPdenovo</italic>
, and
<italic>ALLPATHS </italic>
failed to discover significant differences in the magnitude of memory usage, which were all large: they all require > 100 GB memory on human genome assembly even with error free data. We therefore compare our results with only three major state-of-the-art and purely
<italic>de Bruijn </italic>
graph based assemblers:
<italic>ABySS, Velvet</italic>
, and
<italic>SOAPdenovo</italic>
.</p>
<p>To test the sparse assembly idea we implemented a single threaded program
<italic>SparseAssembler</italic>
, based on the sparse
<italic>k-</italic>
mer graph and assembly process described above. In all tests, we set assemblers to single end single threaded mode.</p>
<p>In simulated comparisons (Tables
<xref ref-type="table" rid="T1">1</xref>
,
<xref ref-type="table" rid="T2">2</xref>
), we uniformly sampled 30× 100 bp reads from the fruit fly (X, NC_004354.3; IIL, NT_033779.4; IIR, NT_033778.3; IIIL, NT_037436.3; IIIR, NT_033777.2; IV, NC_004353.3) and rice genomes
<ext-link ext-link-type="uri" xlink:href="http://rgp.dna.affrc.go.jp/J/IRGSP/Build3/build3.html">http://rgp.dna.affrc.go.jp/J/IRGSP/Build3/build3.html</ext-link>
and introduced uniformly distributed errors at 0.5% error rate and assembled using
<italic>k </italic>
= 31 for all assemblers and used
<italic>g </italic>
= 15 for
<italic>SparseAssembler</italic>
. We chose a somewhat lower error rate than commonly encountered in practice (although after quality trimming and error correction real datasets can achieve such low error rates) in order to allow us to execute all the assemblers being compared. High levels of error lead to increased memory requirements for the majority of existing genome assemblers. We also simulated 50× 200 bp reads (error free as well as 1% error rate) to test the performance using various
<italic>k-</italic>
mer sizes on a human genome (NCBI build 39, Table
<xref ref-type="table" rid="T3">3</xref>
). We used
<italic>k </italic>
= 31, 63, 127, and fixed
<italic>g </italic>
= 25, corresponding to the skipped intermediate
<italic>k-</italic>
mers. Our simulations on the human genome with varying
<italic>k </italic>
highlights several interesting phenomena. Limited by memory and read length, current assemblers usually use small
<italic>k-</italic>
mer sizes (21~64) to assemble human genomes, our simulations suggest longer read lengths could lead to drastic improvements in human genome assembly (Table
<xref ref-type="table" rid="T3">3</xref>
). Longer reads can be obtained with current technology, e.g., through the use of overlapping paired-end reads, currently available [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
].</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Assembly performance comparison on the fruit fly genome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">(
<italic>k </italic>
= 31 )</th>
<th align="left">
<italic>ABySS</italic>
</th>
<th align="left">
<italic>Velvet</italic>
</th>
<th align="left">
<italic>SOAPdenovo</italic>
</th>
<th align="left">
<italic>SparseAssembler</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Time (hr)</td>
<td align="left">5.5</td>
<td align="left">3</td>
<td align="left">3</td>
<td align="left">1</td>
</tr>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="left">46</td>
<td align="left">31</td>
<td align="left">14</td>
<td align="left">2</td>
</tr>
<tr>
<td align="left">> 100 bp (# contigs)</td>
<td align="left">23,992</td>
<td align="left">23,104</td>
<td align="left">20,580</td>
<td align="left">20,429</td>
</tr>
<tr>
<td align="left"> Sum (kbp)</td>
<td align="left">113,580</td>
<td align="left">113,574</td>
<td align="left">112,395</td>
<td align="left">113,650</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="left">4,734</td>
<td align="left">4,916</td>
<td align="left">5,461</td>
<td align="left">5,563</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="left">18,317</td>
<td align="left">19,576</td>
<td align="left">25,461</td>
<td align="left">28,355</td>
</tr>
<tr>
<td align="left">N95 (bp)</td>
<td align="left">66</td>
<td align="left">61</td>
<td align="left">67</td>
<td align="left">74</td>
</tr>
<tr>
<td align="left">Corr NG50 (bp)</td>
<td align="left">18,317</td>
<td align="left">19,576</td>
<td align="left">25,461</td>
<td align="left">28,355</td>
</tr>
<tr>
<td align="left">Corr NG95 (bp)</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="left">162,263</td>
<td align="left">190,104</td>
<td align="left">195,709</td>
<td align="left">273,977</td>
</tr>
<tr>
<td align="left">Coverage (%)</td>
<td align="left">96.24</td>
<td align="left">96.82</td>
<td align="left">95.53</td>
<td align="left">97.83</td>
</tr>
<tr>
<td align="left">Misjoins</td>
<td align="left">0</td>
<td align="left">6</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The performance on the fruit fly genome dataset, genome size: 120,291 kbp. Programs are run using default settings.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Assembly performance comparison on the rice genome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">(
<italic>k </italic>
= 31 )</th>
<th align="left">
<italic>ABySS</italic>
</th>
<th align="left">
<italic>Velvet</italic>
</th>
<th align="left">
<italic>SOAPdenovo</italic>
</th>
<th align="left">
<italic>SparseAssembler</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Time (hr)</td>
<td align="left">13</td>
<td align="left">7</td>
<td align="left">16</td>
<td align="left">5</td>
</tr>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="left">69</td>
<td align="left">51</td>
<td align="left">29</td>
<td align="left">4</td>
</tr>
<tr>
<td align="left">> 100 bp (# contigs)</td>
<td align="left">458,456</td>
<td align="left">397,252</td>
<td align="left">444,545</td>
<td align="left">386,604</td>
</tr>
<tr>
<td align="left"> Sum (kbp)</td>
<td align="left">253,708</td>
<td align="left">225,618</td>
<td align="left">258,106</td>
<td align="left">262,988</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="left">553</td>
<td align="left">568</td>
<td align="left">581</td>
<td align="left">680</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="left">538</td>
<td align="left">310</td>
<td align="left">655</td>
<td align="left">734</td>
</tr>
<tr>
<td align="left">N95 (bp)</td>
<td align="left">38</td>
<td align="left">0</td>
<td align="left">40</td>
<td align="left">31</td>
</tr>
<tr>
<td align="left">Corr NG50 (bp)</td>
<td align="left">538</td>
<td align="left">310</td>
<td align="left">655</td>
<td align="left">733</td>
</tr>
<tr>
<td align="left">Corr NG95 (bp)</td>
<td align="left">38</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="left">23,220</td>
<td align="left">23,939</td>
<td align="left">26,869</td>
<td align="left">26,890</td>
</tr>
<tr>
<td align="left">Coverage (%)</td>
<td align="left">69.2</td>
<td align="left">62.3</td>
<td align="left">71.3</td>
<td align="left">71.5</td>
</tr>
<tr>
<td align="left">Misjoins</td>
<td align="left">1</td>
<td align="left">34</td>
<td align="left">10</td>
<td align="left">9</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The performance on the rice genome dataset, genome size: 370,733 kbp. Programs are run using default settings.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Assembly performance on the human genome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="right">
<italic>k </italic>
= 31</th>
<th align="right">
<italic>k </italic>
= 63</th>
<th align="right">
<italic>k </italic>
= 127</th>
<th align="right">
<italic>k </italic>
= 31</th>
<th align="right">
<italic>k </italic>
= 63</th>
<th align="right">
<italic>k </italic>
= 127</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td align="center" colspan="3">
<bold>Error free data</bold>
</td>
<td align="center" colspan="3">
<bold>1% error rate</bold>
</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="center">14</td>
<td align="center">16</td>
<td align="center">19</td>
<td align="center">30</td>
<td align="center">49</td>
<td align="center">51</td>
</tr>
<tr>
<td align="left">> 100 bp (#
<italic>k </italic>
contigs )</td>
<td align="center">3,195</td>
<td align="center">1,984</td>
<td align="center">714</td>
<td align="center">2,727</td>
<td align="center">1,554</td>
<td align="center">1,359</td>
</tr>
<tr>
<td align="left"> Sum (G bp)</td>
<td align="center">2.37</td>
<td align="center">2.79</td>
<td align="center">2.83</td>
<td align="center">2.29</td>
<td align="center">2.72</td>
<td align="center">2.88</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="center">743</td>
<td align="center">1,406</td>
<td align="center">3,961</td>
<td align="center">839</td>
<td align="center">1,751</td>
<td align="center">2,121</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="center">2,130</td>
<td align="center">6,479</td>
<td align="center">79,906</td>
<td align="center">2,121</td>
<td align="center">6,319</td>
<td align="center">49,572</td>
</tr>
<tr>
<td align="left">N90 (bp)</td>
<td align="center">244</td>
<td align="center">631</td>
<td align="center">10,441</td>
<td align="center">304</td>
<td align="center">872</td>
<td align="center">1,021</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="center">50,800</td>
<td align="center">124,293</td>
<td align="center">801,692</td>
<td align="center">47164</td>
<td align="center">124,292</td>
<td align="center">537,017</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To test performance on real data, we compared our approach on 100-bp-read whole-genome shotgun sequence data generated on the Illumina platform for
<italic>Escherichia coli K-</italic>
12 MG1655 (NCBI SRA accession ERR022075), human chromosome 14, and a whole human genome (NA12878). The performance on some other real datasets (including single cell reads and
<italic>Ion Torrent PGM </italic>
reads) can be found on our website. For the
<italic>E. coli </italic>
dataset,
<italic>k </italic>
= 51 was used for all assemblers (
<italic>SOAPdenovo </italic>
did not output reasonable results on this dataset, so we did not include the result, with N50 ~ 200), and we set
<italic>g </italic>
= 15 for
<italic>SparseAssembler </italic>
(Table
<xref ref-type="table" rid="T4">4</xref>
), for the human chromosome 14,
<italic>k </italic>
= 53 was used for all assemblers, and we set
<italic>g </italic>
= 15 for
<italic>SparseAssembler </italic>
(Table
<xref ref-type="table" rid="T5">5</xref>
), and
<italic>k </italic>
= 31-51,
<italic>g </italic>
= 25 for the whole human genome (Table
<xref ref-type="table" rid="T6">6</xref>
). On these real datasets from the Illumina platform, our approach used around 1/10 memory compared with other assemblers and produced comparable results (Tables
<xref ref-type="table" rid="T4">4</xref>
,
<xref ref-type="table" rid="T5">5</xref>
). Because the bubble merging strategy in
<italic>SparseAssembler </italic>
is simple, the results on real data can include more misjoins than other assemblers but these misjoins appear to occur within the shorter contigs, thus achieving a corrected NG50 not much different from the original NG50 size (Tables
<xref ref-type="table" rid="T4">4</xref>
,
<xref ref-type="table" rid="T5">5</xref>
). The corrected assembly statistics are obtained by fragmenting the assembly wherever errors are encountered in the data. The runtime of
<italic>SparseAssember </italic>
was smaller than that for other assemblers. Raw Illumina reads (80X, length 100) for a member of CEU HapMap population (identifier NA12878) sequenced by the Broad Institute were downloaded from
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20101201_cg_NA12878/NA12878.hiseq.wgs.bwa.raw.bam">ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20101201_cg_NA12878/NA12878.hiseq.wgs.bwa.raw.bam</ext-link>
in the last test. This dataset was also used in [
<xref ref-type="bibr" rid="B7">7</xref>
] and 54 GB memory was consumed to first clean the reads before assembly, using just half of the reads in the dataset. For testing purpose, we also used 40× reads (the first end of the paired library). The most expensive assembly with uncleaned reads took 29 GB memory and roughly 1 day resulting in an N50 size of 2,915 and assembled length of 2.70 G. All runs were single-threaded. Though our quality is lower than some assemblers, using corrected reads and mate-pair information in the future is expected to further improve the assembly result. Most other assemblers take hundreds of GBs of memory which is beyond our computer's reach, but the detailed consumptions on similar datasets can be found in related references.</p>
<table-wrap id="T4" position="float">
<label>Table 4</label>
<caption>
<p>Assembly performance on the
<italic>E.coli </italic>
genome (ERR022075)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">(
<italic>k </italic>
= 51)</th>
<th align="right">
<italic>ABySS</italic>
</th>
<th align="right">
<italic>Velvet</italic>
</th>
<th align="left">
<italic>SparseAssembler</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Time (hr)</td>
<td align="right">2</td>
<td align="right">1</td>
<td align="left">0.7</td>
</tr>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="right">3.5</td>
<td align="right">9.1</td>
<td align="left">0.7</td>
</tr>
<tr>
<td align="left">> 100 bp (# contigs)</td>
<td align="right">430</td>
<td align="right">632</td>
<td align="left">485</td>
</tr>
<tr>
<td align="left">Sum (bp)</td>
<td align="right">4,556,772</td>
<td align="right">4,413,080</td>
<td align="left">4,577,604</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="right">10,597</td>
<td align="right">6,983</td>
<td align="left">9,438</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="right">57,655</td>
<td align="right">19,067</td>
<td align="left">57,830</td>
</tr>
<tr>
<td align="left">N95 (bp)</td>
<td align="right">5,629</td>
<td align="right">128</td>
<td align="left">5,906</td>
</tr>
<tr>
<td align="left">Corr NG50 (bp)</td>
<td align="right">57,655</td>
<td align="right">19,067</td>
<td align="left">57,828</td>
</tr>
<tr>
<td align="left">Corr NG95 (bp)</td>
<td align="right">5,629</td>
<td align="right">125</td>
<td align="left">5,676</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="right">166,107</td>
<td align="right">120,922</td>
<td align="left">173,976</td>
</tr>
<tr>
<td align="left">Coverage (%)</td>
<td align="right">99.90</td>
<td align="right">96.53</td>
<td align="left">99.94</td>
</tr>
<tr>
<td align="left">Misjoins</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="left">2</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5" position="float">
<label>Table 5</label>
<caption>
<p>Assembly performance on the human chromosome 14</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">(
<italic>k </italic>
= 53)</th>
<th align="left">
<italic>ABySS</italic>
</th>
<th align="right">
<italic>Velvet</italic>
</th>
<th align="left">
<italic>SOAPdenovo</italic>
</th>
<th align="left">
<italic>SparseAssembler</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Time (hr)</td>
<td align="left">6</td>
<td align="right">2.5</td>
<td align="left">6.</td>
<td align="left">1.9</td>
</tr>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="left">49</td>
<td align="right">37</td>
<td align="left">30</td>
<td align="left">3</td>
</tr>
<tr>
<td align="left">> 100 bp (# contigs)</td>
<td align="left">85,181</td>
<td align="right">129,046</td>
<td align="left">84,719</td>
<td align="left">55,024</td>
</tr>
<tr>
<td align="left"> Sum (kbp)</td>
<td align="left">88,663</td>
<td align="right">89,854</td>
<td align="left">87,908</td>
<td align="left">86,296</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="left">1,041</td>
<td align="right">696</td>
<td align="left">1,038</td>
<td align="left">1,568</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="left">3,568</td>
<td align="right">1,499</td>
<td align="left">3,117</td>
<td align="left">3,890</td>
</tr>
<tr>
<td align="left">N95 (bp)</td>
<td align="left">179</td>
<td align="right">184</td>
<td align="left">197</td>
<td align="left">202</td>
</tr>
<tr>
<td align="left">Corr NG50 (bp)</td>
<td align="left">3,475</td>
<td align="right">1,487</td>
<td align="left">3,065</td>
<td align="left">3,760</td>
</tr>
<tr>
<td align="left">Corr NG95 (bp)</td>
<td align="left">175</td>
<td align="right">178</td>
<td align="left">192</td>
<td align="left">198</td>
</tr>
<tr>
<td align="left">Coverage (%)</td>
<td align="left">98.54</td>
<td align="right">98.86</td>
<td align="left">98.42</td>
<td align="left">97.56</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="left">61,018</td>
<td align="right">16,043</td>
<td align="left">49,584</td>
<td align="left">60,797</td>
</tr>
<tr>
<td align="left">Misjoins</td>
<td align="left">24</td>
<td align="right">62</td>
<td align="left">47</td>
<td align="left">61</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>This dataset was downloaded from
<ext-link ext-link-type="uri" xlink:href="http://gage.cbcb.umd.edu/data/index.html">http://gage.cbcb.umd.edu/data/index.html</ext-link>
, genome size: 88,289,540.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T6" position="float">
<label>Table 6</label>
<caption>
<p>Assembly performance on the NA12878 human genome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="right">
<italic>k </italic>
= 31</th>
<th align="right">
<italic>k </italic>
= 41</th>
<th align="right">
<italic>k </italic>
= 51</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Memory peak (GB)</td>
<td align="right">26</td>
<td align="right">29</td>
<td align="right">29</td>
</tr>
<tr>
<td align="left">> 100 bp (#
<italic>k </italic>
contigs )</td>
<td align="right">2,740</td>
<td align="right">2,800</td>
<td align="right">2,744</td>
</tr>
<tr>
<td align="left"> Sum (G bp)</td>
<td align="right">2.33</td>
<td align="right">2.57</td>
<td align="right">2.70</td>
</tr>
<tr>
<td align="left">Mean size (bp)</td>
<td align="right">743</td>
<td align="right">919</td>
<td align="right">3,961</td>
</tr>
<tr>
<td align="left">N50 (bp)</td>
<td align="right">2,054</td>
<td align="right">2,647</td>
<td align="right">2,915</td>
</tr>
<tr>
<td align="left">N95 (bp)</td>
<td align="right">318</td>
<td align="right">335</td>
<td align="right">380</td>
</tr>
<tr>
<td align="left">Longest contig (bp)</td>
<td align="right">36,460</td>
<td align="right">38,864</td>
<td align="right">50,441</td>
</tr>
<tr>
<td align="left">Corr NG50 (bp)*</td>
<td align="right">1,502</td>
<td align="right">2,213</td>
<td align="right">2,610</td>
</tr>
<tr>
<td align="left">Corr NG95 (bp)*</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">114</td>
</tr>
<tr>
<td align="left">Misjoins*</td>
<td align="right">21</td>
<td align="right">19</td>
<td align="right">17</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>* The corrected statistics are calculated by mapping back to human chromosome 14.</p>
</table-wrap-foot>
</table-wrap>
<p>In all comparisons, our sparse
<italic>k-</italic>
mers based
<italic>SparseAssembler </italic>
uses substantially less computational memory and completed the assembly in a comparable period of time and with comparable quality with several state-of-the art assemblers. In tests with known reference genomes (Tables
<xref ref-type="table" rid="T1">1</xref>
,
<xref ref-type="table" rid="T2">2</xref>
,
<xref ref-type="table" rid="T4">4</xref>
,
<xref ref-type="table" rid="T5">5</xref>
), the assembled results were mapped back to the known reference genome using
<italic>MUMmer3 </italic>
[
<xref ref-type="bibr" rid="B22">22</xref>
,
<xref ref-type="bibr" rid="B25">25</xref>
,
<xref ref-type="bibr" rid="B26">26</xref>
] to count the number of misjoins. Contigs that contain false joins were broken into smaller but accurate contigs. The corrected NG50s were calculated based on the size of these smaller contigs. This approach is similar to that used in the GAGE assembly evaluation [
<xref ref-type="bibr" rid="B22">22</xref>
]. We did not map back the assembled whole human genomes because of hardware limitations, instead in Table
<xref ref-type="table" rid="T6">6</xref>
, we map the contigs back to a smaller region, the chromosome 14.</p>
</sec>
<sec>
<title>Discussion and Conclusions</title>
<p>This new sparse graph approach to
<italic>de novo </italic>
genome assembly, as implemented here in
<italic>SparseAssembler</italic>
, consistently produces comparable results to the current state-of-the-art
<italic>de Bruijn </italic>
graph-based assemblers, demands considerably smaller amounts of computer memory, using both simulated and real data. This approach can be extended for a sparse string graph as well, by selecting a sparse subset of the reads when constructing the overlap graph. Future improvements, such as incorporating more efficient data structures, promise to reduce memory demand further. Also the assembly approach used in our paper is a simple implementation resembling the
<italic>de Bruijn </italic>
graph approach, meant to illustrate the power if this approach, and we expect much better assembly results can be obtained by incorporating our ideas within existing genome assemblers.</p>
<p>In the
<italic>k-</italic>
mer graph framework, the memory savings achieved by
<italic>SparseAssembler </italic>
is similar to that achieved with Conway & Bromage's succinct data structure but is simpler in idea and implementation. Moreover, the savings of our assemblers are scalable with the length of
<italic>g</italic>
. Thus, as read lengths improve, the links between
<italic>k-</italic>
mers can be extended and the graph can become even sparser, reducing memory demands as sequencing technology develops. Finally, the sparse
<italic>k-</italic>
mer graph shares all advantages of the
<italic>de Bruijn </italic>
graph model.</p>
<p>Therefore, the results reported here strongly support our idea that a sparse assembly graph retains sufficient information for accurate and fast
<italic>de novo </italic>
genome assembly of moderate-size genomes in a cheap, desktop PC computing environment, which is usually only equipped with several gigabytes memory. Future improvements to
<italic>SparseAssembler </italic>
will focus on extending this approach to a sparse string graph and the exploitation of paired-end reads.</p>
</sec>
<sec>
<title>Availability</title>
<p>Related programs and code are available at:</p>
<p>
<ext-link ext-link-type="uri" xlink:href="http://sites.google.com/site/sparseassembler/">http://sites.google.com/site/sparseassembler/</ext-link>
.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>We declare that we have no significant competing financial, professional or personal interests that might have influenced the performance or presentation of the work described in this manuscript.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>CY developed the algorithms, collected results, and wrote the software. ZM, CC, MP, DY contributed discussions on algorithms. CY, ZM, CC, MP, and DY wrote the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We thank Prof. Jin Chen of the Xishuangbanna Tropical Botanical garden, Dr. Jue Ruan in BIG, Prof. Steven Salzberg and his group, in Johns Hopkins University, Prof. James Yorke and his group in Univ. of Maryland, Yingrui Li, Yinglong Xie, Hao Tan in BGI and Ruiqiang Li in Novogene for long-term support and fruitful discussions. We thank Adam Phillippy and Sergey Koren for providing scripts for generating
<italic>MUMmer </italic>
corrected results. This work was supported in part by Yunnan Province, China [20080A001], and the Chinese Academy of Sciences [0902281081, KSCX2-YW-Z-1027, Y002731079], and also by the US National Science Foundation grant IIS-0812111.</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012). The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S6">http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S6</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Bioinformatics challenges of new sequencing technology</article-title>
<source>Trends Genet</source>
<year>2008</year>
<volume>24</volume>
<issue>3</issue>
<fpage>142</fpage>
<lpage>149</lpage>
<pub-id pub-id-type="doi">10.1016/j.tig.2007.12.006</pub-id>
<pub-id pub-id-type="pmid">18262676</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Dew</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Fasulo</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Flanigan</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Kravitz</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Mobarry</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>KHJ</given-names>
</name>
<name>
<surname>Remington</surname>
<given-names>KA</given-names>
</name>
<etal></etal>
<article-title>A whole-genome assembly of Drosophila</article-title>
<source>Science</source>
<year>2000</year>
<volume>287</volume>
<issue>5461</issue>
<fpage>2196</fpage>
<lpage>2204</lpage>
<pub-id pub-id-type="doi">10.1126/science.287.5461.2196</pub-id>
<pub-id pub-id-type="pmid">10731133</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Stanley</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gnerre</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mauceli</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Berger</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mesirov</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<article-title>ARACHNE: A whole-genome shotgun assembler</article-title>
<source>Genome Res</source>
<year>2002</year>
<volume>12</volume>
<issue>1</issue>
<fpage>177</fpage>
<lpage>189</lpage>
<pub-id pub-id-type="doi">10.1101/gr.208902</pub-id>
<pub-id pub-id-type="pmid">11779843</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Mullikin</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Ning</surname>
<given-names>ZM</given-names>
</name>
<article-title>The phusion assembler</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<issue>1</issue>
<fpage>81</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1101/gr.731003</pub-id>
<pub-id pub-id-type="pmid">12529309</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Havlak</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Egan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>YR</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>XZ</given-names>
</name>
<name>
<surname>Weinstock</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Gibbs</surname>
<given-names>RA</given-names>
</name>
<article-title>The atlas genome assembly system</article-title>
<source>Genome Res</source>
<year>2004</year>
<volume>14</volume>
<issue>4</issue>
<fpage>721</fpage>
<lpage>732</lpage>
<pub-id pub-id-type="doi">10.1101/gr.2264004</pub-id>
<pub-id pub-id-type="pmid">15060016</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<article-title>The fragment assembly string graph</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>79</fpage>
<lpage>85</lpage>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="other">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
<source>Genome Res</source>
<year>2011</year>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<article-title>Velvet: Algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJM</given-names>
</name>
<article-title>ABySS: A parallel assembler for short read sequence data</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<issue>6</issue>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="doi">10.1101/gr.089532.108</pub-id>
<pub-id pub-id-type="pmid">19251739</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Chaisson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>HX</given-names>
</name>
<article-title>Fragment assembly with short reads</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>13</issue>
<fpage>2067</fpage>
<lpage>2074</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth205</pub-id>
<pub-id pub-id-type="pmid">15059830</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Gnerre</surname>
<given-names>S</given-names>
</name>
<name>
<surname>MacCallum</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Przybylski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ribeiro</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Burton</surname>
<given-names>JN</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Sharpe</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Shea</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Sykes</surname>
<given-names>S</given-names>
</name>
<etal></etal>
<article-title>High-quality draft assemblies of mammalian genomes from massively parallel sequence data</article-title>
<source>P Natl Acad Sci USA</source>
<year>2011</year>
<volume>108</volume>
<issue>4</issue>
<fpage>1513</fpage>
<lpage>1518</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1017351108</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Himmelbauer</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Dohm</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Lottaz</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Borodina</surname>
<given-names>T</given-names>
</name>
<article-title>SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing</article-title>
<source>Genome Res</source>
<year>2007</year>
<volume>17</volume>
<issue>11</issue>
<fpage>1697</fpage>
<lpage>1706</lpage>
<pub-id pub-id-type="doi">10.1101/gr.6435207</pub-id>
<pub-id pub-id-type="pmid">17908823</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>RQ</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>HM</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>WB</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>XD</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>ZB</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>YR</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>ST</given-names>
</name>
<name>
<surname>Shan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<etal></etal>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<issue>2</issue>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="doi">10.1101/gr.097261.109</pub-id>
<pub-id pub-id-type="pmid">20019144</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>HX</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
<source>P Natl Acad Sci USA</source>
<year>2001</year>
<volume>98</volume>
<issue>17</issue>
<fpage>9748</fpage>
<lpage>9753</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.171285098</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Sundquist</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ronaghi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>HX</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<article-title>Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies</article-title>
<source>PLoS ONE</source>
<year>2007</year>
<volume>2</volume>
<issue>5</issue>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Warren</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJM</given-names>
</name>
<name>
<surname>Holt</surname>
<given-names>RA</given-names>
</name>
<article-title>Assembling millions of short DNA sequences using SSAKE</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>4</issue>
<fpage>500</fpage>
<lpage>501</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl629</pub-id>
<pub-id pub-id-type="pmid">17158514</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Conway</surname>
<given-names>TC</given-names>
</name>
<name>
<surname>Bromage</surname>
<given-names>AJ</given-names>
</name>
<article-title>Succinct data structures for assembling large genomes</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>4</issue>
<fpage>479</fpage>
<lpage>486</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq697</pub-id>
<pub-id pub-id-type="pmid">21245053</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Melsted</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
<article-title>Efficient counting of k-mers in DNA sequences using a bloom filter</article-title>
<source>Bmc Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hayes</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hunt</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Mount</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
<article-title>Reducing storage requirements for biological sequence comparison</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>18</issue>
<fpage>3363</fpage>
<lpage>3369</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth408</pub-id>
<pub-id pub-id-type="pmid">15256412</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Deng</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Papasian</surname>
<given-names>CJ</given-names>
</name>
<article-title>Comparative studies of de novo assembly tools for next-generation sequencing technologies</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>15</issue>
<fpage>2031</fpage>
<lpage>2037</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr319</pub-id>
<pub-id pub-id-type="pmid">21636596</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Zimin</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Magoc</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Treangen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
<etal></etal>
<article-title>GAGE: A critical evaluation of genome assemblies and assembly algorithms</article-title>
<source>Genome Res</source>
<year>2011</year>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>WY</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>YF</given-names>
</name>
<name>
<surname>Shang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>BR</given-names>
</name>
<article-title>A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies</article-title>
<source>PLoS ONE</source>
<year>2011</year>
<volume>6</volume>
<issue>3</issue>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="other">
<name>
<surname>Magoč</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies</article-title>
<source>Bioinformatics</source>
<year>2011</year>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Smoot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shumway</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>C</given-names>
</name>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biol</source>
<year>2004</year>
<volume>5</volume>
<issue>2</issue>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<article-title>Genome assembly forensics: finding the elusive mis-assembly</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>3</issue>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000942 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000942 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3369186
   |texte=   Exploiting sparseness in de novo genome assembly
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22537038" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021