Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 001093 ( Pmc/Corpus ); précédent : 0010929; suivant : 0010940 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">PERGA: A Paired-End Read Guided
<italic>De Novo</italic>
Assembler for Extending Contigs Using SVM and Look Ahead Approach</title>
<author>
<name sortKey="Zhu, Xiao" sort="Zhu, Xiao" uniqKey="Zhu X" first="Xiao" last="Zhu">Xiao Zhu</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leung, Henry C M" sort="Leung, Henry C M" uniqKey="Leung H" first="Henry C. M." last="Leung">Henry C. M. Leung</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chin, Francis Y L" sort="Chin, Francis Y L" uniqKey="Chin F" first="Francis Y. L." last="Chin">Francis Y. L. Chin</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yiu, Siu Ming" sort="Yiu, Siu Ming" uniqKey="Yiu S" first="Siu Ming" last="Yiu">Siu Ming Yiu</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Quan, Guangri" sort="Quan, Guangri" uniqKey="Quan G" first="Guangri" last="Quan">Guangri Quan</name>
<affiliation>
<nlm:aff id="aff3">
<addr-line>National Pilot School of Software, Harbin Institute of Technology, Weihai, Shandong, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Bo" sort="Liu, Bo" uniqKey="Liu B" first="Bo" last="Liu">Bo Liu</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yadong" sort="Wang, Yadong" uniqKey="Wang Y" first="Yadong" last="Wang">Yadong Wang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25461763</idno>
<idno type="pmc">4252104</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4252104</idno>
<idno type="RBID">PMC:4252104</idno>
<idno type="doi">10.1371/journal.pone.0114253</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">001093</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001093</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">PERGA: A Paired-End Read Guided
<italic>De Novo</italic>
Assembler for Extending Contigs Using SVM and Look Ahead Approach</title>
<author>
<name sortKey="Zhu, Xiao" sort="Zhu, Xiao" uniqKey="Zhu X" first="Xiao" last="Zhu">Xiao Zhu</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leung, Henry C M" sort="Leung, Henry C M" uniqKey="Leung H" first="Henry C. M." last="Leung">Henry C. M. Leung</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chin, Francis Y L" sort="Chin, Francis Y L" uniqKey="Chin F" first="Francis Y. L." last="Chin">Francis Y. L. Chin</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yiu, Siu Ming" sort="Yiu, Siu Ming" uniqKey="Yiu S" first="Siu Ming" last="Yiu">Siu Ming Yiu</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Quan, Guangri" sort="Quan, Guangri" uniqKey="Quan G" first="Guangri" last="Quan">Guangri Quan</name>
<affiliation>
<nlm:aff id="aff3">
<addr-line>National Pilot School of Software, Harbin Institute of Technology, Weihai, Shandong, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Bo" sort="Liu, Bo" uniqKey="Liu B" first="Bo" last="Liu">Bo Liu</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wang, Yadong" sort="Wang, Yadong" uniqKey="Wang Y" first="Yadong" last="Wang">Yadong Wang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Since the read lengths of high throughput sequencing (HTS) technologies are short,
<italic>de novo</italic>
assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on
<italic>k</italic>
-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided
<italic>de novo</italic>
assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at
<ext-link ext-link-type="uri" xlink:href="https://github.com/hitbio/PERGA">https://github.com/hitbio/PERGA</ext-link>
.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Shendure, J" uniqKey="Shendure J">J Shendure</name>
</author>
<author>
<name sortKey="Porreca, Gj" uniqKey="Porreca G">GJ Porreca</name>
</author>
<author>
<name sortKey="Reppas, Nb" uniqKey="Reppas N">NB Reppas</name>
</author>
<author>
<name sortKey="Lin, Xx" uniqKey="Lin X">XX Lin</name>
</author>
<author>
<name sortKey="Mccutcheon, Jp" uniqKey="Mccutcheon J">JP McCutcheon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Margulies, M" uniqKey="Margulies M">M Margulies</name>
</author>
<author>
<name sortKey="Egholm, M" uniqKey="Egholm M">M Egholm</name>
</author>
<author>
<name sortKey="Altman, We" uniqKey="Altman W">WE Altman</name>
</author>
<author>
<name sortKey="Attiya, S" uniqKey="Attiya S">S Attiya</name>
</author>
<author>
<name sortKey="Bader, Js" uniqKey="Bader J">JS Bader</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Rq" uniqKey="Li R">RQ Li</name>
</author>
<author>
<name sortKey="Fan, W" uniqKey="Fan W">W Fan</name>
</author>
<author>
<name sortKey="Tian, G" uniqKey="Tian G">G Tian</name>
</author>
<author>
<name sortKey="Zhu, Hm" uniqKey="Zhu H">HM Zhu</name>
</author>
<author>
<name sortKey="He, L" uniqKey="He L">L He</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bentley, Dr" uniqKey="Bentley D">DR Bentley</name>
</author>
<author>
<name sortKey="Balasubramanian, S" uniqKey="Balasubramanian S">S Balasubramanian</name>
</author>
<author>
<name sortKey="Swerdlow, Hp" uniqKey="Swerdlow H">HP Swerdlow</name>
</author>
<author>
<name sortKey="Smith, Gp" uniqKey="Smith G">GP Smith</name>
</author>
<author>
<name sortKey="Milton, J" uniqKey="Milton J">J Milton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blanca, Jm" uniqKey="Blanca J">JM Blanca</name>
</author>
<author>
<name sortKey="Pascual, L" uniqKey="Pascual L">L Pascual</name>
</author>
<author>
<name sortKey="Ziarsolo, P" uniqKey="Ziarsolo P">P Ziarsolo</name>
</author>
<author>
<name sortKey="Nuez, F" uniqKey="Nuez F">F Nuez</name>
</author>
<author>
<name sortKey="Canizares, J" uniqKey="Canizares J">J Canizares</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Surget Groba, Y" uniqKey="Surget Groba Y">Y Surget-Groba</name>
</author>
<author>
<name sortKey="Montoya Burgos, Ji" uniqKey="Montoya Burgos J">JI Montoya-Burgos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flicek, P" uniqKey="Flicek P">P Flicek</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shendure, J" uniqKey="Shendure J">J Shendure</name>
</author>
<author>
<name sortKey="Ji, H" uniqKey="Ji H">H Ji</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warren, Rl" uniqKey="Warren R">RL Warren</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
<author>
<name sortKey="Holt, Ra" uniqKey="Holt R">RA Holt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jeck, Wr" uniqKey="Jeck W">WR Jeck</name>
</author>
<author>
<name sortKey="Reinhardt, Ja" uniqKey="Reinhardt J">JA Reinhardt</name>
</author>
<author>
<name sortKey="Baltrus, Da" uniqKey="Baltrus D">DA Baltrus</name>
</author>
<author>
<name sortKey="Hickenbotham, Mt" uniqKey="Hickenbotham M">MT Hickenbotham</name>
</author>
<author>
<name sortKey="Magrini, V" uniqKey="Magrini V">V Magrini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dohm, Jc" uniqKey="Dohm J">JC Dohm</name>
</author>
<author>
<name sortKey="Lottaz, C" uniqKey="Lottaz C">C Lottaz</name>
</author>
<author>
<name sortKey="Borodina, T" uniqKey="Borodina T">T Borodina</name>
</author>
<author>
<name sortKey="Himmelbauer, H" uniqKey="Himmelbauer H">H Himmelbauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hernandez, D" uniqKey="Hernandez D">D Hernandez</name>
</author>
<author>
<name sortKey="Francois, P" uniqKey="Francois P">P Francois</name>
</author>
<author>
<name sortKey="Farinelli, L" uniqKey="Farinelli L">L Farinelli</name>
</author>
<author>
<name sortKey="Osteras, M" uniqKey="Osteras M">M Osteras</name>
</author>
<author>
<name sortKey="Schrenzel, J" uniqKey="Schrenzel J">J Schrenzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Venter, E" uniqKey="Venter E">E Venter</name>
</author>
<author>
<name sortKey="Walenz, Bp" uniqKey="Walenz B">BP Walenz</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Dew, Im" uniqKey="Dew I">IM Dew</name>
</author>
<author>
<name sortKey="Fasulo, Dp" uniqKey="Fasulo D">DP Fasulo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zimin, Av" uniqKey="Zimin A">AV Zimin</name>
</author>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Butler, J" uniqKey="Butler J">J Butler</name>
</author>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I MacCallum</name>
</author>
<author>
<name sortKey="Kleber, M" uniqKey="Kleber M">M Kleber</name>
</author>
<author>
<name sortKey="Shlyakhter, Ia" uniqKey="Shlyakhter I">IA Shlyakhter</name>
</author>
<author>
<name sortKey="Belmonte, Mk" uniqKey="Belmonte M">MK Belmonte</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hc" uniqKey="Leung H">HC Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chin, Fy" uniqKey="Chin F">FY Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Qian, W" uniqKey="Qian W">W Qian</name>
</author>
<author>
<name sortKey="Fang, X" uniqKey="Fang X">X Fang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcelroy, Ke" uniqKey="Mcelroy K">KE McElroy</name>
</author>
<author>
<name sortKey="Luciani, F" uniqKey="Luciani F">F Luciani</name>
</author>
<author>
<name sortKey="Thomas, T" uniqKey="Thomas T">T Thomas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Zimin, A" uniqKey="Zimin A">A Zimin</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25461763</article-id>
<article-id pub-id-type="pmc">4252104</article-id>
<article-id pub-id-type="publisher-id">PONE-D-14-36457</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0114253</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Sequence Assembly Tools</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>PERGA: A Paired-End Read Guided
<italic>De Novo</italic>
Assembler for Extending Contigs Using SVM and Look Ahead Approach</article-title>
<alt-title alt-title-type="running-head">PERGA: A Paired-End Read Guided
<italic>De Novo</italic>
Assembler</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Zhu</surname>
<given-names>Xiao</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Leung</surname>
<given-names>Henry C. M.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chin</surname>
<given-names>Francis Y. L.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yiu</surname>
<given-names>Siu Ming</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Quan</surname>
<given-names>Guangri</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Bo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Yadong</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Computer Science, University of Hong Kong, Hong Kong</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>National Pilot School of Software, Harbin Institute of Technology, Weihai, Shandong, China</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Zhang</surname>
<given-names>Yan</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Harbin Medical University, China</addr-line>
</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>ydwang@hit.edu.cn</email>
</corresp>
<fn fn-type="COI-statement">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con">
<p>Conceived and designed the experiments: YW. Performed the experiments: XZ HCML. Analyzed the data: XZ HCML FYLC SMY BL. Wrote the paper: XZ HCML SMY. Guided the development in early studies: GQ.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>2</day>
<month>12</month>
<year>2014</year>
</pub-date>
<volume>9</volume>
<issue>12</issue>
<elocation-id>e114253</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>8</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>5</day>
<month>11</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>© 2014 Zhu et al</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Zhu et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract>
<p>Since the read lengths of high throughput sequencing (HTS) technologies are short,
<italic>de novo</italic>
assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on
<italic>k</italic>
-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided
<italic>de novo</italic>
assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at
<ext-link ext-link-type="uri" xlink:href="https://github.com/hitbio/PERGA">https://github.com/hitbio/PERGA</ext-link>
.</p>
</abstract>
<funding-group>
<funding-statement>This work was partially supported by the National Nature Science Foundation of China (61173085, 61102149 and 11171086), the National High-Tech Research and Development Program (863) of China (2012AA020404, 2012AA02A602 and 2012AA02A604), the Hong Kong GRF (HKU 7111/12E, HKU 719709E and 719611E), the Shenzhen Basic Research Project (NO.JCYJ20120618143038947), and the Outstanding Researcher Award (102009124). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<page-count count="27"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>The authors confirm that all data underlying the findings are fully available without restriction. The simulated reads data are available from
<ext-link ext-link-type="uri" xlink:href="https://github.com/hitbio/PERGA">https://github.com/hitbio/PERGA</ext-link>
. The E.coli real short reads data can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/projects/singlecell/nbt_data.html">http://bix.ucsd.edu/projects/singlecell/nbt_data.html</ext-link>
. The S.pombe real short reads data are available from NCBI website
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/sra/?term=ERX174934">http://www.ncbi.nlm.nih.gov/sra/?term=ERX174934</ext-link>
. The human chromosome 14 real data are available from GAGE project
<ext-link ext-link-type="uri" xlink:href="http://gage.cbcb.umd.edu/data">http://gage.cbcb.umd.edu/data</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>The authors confirm that all data underlying the findings are fully available without restriction. The simulated reads data are available from
<ext-link ext-link-type="uri" xlink:href="https://github.com/hitbio/PERGA">https://github.com/hitbio/PERGA</ext-link>
. The E.coli real short reads data can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/projects/singlecell/nbt_data.html">http://bix.ucsd.edu/projects/singlecell/nbt_data.html</ext-link>
. The S.pombe real short reads data are available from NCBI website
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/sra/?term=ERX174934">http://www.ncbi.nlm.nih.gov/sra/?term=ERX174934</ext-link>
. The human chromosome 14 real data are available from GAGE project
<ext-link ext-link-type="uri" xlink:href="http://gage.cbcb.umd.edu/data">http://gage.cbcb.umd.edu/data</ext-link>
.</p>
</notes>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>The high throughput sequencing (HTS) technologies have emerged for several years
<xref rid="pone.0114253-Shendure1" ref-type="bibr">[1]</xref>
,
<xref rid="pone.0114253-Margulies1" ref-type="bibr">[2]</xref>
and are widely used in many biomedical applications, such as large scale DNA sequencing
<xref rid="pone.0114253-Li1" ref-type="bibr">[3]</xref>
, re-sequencing
<xref rid="pone.0114253-Bentley1" ref-type="bibr">[4]</xref>
and SNP discovery
<xref rid="pone.0114253-Li2" ref-type="bibr">[5]</xref>
,
<xref rid="pone.0114253-Blanca1" ref-type="bibr">[6]</xref>
, etc. However, since the length of reads generated by HTS technologies (typically 50–150 base pairs
<xref rid="pone.0114253-Schatz1" ref-type="bibr">[7]</xref>
<xref rid="pone.0114253-Treangen1" ref-type="bibr">[9]</xref>
) are much shorter than those of the traditional Sanger sequencing (typically about 800 base pairs
<xref rid="pone.0114253-Flicek1" ref-type="bibr">[10]</xref>
), and the per-base sequencing error is high
<xref rid="pone.0114253-Shendure2" ref-type="bibr">[11]</xref>
, the short read assembly is still a great challenge for genome sequencing.</p>
<p>The overlap-layout strategy and the de Bruijn graph strategy are two major approaches for assembly. The overlap-layout-based approaches firstly compute the overlaps among reads, and then assemble according to the read overlaps, and it consists of the greedy extension strategy and the overlap graph strategy as two subcategories.</p>
<p>The greedy extension approach was applied by first several
<italic>de novo</italic>
assemblers for the HTS data, such as SSAKE
<xref rid="pone.0114253-Warren1" ref-type="bibr">[12]</xref>
, VCAKE
<xref rid="pone.0114253-Jeck1" ref-type="bibr">[13]</xref>
, SHARCGS
<xref rid="pone.0114253-Dohm1" ref-type="bibr">[14]</xref>
. In these assemblers, reads are stored in a prefix/suffix tree to record overlaps, and assembly is performed based on base-by-base 3′ extension according to the simple greedy heuristics of selecting the base with maximum overlap or using the most commonly represented base. In order to prevent mis-assembly, when there are more than one feasible extension due to sequencing errors or similar regions in the genome, the extension will stop. As a result, short contigs will be produced and the genome sequences cannot be reconstructed completely. In many situations, the erroneous extensions (in the multiple feasible extensions) can be detected if the assemblers try to extend for a few bases, e.g. erroneous extensions due to sequencing error at the end of a read usually cannot be extended in later steps (dead ends
<xref rid="pone.0114253-Hernandez1" ref-type="bibr">[15]</xref>
) and multiple extensions due to sequencing error in the middle of a read should be extended to the same nucleotide in later steps (bubbles
<xref rid="pone.0114253-Hernandez1" ref-type="bibr">[15]</xref>
). Besides constructing short contigs, these assemblers store the reads and their reverse complements inefficiently, so their memory consumptions are usually very large (especially when there are huge number of erroneous reads with high sequencing depth), which limits their application for large amount of HTS datasets.</p>
<p>To avoid the disadvantage of the greedy extension strategy, Edena
<xref rid="pone.0114253-Hernandez1" ref-type="bibr">[15]</xref>
and CABOG
<xref rid="pone.0114253-Miller1" ref-type="bibr">[16]</xref>
adopt the overlap graph strategy. This approach constructs an overlap graph in which a vertex represents a unique read and an edge connects vertices
<italic>u</italic>
and
<italic>v</italic>
if and only if
<italic>u</italic>
and
<italic>v</italic>
overlap each other sufficiently. Assembly is performed by simplifying the graph based on topologies, such as transitive edges, dead ends and bubbles. Each simple path in the simplified graph represents a contig. This approach is also not suitable for HTS data because they require enormous computations to detect overlaps among a great amount of reads. Recently, new methods based on read overlaps using Burrows-Wheeler Transform
<xref rid="pone.0114253-Burrows1" ref-type="bibr">[17]</xref>
, such as SGA
<xref rid="pone.0114253-Simpson1" ref-type="bibr">[18]</xref>
and fermi
<xref rid="pone.0114253-Li2" ref-type="bibr">[5]</xref>
, could assemble larger amount of HTS data. However, they require much more computations to construct a FM-index
<xref rid="pone.0114253-Ferragina1" ref-type="bibr">[19]</xref>
. The Celera Assembler
<xref rid="pone.0114253-Myers1" ref-type="bibr">[20]</xref>
based assembler MaSuRCA
<xref rid="pone.0114253-Zimin1" ref-type="bibr">[21]</xref>
transforms the high coverage data into low coverage but long super-reads to dramatically reduce the overlap computations, which makes it more popular.</p>
<p>The de Bruijn graph strategy, which was firstly introduced in EULER
<xref rid="pone.0114253-Pevzner1" ref-type="bibr">[22]</xref>
, is particularly suitable for short reads of HTS technologies. This strategy can help to reduce the large amount of computations of read overlaps or the construction of FM-index of the overlap-graph approach, and Velvet
<xref rid="pone.0114253-Zerbino1" ref-type="bibr">[23]</xref>
, EULER-SR
<xref rid="pone.0114253-Chaisson1" ref-type="bibr">[24]</xref>
, ALLPATHS
<xref rid="pone.0114253-Butler1" ref-type="bibr">[25]</xref>
, ABySS
<xref rid="pone.0114253-Simpson2" ref-type="bibr">[26]</xref>
, IDBA
<xref rid="pone.0114253-Peng1" ref-type="bibr">[27]</xref>
, IDBA-UD
<xref rid="pone.0114253-Peng2" ref-type="bibr">[28]</xref>
, SOAPdenovo
<xref rid="pone.0114253-Li3" ref-type="bibr">[29]</xref>
, adopt this strategy. This approach breaks up each read into a collection of overlapping
<italic>k</italic>
-substrings, called
<italic>k</italic>
-mers, to construct a de Bruijn graph. In the graph, a vertex represents a unique
<italic>k</italic>
-mer and an edge connects vertices
<italic>u</italic>
and
<italic>v</italic>
if and only if
<italic>u</italic>
and
<italic>v</italic>
overlapped by
<italic>k</italic>
–1 nucleotides and appear consecutively in a read. The graph will then be simplified by removing dead ends and merging bubbles and a simple path in the simplified graph represents a contig. As the
<italic>k</italic>
-mers have fixed length and erroneous
<italic>k</italic>
-mers can be detected from their low sampling rates, the de Bruijn graph consumes much less memories than the overlap graph. However, most of them only use a fixed
<italic>k</italic>
-mer size except IDBA and IDBA-UD. Since small
<italic>k</italic>
values will lead to better connectivity with much more branches due to repeat segments larger than
<italic>k</italic>
, whereas large
<italic>k</italic>
will result in worse connectivity with more gaps due to missing
<italic>k</italic>
-mers
<xref rid="pone.0114253-Peng2" ref-type="bibr">[28]</xref>
. Most of these assemblers just pick an intermediate
<italic>k</italic>
to compromise these two problems. IDBA
<xref rid="pone.0114253-Peng1" ref-type="bibr">[27]</xref>
and IDBA-UD
<xref rid="pone.0114253-Peng2" ref-type="bibr">[28]</xref>
give better results by iterating the
<italic>k</italic>
-mer sizes from
<italic>k</italic>
<sub>min</sub>
to
<italic>k</italic>
<sub>max</sub>
by using small
<italic>k</italic>
to resolve gaps and large
<italic>k</italic>
to resolve branches.</p>
<p>There are two common problems on the above assemblers.</p>
<sec id="s1a">
<title>1) Un-fully utilized information of paired-end and single-end reads</title>
<p>Paired-end reads information usually was used for assembling contigs to scaffolds, however, different overlap lengths of paired-end reads and single-end reads usually were not considered when assembling reads to contigs. Thus, some branches that can be resolved using this information become unsolvable. IDBA-UD applies paired-end reads aligned to the same contig for extending the contig (local assembling). However, paired-end and single-end information were considered at equal weight and the number of reads support each branches and the length of overlaps of each supported reads was not considered.</p>
<p>In fact, paired-end reads should be used in the highest priority to resolve branches. Given a branch with two possible extensions (or outgoing edges), one extension is well supported by enough paired-end reads, whereas the other extension has more single-end reads but without well supported paired-end reads, then assembler should extend the contig to the one with more well supported paired-end reads and treat the other as incorrect. If there are no available paired-end reads, single-end reads should be used to determine the correct extension.</p>
<p>Assemblers stop when there is more than one choice for extension without considering the different overlapped lengths supporting each extension. Instead, they usually treat all single-end reads larger than a threshold (or a rate) equally. Given a branch with two possible extensions (or outgoing edges), assembler should extend the contig to the one with more supporting reads and treat the other as incorrect. Even when the numbers of reads supporting both extensions are the same, assembler should extend the contig to the extension with much longer overlaps because short overlapped reads may due to sequencing errors or short repeats.</p>
</sec>
<sec id="s1b">
<title>2) Tandem repeats</title>
<p>Because of error in recombination or genome duplication, many repeats are short tandem repeats (e.g., <100 bp) with the occurrence positions of the repeats are close in the genome (e.g. distance of two adjacent occurrences <100 bp) (
<xref ref-type="supplementary-material" rid="pone.0114253.s001">File S1</xref>
<xref ref-type="supplementary-material" rid="pone.0114253.s002">S2</xref>
). These tandem repeats will introduce branches which are difficult to be resolved using the existing assemblers. Assemblers based on overlap-layout strategy stop when there are more than one choice for extension. For de Bruijn graph, these tandem repeats will introduce complicated branches in the de Bruijn graph and existing assemblers cannot correctly separate the different copies of repeats to their correct positions while assembling. For such branches, the existing assemblers usually stop to avoid introducing assembly errors, thus resulting short assembly size.</p>
<p>In order to fully utilize the information in reads for resolving branches in assembling, we introduce PERGA (Paired-End Reads Guided Assembler), a novel
<italic>de novo</italic>
sequence reads assembler which adopts greedy-like prediction strategy for assembling reads to form contigs and scaffolds. The main contributions of PERGA are as follows.</p>
<list list-type="order">
<list-item>
<p>
<bold>Utilizing information in paired-end and single-end reads.</bold>
Instead of using single-end reads to construct contigs, PERGA uses paired-end reads and different read overlap size thresholds ranging from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
to resolve the gaps and branches. In PERGA, contigs are extended based on base-by-base extension. Paired-end reads are aligned to contigs for determining possible extensions. When there are not many paired-end reads in some genome regions, single-end reads with variable overlap sizes from larger threshold
<italic>O</italic>
<sub>max</sub>
to smaller threshold
<italic>O</italic>
<sub>min</sub>
are applied to handle branches and gaps. Large overlap size
<italic>O</italic>
<bold></bold>
<italic>O</italic>
<sub>max</sub>
is used in priority to extend contigs to resolve branches; and if there are missing overlaps for larger
<italic>O</italic>
, then a degressive smaller
<italic>O</italic>
will be used to obtain better connectivity to resolve gaps until the read overlap is found before
<italic>O</italic>
<bold> = </bold>
<italic>O</italic>
<sub>min</sub>
.</p>
</list-item>
<list-item>
<p>
<bold>SVM navigation model for determining branch.</bold>
When there are multiple possible extensions (due to sequencing error or repeats), PERGA will determine which extension is more likely based on branch features, i.e. read coverage levels at the branch site and locally, path weight, gap size (see
<xref ref-type="sec" rid="s2">Materials and Methods</xref>
). By constructing decision models using machine learning approach (SVM) based on these features, PERGA can determine the correct extension in 99.7% of cases. Note that PERGA will also determine the case and stop extending the contig when both extensions are likely to be correct.</p>
</list-item>
<list-item>
<p>
<bold>Look ahead approach.</bold>
As there are still some mis-predictions (about 0.1∼0.3%), and when the confidence of the prediction is low, PERGA will check all these extensions and determine whether these extensions are due to sequencing errors or repeats. If the multiple extensions are due to sequencing errors (the extensions are similar and converge to the same nucleotide within short distance), PERGA will merge the extensions together to form a single contigs. If the branches were introduced due to short tandem repeats (extensions are different, do not converge and supported by paired-end reads with vary insert distance), PERGA will detect their overlap and separate different copies of repeats to resolve the branches.</p>
</list-item>
</list>
<p>In summary, PERGA combines principles of traditional overlap-graph based approaches with novel heuristics for extending a path and resolving the paths at branches. More specifically, it employs four heuristics, from the most conservative to the most relaxed as follows. i) at each point, use compatible paired-end reads to extend the path; ii) if no paired-end reads are available, extend with single-end reads, starting from those with the maximum overlap; iii) for multiple feasible extensions, use a machine learning method (SVM) to distinguish one path, taking into account read coverage levels at the branch site and locally, path weight, gap size; iv) if indistinguishable, employ look-ahead approach to search for possible short stretches of sequencing errors that can be bridged and possible short tandem repeats whose different copies can be separated, before terminating the extension at the branch.</p>
<p>According to our experiments, PERGA gave better performance than other assemblers (Velvet, ABySS, IDBA-UD, CABOG and MaSuRCA) with longer and more accurate contigs (scaffolds) with moderate memory because of its greedy-like prediction model, look-ahead approach and variable overlap size approach to fully utilize the information of paired-end and single-end reads in resolving branches in more accurate way while assembling.</p>
</sec>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>Materials and Methods</title>
<p>
<xref ref-type="fig" rid="pone-0114253-g001">Figure 1</xref>
shows the overview of the proposed approach of PERGA, which consists of assembly of reads and assembly of contigs (i.e., scaffolding) as its two phases. In phase 1, a
<italic>k-mer hash table</italic>
, which is used to represent read overlaps by the consecutive
<italic>k</italic>
-mers, is constructed from the set of input reads, and then PERGA uses paired-ends and variable read overlap sizes thresholds from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
to extend contigs. A greedy-like prediction strategy in which a
<italic>k</italic>
-mer is chosen as a start of contig extension is performed iteratively in the 3′ direction one base at a time until there are either no overlapping reads or a repeat is found, and then the contig will be extended on the 5′ end in the same way. For each base extension, PERGA prefers the reads having more overlaps with contigs and uses the reads having the most represented base for extension. When extending a base, PERGA firstly uses paired-end reads to navigate contig extension with the highest priority, as it can resolve the branches caused by repeats smaller than the insert size with much more confidence than those using single-end reads. However, as there may be genome regions with low sequencing depth and insufficient paired-end reads, PERGA uses single-end reads to extend contigs in such regions by applying the variable overlap size ranging from larger
<italic>O</italic>
<sub>max</sub>
to smaller
<italic>O</italic>
<sub>min</sub>
to resolve repeats of sizes smaller than
<italic>O</italic>
<sub>max</sub>
and to resolve gaps due to the missing large overlaps.</p>
<fig id="pone-0114253-g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Workflow of PERGA.</title>
<p>There are two phases for PERGA. assembly of reads and assembly of contigs
<bold>.</bold>
(A) Phase 1, assembly of reads.
<italic>k</italic>
-mer hash table is firstly constructed using paired-end reads for
<italic>k</italic>
 = 
<italic>O</italic>
<sub>min</sub>
, then contigs are extended iteratively one base at a time (left feedback loop) at 3′ end by using paired-end reads in high priority, and variable overlap size thresholds ranging from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
(right feedback loop) if there are no paired-ends. (B) Phase 2, scaffolding. Paired-end reads are used to order and orient contigs, fill intra-scaffold gaps to generate larger scaffolds.</p>
</caption>
<graphic xlink:href="pone.0114253.g001"></graphic>
</fig>
<p>Due to existence of repeats in genome or sequencing errors in reads, there may be branches which have more than one feasible extension with various read occurrences when extending contigs. Instead of stopping the extension, PERGA records the branch information to generate hyperplanes for the paired-ends and single-ends respectively by Support Vector Machine (SVM) method. Finally, these two SVM models are used to determine whether to extend or stop for branches while assembling, and in most cases, branches can be correctly resolved.</p>
<p>However, there are also a few exceptions (branches that are incorrectly stopped or extended) when using the SVM models to decide the navigation. These situations can be resolved by looking ahead to find the feasible paths to resolve the incorrect stops and incorrect extensions. This look-ahead approach can make the contig much longer with fewer mis-assemblies. Note that PERGA determines whether the branches is due to sequencing errors or repeats based on the properties of extended paths and will resolve these two kinds of branches using different methods.</p>
<p>Besides look-ahead approach, PERGA also handles erroneous bases in reads using topological structures, which is similar to the removals of dead ends and bubbles for de Bruijn graph based approaches. During extension, errors at ends of reads will lead to dead ends, and the other errors in the inner part of reads will cause bubbles, PERGA deals with dead ends with lengths smaller than read length and tolerates bubbles with sizes no more than
<italic>O</italic>
<sub>min</sub>
. In PERGA, the dead ends containing erroneous
<italic>k</italic>
-mers will be excluded from assembly by other correct reads during extension; and the bubbles in reads are deemed as valid substitution.</p>
<p>In phase 2, paired-end reads are aligned onto contigs and are used to order and orient contigs to form scaffolds (i.e., ordered sets of contigs with gaps in between). Then, the overlap sizes and the gap sizes for the linked neighboring contigs are computed, and the overlapped neighboring contigs are merged to form longer contiguous sequences, and the gapped neighboring contigs are processed using a local assembly approach to close their intra-scaffold gaps to generate longer contiguous sequences. Unlike SOAPdenovo
<xref rid="pone.0114253-Li3" ref-type="bibr">[29]</xref>
which trims
<italic>k</italic>
bases to exclude erroneous bases at contig ends when scaffolding, PERGA corrects such erroneous bases by pair-wise alignment of the overlapped ends of the neighboring contigs. Finally, the scaffold sequences are generated to form the resultant assembly according to the overlaps and the gap sizes of the contigs in scaffolds.</p>
<p>The details of each step will be described one by one in the following sections.</p>
<sec id="s2a">
<title>Assembly of reads to contigs</title>
<p>The first phase of PERGA is to assemble reads into contigs using a greedy-like prediction method based on paired-end reads information (if possible) and then single-end reads. The algorithm starts with a
<italic>k</italic>
-mer at the end of an unused read and treats it as contig. PERGA iteratively aligns paired-end reads to contigs and tries to extend it at both ends. In order to determine the possible extension, either A, C, G or T, a SVM model is used to determine whether PERGA should extend the contig using the nucleotide with maximum supports from aligned paired-end reads (instead of extending the contig only when all aligned reads support the same extension as other greedy algorithms) based on the properties of aligned reads. Besides, even when the SVM model cannot determine whether extending the contig or not, PERGA will try to extend the contigs with all possible nucleotides and determine which nucleotide should be used to extend the contig by the later steps (look-ahead approach). After extension, errors in aligned reads can be identified and be corrected for later extension. Details of the assembling step are described as follows.</p>
<sec id="s2a1">
<title>Construct
<italic>k</italic>
-mer hash table</title>
<p>PERGA applies a
<italic>k</italic>
-mer based, cost effective approach to perform read alignments. Overlaps of two reads can be represented by their consecutive common
<italic>k</italic>
-mers, for example, two reads overlap with
<italic>w</italic>
nucleotides should share
<italic>w</italic>
<italic>k</italic>
+1 consecutive
<italic>k</italic>
-mers. Thus, PERGA uses a hash table to store occurrences of
<italic>k</italic>
-mers in reads. We refer
<italic>occurrence</italic>
of a
<italic>k</italic>
-mer as the positions on reads it appears. Note that a
<italic>k</italic>
-mer may occur in multiple reads and a
<italic>k</italic>
-mer and its reverse complement are stored at the same entry; and the occurrences of each
<italic>k</italic>
-mer are stored in an ascending order according to their reads, so that reads can be aligned onto contigs in a fast way. Moreover, in order to reduce the memory consumption, PERGA only samples ten percent of all the
<italic>k</italic>
-mers of a read at its both ends, and these
<italic>k</italic>
-mers at read ends are used to align reads to contig in an effect way while assembling.</p>
</sec>
<sec id="s2a2">
<title>Align reads to contig</title>
<p>Paired-end reads information is used to extend a contig before single-end reads information because it can resolve longer repeats, i.e., up to the insert size of paired-end reads. PERGA automatically infers the mean insert size as well as the standard deviation of the paired-end reads that have been assembled onto contigs. Only those two ends which are aligned in correct directions, i.e. pointed to each other on different strands, are used to extend contigs.</p>
<p>PERGA extracts the paths that will be assembled in near genome region along with assembly, and reads having a large portion of aligned bases (e.g.,>90%) with the paths will be considered to be correctly aligned. Therefore, some reads from other genome regions with only a few aligned bases at the ends will be prevented from the assembly, thus reduce the adverse impact of the short repeats at the reads ends and improve the significance of the correctly aligned reads while assembly, and the prevented reads can be placed to their correct place onto other genome regions (
<xref ref-type="fig" rid="pone-0114253-g002">Figure 2A</xref>
). As shown in
<xref ref-type="fig" rid="pone-0114253-g002">Figure 2B</xref>
, PERGA aligns paired-end reads onto a single contig. Reads with one end totally aligned to the contigs and the other end partially aligned to the contig are used to determine the extension of contig.</p>
<fig id="pone-0114253-g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Align reads to contig for extension.</title>
<p>(A) Align reads to contig. The path of nearby genome region is extracted according to the reads that are partially aligned onto contig, then new reads having more than 90% aligned bases with the path will be considered correctly aligned; otherwise, they should be aligned onto other genome regions rather than at that position. (B) Extension using paired-end reads. Contig is extended at the 3′ end according to the reads in Pool 1 and whose mates in Pool 2. There are two candidate bases ‘C’ and ‘T’, and ‘C’ is well supported by the mates in the two pools, whereas ‘T’ has no paired-end reads support, thus ‘C’ will be chosen to append onto the contig. (C) Extension using single-end reads. When assemble the grey color region which cannot be assembled by paired-end reads and the reads in Pool 1 have no mates in Pool 2, the reads in Pool 1 are used as single-end reads to extend the contig.</p>
</caption>
<graphic xlink:href="pone.0114253.g002"></graphic>
</fig>
<p>When starts the extension of a contig, the
<italic>k</italic>
-mer at end of a read is selected as the start contig, and the contig is extended iteratively using SVM model to eliminate sequencing errors and avoid the impacts of short repeats by applying the variable overlap size approach based on single-end reads. When the contig is long enough for using paired-end reads, extension will be applied using paired-end reads in the highest priority to avoid repeats shorter than the insert size. Moreover, when the start
<italic>k</italic>
-mer contains sequencing errors, it typically has low frequency in
<italic>k</italic>
-mer hash table, and such
<italic>k</italic>
-mers are excluded from the start construction of a contig. When PERGA cannot determine the extension from the aligned paired-end reads, single-end reads information, including paired-end read with one end aligned to the end of a contig and the other end unaligned, is used to determine the extension of contig (
<xref ref-type="fig" rid="pone-0114253-g002">Figure 2C</xref>
).</p>
</sec>
<sec id="s2a3">
<title>Utilizing information in paired-end and single-end reads</title>
<p>Since the alignment of single-end reads are less confident than the paired-end reads especially when the length of aligned region
<italic>O</italic>
is short, single-end read information is used carefully from reads with large
<italic>O</italic>
to reads with small
<italic>O</italic>
. PERGA determines the possible extension using reads with
<italic>O</italic>
larger than a larger threshold
<italic>O</italic>
<sub>max</sub>
then to smaller threshold
<italic>O</italic>
<sub>min</sub>
iteratively. Thus, if PERGA can determine the extension using reads with large
<italic>O</italic>
confidently, it will not consider those reads with small
<italic>O</italic>
. In
<xref ref-type="fig" rid="pone-0114253-g003">Figure 3</xref>
, the contig is extended by reads 1 and 2 (
<italic>O</italic>
≧6) and resolves the repeat AAT in reads 5 and 6 from other genome regions, and if there are no reads having
<italic>O</italic>
≧6, then smaller
<italic>O</italic>
≧5 will be applied again in the same way. Moreover, the read overlap approach can resolve repeats in the reads without overlaps among each other, e.g., GCA from reads 1, 2, 5 and 6.</p>
<fig id="pone-0114253-g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Example of the variable overlap size approach for contig extension.</title>
<p>Suppose
<italic>k</italic>
 = 
<italic>O</italic>
<sub>min</sub>
 = 3 and
<italic>O</italic>
<sub>max</sub>
 = 6. There are 6 reads, reads 1–4 are the reads that can be assembled onto the contig, while reads 5–6 are the reads that should be assembled onto other regions. The contig is extended using
<italic>O</italic>
≥6 by reads 1–2, and if there are no reads having
<italic>O</italic>
≥6, then smaller
<italic>O</italic>
≥5 will be used in the same way until the contig can be extended successful. AAT is a repeat that can be resolved by
<italic>O</italic>
≥4 and GCA is another repeat that has been resolved by reads as such reads do not overlap each other.</p>
</caption>
<graphic xlink:href="pone.0114253.g003"></graphic>
</fig>
</sec>
<sec id="s2a4">
<title>SVM navigation model</title>
<p>When extending contigs, there may be more than one feasible extension with various supporting reads that are mainly due to repeats or sequencing errors, i.e. there is a branch (
<xref ref-type="fig" rid="pone-0114253-g004">Figure 4</xref>
). When determining correct extension at branch, PERGA records the branch information as features (
<italic>maxOcc</italic>
,
<italic>secOcc</italic>
,
<italic>covRatio</italic>
,
<italic>gapLen</italic>
), where
<italic>maxOcc</italic>
is the number of reads supporting the majority nucleotide,
<italic>secOcc</italic>
is the number of reads supporting the second majority nucleotide,
<italic>covRatio</italic>
is the ratio of the average number of aligned reads (per nucleotide) at the extending ends (within two read lengths) to the average number of aligned reads for the contig,
<italic>gapLen</italic>
is the distance of the previous completely aligned reads to the contig end. The idea is that for a branch, if its maxOcc and secOcc differ a lot (e.g., secOcc/maxOcc <0.7), the feasible extension corresponding to the maxOcc is usually a correct extension; otherwise, the extension corresponding to maxOcc might be incorrect and should be stopped for further checking. A branch with low gapLen suggests that the number reads aligned to the end of contig is high and the maxOcc should be a correct extension, and a branch with covRatio larger than one suggests that there is a repeat nearby and PERGA should extend more carefully.</p>
<fig id="pone-0114253-g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g004</object-id>
<label>Figure 4</label>
<caption>
<title>The SVM model to resolve branches.</title>
<p>The maxOcc and secOcc should differ as much as possible, the covRatio should be near 1.0 and the gapLen should be as short as possible. Otherwise, contig should be extended more carefully.</p>
</caption>
<graphic xlink:href="pone.0114253.g004"></graphic>
</fig>
<p>For training the SVM prediction models, we recorded the branches of the four features while assembling, and treated each branch as a point in a four-dimensional space in which these points can be used to draw a hyperplane by machine learning approach to separate the branches that should be continued or stopped. By comparing them to the reference while assembling, these branches can be classified into correct extension, wrong extension, correct stop and wrong stop. We used these points as training dataset to train the SVM model, and labelled each branch as CONTINUE if the branch is a correct extension or a wrong stop that should be continued; otherwise it is a STOP branch that should be stopped.</p>
<p>Based on training dataset on branches, PERGA can determine the cases whether we should extend a contig using the majority nucleotide or not. A support vector machine method using polynomial kernel function
<italic>K</italic>
(
<italic>x</italic>
,
<italic>y</italic>
) = <
<italic>x</italic>
,
<italic>y</italic>
> ×(1+<
<italic>x</italic>
,
<italic>y</italic>
>)
<sup>2</sup>
, where
<italic>x</italic>
,
<italic>y</italic>
are vectors containing branch information, <
<italic>x</italic>
,
<italic>y</italic>
> is the dot product of
<italic>x</italic>
,
<italic>y</italic>
being constructed based on the four features and is used to determine if a contig should be extended. Note that PERGA will firstly determine the correct extension using features calculated based on aligned paired-end reads. If PERGA fails to decide whether to extend the contig, it will recalculate the features using aligned single-end reads from
<italic>O</italic>
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
and will determine when to extend the contig.</p>
</sec>
<sec id="s2a5">
<title>Look-ahead approach</title>
<p>Since SVM is not perfect for all of the branches, there are a few low confident cases when using the SVM models to decide whether to extend a contig or not. For these cases, PERGA looks ahead to find all feasible extensions. Starting from the branches, PERGA extracts all paths that will be appended to contig using the reads, and compares the sequences of these paths. Based on the assumption that if the multiple extensions are due to repeats, it will be hard to get a highly agreed consensus sequence than the case that the multiple extensions are due to sequencing errors, PERGA calculates the ratio of the majority nucleotide at each position and assumes the majority nucleotide is incorrect if its ratio is less than 0.9. If there are less than 3 incorrect nucleotides, the extension will continue using the majority nucleotide, otherwise, PERGA will determine if the branches is due to short tandem repeats.</p>
</sec>
<sec id="s2a6">
<title>Deal with short tandem repeats</title>
<p>As our observations, there are some short repeats with lengths less than the read length, and some of them have a short distance, e.g., less than a read (
<xref ref-type="supplementary-material" rid="pone.0114253.s001">File S1</xref>
<xref ref-type="supplementary-material" rid="pone.0114253.s002">S2</xref>
). Such repeats may cause ambiguities while assembling, thus should be resolved for better accuracy. If the path overlap is significant (e.g.,>10 bp) (e.g., path P1 and P2 overlaps in
<xref ref-type="fig" rid="pone-0114253-g005">Figure 5</xref>
), which usually means that there is another repeat in nearby genome region that will be assembled recently, and the two paths can be merged and the aligned reads on path P2 will be adjusted according to the overlap, and the extension will continue according to the path (e.g., P1) at the branch; otherwise, It usually means that these repeat copies are from different genome regions with a large distance, and the extension should stop to prevent assembly errors.</p>
<fig id="pone-0114253-g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g005</object-id>
<label>Figure 5</label>
<caption>
<title>The approach to separate repeat R to its two copies R1 and R2.</title>
<p>(A) The branch is caused by different copies of repeat R (bold red). (B) The two copies of repeat R are resolved by two paths P1 and P2. P1 and P2 are extracted according to the reads at that branch, and the suffix of P1 overlaps the prefix of P2, then the reads of P2 are adjusted to their correct positions.</p>
</caption>
<graphic xlink:href="pone.0114253.g005"></graphic>
</fig>
<p>When there are more than one repeat after merging overlapped paths, PERGA firstly calculates the mismatched base count of each path comparing to contig, and then computes the distance of the paired ends of each path that one end aligned on the path and the other end aligned on contig. For each path, if its mismatched base count is significant (e.g.,>2) and its distance is much different from the insert distance (e.g., difference>2 * standard deviation of insert distance), which usually means that the repeat copy (i.e., the path) may come from other genome regions in high probability rather than from the branch, so such paths will be invalid and be removed together with their aligned reads, and these removed reads can be used for later assembly of other genome regions.</p>
</sec>
<sec id="s2a7">
<title>Handling erroneous bases</title>
<p>Erroneous bases in reads for HTS data can make the assembly problem much more complex and error-prone, and cannot be easily solved by paired-end reads and variable overlap size approach. To resolve ambiguities arising due to sequencing errors, PERGA applies a method similar to the approach based on topological structures
<xref rid="pone.0114253-Hernandez1" ref-type="bibr">[15]</xref>
,
<xref rid="pone.0114253-Zerbino1" ref-type="bibr">[23]</xref>
,
<xref rid="pone.0114253-Simpson2" ref-type="bibr">[26]</xref>
. Errors at the ends of reads usually lead to short dead ends which are likely to be terminated prematurely, and errors in the inner part of reads will cause small bubbles in which the two paths have similar bases with the same starting and ending reads (
<xref ref-type="fig" rid="pone-0114253-g006">Figure 6</xref>
). Note that such dead ends and bubbles are checked in reads rather than in contigs, and PERGA checks such errors and corrects them for later extensions. It checks the similarity between a read and the contig according to topological structures, and if similarity is high, say ≧95%, then the read is assembled onto contigs; otherwise, the read will be not assembled onto contigs, instead, it might be assembled into other genome regions with higher similarity.</p>
<fig id="pone-0114253-g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Scheme for removing erroneous bases.</title>
<p>Erroneous bases in reads will cause dead ends and bubbles that can be implicitly resolved as these errors can be masked by these correct reads. Reads with low similarities probably can be assembled onto other contigs with higher similarity.</p>
</caption>
<graphic xlink:href="pone.0114253.g006"></graphic>
</fig>
</sec>
</sec>
<sec id="s2b">
<title>Assembly of contigs</title>
<p>PERGA assembles generated contigs into larger scaffolds using paired-end information similar as existing assemblers. In this procedure, reads are aligned onto contig ends to order and orient contigs to generate scaffolds. After constructing scaffolds, PERGA merges the overlapped neighboring contigs, fills intra-scaffold gaps, and generates consensus sequences to give final assembly (
<xref ref-type="fig" rid="pone-0114253-g001">Figure 1B</xref>
). Detailed scaffolding method is described in the following subsections.</p>
<sec id="s2b1">
<title>Reads alignment</title>
<p>If one end of a paired-end read uniquely aligned onto one contig and the other end uniquely aligned onto another contig, these two contigs should appear adjacently in the genome. Note that reads aligned to multiple contigs should not be considered. As reads with both ends aligned to the same contigs does not provide extra information for constructing scaffold, PERGA aligns reads to the end of contig, called
<italic>linking region</italic>
, which has 2 kbp by default.</p>
</sec>
<sec id="s2b2">
<title>Linking contigs to scaffolds</title>
<p>Since reads may be sampled from positive strand or negative strand randomly and whether a contig sequence represents the positive strand or negative strand is unknown, there are four valid
<italic>placements</italic>
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e001.jpg"></inline-graphic>
</inline-formula>
for two adjacent contigs (A, B) as shown in
<xref ref-type="fig" rid="pone-0114253-g007">Figure 7</xref>
. The relative positions and directions of the contigs can be determined from the aligned paired-end reads. However, because of sequencing errors and misalignment, the relative direction and position of two contigs can be different using different paired-end reads. In order to determine the correct relative direction and position of contigs, a
<italic>scaffold graph</italic>
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e002.jpg"></inline-graphic>
</inline-formula>
is constructed over the set of linking regions
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e003.jpg"></inline-graphic>
</inline-formula>
to capture all placements of adjacent contigs by the set of edges
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e004.jpg"></inline-graphic>
</inline-formula>
. In the graph,
<italic>placement weight</italic>
is defined as the number of paired-end reads support each placement of two linking regions
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e005.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e006.jpg"></inline-graphic>
</inline-formula>
in distinct contigs, and each edge
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e007.jpg"></inline-graphic>
</inline-formula>
is associated with a quaternion
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e008.jpg"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e009.jpg"></inline-graphic>
</inline-formula>
is the weight for placement
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e010.jpg"></inline-graphic>
</inline-formula>
. Only uniquely aligned reads are used to construct graph to prevent introducing errors by repeats.</p>
<fig id="pone-0114253-g007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Four placements for two adjacent contigs.</title>
<p>Placements depicted at bottom correspond to the ones in top table. Adjacent contigs (bold arrows) are placed based on their aligned read pairs. Grey arrows indicate reverse complements of contigs. Contig orientation (‘+’/‘−’) in top table is the contig orientation in scaffolds.</p>
</caption>
<graphic xlink:href="pone.0114253.g007"></graphic>
</fig>
<p>Contigs are linked based on a greedy approach. A contig longer than the linking region size is randomly selected as the initial scaffold to be extended. The extension is performed iteratively by including the neighboring contigs to the right, and once a contig is included in a scaffold, its orientation is assigned according to the placement. Extension is performed iteratively and is terminated until no neighboring contigs or multiple candidates undifferentiated which one is correct. When the extension is terminated from the 3′ end, the 5′ end will be extended in a similar fashion (
<xref ref-type="fig" rid="pone-0114253-g008">Figure 8</xref>
). After contigs are linked, their orders and orientations in scaffolds will be determined.</p>
<fig id="pone-0114253-g008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.g008</object-id>
<label>Figure 8</label>
<caption>
<title>Scheme for contigs linking.</title>
<p>The first link round (right) extends contigs by paired-end reads from the starting contig A to the right until no extension are possible, then the second link round (left) is carried out from A to the left in the same way. Scaffold is a linear structure of a set of linked contigs (bottom) that have been ordered and oriented.</p>
</caption>
<graphic xlink:href="pone.0114253.g008"></graphic>
</fig>
</sec>
<sec id="s2b3">
<title>Overlap between contigs</title>
<p>To generate final scaffolds, it is necessary to compute the distance between each two adjacent contigs in scaffolds, which may be overlapped or have gaps in between. For overlapped contigs, the overlapped region will be detected and the two contigs should be merged into a single contig. For contigs with gaps in between, the gap size will be computed according to the paired-end reads that link the two contigs.</p>
<p>PERGA firstly estimates the gap size between adjacent contigs in scaffolds. Given the paired-end reads with ends
<italic>a</italic>
and
<italic>b</italic>
aligned to different contigs A and B, respectively, the gap size
<italic>g</italic>
can be estimated by
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e011.jpg"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e012.jpg"></inline-graphic>
</inline-formula>
is the mean insert size,
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e013.jpg"></inline-graphic>
</inline-formula>
is the distance from 5′ end of read
<italic>a</italic>
to the gap margin of contig A,
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e014.jpg"></inline-graphic>
</inline-formula>
is the distance from 5′ end of read
<italic>b</italic>
to gap margin of contig B. In practice, there can be multiple such paired-ends aligned onto two adjacent contigs, thus the final gap size
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e015.jpg"></inline-graphic>
</inline-formula>
can be inferred by
<disp-formula id="pone.0114253.e016">
<graphic xlink:href="pone.0114253.e016.jpg" position="anchor" orientation="portrait"></graphic>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0114253.e017.jpg"></inline-graphic>
</inline-formula>
is the number of read pairs between contigs A and B.</p>
<p>PERGA further checks the inferred gap size. If the inferred gap size is a large positive number, there is probably a gap between contigs with the estimated gap size; and if the gap size is a large negative number, there is probably an overlap. If the gap size is not significant, further check is needed by comparing the prefix and suffix of the two contigs.</p>
<p>When the gap size is a large negative number or insignificant, the two contigs may overlap with certain proportion. Because of sequencing error and mistakes in assembling, the overlapping sequence may not be exactly the same. PERGA performs the pair-wise sequence alignment to capture the overlaps. If an overlap is larger than 3 and is agreed with the estimated gap size, this pair of contigs will be recorded as overlapping contigs and merged into a single contig; otherwise, there will be a gap between them.</p>
</sec>
<sec id="s2b4">
<title>Gap filling</title>
<p>After estimating gap size between adjacent contigs, it is necessary to fill the gap regions for better continuity by local assembly using paired-end reads with one end aligned onto contigs and the other end aligned in gap regions. Most of the sequences in gap regions are repetitive sequences, thus gap filling can be used to resolve such repeats. As the sequences adjacent to gap regions have been recognized, repetitive sequences in gap regions can be easily reconstructed by local assembly which is based on the algorithm of assembly of reads for PERGA using paired-ends.</p>
<p>Consensus sequences are generated from contigs in scaffolds considering their overlaps and gaps. If adjacent contigs are overlapped, then they will be merged; and if contigs are gapped, the gap region between these contigs will be filled with ambiguous bases (‘N’).</p>
</sec>
</sec>
<sec id="s2c">
<title>Datasets</title>
<p>We evaluated the performances of PERGA on both simulated and real datasets of Escherichia coli, Schizosaccharomyces pombe and human chromosome 14 (details are shown in
<xref ref-type="table" rid="pone-0114253-t001">Table 1</xref>
) with reference sizes ranging from 4.6 Mbp to 88.3 Mbp (million base pairs). The simulated Illumina paired-ends datasets were generated using GemSIM
<xref rid="pone.0114253-McElroy1" ref-type="bibr">[30]</xref>
with various coverages 50x, 60x, 100x for
<italic>E.coli</italic>
(can be downloaded from
<ext-link ext-link-type="uri" xlink:href="https://github.com/hitbio/PERGA">https://github.com/hitbio/PERGA</ext-link>
), 50x for
<italic>S.pombe</italic>
, 50x for human chromosome 14; and the real Illumina
<italic>E.coli</italic>
paired-end reads data were downloaded from
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/projects/singlecell/nbt_data.html">http://bix.ucsd.edu/projects/singlecell/nbt_data.html</ext-link>
, with standard genomic DNA prepared from culture, with coverage around 600x, the real
<italic>S.pombe</italic>
data were downloaded from NCBI (SRA accession: ERX174934), the real human chromosome 14 data were downloaded from
<ext-link ext-link-type="uri" xlink:href="http://gage.cbcb.umd.edu/data">http://gage.cbcb.umd.edu/data</ext-link>
, this dataset had already been error-corrected using Quake
<xref rid="pone.0114253-Kelley1" ref-type="bibr">[31]</xref>
by Salzberg et al
<xref rid="pone.0114253-Salzberg1" ref-type="bibr">[32]</xref>
. We evaluated the performance of PERGA on resolving branches using SVM prediction model and look-ahead approach. We also compared the performance of PERGA in assembling with other leading state-of-the-art assemblers, including IDBA-UD (v1.0.9)
<xref rid="pone.0114253-Peng2" ref-type="bibr">[28]</xref>
, ABySS (v1.3.2)
<xref rid="pone.0114253-Simpson2" ref-type="bibr">[26]</xref>
, Velvet (v1.2.01)
<xref rid="pone.0114253-Zerbino1" ref-type="bibr">[23]</xref>
, and overlap-based assemblers SGA (v0.9.20)
<xref rid="pone.0114253-Simpson1" ref-type="bibr">[18]</xref>
, CABOG (v7.0)
<xref rid="pone.0114253-Miller1" ref-type="bibr">[16]</xref>
and MaSuRCA (v2.2.1)
<xref rid="pone.0114253-Zimin1" ref-type="bibr">[21]</xref>
.</p>
<table-wrap id="pone-0114253-t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t001</object-id>
<label>Table 1</label>
<caption>
<title>Datasets D1∼D8 for assemblies.</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t001-1" xlink:href="pone.0114253.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Datasets</td>
<td align="left" rowspan="1" colspan="1">D1</td>
<td align="left" rowspan="1" colspan="1">D2</td>
<td align="left" rowspan="1" colspan="1">D3</td>
<td align="left" rowspan="1" colspan="1">D4</td>
<td align="left" rowspan="1" colspan="1">D5</td>
<td align="left" rowspan="1" colspan="1">D6</td>
<td align="left" rowspan="1" colspan="1">D7</td>
<td align="left" rowspan="1" colspan="1">D8</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Organism</td>
<td colspan="4" align="left" rowspan="1">
<italic>E.coli</italic>
K12 MG1655</td>
<td colspan="2" align="left" rowspan="1">
<italic>S.pombe</italic>
972 h-</td>
<td colspan="2" align="left" rowspan="1">Human chr14</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Ref. size</td>
<td colspan="4" align="left" rowspan="1">4.64 Mbp</td>
<td colspan="2" align="left" rowspan="1">12.59 Mbp</td>
<td colspan="2" align="left" rowspan="1">88.29 Mbp</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Data type</td>
<td align="left" rowspan="1" colspan="1">simulated</td>
<td align="left" rowspan="1" colspan="1">simulated</td>
<td align="left" rowspan="1" colspan="1">simulated</td>
<td align="left" rowspan="1" colspan="1">real</td>
<td align="left" rowspan="1" colspan="1">simulated</td>
<td align="left" rowspan="1" colspan="1">real</td>
<td align="left" rowspan="1" colspan="1">simulated</td>
<td align="left" rowspan="1" colspan="1">real</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Read length</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">100 bp</td>
<td align="left" rowspan="1" colspan="1">101 bp</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">#Reads (million)</td>
<td align="left" rowspan="1" colspan="1">2×1.16</td>
<td align="left" rowspan="1" colspan="1">2×1.4</td>
<td align="left" rowspan="1" colspan="1">2×2.3</td>
<td align="left" rowspan="1" colspan="1">2×14.2</td>
<td align="left" rowspan="1" colspan="1">2×3.1</td>
<td align="left" rowspan="1" colspan="1">2×3.3</td>
<td align="left" rowspan="1" colspan="1">2×22.1</td>
<td align="left" rowspan="1" colspan="1">2×16.3</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Cov. depth</td>
<td align="left" rowspan="1" colspan="1">50×</td>
<td align="left" rowspan="1" colspan="1">60×</td>
<td align="left" rowspan="1" colspan="1">100×</td>
<td align="left" rowspan="1" colspan="1">600×</td>
<td align="left" rowspan="1" colspan="1">50×</td>
<td align="left" rowspan="1" colspan="1">52×</td>
<td align="left" rowspan="1" colspan="1">50×</td>
<td align="left" rowspan="1" colspan="1">40×</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Insert size (bp)</td>
<td align="left" rowspan="1" colspan="1">370±56</td>
<td align="left" rowspan="1" colspan="1">370±58</td>
<td align="left" rowspan="1" colspan="1">366±59</td>
<td align="left" rowspan="1" colspan="1">215±11</td>
<td align="left" rowspan="1" colspan="1">370±57</td>
<td align="left" rowspan="1" colspan="1">380±82</td>
<td align="left" rowspan="1" colspan="1">366±49</td>
<td align="left" rowspan="1" colspan="1">158±17</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label></label>
<p>The RefSeq for
<italic>E.coli</italic>
K12 MG 1655 is NC_000913.2; the RefSeq for
<italic>S.pombe</italic>
972 h- are NC_003424.3, NC_003423.3, NC_003421.2, NC_001326.1; the refSeq for human chromosome 14 is NT_026437.12.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>To evaluate the performances of each assembler, we used the length of N50 to evaluate their length metrics, and we used BLASTN (v2.2.25+)
<xref rid="pone.0114253-Altschul1" ref-type="bibr">[33]</xref>
to align the contigs and scaffolds to reference to evaluate their accuracy by using reference covered ratio, number and lengths of mis-assemblies. If a contig (or scaffold) entirely matches with the reference with similarity <95%, it is considered as a mis-assembled contigs. As Velvet is a scaffold-only assembler for paired-ends data, we split the scaffolds at the positions of poly-N to get the contigs for comparisons. Note that repeats from different genomic regions will be collapsed into a single copy which can be aligned to more than one location or in disjoint locations when using BLASTN, and we also deem that all those genomic locations are covered by these repeats.</p>
<p>We tested the performance of SVM approach as well as the look ahead approach first. The results were shown in
<xref ref-type="table" rid="pone-0114253-t002">Tables 2</xref>
<xref ref-type="table" rid="pone-0114253-t003">3</xref>
. And then, we compared the performance of PERGA to other leading state-of-the-art assemblers, the main items of the results were listed in
<xref ref-type="table" rid="pone-0114253-t004">Tables 4</xref>
<xref ref-type="table" rid="pone-0114253-t011">11</xref>
with details in Tables S1–S8 in
<xref ref-type="supplementary-material" rid="pone.0114253.s001">File S1</xref>
.</p>
<table-wrap id="pone-0114253-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t002</object-id>
<label>Table 2</label>
<caption>
<title>Statistical results for greedy-like prediction model.</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t002-2" xlink:href="pone.0114253.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Datasets</td>
<td align="left" rowspan="1" colspan="1">Correct extensions</td>
<td align="left" rowspan="1" colspan="1">Incorrect extensions</td>
<td align="left" rowspan="1" colspan="1">Correct stops</td>
<td align="left" rowspan="1" colspan="1">Incorrect stops</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">D1</td>
<td align="left" rowspan="1" colspan="1">70299 (99.70%)</td>
<td align="left" rowspan="1" colspan="1">60 (0.09%)</td>
<td align="left" rowspan="1" colspan="1">123 (0.17%)</td>
<td align="left" rowspan="1" colspan="1">26 (0.04%)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">D2</td>
<td align="left" rowspan="1" colspan="1">84829 (99.74%)</td>
<td align="left" rowspan="1" colspan="1">46 (0.05%)</td>
<td align="left" rowspan="1" colspan="1">148 (0.18%)</td>
<td align="left" rowspan="1" colspan="1">25 (0.03%)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">D3</td>
<td align="left" rowspan="1" colspan="1">136309 (99.82%)</td>
<td align="left" rowspan="1" colspan="1">48 (0.04%)</td>
<td align="left" rowspan="1" colspan="1">169 (0.12%)</td>
<td align="left" rowspan="1" colspan="1">27 (0.02%)</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t003</object-id>
<label>Table 3</label>
<caption>
<title>Statistical results for look-ahead approach.</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t003-3" xlink:href="pone.0114253.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">checking sequencing errors</td>
<td colspan="3" align="left" rowspan="1">checking short repeats</td>
<td align="left" rowspan="1" colspan="1">Overall</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Datasets</td>
<td align="left" rowspan="1" colspan="1">Correct navi.</td>
<td align="left" rowspan="1" colspan="1">Incorrect navi.</td>
<td align="left" rowspan="1" colspan="1">Sum</td>
<td align="left" rowspan="1" colspan="1">Correct navi.</td>
<td align="left" rowspan="1" colspan="1">Incorrect navi.</td>
<td align="left" rowspan="1" colspan="1">Sum</td>
<td align="left" rowspan="1" colspan="1">Correct navi.</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">D1</td>
<td align="left" rowspan="1" colspan="1">448 (97.60%)</td>
<td align="left" rowspan="1" colspan="1">11 (2.40%)</td>
<td align="left" rowspan="1" colspan="1">459</td>
<td align="left" rowspan="1" colspan="1">174 (98.31%)</td>
<td align="left" rowspan="1" colspan="1">3 (1.69%)</td>
<td align="left" rowspan="1" colspan="1">177</td>
<td align="left" rowspan="1" colspan="1">445 (99.3%)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">D2</td>
<td align="left" rowspan="1" colspan="1">443 (98.44%)</td>
<td align="left" rowspan="1" colspan="1">7 (1.56%)</td>
<td align="left" rowspan="1" colspan="1">450</td>
<td align="left" rowspan="1" colspan="1">174 (99.43%)</td>
<td align="left" rowspan="1" colspan="1">1 (0.57%)</td>
<td align="left" rowspan="1" colspan="1">175</td>
<td align="left" rowspan="1" colspan="1">442 (99.8%)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">D3</td>
<td align="left" rowspan="1" colspan="1">485 (98.98%)</td>
<td align="left" rowspan="1" colspan="1">5 (1.02%)</td>
<td align="left" rowspan="1" colspan="1">490</td>
<td align="left" rowspan="1" colspan="1">223 (98.24%)</td>
<td align="left" rowspan="1" colspan="1">4 (1.76%)</td>
<td align="left" rowspan="1" colspan="1">227</td>
<td align="left" rowspan="1" colspan="1">481 (99.2%)</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t004</object-id>
<label>Table 4</label>
<caption>
<title>Evaluation for
<italic>E.coli</italic>
simulated short reads data (D1, 50×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t004-4" xlink:href="pone.0114253.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>174.7</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>100.0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>174.7</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>100.0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>3</bold>
</td>
<td align="left" rowspan="1" colspan="1">0.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">112.6</td>
<td align="left" rowspan="1" colspan="1">99.98</td>
<td align="left" rowspan="1" colspan="1">2/559</td>
<td align="left" rowspan="1" colspan="1">148.5</td>
<td align="left" rowspan="1" colspan="1">99.98</td>
<td align="left" rowspan="1" colspan="1">1/321</td>
<td align="left" rowspan="1" colspan="1">11</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">119.2</td>
<td align="left" rowspan="1" colspan="1">99.90</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">119.2</td>
<td align="left" rowspan="1" colspan="1">99.42</td>
<td align="left" rowspan="1" colspan="1">1/3617</td>
<td align="left" rowspan="1" colspan="1">9</td>
<td align="left" rowspan="1" colspan="1">1.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">108.1</td>
<td align="left" rowspan="1" colspan="1">99.76</td>
<td align="left" rowspan="1" colspan="1">7/6658</td>
<td align="left" rowspan="1" colspan="1">148.3</td>
<td align="left" rowspan="1" colspan="1">99.89</td>
<td align="left" rowspan="1" colspan="1">1/1596</td>
<td align="left" rowspan="1" colspan="1">
<bold>3</bold>
</td>
<td align="left" rowspan="1" colspan="1">0.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">24.1</td>
<td align="left" rowspan="1" colspan="1">98.57</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">95.5</td>
<td align="left" rowspan="1" colspan="1">98.59</td>
<td align="left" rowspan="1" colspan="1">1/4120</td>
<td align="left" rowspan="1" colspan="1">43</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">83.1</td>
<td align="left" rowspan="1" colspan="1">99.03</td>
<td align="left" rowspan="1" colspan="1">1/2638</td>
<td align="left" rowspan="1" colspan="1">88.5</td>
<td align="left" rowspan="1" colspan="1">99.03</td>
<td align="left" rowspan="1" colspan="1">1/2638</td>
<td align="left" rowspan="1" colspan="1">77</td>
<td align="left" rowspan="1" colspan="1">2.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">172.8</td>
<td align="left" rowspan="1" colspan="1">87.98</td>
<td align="left" rowspan="1" colspan="1">4/560k</td>
<td align="left" rowspan="1" colspan="1">172.8</td>
<td align="left" rowspan="1" colspan="1">87.98</td>
<td align="left" rowspan="1" colspan="1">4/560k</td>
<td align="left" rowspan="1" colspan="1">16</td>
<td align="left" rowspan="1" colspan="1">2.2</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t005</object-id>
<label>Table 5</label>
<caption>
<title>Evaluation for
<italic>E.coli</italic>
simulated short reads data (D2, 60×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t005-5" xlink:href="pone.0114253.t005"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>173.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>173.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>3</bold>
</td>
<td align="left" rowspan="1" colspan="1">1.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">124.6</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>173.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">99.97</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">13</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">119.2</td>
<td align="left" rowspan="1" colspan="1">99.92</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">135.0</td>
<td align="left" rowspan="1" colspan="1">99.56</td>
<td align="left" rowspan="1" colspan="1">1/25k</td>
<td align="left" rowspan="1" colspan="1">10</td>
<td align="left" rowspan="1" colspan="1">1.1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">125.2</td>
<td align="left" rowspan="1" colspan="1">99.79</td>
<td align="left" rowspan="1" colspan="1">6/4451</td>
<td align="left" rowspan="1" colspan="1">148.5</td>
<td align="left" rowspan="1" colspan="1">99.87</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">1.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">23.5</td>
<td align="left" rowspan="1" colspan="1">98.35</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">95.4</td>
<td align="left" rowspan="1" colspan="1">98.48</td>
<td align="left" rowspan="1" colspan="1">1/492</td>
<td align="left" rowspan="1" colspan="1">50</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">68.4</td>
<td align="left" rowspan="1" colspan="1">98.72</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">77.1</td>
<td align="left" rowspan="1" colspan="1">98.64</td>
<td align="left" rowspan="1" colspan="1">1/4996</td>
<td align="left" rowspan="1" colspan="1">98</td>
<td align="left" rowspan="1" colspan="1">2.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">156.4</td>
<td align="left" rowspan="1" colspan="1">94.25</td>
<td align="left" rowspan="1" colspan="1">2/257k</td>
<td align="left" rowspan="1" colspan="1">156.4</td>
<td align="left" rowspan="1" colspan="1">94.25</td>
<td align="left" rowspan="1" colspan="1">2/257k</td>
<td align="left" rowspan="1" colspan="1">19</td>
<td align="left" rowspan="1" colspan="1">2.2</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t006</object-id>
<label>Table 6</label>
<caption>
<title>Evaluation for
<italic>E.coli</italic>
simulated short reads data (D3, 100×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t006-6" xlink:href="pone.0114253.t006"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>174.7</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>174.7</bold>
</td>
<td align="left" rowspan="1" colspan="1">99.99</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>5</bold>
</td>
<td align="left" rowspan="1" colspan="1">1.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">124.6</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">148.6</td>
<td align="left" rowspan="1" colspan="1">99.96</td>
<td align="left" rowspan="1" colspan="1">2/1723</td>
<td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">0.7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">126.2</td>
<td align="left" rowspan="1" colspan="1">99.90</td>
<td align="left" rowspan="1" colspan="1">1/524</td>
<td align="left" rowspan="1" colspan="1">135.0</td>
<td align="left" rowspan="1" colspan="1">90.89</td>
<td align="left" rowspan="1" colspan="1">4/206k</td>
<td align="left" rowspan="1" colspan="1">16</td>
<td align="left" rowspan="1" colspan="1">1.7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">117.5</td>
<td align="left" rowspan="1" colspan="1">99.75</td>
<td align="left" rowspan="1" colspan="1">9/7347</td>
<td align="left" rowspan="1" colspan="1">148.5</td>
<td align="left" rowspan="1" colspan="1">
<bold>100.0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">7</td>
<td align="left" rowspan="1" colspan="1">1.4</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">21.7</td>
<td align="left" rowspan="1" colspan="1">98.16</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">105.6</td>
<td align="left" rowspan="1" colspan="1">98.46</td>
<td align="left" rowspan="1" colspan="1">2/1024</td>
<td align="left" rowspan="1" colspan="1">103</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">37.3</td>
<td align="left" rowspan="1" colspan="1">93.63</td>
<td align="left" rowspan="1" colspan="1">1/61k</td>
<td align="left" rowspan="1" colspan="1">56.7</td>
<td align="left" rowspan="1" colspan="1">93.56</td>
<td align="left" rowspan="1" colspan="1">1/65k</td>
<td align="left" rowspan="1" colspan="1">209</td>
<td align="left" rowspan="1" colspan="1">2.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">148.8</td>
<td align="left" rowspan="1" colspan="1">98.89</td>
<td align="left" rowspan="1" colspan="1">2/54k</td>
<td align="left" rowspan="1" colspan="1">172.2</td>
<td align="left" rowspan="1" colspan="1">92.15</td>
<td align="left" rowspan="1" colspan="1">4/371k</td>
<td align="left" rowspan="1" colspan="1">29</td>
<td align="left" rowspan="1" colspan="1">2.4</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t007</object-id>
<label>Table 7</label>
<caption>
<title>Evaluation for
<italic>E.coli</italic>
real short reads data (D4, 600×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t007-7" xlink:href="pone.0114253.t007"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>133.5</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">1/207</td>
<td align="left" rowspan="1" colspan="1">
<bold>154.8</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">1/207</td>
<td align="left" rowspan="1" colspan="1">
<bold>21</bold>
</td>
<td align="left" rowspan="1" colspan="1">3.8</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">106.8</td>
<td align="left" rowspan="1" colspan="1">99.93</td>
<td align="left" rowspan="1" colspan="1">1/2105</td>
<td align="left" rowspan="1" colspan="1">148.5</td>
<td align="left" rowspan="1" colspan="1">99.98</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">31</td>
<td align="left" rowspan="1" colspan="1">2.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">96.0</td>
<td align="left" rowspan="1" colspan="1">93.61</td>
<td align="left" rowspan="1" colspan="1">4/293k</td>
<td align="left" rowspan="1" colspan="1">113.4</td>
<td align="left" rowspan="1" colspan="1">91.45</td>
<td align="left" rowspan="1" colspan="1">5/372k</td>
<td align="left" rowspan="1" colspan="1">64</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.3</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">82.8</td>
<td align="left" rowspan="1" colspan="1">95.25</td>
<td align="left" rowspan="1" colspan="1">11/212k</td>
<td align="left" rowspan="1" colspan="1">95.5</td>
<td align="left" rowspan="1" colspan="1">86.21</td>
<td align="left" rowspan="1" colspan="1">5/633k</td>
<td align="left" rowspan="1" colspan="1">33</td>
<td align="left" rowspan="1" colspan="1">5.1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">19.3</td>
<td align="left" rowspan="1" colspan="1">98.06</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">21.3</td>
<td align="left" rowspan="1" colspan="1">98.15</td>
<td align="left" rowspan="1" colspan="1">1/411</td>
<td align="left" rowspan="1" colspan="1">357</td>
<td align="left" rowspan="1" colspan="1">5.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td colspan="9" align="left" rowspan="1">Could not be run correctly as it required lots of disk space that exceeded our machine</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">72.3</td>
<td align="left" rowspan="1" colspan="1">97.33</td>
<td align="left" rowspan="1" colspan="1">6/126k</td>
<td align="left" rowspan="1" colspan="1">77.6</td>
<td align="left" rowspan="1" colspan="1">97.29</td>
<td align="left" rowspan="1" colspan="1">7/129k</td>
<td align="left" rowspan="1" colspan="1">118</td>
<td align="left" rowspan="1" colspan="1">2.5</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t008</object-id>
<label>Table 8</label>
<caption>
<title>Evaluation for
<italic>S.pombe</italic>
simulated short reads data (D5, 50×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t008-8" xlink:href="pone.0114253.t008"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">255.4</td>
<td align="left" rowspan="1" colspan="1">99.91</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">386.7</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.90</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>8</bold>
</td>
<td align="left" rowspan="1" colspan="1">1.7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">137.7</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.99</bold>
</td>
<td align="left" rowspan="1" colspan="1">3/966</td>
<td align="left" rowspan="1" colspan="1">254.7</td>
<td align="left" rowspan="1" colspan="1">99.08</td>
<td align="left" rowspan="1" colspan="1">6/100k</td>
<td align="left" rowspan="1" colspan="1">31</td>
<td align="left" rowspan="1" colspan="1">
<bold>1.3</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">181.8</td>
<td align="left" rowspan="1" colspan="1">99.78</td>
<td align="left" rowspan="1" colspan="1">12/9k</td>
<td align="left" rowspan="1" colspan="1">211.0</td>
<td align="left" rowspan="1" colspan="1">76.72</td>
<td align="left" rowspan="1" colspan="1">21/2.7M</td>
<td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">1.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">158.6</td>
<td align="left" rowspan="1" colspan="1">99.74</td>
<td align="left" rowspan="1" colspan="1">15/6k</td>
<td align="left" rowspan="1" colspan="1">293.2</td>
<td align="left" rowspan="1" colspan="1">99.74</td>
<td align="left" rowspan="1" colspan="1">8/8k</td>
<td align="left" rowspan="1" colspan="1">11</td>
<td align="left" rowspan="1" colspan="1">1.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">43.0</td>
<td align="left" rowspan="1" colspan="1">98.12</td>
<td align="left" rowspan="1" colspan="1">1/214</td>
<td align="left" rowspan="1" colspan="1">155.1</td>
<td align="left" rowspan="1" colspan="1">98.95</td>
<td align="left" rowspan="1" colspan="1">4/27k</td>
<td align="left" rowspan="1" colspan="1">103</td>
<td align="left" rowspan="1" colspan="1">2.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">139.6</td>
<td align="left" rowspan="1" colspan="1">95.24</td>
<td align="left" rowspan="1" colspan="1">3/218k</td>
<td align="left" rowspan="1" colspan="1">157.1</td>
<td align="left" rowspan="1" colspan="1">90.88</td>
<td align="left" rowspan="1" colspan="1">6/778k</td>
<td align="left" rowspan="1" colspan="1">243</td>
<td align="left" rowspan="1" colspan="1">2.5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">
<bold>417.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">90.76</td>
<td align="left" rowspan="1" colspan="1">6/1.3M</td>
<td align="left" rowspan="1" colspan="1">
<bold>417.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">90.76</td>
<td align="left" rowspan="1" colspan="1">6/1.3M</td>
<td align="left" rowspan="1" colspan="1">43</td>
<td align="left" rowspan="1" colspan="1">2.7</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t009" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t009</object-id>
<label>Table 9</label>
<caption>
<title>Evaluation for
<italic>S.pombe</italic>
real short reads data (D6, 52×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t009-9" xlink:href="pone.0114253.t009"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>37.0</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>98.97</bold>
</td>
<td align="left" rowspan="1" colspan="1">17/71k</td>
<td align="left" rowspan="1" colspan="1">
<bold>70.3</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>98.97</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>17/73k</bold>
</td>
<td align="left" rowspan="1" colspan="1">7</td>
<td align="left" rowspan="1" colspan="1">1.7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">32.1</td>
<td align="left" rowspan="1" colspan="1">98.54</td>
<td align="left" rowspan="1" colspan="1">28/140k</td>
<td align="left" rowspan="1" colspan="1">54.0</td>
<td align="left" rowspan="1" colspan="1">97.56</td>
<td align="left" rowspan="1" colspan="1">35/247k</td>
<td align="left" rowspan="1" colspan="1">31</td>
<td align="left" rowspan="1" colspan="1">1.3</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">33.3</td>
<td align="left" rowspan="1" colspan="1">98.20</td>
<td align="left" rowspan="1" colspan="1">44/73k</td>
<td align="left" rowspan="1" colspan="1">35.7</td>
<td align="left" rowspan="1" colspan="1">96.29</td>
<td align="left" rowspan="1" colspan="1">48/230k</td>
<td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.8</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">28.7</td>
<td align="left" rowspan="1" colspan="1">97.36</td>
<td align="left" rowspan="1" colspan="1">26/175k</td>
<td align="left" rowspan="1" colspan="1">42.3</td>
<td align="left" rowspan="1" colspan="1">95.86</td>
<td align="left" rowspan="1" colspan="1">29/391k</td>
<td align="left" rowspan="1" colspan="1">
<bold>6</bold>
</td>
<td align="left" rowspan="1" colspan="1">3.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">21.3</td>
<td align="left" rowspan="1" colspan="1">97.10</td>
<td align="left" rowspan="1" colspan="1">
<bold>16/38k</bold>
</td>
<td align="left" rowspan="1" colspan="1">39.1</td>
<td align="left" rowspan="1" colspan="1">97.32</td>
<td align="left" rowspan="1" colspan="1">21/100k</td>
<td align="left" rowspan="1" colspan="1">114</td>
<td align="left" rowspan="1" colspan="1">2.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">22.3</td>
<td align="left" rowspan="1" colspan="1">95.12</td>
<td align="left" rowspan="1" colspan="1">8/71k</td>
<td align="left" rowspan="1" colspan="1">49.4</td>
<td align="left" rowspan="1" colspan="1">98.95</td>
<td align="left" rowspan="1" colspan="1">11/396k</td>
<td align="left" rowspan="1" colspan="1">705</td>
<td align="left" rowspan="1" colspan="1">7.3</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">36.2</td>
<td align="left" rowspan="1" colspan="1">97.0</td>
<td align="left" rowspan="1" colspan="1">17/210k</td>
<td align="left" rowspan="1" colspan="1">64.7</td>
<td align="left" rowspan="1" colspan="1">93.71</td>
<td align="left" rowspan="1" colspan="1">20/672k</td>
<td align="left" rowspan="1" colspan="1">70</td>
<td align="left" rowspan="1" colspan="1">2.7</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t010" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t010</object-id>
<label>Table 10</label>
<caption>
<title>Evaluation for human chromosome 14 simulated short reads data (D7, 50×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t010-10" xlink:href="pone.0114253.t010"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">
<bold>149.9</bold>
</td>
<td align="left" rowspan="1" colspan="1">99.54</td>
<td align="left" rowspan="1" colspan="1">22/60k</td>
<td align="left" rowspan="1" colspan="1">
<bold>229.5</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.58</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>21/30k</bold>
</td>
<td align="left" rowspan="1" colspan="1">169</td>
<td align="left" rowspan="1" colspan="1">9.3</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">66.7</td>
<td align="left" rowspan="1" colspan="1">
<bold>99.74</bold>
</td>
<td align="left" rowspan="1" colspan="1">44/152k</td>
<td align="left" rowspan="1" colspan="1">174.3</td>
<td align="left" rowspan="1" colspan="1">98.51</td>
<td align="left" rowspan="1" colspan="1">58/1.24M</td>
<td align="left" rowspan="1" colspan="1">
<bold>144</bold>
</td>
<td align="left" rowspan="1" colspan="1">8.7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">11.4</td>
<td align="left" rowspan="1" colspan="1">94.98</td>
<td align="left" rowspan="1" colspan="1">377/354k</td>
<td align="left" rowspan="1" colspan="1">30.2</td>
<td align="left" rowspan="1" colspan="1">83.40</td>
<td align="left" rowspan="1" colspan="1">1109/11M</td>
<td align="left" rowspan="1" colspan="1">331</td>
<td align="left" rowspan="1" colspan="1">
<bold>6.0</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">8.6</td>
<td align="left" rowspan="1" colspan="1">92.24</td>
<td align="left" rowspan="1" colspan="1">1642/3.4M</td>
<td align="left" rowspan="1" colspan="1">78.6</td>
<td align="left" rowspan="1" colspan="1">24.87</td>
<td align="left" rowspan="1" colspan="1">1655/67M</td>
<td align="left" rowspan="1" colspan="1">147</td>
<td align="left" rowspan="1" colspan="1">13.5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">2.7</td>
<td align="left" rowspan="1" colspan="1">85.36</td>
<td align="left" rowspan="1" colspan="1">
<bold>146/43k</bold>
</td>
<td align="left" rowspan="1" colspan="1">5.5</td>
<td align="left" rowspan="1" colspan="1">80.10</td>
<td align="left" rowspan="1" colspan="1">1980/5M</td>
<td align="left" rowspan="1" colspan="1">1360</td>
<td align="left" rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">69.3</td>
<td align="left" rowspan="1" colspan="1">76.43</td>
<td align="left" rowspan="1" colspan="1">285/19M</td>
<td align="left" rowspan="1" colspan="1">82.8</td>
<td align="left" rowspan="1" colspan="1">67.40</td>
<td align="left" rowspan="1" colspan="1">318/26.6M</td>
<td align="left" rowspan="1" colspan="1">2742</td>
<td align="left" rowspan="1" colspan="1">11</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td colspan="9" align="left" rowspan="1">Could not be run correctly because of unknown running error</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0114253-t011" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114253.t011</object-id>
<label>Table 11</label>
<caption>
<title>Evaluation for human chromosome 14 real short reads data (D8, 40×).</title>
</caption>
<alternatives>
<graphic id="pone-0114253-t011-11" xlink:href="pone.0114253.t011"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">Contigs</td>
<td colspan="3" align="left" rowspan="1">Scaffolds</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
/
<italic>O</italic>
</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">N50 (kbp)</td>
<td align="left" rowspan="1" colspan="1">Cov. (%)</td>
<td align="left" rowspan="1" colspan="1">Misass. (#/sum)</td>
<td align="left" rowspan="1" colspan="1">Time (min)</td>
<td align="left" rowspan="1" colspan="1">Mem. (GB)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">PERGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧25</td>
<td align="left" rowspan="1" colspan="1">11.8</td>
<td align="left" rowspan="1" colspan="1">
<bold>95.86</bold>
</td>
<td align="left" rowspan="1" colspan="1">435/2.8M</td>
<td align="left" rowspan="1" colspan="1">20.2</td>
<td align="left" rowspan="1" colspan="1">91.96</td>
<td align="left" rowspan="1" colspan="1">423/6.1M</td>
<td align="left" rowspan="1" colspan="1">194</td>
<td align="left" rowspan="1" colspan="1">7.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">IDBA-UD</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">
<bold>16.3</bold>
</td>
<td align="left" rowspan="1" colspan="1">94.81</td>
<td align="left" rowspan="1" colspan="1">351/4.1M</td>
<td align="left" rowspan="1" colspan="1">
<bold>21.8</bold>
</td>
<td align="left" rowspan="1" colspan="1">92.94</td>
<td align="left" rowspan="1" colspan="1">335/5.8M</td>
<td align="left" rowspan="1" colspan="1">122</td>
<td align="left" rowspan="1" colspan="1">8.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ABySS</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">3.9</td>
<td align="left" rowspan="1" colspan="1">92.06</td>
<td align="left" rowspan="1" colspan="1">485/574k</td>
<td align="left" rowspan="1" colspan="1">4.1</td>
<td align="left" rowspan="1" colspan="1">91.61</td>
<td align="left" rowspan="1" colspan="1">547/975k</td>
<td align="left" rowspan="1" colspan="1">161</td>
<td align="left" rowspan="1" colspan="1">6.4</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Velvet</td>
<td align="left" rowspan="1" colspan="1">
<italic>k</italic>
 = 45</td>
<td align="left" rowspan="1" colspan="1">3.8</td>
<td align="left" rowspan="1" colspan="1">89.62</td>
<td align="left" rowspan="1" colspan="1">1199/2.8M</td>
<td align="left" rowspan="1" colspan="1">6.6</td>
<td align="left" rowspan="1" colspan="1">70.44</td>
<td align="left" rowspan="1" colspan="1">4342/21M</td>
<td align="left" rowspan="1" colspan="1">
<bold>68</bold>
</td>
<td align="left" rowspan="1" colspan="1">6.5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SGA</td>
<td align="left" rowspan="1" colspan="1">
<italic>O</italic>
≧31</td>
<td align="left" rowspan="1" colspan="1">2.4</td>
<td align="left" rowspan="1" colspan="1">84.36</td>
<td align="left" rowspan="1" colspan="1">
<bold>255/95k</bold>
</td>
<td align="left" rowspan="1" colspan="1">2.7</td>
<td align="left" rowspan="1" colspan="1">84.21</td>
<td align="left" rowspan="1" colspan="1">247/1.5M</td>
<td align="left" rowspan="1" colspan="1">826</td>
<td align="left" rowspan="1" colspan="1">16</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CABOG</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">13.0</td>
<td align="left" rowspan="1" colspan="1">87.65</td>
<td align="left" rowspan="1" colspan="1">527/7.3M</td>
<td align="left" rowspan="1" colspan="1">20.6</td>
<td align="left" rowspan="1" colspan="1">82.07</td>
<td align="left" rowspan="1" colspan="1">560/13M</td>
<td align="left" rowspan="1" colspan="1">1757</td>
<td align="left" rowspan="1" colspan="1">10</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MaSuRCA</td>
<td align="left" rowspan="1" colspan="1">default</td>
<td align="left" rowspan="1" colspan="1">6.8</td>
<td align="left" rowspan="1" colspan="1">95.38</td>
<td align="left" rowspan="1" colspan="1">165/616k</td>
<td align="left" rowspan="1" colspan="1">6.9</td>
<td align="left" rowspan="1" colspan="1">
<bold>95.28</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>170/710k</bold>
</td>
<td align="left" rowspan="1" colspan="1">478</td>
<td align="left" rowspan="1" colspan="1">
<bold>2.7</bold>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>The experiments for the simulated reads data were carried out on a 64-bit Linux machine with an Intel(R) Core-2 CPU 2.53-GHz supplied with 3 GB memory except the experiments for CABOG. The experiments for CABOG and the real reads data were carried out on an Intel(R) Xeon(R) Core-8 CPU 2.00-GHz server supplied with 24 GB memory.</p>
</sec>
</sec>
<sec id="s3">
<title>Results and Discussion</title>
<sec id="s3a">
<title>Performance of greedy-like prediction model</title>
<p>The performance of our greedy-like SVM prediction model was assessed by counting the numbers of correctly and incorrectly predicted extensions and stops for all branches during the assembling step. To evaluate the greedy-like prediction model independently, statistics were calculated when the look-ahead approach was not used. The statistical results are shown in
<xref ref-type="table" rid="pone-0114253-t002">Table 2</xref>
. By constructing the decision models using machine learning approach, PERGA can determine the correct extension in 99.7% of cases for the simulated reads data D1∼D3. And PERGA also determines the stop cases that both extensions are likely to be correct. PERGA can produce only a few incorrect extensions and incorrect stops (less than 0.1%).</p>
</sec>
<sec id="s3b">
<title>Performance of look-ahead approach</title>
<p>Although the performance of SVM prediction model is good, PERGA still has some incorrect extensions and stops with low confidence. These low confident predictions can be resolved by the look-ahead approach.
<xref ref-type="table" rid="pone-0114253-t003">Table 3</xref>
shows the number of correct and incorrect navigations for low confident branches when applying this approach to resolve branches due to sequencing errors and short tandem repeats on the datasets D1∼D3. The statistics for sequencing errors were calculated independently without considering short repeats; and for the low confident branches with unsatisfied properties due to sequencing errors, further check was applied to determine if these branches were due to short tandem repeat.</p>
<p>For the branches due to sequencing errors, most of them can be correctly resolved in high probability (about 98%) with low error rate (about 2%), which makes PERGA generates long and accurate contigs. According to
<xref ref-type="table" rid="pone-0114253-t002">Table 2</xref>
and
<xref ref-type="table" rid="pone-0114253-t003">Table 3</xref>
, less than 1% of the branches (D1∼D3) are adjusted by this approach. As look-ahead approach is very effective, the mis-prediction branches can be easily handled by this approach.</p>
<p>From the table, it can be seen that about one third of the branches do not satisfy the properties for branches due to sequencing error. These branches will be further checked according to the properties for branches due to short repeats. Among these branches, there were only a few incorrect navigations (<2%). Thus, the overall correct navigations for look-ahead approach, after dealing with sequencing errors and short tandem repeats, were more than 99% of cases.</p>
<p>In summary, the greedy-prediction navigation model resolved most (>99%) of the branches, and then the look-ahead approach further resolved most (>99%) of the low confident branches. Therefore, after combining the greedy-like prediction model and look-ahead approach, PERGA can produce long and accurate contigs.</p>
</sec>
<sec id="s3c">
<title>Performance on
<italic>E.coli</italic>
genome</title>
<p>
<xref ref-type="table" rid="pone-0114253-t004">Table 4</xref>
shows the performances of PERGA as well as other assemblers on the 50x simulated paired-end reads dataset D1. PERGA generated the longest contigs in N50 measures, highest reference coverage and the most accurate result with no mis-assemblies. PERGA and Velvet were the fastest assemblers among seven assemblers with moderate memory usage and were about 3 times faster than IDBA-UD and ABySS, since SGA uses the FM-index to compute the read overlap, and CABOG computes the read overlaps between each other, thus they cost more time while assembling (43 minutes and 77 minutes). MaSuRCA generated 4 mis-assembled contigs (560 kbp) while PERGA did not. This is because that PERGA handles branches for extension much more carefully, it utilizes the greedy-like prediction SVM models which contains much branch information to give much better extensions, and PERGA distinguishes sequencing errors and repeats for branches using the look-ahead approach to decide the correct extensions. Thus, PERGA can provide fewer but longer contigs than the others without producing erroneous contigs.</p>
<p>When the number of reads in the dataset increases, the running times of all the other assemblers increase (
<xref ref-type="table" rid="pone-0114253-t005">Table 5</xref>
for dataset D2). However, the running time of PERGA does not increase significantly and is still the fastest assembler. The contigs and scaffolds produced by PERGA have the largest N50 and coverage with no mis-assemblies. Because of the increase in sequencing depth, IDBA-UD and Velvet performed better than on D2 with contig N50 increased from 112.6 kbp and 108.1 kbp to 124.6 kbp and 125.2 kbp respectively with less assembly errors, whereas MaSuRCA had lower N50 size than on D1. PERGA and IDBA-UD had the largest scaffold sizes (173.9 kbp), whereas the results of other assemblers were much shorter (only around 140 kbp). MaSuRCA tended to produce more erroneous contigs and scaffolds, which decreased its genome coverage. Compared with other assemblers, SGA and CABOG produced shorter contigs and scaffolds with longer running times. The coverage of all assemblers on D1 and D2 are much the same and they do not differ much between contigs construction and scaffolds production.</p>
<p>For the simulated 100x dataset D3 (
<xref ref-type="table" rid="pone-0114253-t006">Table 6</xref>
), since the sequencing depth is high, all assemblers generated similar numbers of contigs with similar coverage in both contigs and scaffolds except ABySS and MaSuRCA which dropped from 99.90% to 90.89% and from 98.89% to 92.15% because of the mis-assembled 4 scaffolds (206 kbp) and mis-assembled 4 scaffolds (371 kbp) for ABySS and MaSuRCA, respectively. MaSuRCA had the most erroneous scaffolds, CABOG generated 1 mis-assembled contig (61 kbp) and 1 mis-assembled scaffold (65 kbp), thus its genome coverage dropped to 93.5%. From the experiments on D1∼D3, it can be observed that CABOG may be not suitable for high coverage data since its contigs (scaffolds) sizes decreased with the increasing coverage depth, and it can also be seen that the overlap-based assemblers (SGA and CABOG) is not suitable for high coverage data. PERGA also was the fastest assembler and produced the most accurate results while others all produced several mis-assembled contigs (scaffolds). In all experiments on simulated data, PERGA did not produce any mis-assembled contigs and scaffolds while the other assemblers mis-assembled some reads in some datasets.</p>
<p>We further used the downloaded
<italic>E.coli</italic>
dataset D4 with coverage ∼600x to highlight the performance of PERGA on high coverage data, and compared its performances with other assemblers. CABOG could not be run on D4 because it required lots of disk space that exceeded the server. Before assembling, paired-end reads data were corrected using Quake
<xref rid="pone.0114253-Kelley1" ref-type="bibr">[31]</xref>
, and the results were shown in
<xref ref-type="table" rid="pone-0114253-t007">Table 7</xref>
.</p>
<p>The overall performance of all other assemblers dropped because of the short insert size dropped from 370 bp to 215 bp, and some repeats with length falling in this range could not be resolved. PERGA still had the best performance in N50 size, maximal size, and genome coverage. It may suggest that the SVM model used by PERGA can capture the properties in real datasets. PERGA was the fastest assembler and generated the longest contigs (scaffolds) (133.5 kbp and 154.8 kbp) with the highest coverage (99.99%), while others assemblers had much lower N50 except scaffolds of IDBA-UD (148.5 kbp). Since PERGA generated very long and accurate contigs, the scaffolds produced by PERGA had the largest N50 and highest coverage even though it did not connect many contigs in scaffolding. As the sequencing depth increased from 50x to 600x, the contigs (scaffolds) N50 size of MaSuRCA decreased from 172 kbp to 77 kbp and the number of contigs (scaffolds) increased from 70 to 240, which may indicate that MaSuRCA is not suitable for high coverage datasets.</p>
<p>After scaffolding, ABySS and Velvet produced longer scaffolds with lower coverage, while PERGA and IDBA-UD did not have coverage difference between contigs and scaffolds as they produced accurate assemblies. ABySS and Velvet both had>200 kbp mis-assembled contigs and >300 kbp mis-assembled scaffolds, thus their contig coverage dropped dramatically from 96% to 86%, MaSuRCA also generated a few erroneous contigs and scaffolds (>100 kbp), and SGA generated accurate contigs and scaffolds, however, its N50 sizes are very small (19.3 kbp and 21.3 kbp), and it used more time than others. This shows that ABySS, Velvet, SGA and MaSuRCA might not be suitable for high coverage sequencing data. They can have good performance on low coverage data but might not be good on high coverage data.</p>
</sec>
<sec id="s3d">
<title>Performance on
<italic>S.pombe</italic>
genome</title>
<p>We also tested the performance of PERGA on
<italic>S.pombe</italic>
50x simulated dataset D5 and 52x real dataset D6, and the reads of real
<italic>S.pombe</italic>
dataset were error-corrected using Quake
<xref rid="pone.0114253-Kelley1" ref-type="bibr">[31]</xref>
prior to assembly, the results were shown in
<xref ref-type="table" rid="pone-0114253-t008">Tables 8</xref>
<xref ref-type="table" rid="pone-0114253-t009">9</xref>
.</p>
<p>From
<xref ref-type="table" rid="pone-0114253-t008">Table 8</xref>
, MaSuRCA, PERGA and Velvet were the top three assemblers in scaffold N50 size (417.9 kbp, 386.7 kbp and 293.2 kbp), whereas the scaffold N50 size of other assemblers were all less than 300 kbp with a few assembly errors, and however, the genome coverage of MaSuRCA was only 90.76% because of its 6 mis-assembled large scaffolds (1.3 Mbp). ABySS generated more mis-assembled scaffolds than others (2.7 Mbp), so its genome coverage dropped from 99.78% to 76.7%. IDBA-UD generated the short contigs and fewer mis-assemblies than ABySS and Velvet, and its scaffolds have more errors than Velvet. SGA had the most number but shortest contigs and scaffolds (N50 size 43.0 kbp and 155.1 kbp), and CABOG generated short scaffolds (N50 size 157.1 kbp) with a few errors (778 kbp). Overall, PERGA outperformed other assemblers on D5 in assembly length, accuracy, coverage and running time.</p>
<p>The results for
<italic>S.pombe</italic>
real dataset D6 were listed in
<xref ref-type="table" rid="pone-0114253-t009">Table 9</xref>
. From the table, all assemblers produced similar results in terms of length, accuracy and coverage. PERGA generated the largest contigs and scaffolds (N50 size 37.0 kbp and 70.3 kbp) with highest genome coverage (98.97%) and fewer assembly errors (73 kbp). PERGA, MaSuRCA and IDBA-UD were the top three assemblers in scaffold N50 size (>54 kbp), whereas the N50 size of others were less than 50 kbp. PERGA and IDBA-UD generated the largest scaffold size (334.7 kbp and 270.3 kbp), and MaSuRCA was more error prone and more likely to produce erroneous contigs and scaffolds (210 kbp and 672 kbp). Velvet and PERGA were much faster than other assemblers; however, Velvet produced more errors (175 kbp and 239 kbp) with a high memory cost (3.9 GB). SGA and CABOG needed more time than others, and CABOG required the most time and memory consumption among all the assemblers, whereas ABySS had the least memory consumption (0.8 GB).</p>
<p>In summary, PERGA generated better results than other assemblers for both the simulated and real
<italic>S.pombe</italic>
datasets D5∼D6, which indicated that the SVM model and the look-ahead approach were suitable for the assembly of other genomes and resulted in good performance.</p>
</sec>
<sec id="s3e">
<title>Performance on human chromosome 14</title>
<p>To highlight the performance of PERGA, we used the human chromosome 14 simulated 50x dataset D7 and real 40x dataset D8 to test its performance. MaSuRCA could not be run correctly on the simulated dataset because of unknown running error, and the results for the assemblers were shown in
<xref ref-type="table" rid="pone-0114253-t010">Tables 10</xref>
<xref ref-type="table" rid="pone-0114253-t011">11</xref>
.</p>
<p>From
<xref ref-type="table" rid="pone-0114253-t010">Table 10</xref>
, PERGA generated the least number of contigs with the largest contigs (N50 size 149.9 kbp, maximal size 1 Mbp, mean size 36 kbp) and largest scaffolds (N50 size 229.5 kbp, maximal size 1 Mbp, mean size 39 kbp) with fewer mis-assemblies than others. PERGA and IDBA-UD were the top two assemblers in scaffold N50 size (230 kbp and 174 kbp), maximal size (about 1 Mbp) and accuracy (mis-assembled scaffolds 30 kbp and 1.2 Mbp), whereas the scaffold N50 size and maximal size of others were <90 kbp and <500 kbp, respectively, and they generated more mis-assembled scaffolds (>5 Mbp) than PERGA and IDBA-UD. IDBA-UD was the fastest assembler and it generated much longer and more accurate results than ABySS and Velvet. Velvet, CABOG and ABySS generated the least accurate results and their contig sizes and scaffold sizes were also short, and the genome coverage of Velvet dropped greatly from 92.24% to 24.87% because of its lots of mis-assembled scaffolds (67 Mbp). SGA generated the most number (32077) of scaffolds with the shortest length (N50 size 5.5 kbp), and about 10 Mbp genome regions were missing, so its genome coverage was no more than 86%.</p>
<p>From
<xref ref-type="table" rid="pone-0114253-t011">Table 11</xref>
, PERGA generated large contigs and scaffolds (N50 size 11.8 kbp and 20.2 kbp) with the high genome coverage (95.86% and 91.96%) and fewer mis-assemblies (2.8 Mbp and 6.1 Mbp), that is because PERGA tried to extend contigs in more accurate way. IDBA-UD, CABOG and PERGA were the top three assemblers in terms of N50 size (>10 kbp and >20 kbp for contigs and scaffolds, respectively) and maximal size (>100 kbp and >140 kbp for contigs and scaffolds), however, CABOG produced more errors (7.3 Mbp and 13 Mbp for contigs and scaffolds, respectively) than IDBA-UD and PERGA, and the N50 size of other assemblers were less than 7 kbp for contigs and scaffolds. MaSuRCA, ABySS and SGA generated the most accurate results; however, their lengths were short (scaffold N50 size <7 kbp), and also, the total summed assembly length of SGA was only 85% (75 Mbp) of the reference (88 Mbp), about 10 Mbp (10%) of genome regions were missing, which decreased its genome coverage (<85%). Velvet generated short (N50 size 3.8 kbp) but accurate contigs (error contigs 2.8 Mbp), however, its scaffolds were also short (6.6 kbp) and contained much more errors (21 Mbp), so its genome coverage dropped dramatically from 89% to 70%.</p>
<p>From the experiments on D1∼D8, PERGA performed faster than other assemblers because of several reasons. First, the
<italic>k</italic>
-mer hash table enabled the fast way of aligning reads to contigs while assembling. Second, reads from other genome regions with only a few bases aligned onto contigs were prevented from assembly, which reduced the computations of spurious overlap. Third, PERGA adopted the variable overlap size approach and used paired-end reads with the highest priority, which reduced the computations and speeded up the algorithm. Therefore, PERGA performed faster than other assemblers.</p>
<p>In summary, PERGA outperformed most of other assemblers when the genome sizes increase from small bacterial genomes (e.g.
<italic>E.coli</italic>
) to large human chromosomes (e.g. chromosome 14) in longer and more accurate assembly results, and IDBA-UD was the second best assembler and had similar performance, whereas other assemblers (e.g. Velvet, SGA, and ABySS) might be more suitable for small bacterial genomes rather than large genomes.</p>
<p>The experiments showed that the greedy-like prediction extension strategy has better performance than graph-based assemblers because it uses the SVM prediction model to eliminate sequencing errors, and uses the look-ahead approach to deal with sequencing errors and resolve short tandem repeats in genome, thus resulting longer and more accurate assembly results.</p>
</sec>
</sec>
<sec id="s4">
<title>Conclusions</title>
<p>In this article, we present PERGA, a novel
<italic>de novo</italic>
paired-end reads assembler, which can generate large and accurate assemblies using the greedy-like prediction strategy to handle branches and errors to give much better extensions. By using look-ahead approach, PERGA distinguishes sequencing errors and repeats accurately and separates different copies of short repeats to make the extension much longer and more accurate. Moreover, instead of using single-end reads to construct contigs, PERGA uses paired-end reads in the first step and gives different priority to different read overlap thresholds ranging from
<italic>O</italic>
<sub>max</sub>
to
<italic>O</italic>
<sub>min</sub>
to resolve the gap and branch problem. Experiments showed that PERGA could generate very long and accurate contigs and scaffolds with fewer mis-assembly errors both for simulated reads data and real data sets for both low and high coverage datasets than the existing methods on both small bacterial genomes (e.g.,
<italic>E.coli</italic>
and
<italic>S.pombe</italic>
) and large complex genomes (e.g., human chromosome 14).</p>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0114253.s001">
<label>File S1</label>
<caption>
<p>
<bold>Example of tandem repeats in human chromosome and detailed assembly results.</bold>
The reference region 58,287,977–58,288,418 (region size 442 bp) of human chromosome 14 consists of three complex repeats A, B and C, with A appears three times, B appears four times, C appears five times, and A contains B as sub-repeat, B contains C as sub-repeat. PERGA can correctly resolve this repeat region but others fail.</p>
<p>(DOC)</p>
</caption>
<media xlink:href="pone.0114253.s001.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0114253.s002">
<label>File S2</label>
<caption>
<p>
<bold>The detailed view of resolving tandem repeats in human chromosome 14 by PERGA.</bold>
</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0114253.s002.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>The authors thank Qinghua Jiang and Yongdong Xu at Harbin Institute of Technology for informative suggestions. We also thank the reviewers for their constructive comments.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0114253-Shendure1">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Shendure</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Porreca</surname>
<given-names>GJ</given-names>
</name>
,
<name>
<surname>Reppas</surname>
<given-names>NB</given-names>
</name>
,
<name>
<surname>Lin</surname>
<given-names>XX</given-names>
</name>
,
<name>
<surname>McCutcheon</surname>
<given-names>JP</given-names>
</name>
,
<etal>et al</etal>
(
<year>2005</year>
)
<article-title>Accurate multiplex polony sequencing of an evolved bacterial genome</article-title>
.
<source>Science</source>
<volume>309</volume>
:
<fpage>1728</fpage>
<lpage>1732</lpage>
.
<pub-id pub-id-type="pmid">16081699</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Margulies1">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>Margulies</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Egholm</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Altman</surname>
<given-names>WE</given-names>
</name>
,
<name>
<surname>Attiya</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Bader</surname>
<given-names>JS</given-names>
</name>
,
<etal>et al</etal>
(
<year>2005</year>
)
<article-title>Genome sequencing in microfabricated high-density picolitre reactors</article-title>
.
<source>Nature</source>
<volume>437</volume>
:
<fpage>376</fpage>
<lpage>380</lpage>
.
<pub-id pub-id-type="pmid">16056220</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Li1">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>RQ</given-names>
</name>
,
<name>
<surname>Fan</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Tian</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Zhu</surname>
<given-names>HM</given-names>
</name>
,
<name>
<surname>He</surname>
<given-names>L</given-names>
</name>
,
<etal>et al</etal>
(
<year>2010</year>
)
<article-title>The sequence and de novo assembly of the giant panda genome</article-title>
.
<source>Nature</source>
<volume>463</volume>
:
<fpage>311</fpage>
<lpage>317</lpage>
.
<pub-id pub-id-type="pmid">20010809</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Bentley1">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Bentley</surname>
<given-names>DR</given-names>
</name>
,
<name>
<surname>Balasubramanian</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Swerdlow</surname>
<given-names>HP</given-names>
</name>
,
<name>
<surname>Smith</surname>
<given-names>GP</given-names>
</name>
,
<name>
<surname>Milton</surname>
<given-names>J</given-names>
</name>
,
<etal>et al</etal>
(
<year>2008</year>
)
<article-title>Accurate whole human genome sequencing using reversible terminator chemistry</article-title>
.
<source>Nature</source>
<volume>456</volume>
:
<fpage>53</fpage>
<lpage>59</lpage>
.
<pub-id pub-id-type="pmid">18987734</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Li2">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
(
<year>2012</year>
)
<article-title>Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly</article-title>
.
<source>Bioinformatics</source>
<volume>28</volume>
:
<fpage>1838</fpage>
<lpage>1844</lpage>
.
<pub-id pub-id-type="pmid">22569178</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Blanca1">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Blanca</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Pascual</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Ziarsolo</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Nuez</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Canizares</surname>
<given-names>J</given-names>
</name>
(
<year>2011</year>
)
<article-title>ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using Next Generation Sequence</article-title>
.
<source>BMC Genomics</source>
<volume>12</volume>
:
<fpage>285</fpage>
.
<pub-id pub-id-type="pmid">21635747</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Schatz1">
<label>7</label>
<mixed-citation publication-type="journal">
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
,
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
(
<year>2010</year>
)
<article-title>Assembly of large genomes using second-generation sequencing</article-title>
.
<source>Genome Res</source>
<volume>20</volume>
:
<fpage>1165</fpage>
<lpage>1173</lpage>
.
<pub-id pub-id-type="pmid">20508146</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-SurgetGroba1">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Surget-Groba</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Montoya-Burgos</surname>
<given-names>JI</given-names>
</name>
(
<year>2010</year>
)
<article-title>Optimization of de novo transcriptome assembly from next-generation sequencing data</article-title>
.
<source>Genome Res</source>
<volume>20</volume>
:
<fpage>1432</fpage>
<lpage>1440</lpage>
.
<pub-id pub-id-type="pmid">20693479</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Treangen1">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Treangen</surname>
<given-names>TJ</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
(
<year>2012</year>
)
<article-title>Repetitive DNA and next-generation sequencing: computational challenges and solutions</article-title>
.
<source>Nat Rev Genet</source>
<volume>13</volume>
:
<fpage>36</fpage>
<lpage>46</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0114253-Flicek1">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Flicek</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
(
<year>2009</year>
)
<article-title>Sense from sequence reads: methods for alignment and assembly</article-title>
.
<source>Nat Methods</source>
<volume>6</volume>
:
<fpage>S6</fpage>
<lpage>S12</lpage>
.
<pub-id pub-id-type="pmid">19844229</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Shendure2">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Shendure</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Ji</surname>
<given-names>H</given-names>
</name>
(
<year>2008</year>
)
<article-title>Next-generation DNA sequencing</article-title>
.
<source>Nat Biotechnol</source>
<volume>26</volume>
:
<fpage>1135</fpage>
<lpage>1145</lpage>
.
<pub-id pub-id-type="pmid">18846087</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Warren1">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Warren</surname>
<given-names>RL</given-names>
</name>
,
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
,
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
,
<name>
<surname>Holt</surname>
<given-names>RA</given-names>
</name>
(
<year>2007</year>
)
<article-title>Assembling millions of short DNA sequences using SSAKE</article-title>
.
<source>Bioinformatics</source>
<volume>23</volume>
:
<fpage>500</fpage>
<lpage>501</lpage>
.
<pub-id pub-id-type="pmid">17158514</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Jeck1">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Jeck</surname>
<given-names>WR</given-names>
</name>
,
<name>
<surname>Reinhardt</surname>
<given-names>JA</given-names>
</name>
,
<name>
<surname>Baltrus</surname>
<given-names>DA</given-names>
</name>
,
<name>
<surname>Hickenbotham</surname>
<given-names>MT</given-names>
</name>
,
<name>
<surname>Magrini</surname>
<given-names>V</given-names>
</name>
,
<etal>et al</etal>
(
<year>2007</year>
)
<article-title>Extending assembly of short DNA sequences to handle error</article-title>
.
<source>Bioinformatics</source>
<volume>23</volume>
:
<fpage>2942</fpage>
<lpage>2944</lpage>
.
<pub-id pub-id-type="pmid">17893086</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Dohm1">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Dohm</surname>
<given-names>JC</given-names>
</name>
,
<name>
<surname>Lottaz</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Borodina</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Himmelbauer</surname>
<given-names>H</given-names>
</name>
(
<year>2007</year>
)
<article-title>SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing</article-title>
.
<source>Genome Res</source>
<volume>17</volume>
:
<fpage>1697</fpage>
<lpage>1706</lpage>
.
<pub-id pub-id-type="pmid">17908823</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Hernandez1">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hernandez</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Francois</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Farinelli</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Osteras</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Schrenzel</surname>
<given-names>J</given-names>
</name>
(
<year>2008</year>
)
<article-title>De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer</article-title>
.
<source>Genome Res</source>
<volume>18</volume>
:
<fpage>802</fpage>
<lpage>809</lpage>
.
<pub-id pub-id-type="pmid">18332092</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Miller1">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>JR</given-names>
</name>
,
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
,
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Venter</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Walenz</surname>
<given-names>BP</given-names>
</name>
,
<etal>et al</etal>
(
<year>2008</year>
)
<article-title>Aggressive assembly of pyrosequencing reads with mates</article-title>
.
<source>Bioinformatics</source>
<volume>24</volume>
:
<fpage>2818</fpage>
<lpage>2824</lpage>
.
<pub-id pub-id-type="pmid">18952627</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Burrows1">
<label>17</label>
<mixed-citation publication-type="other">Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report 124: Palo Alto, CA, Digital Equipment Corporation.</mixed-citation>
</ref>
<ref id="pone.0114253-Simpson1">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
,
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
(
<year>2012</year>
)
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
.
<source>Genome Res</source>
<volume>22</volume>
:
<fpage>549</fpage>
<lpage>556</lpage>
.
<pub-id pub-id-type="pmid">22156294</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Ferragina1">
<label>19</label>
<mixed-citation publication-type="other">Ferragina P, Manzini G (2000) Opportunistic Data Structures with Applications; 2000. pp. IEEE Computer Society, 390–398.</mixed-citation>
</ref>
<ref id="pone.0114253-Myers1">
<label>20</label>
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
,
<name>
<surname>Sutton</surname>
<given-names>GG</given-names>
</name>
,
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
,
<name>
<surname>Dew</surname>
<given-names>IM</given-names>
</name>
,
<name>
<surname>Fasulo</surname>
<given-names>DP</given-names>
</name>
,
<etal>et al</etal>
(
<year>2000</year>
)
<article-title>A whole-genome assembly of Drosophila</article-title>
.
<source>Science</source>
<volume>287</volume>
:
<fpage>2196</fpage>
<lpage>2204</lpage>
.
<pub-id pub-id-type="pmid">10731133</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Zimin1">
<label>21</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zimin</surname>
<given-names>AV</given-names>
</name>
,
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
,
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>The MaSuRCA genome assembler</article-title>
.
<source>Bioinformatics</source>
<volume>29</volume>
:
<fpage>2669</fpage>
<lpage>2677</lpage>
.
<pub-id pub-id-type="pmid">23990416</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Pevzner1">
<label>22</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
,
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
(
<year>2001</year>
)
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
.
<source>Proc Natl Acad Sci USA</source>
<volume>98</volume>
:
<fpage>9748</fpage>
<lpage>9753</lpage>
.
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Zerbino1">
<label>23</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
,
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
(
<year>2008</year>
)
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
.
<source>Genome Res</source>
<volume>18</volume>
:
<fpage>821</fpage>
<lpage>829</lpage>
.
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Chaisson1">
<label>24</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
,
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
(
<year>2008</year>
)
<article-title>Short read fragment assembly of bacterial genomes</article-title>
.
<source>Genome Res</source>
<volume>18</volume>
:
<fpage>324</fpage>
<lpage>330</lpage>
.
<pub-id pub-id-type="pmid">18083777</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Butler1">
<label>25</label>
<mixed-citation publication-type="journal">
<name>
<surname>Butler</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>MacCallum</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Kleber</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Shlyakhter</surname>
<given-names>IA</given-names>
</name>
,
<name>
<surname>Belmonte</surname>
<given-names>MK</given-names>
</name>
,
<etal>et al</etal>
(
<year>2008</year>
)
<article-title>ALLPATHS: de novo assembly of whole-genome shotgun microreads</article-title>
.
<source>Genome Res</source>
<volume>18</volume>
:
<fpage>810</fpage>
<lpage>820</lpage>
.
<pub-id pub-id-type="pmid">18340039</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Simpson2">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
,
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
,
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
,
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
,
<etal>et al</etal>
(
<year>2009</year>
)
<article-title>ABySS: A parallel assembler for short read sequence data</article-title>
.
<source>Genome Res</source>
<volume>19</volume>
:
<fpage>1117</fpage>
<lpage>1123</lpage>
.
<pub-id pub-id-type="pmid">19251739</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Peng1">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
,
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
,
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
(
<year>2010</year>
)
<article-title>IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler</article-title>
.
<source>Research in Computational Molecular Biology, Proceedings</source>
<volume>6044</volume>
:
<fpage>426</fpage>
<lpage>440</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0114253-Peng2">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Leung</surname>
<given-names>HC</given-names>
</name>
,
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
,
<name>
<surname>Chin</surname>
<given-names>FY</given-names>
</name>
(
<year>2012</year>
)
<article-title>IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth</article-title>
.
<source>Bioinformatics</source>
<volume>28</volume>
:
<fpage>1420</fpage>
<lpage>1428</lpage>
.
<pub-id pub-id-type="pmid">22495754</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Li3">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Zhu</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Qian</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Fang</surname>
<given-names>X</given-names>
</name>
,
<etal>et al</etal>
(
<year>2009</year>
)
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
.
<source>Genome Res</source>
<volume>20</volume>
:
<fpage>265</fpage>
<lpage>272</lpage>
.
<pub-id pub-id-type="pmid">20019144</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-McElroy1">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>McElroy</surname>
<given-names>KE</given-names>
</name>
,
<name>
<surname>Luciani</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Thomas</surname>
<given-names>T</given-names>
</name>
(
<year>2012</year>
)
<article-title>GemSIM: general, error-model based simulator of next-generation sequencing data</article-title>
.
<source>BMC Genomics</source>
<volume>13</volume>
:
<fpage>74</fpage>
.
<pub-id pub-id-type="pmid">22336055</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Kelley1">
<label>31</label>
<mixed-citation publication-type="journal">
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
,
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
(
<year>2010</year>
)
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
.
<source>Genome Biol</source>
<volume>11</volume>
:
<fpage>R116</fpage>
.
<pub-id pub-id-type="pmid">21114842</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Salzberg1">
<label>32</label>
<mixed-citation publication-type="journal">
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
,
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
,
<name>
<surname>Zimin</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Magoc</surname>
<given-names>T</given-names>
</name>
,
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>GAGE: A critical evaluation of genome assemblies and assembly algorithms</article-title>
.
<source>Genome Res</source>
<volume>22</volume>
:
<fpage>557</fpage>
<lpage>567</lpage>
.
<pub-id pub-id-type="pmid">22147368</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0114253-Altschul1">
<label>33</label>
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
,
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
,
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
(
<year>1990</year>
)
<article-title>Basic local alignment search tool</article-title>
.
<source>J Mol Biol</source>
<volume>215</volume>
:
<fpage>403</fpage>
<lpage>410</lpage>
.
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001093  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001093  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021