Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Readjoiner: a fast and memory efficient string graph-based sequence assembler

Identifieur interne : 000943 ( Pmc/Corpus ); précédent : 000942; suivant : 000944

Readjoiner: a fast and memory efficient string graph-based sequence assembler

Auteurs : Giorgio Gonnella ; Stefan Kurtz

Source :

RBID : PMC:3507659

Abstract

Background

Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads.

Results

Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.

Conclusions

Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner.


Url:
DOI: 10.1186/1471-2105-13-82
PubMed: 22559072
PubMed Central: 3507659

Links to Exploration step

PMC:3507659

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Readjoiner: a fast and memory efficient string graph-based sequence assembler</title>
<author>
<name sortKey="Gonnella, Giorgio" sort="Gonnella, Giorgio" uniqKey="Gonnella G" first="Giorgio" last="Gonnella">Giorgio Gonnella</name>
<affiliation>
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kurtz, Stefan" sort="Kurtz, Stefan" uniqKey="Kurtz S" first="Stefan" last="Kurtz">Stefan Kurtz</name>
<affiliation>
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22559072</idno>
<idno type="pmc">3507659</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3507659</idno>
<idno type="RBID">PMC:3507659</idno>
<idno type="doi">10.1186/1471-2105-13-82</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000943</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000943</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Readjoiner: a fast and memory efficient string graph-based sequence assembler</title>
<author>
<name sortKey="Gonnella, Giorgio" sort="Gonnella, Giorgio" uniqKey="Gonnella G" first="Giorgio" last="Gonnella">Giorgio Gonnella</name>
<affiliation>
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kurtz, Stefan" sort="Kurtz, Stefan" uniqKey="Kurtz S" first="Stefan" last="Kurtz">Stefan Kurtz</name>
<affiliation>
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into
<italic>k</italic>
-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads.</p>
</sec>
<sec>
<title>Results</title>
<p>Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at
<ext-link ext-link-type="uri" xlink:href="http://www.zbh.uni-hamburg.de/readjoiner">http://www.zbh.uni-hamburg.de/readjoiner</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcpherson, Jd" uniqKey="Mcpherson J">JD McPherson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Wong, K" uniqKey="Wong K">K Wong</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hernandez, D" uniqKey="Hernandez D">D Hernandez</name>
</author>
<author>
<name sortKey="Francois, P" uniqKey="Francois P">P FranÇois</name>
</author>
<author>
<name sortKey="Farinelli, L" uniqKey="Farinelli L">L Farinelli</name>
</author>
<author>
<name sortKey="Oster S, M" uniqKey="Oster S M">M Osterås</name>
</author>
<author>
<name sortKey="Schrenzel, J" uniqKey="Schrenzel J">J Schrenzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manber, U" uniqKey="Manber U">U Manber</name>
</author>
<author>
<name sortKey="Myers, G" uniqKey="Myers G">G Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinh, H" uniqKey="Dinh H">H Dinh</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
<author>
<name sortKey="Rantala, T" uniqKey="Rantala T">T Rantala</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormen, T" uniqKey="Cormen T">T Cormen</name>
</author>
<author>
<name sortKey="Leiserson, C" uniqKey="Leiserson C">C Leiserson</name>
</author>
<author>
<name sortKey="Rivest, R" uniqKey="Rivest R">R Rivest</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Steinbiss, S" uniqKey="Steinbiss S">S Steinbiss</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bentley, J" uniqKey="Bentley J">J Bentley</name>
</author>
<author>
<name sortKey="Mcillroy, M" uniqKey="Mcillroy M">M McIllroy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abouelhoda, M" uniqKey="Abouelhoda M">M Abouelhoda</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fredkin, E" uniqKey="Fredkin E">E Fredkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bowden, T" uniqKey="Bowden T">T Bowden</name>
</author>
<author>
<name sortKey="Bauer, B" uniqKey="Bauer B">B Bauer</name>
</author>
<author>
<name sortKey="Nerin, J" uniqKey="Nerin J">J Nerin</name>
</author>
<author>
<name sortKey="Feng, S" uniqKey="Feng S">S Feng</name>
</author>
<author>
<name sortKey="Seibold, S" uniqKey="Seibold S">S Seibold</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barthelson, R" uniqKey="Barthelson R">R Barthelson</name>
</author>
<author>
<name sortKey="Mcfarlin, Aj" uniqKey="Mcfarlin A">AJ McFarlin</name>
</author>
<author>
<name sortKey="Rounsley, Sd" uniqKey="Rounsley S">SD Rounsley</name>
</author>
<author>
<name sortKey="Young, S" uniqKey="Young S">S Young</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author>
<name sortKey="Delcher, A" uniqKey="Delcher A">A Delcher</name>
</author>
<author>
<name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author>
<name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author>
<name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Earl, Da" uniqKey="Earl D">DA Earl</name>
</author>
<author>
<name sortKey="Bradnam, K" uniqKey="Bradnam K">K Bradnam</name>
</author>
<author>
<name sortKey="St John, J" uniqKey="St John J">J St John</name>
</author>
<author>
<name sortKey="Darling, A" uniqKey="Darling A">A Darling</name>
</author>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
<author>
<name sortKey="Faas, J" uniqKey="Faas J">J Faas</name>
</author>
<author>
<name sortKey="Yu, Hok" uniqKey="Yu H">HOK Yu</name>
</author>
<author>
<name sortKey="Vince, B" uniqKey="Vince B">B Vince</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Diekhans, M" uniqKey="Diekhans M">M Diekhans</name>
</author>
<author>
<name sortKey="Nguyen, N" uniqKey="Nguyen N">N Nguyen</name>
</author>
<author>
<name sortKey="Nuwantha, P" uniqKey="Nuwantha P">P Nuwantha</name>
</author>
<author>
<name sortKey="Sung, Awk" uniqKey="Sung A">AWK Sung</name>
</author>
<author>
<name sortKey="Ning, Z" uniqKey="Ning Z">Z Ning</name>
</author>
<author>
<name sortKey="Haimel, M" uniqKey="Haimel M">M Haimel</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Fronseca, Na" uniqKey="Fronseca N">NA Fronseca</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Docking, Tr" uniqKey="Docking T">TR Docking</name>
</author>
<author>
<name sortKey="Ho, Iy" uniqKey="Ho I">IY Ho</name>
</author>
<author>
<name sortKey="Rokhsar, Ds" uniqKey="Rokhsar D">DS Rokhsar</name>
</author>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author>
<name sortKey="Chapuis, G" uniqKey="Chapuis G">G Chapuis</name>
</author>
<author>
<name sortKey="Naquin, D" uniqKey="Naquin D">D Naquin</name>
</author>
<author>
<name sortKey="Maillet, N" uniqKey="Maillet N">N Maillet</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Kelly, Dr" uniqKey="Kelly D">DR Kelly</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Yang, Sp" uniqKey="Yang S">SP Yang</name>
</author>
<author>
<name sortKey="Wu, W" uniqKey="Wu W">W Wu</name>
</author>
<author>
<name sortKey="Chou, Wc" uniqKey="Chou W">WC Chou</name>
</author>
<author>
<name sortKey="Srivastava, A" uniqKey="Srivastava A">A Srivastava</name>
</author>
<author>
<name sortKey="Shaw, Ti" uniqKey="Shaw T">TI Shaw</name>
</author>
<author>
<name sortKey="Ruby, Jg" uniqKey="Ruby J">JG Ruby</name>
</author>
<author>
<name sortKey="Skewes Cox, P" uniqKey="Skewes Cox P">P Skewes-Cox</name>
</author>
<author>
<name sortKey="Betegon, M" uniqKey="Betegon M">M Betegon</name>
</author>
<author>
<name sortKey="Dimon, Mt" uniqKey="Dimon M">MT Dimon</name>
</author>
<author>
<name sortKey="Solovyev, V" uniqKey="Solovyev V">V Solovyev</name>
</author>
<author>
<name sortKey="Kosarev, P" uniqKey="Kosarev P">P Kosarev</name>
</author>
<author>
<name sortKey="Vorobyev, D" uniqKey="Vorobyev D">D Vorobyev</name>
</author>
<author>
<name sortKey="Ramirez Gonzalez, R" uniqKey="Ramirez Gonzalez R">R Ramirez-Gonzalez</name>
</author>
<author>
<name sortKey="Leggett, R" uniqKey="Leggett R">R Leggett</name>
</author>
<author>
<name sortKey="Maclean, D" uniqKey="Maclean D">D MacLean</name>
</author>
<author>
<name sortKey="Xia, F" uniqKey="Xia F">F Xia</name>
</author>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="L, Z" uniqKey="L Z">Z L</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S Gnerre</name>
</author>
<author>
<name sortKey="Maccallum, I" uniqKey="Maccallum I">I MacCallum</name>
</author>
<author>
<name sortKey="Przybylski, D" uniqKey="Przybylski D">D Przybylski</name>
</author>
<author>
<name sortKey="Ribeiro, Fj" uniqKey="Ribeiro F">FJ Ribeiro</name>
</author>
<author>
<name sortKey="Yin, S" uniqKey="Yin S">S Yin</name>
</author>
<author>
<name sortKey="Sharpe, T" uniqKey="Sharpe T">T Sharpe</name>
</author>
<author>
<name sortKey="Hall, G" uniqKey="Hall G">G Hall</name>
</author>
<author>
<name sortKey="Kersey, Pj" uniqKey="Kersey P">PJ Kersey</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Chapman, Ja" uniqKey="Chapman J">JA Chapman</name>
</author>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Derisi, Jl" uniqKey="Derisi J">JL DeRisi</name>
</author>
<author>
<name sortKey="Caccamo, M" uniqKey="Caccamo M">M Caccamo</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Jaffe, Db" uniqKey="Jaffe D">DB Jaffe</name>
</author>
<author>
<name sortKey="Green, R" uniqKey="Green R">R Green</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
<author>
<name sortKey="Korf, I" uniqKey="Korf I">I Korf</name>
</author>
<author>
<name sortKey="Paten, B" uniqKey="Paten B">B Paten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
<author>
<name sortKey="Landau, Gm" uniqKey="Landau G">GM Landau</name>
</author>
<author>
<name sortKey="Schieber, B" uniqKey="Schieber B">B Schieber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, D" uniqKey="Kelley D">D Kelley</name>
</author>
<author>
<name sortKey="Schatz, M" uniqKey="Schatz M">M Schatz</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Kosack, Ds" uniqKey="Kosack D">DS Kosack</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boetzer, M" uniqKey="Boetzer M">M Boetzer</name>
</author>
<author>
<name sortKey="Henkel, Cv" uniqKey="Henkel C">CV Henkel</name>
</author>
<author>
<name sortKey="Jansen, Hj" uniqKey="Jansen H">HJ Jansen</name>
</author>
<author>
<name sortKey="Butler, D" uniqKey="Butler D">D Butler</name>
</author>
<author>
<name sortKey="Pirovano, W" uniqKey="Pirovano W">W Pirovano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Donmez, N" uniqKey="Donmez N">N Donmez</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22559072</article-id>
<article-id pub-id-type="pmc">3507659</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-82</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-82</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Readjoiner: a fast and memory efficient string graph-based sequence assembler</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Gonnella</surname>
<given-names>Giorgio</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>gonnella@zbh.uni-hamburg.de</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A2">
<name>
<surname>Kurtz</surname>
<given-names>Stefan</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>kurtz@zbh.uni-hamburg.de</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>6</day>
<month>5</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<fpage>82</fpage>
<lpage>82</lpage>
<history>
<date date-type="received">
<day>16</day>
<month>11</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>2</day>
<month>3</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2012 Gonnella and Kurtz; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Gonnella and Kurtz; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/82"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Ongoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into
<italic>k</italic>
-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads.</p>
</sec>
<sec>
<title>Results</title>
<p>Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at
<ext-link ext-link-type="uri" xlink:href="http://www.zbh.uni-hamburg.de/readjoiner">http://www.zbh.uni-hamburg.de/readjoiner</ext-link>
.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>The
<italic>de novo</italic>
sequence assembly problem is to reconstruct a target sequence from a set of sequence reads. The classical approach to
<italic>de novo</italic>
assembly consists of three phases: overlap, layout and consensus. During the overlap phase, suffix-prefix matches among all pairs of sequence reads are computed, and turned into an overlap graph [
<xref ref-type="bibr" rid="B1">1</xref>
]. In the layout phase the location of the reads with respect to each other is determined. In the consensus phase the target sequence is reconstructed, by selecting a base for each position.</p>
<p>The introduction of the massively parallel next-generation DNA sequencing technologies has led to a considerable increase in the amount of data typically generated by sequencing experiments. For example, as of January 2012, the HiSeq2000 sequencer of Illumina delivers sets of 100 bp reads with a total length of up to 600 Gbp [
<xref ref-type="bibr" rid="B2">2</xref>
]. Sequence analysis software tools developed only a few years ago are often unable to deal with such large amounts of short reads: This has led to a gap between the ability to produce sequence data and the capability to assemble and analyze them [
<xref ref-type="bibr" rid="B3">3</xref>
].</p>
<p>The computation of the overlap graph is the most time and space consuming of the three phases, and was considered a bottleneck in the computation. Therefore, alternative methods were developed avoiding an explicit overlap computation. An approach which proved to be effective is based on the enumeration of all
<italic>k</italic>
-mers of the reads and their representation in a de Bruijn graph, as first proposed by [
<xref ref-type="bibr" rid="B4">4</xref>
]. This concept is applied in several popular short reads assemblers such as Velvet [
<xref ref-type="bibr" rid="B5">5</xref>
], EULER-SR [
<xref ref-type="bibr" rid="B6">6</xref>
] and Abyss [
<xref ref-type="bibr" rid="B7">7</xref>
].</p>
<p>The de Bruijn graph describing the
<italic>k</italic>
-mer spectrum of the read set has interesting properties for the solution of the assembly problem, such as the collapsing of different instances of sequence repeats into common paths of the graph. However, reducing short reads into even shorter units compromises the ability of disambiguation of short repeats. Myers [
<xref ref-type="bibr" rid="B8">8</xref>
] presented an alternative framework, the assembly string graph. Like in the de Bruijn graph, repeats still collapse into common graph elements. There are two main advantages of the string graph, compared to the de Bruijn graph: At first, it does not require to split the reads into
<italic>k</italic>
-mers. Secondly, a string graph always retains read coherence, i.e. each path in the string graph represents a valid assembly of the reads.</p>
<p>Edena [
<xref ref-type="bibr" rid="B9">9</xref>
] was the first available implementation of a string graph-based assembler. It computes suffix-prefix matches using a suffix array [
<xref ref-type="bibr" rid="B10">10</xref>
] representing all suffixes of the reads. From these, the complete overlap graph is constructed before transitive edges are removed using an algorithm described in [
<xref ref-type="bibr" rid="B8">8</xref>
].</p>
<p>A more space-efficient approach to the string graph construction has been presented in [
<xref ref-type="bibr" rid="B11">11</xref>
] and was implemented in the open source String Graph Assembler (SGA) [
<xref ref-type="bibr" rid="B12">12</xref>
]. SGA computes suffix-prefix matches using an algorithm based on the Burrows and Wheeler transform (BWT), allowing to classify suffix-prefix matches as transitive or irreducible, so that the string graph can directly be constructed.</p>
<p>Recently, a compact representation for exact-match overlap graphs has been described in [
<xref ref-type="bibr" rid="B13">13</xref>
], together with a fast construction algorithm, which has been implemented in the string graph-based assembler LEAP.</p>
<p>In this paper, we present new efficient algorithms for the computation of irreducible suffix-prefix matches and the construction of the assembly string graph. These are implemented in a new string graph based sequence assembler
<italic>Readjoiner</italic>
. To validate our approach, we compared
<italic>Readjoiner</italic>
with the current implementations of Edena, LEAP and SGA.
<italic>Readjoiner</italic>
proved to be considerably faster than previous competitors, or uses less memory. In fact,
<italic>Readjoiner</italic>
is able to handle very large datasets using limited resources: for example, a short reads dataset consisting of 115 Gbp could be assembled on a single core in 51 hours using 52 GB RAM.</p>
<p>All string graph-based assemblers aim at constructing the same graph: However, the algorithms and data structures employed in Edena, LEAP, SGA and
<italic>Readjoiner</italic>
differ considerably. LEAP employs a compact representation of the overlap graph, while
<italic>Readjoiner</italic>
circumvents the construction of the full overlap graph. Both Edena and SGA are based on explicit index structures (suffix array and FM-index, respectively) representing all suffixes of all reads in the read set, while
<italic>Readjoiner</italic>
enumerates and sorts only a proper subset of the suffixes of the reads, and efficiently inserts them into buckets, which can be processed independently from each other.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Basic definitions</title>
<p>Let
<italic>w</italic>
be a string of length
<italic>n</italic>
of symbols over an alphabet
<inline-formula>
<mml:math id="M1" name="1471-2105-13-82-i1" overflow="scroll">
<mml:mi>Σ</mml:mi>
</mml:math>
</inline-formula>
.
<italic>w</italic>
[
<italic>i</italic>
] denotes the
<italic>i</italic>
th symbol of
<italic>w</italic>
and
<italic>w</italic>
[
<italic>i</italic>
<italic>j</italic>
] the substring of
<italic>w</italic>
from position
<italic>i</italic>
to
<italic>j</italic>
, 1 ≤ 
<italic>i</italic>
, 
<italic>j</italic>
 ≤ 
<italic>n</italic>
.
<italic>w</italic>
[1…
<italic>i</italic>
] is the
<italic>prefix</italic>
of
<italic>w</italic>
ending at position
<italic>i</italic>
and
<italic>w</italic>
[
<italic>j</italic>
<italic>n</italic>
] is the
<italic>suffix</italic>
of
<italic>w</italic>
starting at position
<italic>j</italic>
. A substring of
<italic>w</italic>
is
<italic>proper</italic>
if it is different from
<italic>w</italic>
. A substring of
<italic>w</italic>
is
<italic>internal</italic>
if it is neither a prefix nor a suffix of
<italic>w</italic>
.</p>
<p>A read
<italic>r</italic>
is a string over the alphabet {
<italic>A</italic>
, 
<italic>C</italic>
, 
<italic>G</italic>
, 
<italic>T</italic>
} which is assumed to be sorted by the alphabetical order < such that
<italic>A</italic>
 < 
<italic>C</italic>
 < 
<italic>G</italic>
 < 
<italic>T</italic>
. ⊴ denotes the lexicographic order of all substrings of the reads induced by the alphabetical order < . Let
<italic>n</italic>
be the length of
<italic>r</italic>
. The reverse complement of
<italic>r</italic>
, denoted by
<inline-formula>
<mml:math id="M2" name="1471-2105-13-82-i2" overflow="scroll">
<mml:mover accent="true">
<mml:mi>r</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
, is the sequence
<inline-formula>
<mml:math id="M3" name="1471-2105-13-82-i3" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
, where
<inline-formula>
<mml:math id="M4" name="1471-2105-13-82-i4" overflow="scroll">
<mml:mover accent="true">
<mml:mi>a</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
indicates the Watson-Crick complement of base
<italic>a</italic>
.</p>
</sec>
<sec>
<title>Computing suffix- and prefix-free read sets</title>
<p>The first step of our approach for assembling a collection of reads is to eliminate reads that are prefixes or suffixes of other reads. Here we describe a method to recognize these reads. Consider an ordered set
<inline-formula>
<mml:math id="M5" name="1471-2105-13-82-i5" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of reads, possibly of variable length, in which some reads may occur more than once (so
<inline-formula>
<mml:math id="M6" name="1471-2105-13-82-i6" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is indeed a multiset). We assume that, for all
<italic>i</italic>
, 1 ≤ 
<italic>i</italic>
 ≤ 
<italic>m</italic>
, the
<italic>i</italic>
th read
<italic>r</italic>
<sub>
<italic>i</italic>
</sub>
in
<inline-formula>
<mml:math id="M7" name="1471-2105-13-82-i7" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is virtually padded by a
<italic>sentinel symbol</italic>
$
<sub>
<italic>i</italic>
</sub>
at the right end and that the alphabetical order < is extended such that
<italic>A</italic>
 < 
<italic>C</italic>
 < 
<italic>G</italic>
 < 
<italic>T</italic>
 < $
<sub>1</sub>
 < $
<sub>2</sub>
 < · · · < $
<sub>
<italic>m</italic>
</sub>
.</p>
<p>We define a binary relation ≺ on
<inline-formula>
<mml:math id="M8" name="1471-2105-13-82-i8" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
such that
<italic>r</italic>
<sub>
<italic>i</italic>
</sub>
 ≺ 
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
if and only if
<italic>i</italic>
 < 
<italic>j</italic>
. That is, ≺ reflects the order of the reads in the input.
<inline-formula>
<mml:math id="M9" name="1471-2105-13-82-i9" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is
<italic>prefix-free</italic>
if for all reads
<italic>r</italic>
in
<inline-formula>
<mml:math id="M10" name="1471-2105-13-82-i10" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
there is no
<italic>r</italic>
′ in
<inline-formula>
<mml:math id="M11" name="1471-2105-13-82-i11" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="true">{</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
such that
<italic>r</italic>
is a prefix of
<italic>r</italic>
′.
<inline-formula>
<mml:math id="M12" name="1471-2105-13-82-i12" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is
<italic>suffix-free</italic>
if for all
<italic>r</italic>
in
<inline-formula>
<mml:math id="M13" name="1471-2105-13-82-i13" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
there is no read
<italic>r</italic>
′ in
<inline-formula>
<mml:math id="M14" name="1471-2105-13-82-i14" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="true">{</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
such that
<italic>r</italic>
is a suffix of
<italic>r</italic>
′.</p>
<p>To obtain a prefix- and suffix-free set of reads we lexicographically sort all reads using a modified radixsort for strings, as described in [
<xref ref-type="bibr" rid="B14">14</xref>
]. In this algorithm, the strings to be sorted are first inserted into buckets according to their first character. Each bucket is then sorted recursively according to the next character of all reads in the bucket. A bucket always consists of reads which have a common prefix. Once a bucket is smaller than some constant, the remaining suffixes of the reads in the bucket are sorted by insertion sort [
<xref ref-type="bibr" rid="B15">15</xref>
].</p>
<p>During the sorting process, the length of the longest common prefix (lcp) of two lexicographically consecutive reads is calculated as a byproduct. For two lexicographically consecutive reads
<italic>r</italic>
and
<italic>r</italic>
′ with an lcp of length ℓ = |
<italic>r</italic>
|, we can conclude that
<italic>r</italic>
is a prefix of
<italic>r</italic>
′. If ℓ < |
<italic>r</italic>
′|, then
<italic>r</italic>
is a proper prefix of
<italic>r</italic>
′ and we mark
<italic>r</italic>
. If ℓ = |
<italic>r</italic>
′|, then
<italic>r</italic>
and
<italic>r</italic>
′ are identical and we mark the read which is larger according to the binary relation ≺ .</p>
<p>To handle reverse complements and to mark reads which are suffixes of other reads, one simply applies this method to the multiset
<inline-formula>
<mml:math id="M15" name="1471-2105-13-82-i15" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
where
<inline-formula>
<mml:math id="M16" name="1471-2105-13-82-i16" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
for all
<italic>i</italic>
, 1 ≤ 
<italic>i</italic>
 ≤ 
<italic>m</italic>
. As
<inline-formula>
<mml:math id="M17" name="1471-2105-13-82-i17" overflow="scroll">
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
includes the reverse complements of the reads, the method also marks reads which are suffixes of other reads. This is due to the observation that if read
<italic>r</italic>
is a suffix of read
<italic>r</italic>
′, then
<inline-formula>
<mml:math id="M18" name="1471-2105-13-82-i18" overflow="scroll">
<mml:mover accent="true">
<mml:mi>r</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
is a prefix of
<inline-formula>
<mml:math id="M19" name="1471-2105-13-82-i19" overflow="scroll">
<mml:mover accent="true">
<mml:mrow>
<mml:msup>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
.</p>
<p>In a final step of the algorithm one eliminates all reads from
<inline-formula>
<mml:math id="M20" name="1471-2105-13-82-i20" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
which have been marked. The remaining unmarked reads from
<inline-formula>
<mml:math id="M21" name="1471-2105-13-82-i21" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
are processed further. The algorithm to compute a suffix- and prefix-free set of reads runs in
<inline-formula>
<mml:math id="M22" name="1471-2105-13-82-i22" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mfrac>
<mml:msub>
<mml:mi>λ</mml:mi>
<mml:mtext>max</mml:mtext>
</mml:msub>
<mml:mi>ω</mml:mi>
</mml:mfrac>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
time, where
<italic>λ</italic>
<sub>max</sub>
is the maximum length of a read and
<italic>ω</italic>
is the machine’s word size. As we consider
<italic>λ</italic>
<sub>max</sub>
to be a constant (which does not imply that the reads are all of the same length), the algorithm runs in
<italic>O</italic>
(
<italic>m</italic>
) time.</p>
</sec>
<sec>
<title>Computing suffix-prefix matches</title>
<p>Suppose that
<inline-formula>
<mml:math id="M23" name="1471-2105-13-82-i23" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is a suffix- and prefix-free set of
<italic>m</italic>
reads. Let ℓ
<sub>
<italic>min</italic>
</sub>
 > 0 be the minimum length parameter. The set
<inline-formula>
<mml:math id="M24" name="1471-2105-13-82-i24" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>M</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of
<italic>suffix-prefix matches</italic>
(
<italic>SPM</italic>
s, for short) is the smallest set of triples 〈
<italic>r</italic>
, 
<italic>r</italic>
′, ℓ〉 such that
<inline-formula>
<mml:math id="M25" name="1471-2105-13-82-i25" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
and strings
<italic>u</italic>
, 
<italic>v</italic>
, 
<italic>w</italic>
exist such that
<italic>r</italic>
 = 
<italic>uv</italic>
,
<italic>r</italic>
′ = 
<italic>vw</italic>
, and |v| =ℓ ≥ ℓ
<sub>
<italic>min</italic>
</sub>
. ℓ is the
<italic>length</italic>
of a suffix-prefix match 〈
<italic>r</italic>
, 
<italic>r</italic>
′, ℓ〉 . The
<italic>suffix-prefix matching problem</italic>
is to find all suffix-prefix matches. As the reads of length smaller than ℓ
<sub>
<italic>min</italic>
</sub>
cannot, by definition, contribute any
<italic>SPM</italic>
, we can ignore them and thus we assume that
<inline-formula>
<mml:math id="M26" name="1471-2105-13-82-i26" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
only contains reads of length at least ℓ
<sub>
<italic>min</italic>
</sub>
.</p>
<p>The method to solve the
<italic>suffix-prefix matching problem</italic>
presented here consists of two main algorithms. The first algorithm identifies and lexicographically sorts all SPM-relevant suffixes,
<italic>i.e.</italic>
a subset of all suffixes of all reads from which one can compute all suffix-prefix matches. The second algorithm enumerates these matches given the sorted list of all SPM-relevant suffixes.</p>
<p>Consider a suffix-prefix match 〈
<italic>r</italic>
, 
<italic>r</italic>
′, ℓ〉 . By definition, the suffix of length ℓ of
<italic>r</italic>
exactly matches the prefix of length ℓ of
<italic>r</italic>
′. Obviously, the suffix of
<italic>r</italic>
involved in the match starts at some position
<italic>j</italic>
,
<inline-formula>
<mml:math id="M27" name="1471-2105-13-82-i27" overflow="scroll">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo> ≤ </mml:mo>
<mml:mi>j </mml:mi>
<mml:mo>≤ </mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi></mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
in
<italic>r</italic>
. This implies that
<italic>r</italic>
must be at least of length ℓ
<sub>
<italic>min</italic>
</sub>
 + 1. The suffix cannot start at the first position in
<italic>r</italic>
, as otherwise
<italic>r</italic>
would be a prefix of some other read, contradicting our assumption that
<inline-formula>
<mml:math id="M28" name="1471-2105-13-82-i28" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
is prefix-free.</p>
<p>To enumerate the set of all suffix-prefix matches of length at least ℓ
<sub>
<italic>min</italic>
</sub>
, we preprocess all reads and determine all proper suffixes of the reads which may be involved in a suffix-prefix match. More precisely, for all reads
<italic>r</italic>
we determine all
<italic>matching candidates</italic>
, i.e. all proper suffixes
<italic>s</italic>
of
<italic>r</italic>
such that the length of
<italic>s</italic>
is at least ℓ
<sub>
<italic>min</italic>
</sub>
and there is a read
<italic>r</italic>
′ such that
<italic>s</italic>
and
<italic>r</italic>
′ have a common prefix of length at least
<italic>k</italic>
, where
<italic>k</italic>
is an user-defined parameter satisfying
<inline-formula>
<mml:math id="M29" name="1471-2105-13-82-i29" overflow="scroll">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo> ≤ </mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="true">{</mml:mo>
<mml:msub>
<mml:mi></mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. There are two reasons for imposing this constraint on
<italic>k</italic>
: First, we want to represent a string of length
<italic>k</italic>
over an alphabet of size 4 in one machine word, thus
<inline-formula>
<mml:math id="M30" name="1471-2105-13-82-i30" overflow="scroll">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo> ≤</mml:mo>
<mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
. Second, the suffixes of the reads from which we take the prefixes of length
<italic>k</italic>
have minimum length ℓ
<sub>
<italic>min</italic>
</sub>
, thus we choose
<italic>k</italic>
 ≤ ℓ
<sub>
<italic>min</italic>
</sub>
.</p>
<p>The set of all matching candidates and all reads forms the
<italic>set of all</italic>
(ℓ
<sub>
<italic>min</italic>
</sub>
, 
<italic>k</italic>
)
<italic>-SPM-relevant suffixes</italic>
. For simplicity sake, we use the notion
<italic>SPM-relevant suffixes</italic>
if ℓ
<sub>
<italic>min</italic>
</sub>
and
<italic>k</italic>
are clear from the context. While all
<italic>SPM</italic>
s can be constructed from the SPM-relevant suffixes, not all SPM-relevant suffixes lead to an
<italic>SPM</italic>
.</p>
<sec>
<title>An efficient algorithm for identifying and sorting all SPM-relevant suffixes</title>
<p>The first two phases of our algorithm follow a strategy that is borrowed from the counting sort algorithm [
<xref ref-type="bibr" rid="B15">15</xref>
]. Like this, our algorithm has a counting phase and an insertion phase. However, in our problem, the elements (i.e. SPM-relevant suffixes) to be sorted are only determined during the algorithm. Moreover, the number of keys (i.e. initial
<italic>k</italic>
-mers) whose occurrences are counted is on the order of the number of elements to be sorted. Therefore, in a space efficient solution, it is not trivial to access a counter given a key. We have developed a time and space efficient method to access the counter for a key, exploiting the fact that counting and inserting the SPM-relevant suffixes does not have to be done immediately. Instead, the items to be counted/inserted are first buffered and then sorted. A linear scan then performs the counting or inserting step.</p>
<p>In contrast to counting sort, our algorithm uses an extra sorting step to obtain the final order of elements pre-sorted in the insertion phase. Under the assumption that the maximum read length is a constant (which does not imply that the reads are all of the same length), our algorithm runs in
<italic>O</italic>
(
<italic>n</italic>
) time and space, where
<italic>n</italic>
is the total length of all reads. To the best of our knowledge a method employing a similar strategy has not yet been developed for the suffix-prefix matching problem.</p>
<p>We first give a description of our algorithm using string notation. In a separate section, we explain how to efficiently implement the algorithm. In the following, we only consider the reads in the forward direction. However, it is not difficult to extend our method to also incorporate the reverse complements of the reads and we comment on this issue at the end of the methods section.</p>
<p>The
<italic>initial k</italic>
<italic>mer</italic>
of some sequence
<italic>s</italic>
is the prefix of
<italic>s</italic>
of length
<italic>k</italic>
. To determine the matching candidates efficiently, we first enumerate the initial
<italic>k</italic>
-mers of all reads and store them in a table of size
<italic>m</italic>
. This can be done in
<italic>O</italic>
(
<italic>m</italic>
) time. The notion
<italic>table size</italic>
always refers to the number of entries in the table. The next step lexicographically sorts the
<italic>k</italic>
-mers in the table in ascending order. This string sorting problem can be transformed into an integer sorting problem (see Implementation) which can be solved by radixsort [
<xref ref-type="bibr" rid="B15">15</xref>
] in
<italic>O</italic>
(
<italic>m</italic>
) time and
<italic>O</italic>
(
<italic>m</italic>
) extra working space.</p>
<p>In the next step, a linear scan of the sorted
<italic>k</italic>
-mers removes duplicates from the table and counts how many times each
<italic>k</italic>
-mer occurs in the table. This scan requires
<italic>O</italic>
(
<italic>m</italic>
) time. Let
<italic>d</italic>
 ≤ 
<italic>m</italic>
be the number of different
<italic>k</italic>
-mers. These can be stored in a table
<italic>K</italic>
of size
<italic>d</italic>
.</p>
<p>The counts for the elements in
<italic>K</italic>
require another table
<italic>C</italic>
of size
<italic>d</italic>
. In addition to the duplicate removal and counting, the linear scan of the sorted
<italic>k</italic>
-mers constructs two sets
<italic>P</italic>
and
<italic>Q</italic>
, the size of which depends on two user defined parameters
<italic>k</italic>
′ ≤ 
<italic>k</italic>
and
<italic>k</italic>
 ″≤ 
<italic>k</italic>
.
<italic>P</italic>
is the set of all initial
<italic>k</italic>
′-mers of the reads.
<italic>Q</italic>
is the set of all
<italic>k</italic>
-mers
<inline-formula>
<mml:math id="M31" name="1471-2105-13-82-i31" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for some
<inline-formula>
<mml:math id="M32" name="1471-2105-13-82-i32" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. We assume that elements can be added to
<italic>P</italic>
and
<italic>Q</italic>
in constant time and that membership in these sets can be decided in constant time. Thus the linear scan constructs
<italic>P</italic>
and
<italic>Q</italic>
in
<italic>O</italic>
(
<italic>m</italic>
) time. As
<italic>P</italic>
is a subset of a set of size
<inline-formula>
<mml:math id="M33" name="1471-2105-13-82-i33" overflow="scroll">
<mml:msup>
<mml:mn>4</mml:mn>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msup>
</mml:math>
</inline-formula>
,
<italic>P</italic>
can be stored in
<inline-formula>
<mml:math id="M34" name="1471-2105-13-82-i34" overflow="scroll">
<mml:msup>
<mml:mn>4</mml:mn>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msup>
</mml:math>
</inline-formula>
bits.
<italic>Q</italic>
requires
<inline-formula>
<mml:math id="M35" name="1471-2105-13-82-i35" overflow="scroll">
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
bits.</p>
<p>Up until now, only the initial
<italic>k</italic>
-mers of the reads were considered, resulting in a sorted table
<italic>K</italic>
of
<italic>d</italic>
non-redundant keys (i.e. initial
<italic>k</italic>
-mers of reads), a table
<italic>C</italic>
of size
<italic>d</italic>
for counting
<italic>k</italic>
-mers and two sets
<italic>P</italic>
and
<italic>Q</italic>
. By construction, each count in
<italic>C</italic>
is at least 1 and the sum of the counts is
<italic>m</italic>
. The next task is to enumerate, for all reads
<italic>r</italic>
, the suffixes of
<italic>r</italic>
at all positions
<italic>j</italic>
,
<inline-formula>
<mml:math id="M36" name="1471-2105-13-82-i36" overflow="scroll">
<mml:mrow>
<mml:mn>2 </mml:mn>
<mml:mo>≤ </mml:mo>
<mml:mi>j </mml:mi>
<mml:mo>≤ </mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi></mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
.
<italic>r</italic>
has |
<italic>r</italic>
| − ℓ
<sub>
<italic>min</italic>
</sub>
such suffixes. For each such suffix
<italic>s</italic>
(which by construction is of length ≥ ℓ
<sub>
<italic>min</italic>
</sub>
), one extracts two strings
<italic>v</italic>
 = 
<italic>s</italic>
[1…
<italic>k</italic>
′] and
<inline-formula>
<mml:math id="M37" name="1471-2105-13-82-i37" overflow="scroll">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. If
<italic>v</italic>
does not occur in
<italic>P</italic>
, then
<italic>v</italic>
is not a prefix of any read in
<inline-formula>
<mml:math id="M38" name="1471-2105-13-82-i38" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
and thus
<italic>s</italic>
is not a matching candidate and can be discarded. If
<italic>w</italic>
does not occur in
<italic>Q</italic>
,  then 
<inline-formula>
<mml:math id="M39" name="1471-2105-13-82-i39" overflow="scroll">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for all reads
<inline-formula>
<mml:math id="M40" name="1471-2105-13-82-i40" overflow="scroll">
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
and thus
<italic>s</italic>
is not a matching candidate and can be discarded. Thus
<italic>P</italic>
and
<italic>Q</italic>
serve as filters to efficiently detect suffixes which can be discarded. For read
<italic>r</italic>
the suffixes
<italic>s</italic>
and corresponding strings
<italic>v</italic>
and
<italic>w</italic>
can be enumerated in
<italic>O</italic>
(|
<italic>r</italic>
| − ℓ
<sub>
<italic>min</italic>
</sub>
) time. Checking membership in
<italic>P</italic>
and in
<italic>Q</italic>
requires constant time. Therefore, each read
<italic>r</italic>
is processed in
<italic>O</italic>
(|
<italic>r</italic>
| − ℓ
<sub>
<italic>min</italic>
</sub>
) time. Thus the enumeration and checking requires
<inline-formula>
<mml:math id="M41" name="1471-2105-13-82-i41" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>m</mml:mi>
<mml:mspace width="0.33em"></mml:mspace>
<mml:msub>
<mml:mi></mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
time altogether.</p>
<p>The next task is to process a suffix, say
<italic>s</italic>
which has passed the
<italic>P</italic>
-filter and the
<italic>Q</italic>
-filter. That is,
<italic>v</italic>
 = 
<italic>s</italic>
[1…
<italic>k</italic>
′] ∈ 
<italic>P</italic>
and
<inline-formula>
<mml:math id="M42" name="1471-2105-13-82-i42" overflow="scroll">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>Q</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
holds. One now has to check if
<italic>u</italic>
 = 
<italic>s</italic>
[1…
<italic>k</italic>
] occurs in
<italic>K</italic>
to verify if
<italic>s</italic>
is a matching candidate. If the latter is true, the appropriate counter needs to be incremented. Hence this is the counting phase of our algorithm. The simplest way to check the occurrence of
<italic>u</italic>
in
<italic>K</italic>
, is to perform a binary search, taking
<italic>u</italic>
as the key. However, this would require
<italic>O</italic>
(
<italic>log</italic>
<sub>2</sub>
<italic>d</italic>
) time for each
<italic>k</italic>
-mer passing the filters. This is too slow. Using a hash table turned out to be too slow as well and would require too much extra space, which we do not want to afford.</p>
<p>We propose an efficient method that works as follows: Store each
<italic>k</italic>
-mer
<italic>s</italic>
[1.
<italic>k</italic>
] passing the
<italic>P</italic>
and the
<italic>Q</italic>
-filter in a buffer
<italic>B</italic>
of fixed size
<inline-formula>
<mml:math id="M43" name="1471-2105-13-82-i43" overflow="scroll">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>d</mml:mi>
<mml:mi>γ</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
for some constant
<italic>γ</italic>
 > 1. Once
<italic>B</italic>
is full or all
<italic>k</italic>
-mers have been added to
<italic>B</italic>
, sort the elements in
<italic>B</italic>
in ascending lexicographic order. Then perform a binary search in
<italic>K</italic>
, but only for the first element in
<italic>B</italic>
, say
<italic>x</italic>
. As
<italic>B</italic>
is sorted,
<italic>x</italic>
is the smallest element. The binary search for
<italic>x</italic>
in
<italic>K</italic>
finds the smallest element in
<italic>K</italic>
greater than or equal to
<italic>x</italic>
using
<italic>O</italic>
(
<italic>log</italic>
<sub>2 </sub>
<italic>d</italic>
) time. If such an element occurs in
<italic>K</italic>
, say at index
<italic>e</italic>
, then simultaneously scan
<italic>B</italic>
beginning with the first index and
<italic>K</italic>
beginning at index
<italic>e</italic>
. For any element in
<italic>B</italic>
that is equal to an element in
<italic>K</italic>
, say at index
<italic>i</italic>
in
<italic>K</italic>
, increment the counter in
<italic>C</italic>
at the same index.</p>
<p>This simultaneous linear scan of
<italic>B</italic>
and (a part of)
<italic>K</italic>
takes
<italic>O</italic>
(
<italic>b</italic>
 + 
<italic>d</italic>
) time and finds all
<italic>k</italic>
-mers from
<italic>B</italic>
occurring in
<italic>K</italic>
. Once the scan and the associated increments are done, the buffer is emptied for the next round. Suppose that there are in total
<italic>b</italic>
<sup></sup>
<italic>k</italic>
-mers that have passed
<italic>B</italic>
. Thus there are
<inline-formula>
<mml:math id="M44" name="1471-2105-13-82-i44" overflow="scroll">
<mml:mfenced open="⌈" close="⌉">
<mml:mfrac>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mi>b</mml:mi>
</mml:mfrac>
</mml:mfenced>
</mml:math>
</inline-formula>
rounds filling the buffer. Each round is associated with a sorting step, a binary search and a linear scan. Sorting requires
<italic>O</italic>
(
<italic>b</italic>
) time using radixsort. This gives a running time of
<inline-formula>
<mml:math id="M45" name="1471-2105-13-82-i45" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mi>b</mml:mi>
</mml:mfrac>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>d</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mi>b</mml:mi>
</mml:mfrac>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mi>b</mml:mi>
</mml:mfrac>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>b</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>γ</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. As
<inline-formula>
<mml:math id="M46" name="1471-2105-13-82-i46" overflow="scroll">
<mml:mrow>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo></mml:mo>
<mml:mi> n</mml:mi>
<mml:mo>:</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, the running time is linear in the total length of the reads.</p>
<p>Once all reads have been processed, for any initial
<italic>k</italic>
-mer
<italic>u</italic>
of any read, the following holds: If
<italic>u</italic>
is the
<italic>i</italic>
th initial
<italic>k</italic>
-mer in
<italic>K</italic>
, then
<italic>C</italic>
[
<italic>i</italic>
] is the number of SPM-relevant suffixes of which
<italic>u</italic>
is a prefix. To prepare for the insertion phase, compute the partial sums of
<italic>C</italic>
in an array
<italic>π</italic>
of size
<italic>d</italic>
 + 1, such that
<italic>π</italic>
[0] = 
<italic>C</italic>
[0],
<inline-formula>
<mml:math id="M47" name="1471-2105-13-82-i47" overflow="scroll">
<mml:mrow>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for all
<italic>i</italic>
, 1 ≤ 
<italic>i</italic>
 ≤ 
<italic>d</italic>
 − 1, and
<italic>π</italic>
[
<italic>d</italic>
] = 
<italic>π</italic>
[
<italic>d</italic>
 − 1].
<italic>π</italic>
[
<italic>d</italic>
] is the number of all SPM-relevant suffixes. One creates a table
<italic>S</italic>
of size
<italic>g</italic>
: = 
<italic>π</italic>
[
<italic>d</italic>
] to hold pairs of read numbers and read offsets. As in the counting phase, enumerate all suffixes of reads of length at least ℓ
<sub>
<italic>min</italic>
</sub>
passing the
<italic>P</italic>
- and the
<italic>Q</italic>
-filter. Suppose that
<italic>s</italic>
is such a suffix of read number
<italic>p</italic>
and with read offset
<italic>q</italic>
. Let
<italic>u</italic>
be the initial
<italic>k</italic>
-mer of
<italic>s</italic>
. Then we store (
<italic>p</italic>
, 
<italic>q</italic>
, 
<italic>u</italic>
) in a buffer
<italic>B</italic>
′ of fixed size
<inline-formula>
<mml:math id="M48" name="1471-2105-13-82-i48" overflow="scroll">
<mml:mfrac>
<mml:mi>b</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:math>
</inline-formula>
. We choose this buffer size, as the elements in
<italic>B</italic>
′ require twice as much space as the elements in
<italic>B</italic>
. As in the counting phase, sort the buffer in lexicographic order of the
<italic>k</italic>
-mers it stores, and then process the buffer elements using the
<italic>k</italic>
-mer, say
<italic>u</italic>
, as a key to determine if
<italic>u</italic>
matches some element in
<italic>K</italic>
, say at index
<italic>i</italic>
. Then insert (
<italic>p</italic>
, 
<italic>q</italic>
) at index
<italic>π</italic>
[
<italic>i</italic>
] − 1 in
<italic>S</italic>
and decrement
<italic>π</italic>
[
<italic>i</italic>
].</p>
<p>After all
<italic>b</italic>
<sup></sup>
elements passed the buffer and have been processed,
<italic>S</italic>
holds all SPM-relevant suffixes (represented by read numbers and read offsets) in lexicographic order of their initial
<italic>k</italic>
-mers. Let
<italic>u</italic>
be the
<italic>i</italic>
th
<italic>k</italic>
-mer in
<italic>K</italic>
. Then all SPM-relevant suffixes with common prefix
<italic>u</italic>
are stored in
<italic>S</italic>
from index
<italic>π</italic>
[
<italic>i</italic>
] to
<inline-formula>
<mml:math id="M49" name="1471-2105-13-82-i49" overflow="scroll">
<mml:mrow>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. Thus
<italic>S</italic>
can uniquely be divided into buckets of SPM-relevant suffixes with the same initial
<italic>k</italic>
-mer. Each such bucket can be sorted independently from all other buckets. Moreover, each SPM-relevant suffix not occurring in the
<italic>i</italic>
th bucket, has an initial
<italic>k</italic>
-mer different from
<italic>u</italic>
and thus cannot have a common prefix of length ≥ ℓ
<sub>
<italic>min</italic>
</sub>
with the suffixes in the
<italic>i</italic>
th bucket. As a consequence, all suffix-prefix matches are derivable from pairs of SPM-relevant suffixes occurring in the same bucket. Thus, the suffix-prefix matches problem can be divided into
<italic>d</italic>
subproblems, each consisting of the computation of suffix-prefix matches from a bucket of SPM-relevant suffixes. This problem is considered later.</p>
<p>To sort the
<italic>i</italic>
th bucket one extracts the remaining suffixes relevant for sorting the bucket and stores them in a table. This strategy minimizes the number of slow random accesses to the reads. Consider the
<italic>i</italic>
th bucket and let (
<italic>p</italic>
, 
<italic>q</italic>
) be one of the suffixes in the bucket, referring to the suffix of read
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
at read offset
<italic>q</italic>
. Then extract the suffix of
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
starting at position
<italic>q</italic>
 + 
<italic>k</italic>
. As the maximum read length is considered to be constant, the total size of the remaining suffixes to be stored is
<inline-formula>
<mml:math id="M50" name="1471-2105-13-82-i50" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. The remaining suffixes can be sorted using radixsort in
<inline-formula>
<mml:math id="M51" name="1471-2105-13-82-i51" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
time. An additional linear time scan over the sorted suffixes of the bucket delivers a table
<italic>L</italic>
of size
<inline-formula>
<mml:math id="M52" name="1471-2105-13-82-i52" overflow="scroll">
<mml:mrow>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, such that
<italic>L</italic>
<sub>
<italic>j</italic>
</sub>
is the length of the longest common prefix of the suffixes
<inline-formula>
<mml:math id="M53" name="1471-2105-13-82-i53" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
and
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
] + 
<italic>j</italic>
] for all
<italic>j</italic>
,
<inline-formula>
<mml:math id="M54" name="1471-2105-13-82-i54" overflow="scroll">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo> ≤ </mml:mo>
<mml:mi>j </mml:mi>
<mml:mo>≤ </mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>Sorting all remaining suffixes and computing the lcp-table
<italic>L</italic>
thus requires
<italic>O</italic>
(
<italic>β</italic>
<sub>
<italic>max</italic>
</sub>
) space and
<inline-formula>
<mml:math id="M55" name="1471-2105-13-82-i55" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>O</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
time where
<italic>β</italic>
<sub>
<italic>max</italic>
</sub>
is the maximum size of a bucket and
<italic>g</italic>
is the total number of SPM-relevant suffixes. The bucket of sorted SPM-relevant suffixes and the corresponding table
<italic>L</italic>
are processed by Algorithm 2 described after the following implementation section and Algorithm 3 described in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 7.</p>
<p>All in all, our algorithm runs in
<italic>O</italic>
(
<italic>m</italic>
 + 
<italic>n</italic>
 + 
<italic>g</italic>
) = 
<italic>O</italic>
(
<italic>n</italic>
) time and
<italic>O</italic>
(
<italic>m</italic>
 + 4
<sup>
<italic>k</italic>
</sup>
 + 4
<sup>
<italic>k</italic>
′′</sup>
 + 
<italic>β</italic>
<sub>
<italic>max</italic>
</sub>
 + 
<italic>g</italic>
 + 
<italic>n</italic>
) space. As we choose
<italic>k</italic>
′′ ≤ 
<italic>k</italic>
′ ∈ 
<italic>O</italic>
(
<italic>log</italic>
<sub>2 </sub>
<italic>n</italic>
) and
<italic>m</italic>
,
<italic>g</italic>
, and
<italic>β</italic>
<sub>
<italic>max</italic>
</sub>
are all smaller than
<italic>n</italic>
, the space requirement is
<italic>O</italic>
(
<italic>n</italic>
). Thus the algorithm for identifying and sorting all (ℓ
<sub>
<italic>min</italic>
</sub>
, 
<italic>k</italic>
)-SPM-relevant suffixes is optimal.</p>
</sec>
<sec>
<title>Implementation</title>
<p>We will now describe how to efficiently implement the algorithm described above. An essential technique used in our algorithm are integer codes for
<italic>k</italic>
-mers. These are widely used in sequence processing. As we have three different mer-sizes (
<italic>k</italic>
, 
<italic>k</italic>
′, and
<italic>k″</italic>
) and dependencies between the corresponding integer codes, we shortly describe the technique here. In our problem, a
<italic>k</italic>
-mer always refers to a sequence of which it is a prefix. Therefore, we introduce integer codes for strings of length ≥ 
<italic>k</italic>
: For all strings
<italic>s</italic>
of length at least
<italic>k</italic>
define the integer code ϕ
<inline-formula>
<mml:math id="M56" name="1471-2105-13-82-i56" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mo>ϕ</mml:mo>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mspace width="0.12em"></mml:mspace>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
, where
<italic>ϕ</italic>
is the mapping [
<italic>A</italic>
 → 0, 
<italic>C</italic>
 → 1, 
<italic>G</italic>
 → 2, 
<italic>T</italic>
 → 3] uniquely assigning numbers from 0 to 3 to the bases in the alphabetical order of the bases. Note that only the first
<italic>k</italic>
symbols of
<italic>s</italic>
determine
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
), which is an integer in the range [0…4
<sup>
<italic>k</italic>
</sup>
 − 1]. For all strings
<italic>s</italic>
and
<italic>s</italic>
′ of length at least
<italic>k</italic>
,
<italic>s</italic>
 ⊴ 
<italic>s</italic>
′ implies
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) ≤ 
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
′), where ⊴ denotes the lexicographic order of strings and ≤ denotes the order of integers.</p>
<p>Besides
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
, we use the encodings
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
and ϕ
<inline-formula>
<mml:math id="M57" name="1471-2105-13-82-i57" overflow="scroll">
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
</mml:math>
</inline-formula>
for some
<italic>k</italic>
′, 
<italic>k</italic>
 ″≤ 
<italic>k</italic>
.
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
encodes the prefix of
<italic>s</italic>
of length
<italic>k</italic>
′ and is defined in analogy to
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(replacing
<italic>k</italic>
by
<italic>k</italic>
′). ϕ
<inline-formula>
<mml:math id="M58" name="1471-2105-13-82-i58" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
encodes the suffix
<inline-formula>
<mml:math id="M59" name="1471-2105-13-82-i59" overflow="scroll">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of
<italic>s</italic>
[1…
<italic>k</italic>
] of length
<italic>k</italic>
″, i.e. ϕ
<inline-formula>
<mml:math id="M60" name="1471-2105-13-82-i60" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="0.12em"></mml:mspace>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi>ϕ</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) and ϕ
<inline-formula>
<mml:math id="M61" name="1471-2105-13-82-i61" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mi></mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
can be computed from
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) according to the following equations:</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M62" name="1471-2105-13-82-i62" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>ϕ</mml:mi>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>ϕ</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>'</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M63" name="1471-2105-13-82-i63" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>ϕ</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mtext mathvariant="italic">mod</mml:mtext>
<mml:mspace width="0.12em"></mml:mspace>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>We implement
<italic>k</italic>
-mers by their integer codes. Each integer code can be computed in constant time by extracting the appropriate sequence of consecutive bit pairs from a 2bit per base encoding of the read set. In our implementation, we use the representation and the appropriate access functions from the
<italic>GtEncseq</italic>
software library [
<xref ref-type="bibr" rid="B16">16</xref>
]. As
<inline-formula>
<mml:math id="M64" name="1471-2105-13-82-i64" overflow="scroll">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo> ≤</mml:mo>
<mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
we can store each integer code in an integer of the machine’s word size. We sort
<italic>m</italic>
integer codes for the initial
<italic>k</italic>
-mers using quicksort, adapting the code from [
<xref ref-type="bibr" rid="B17">17</xref>
]. Our implementation works without recursion and uses an extra stack of size
<italic>O</italic>
(
<italic>log</italic>
<sub>2 </sub>
<italic>m</italic>
) to sort
<italic>m</italic>
integers. This small additional space requirement is the main reason to choose quicksort instead of radixsort, which is usually more than twice as fast, but requires
<italic>O</italic>
(
<italic>m</italic>
) extra working space, which we do not want to afford.</p>
<p>The sets
<italic>P</italic>
and
<italic>Q</italic>
are implemented by bit vectors
<italic>v</italic>
<sub>
<italic>P</italic>
</sub>
and
<italic>v</italic>
<sub>
<italic>Q</italic>
</sub>
of 4
<sup>
<italic>k</italic>
</sup>
and 4
<sup>
<italic>k</italic>
′′</sup>
bits, respectively. Bit
<italic>v</italic>
<sub>
<italic>P</italic>
</sub>
<italic>q</italic>
is set if and only if
<italic>q</italic>
 = 
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>r</italic>
) for some
<inline-formula>
<mml:math id="M65" name="1471-2105-13-82-i65" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. Bit
<italic>v</italic>
<sub>
<italic>Q</italic>
</sub>
<italic>q</italic>
is set if and only if
<inline-formula>
<mml:math id="M66" name="1471-2105-13-82-i66" overflow="scroll">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>′′</mml:mo>
</mml:msup>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for some read
<inline-formula>
<mml:math id="M67" name="1471-2105-13-82-i67" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. To obtain the bit index, one computes
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) and
<inline-formula>
<mml:math id="M68" name="1471-2105-13-82-i68" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mi>ϕ</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
from
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) using Equations (1) and (2). Equation (1) can be implemented by a bitwise right shift of 2(
<italic>k</italic>
 − 
<italic>k</italic>
′) bits. Equation (2) can be implemented by a bitwise and operation with the integer 2
<sup>2
<italic>k</italic>
′′</sup>
 − 1. Thus, given the integer code for
<italic>s</italic>
, both
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) and
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
<sup>
<italic>k</italic>
</sup>
(
<italic>s</italic>
) can be computed in constant time. Therefore, the sets
<italic>P</italic>
and
<italic>Q</italic>
can be constructed in
<italic>O</italic>
(
<italic>m</italic>
) time and each access takes constant time.</p>
<p>When determining the
<italic>k</italic>
-mer codes in the counting phase and in the insertion phase, we sweep a window of width
<italic>k</italic>
over the sequence reads and compute the integer code for the sequence in the current window in constant time.</p>
<p>We implement the counts by a byte array of size
<italic>d</italic>
and store counts larger than 255 in an additional hash table. Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 1 gives the details.</p>
<p>The partial sums in table
<italic>π</italic>
are bounded by
<italic>g</italic>
, the number of SPM-relevant suffixes. For large read sets,
<italic>g</italic>
can be larger than 2
<sup>32</sup>
 − 1. However, as the partial sum are strictly increasing, one can implement
<italic>π</italic>
by a 32 bit integer table
<italic>PS</italic>
of size
<italic>d</italic>
 + 1, such that
<italic>PS</italic>
[
<italic>i</italic>
] = 
<italic>π</italic>
[
<italic>i</italic>
] mod 2
<sup>32</sup>
for any
<italic>i</italic>
, 0 ≤ 
<italic>i</italic>
 ≤ 
<italic>d</italic>
and an additional integer table of size
<inline-formula>
<mml:math id="M69" name="1471-2105-13-82-i69" overflow="scroll">
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mo stretchy="true">{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2 </mml:mn>
</mml:msub>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mn>32</mml:mn>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
marking the boundaries of carry bits. Details are given in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 2.</p>
<p>For the insertion phase we need a representation of the read set (2
<italic>n</italic>
bits), table
<italic>K</italic>
(2
<italic>kd</italic>
bits), set
<italic>P</italic>
and
<italic>Q</italic>
(4
<sup>
<italic>k</italic>
</sup>
and 4
<sup>
<italic>k</italic>
′′</sup>
bits, respectively), table
<italic>π</italic>
(32(
<italic>d</italic>
 + 1) bits) and table
<italic>S</italic>
of size
<italic>g</italic>
. As
<italic>S</italic>
holds pairs of read numbers and read offsets, each entry in
<italic>S</italic>
is stored compactly in
<inline-formula>
<mml:math id="M70" name="1471-2105-13-82-i70" overflow="scroll">
<mml:mrow>
<mml:mi>σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi> m</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>λ</mml:mi>
<mml:mtext>max</mml:mtext>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script"></mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>
bits. This would give an integer array of size
<inline-formula>
<mml:math id="M71" name="1471-2105-13-82-i71" overflow="scroll">
<mml:mfenced open="⌈" close="⌉">
<mml:mfrac>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mspace width="0.33em"></mml:mspace>
<mml:mi>σ</mml:mi>
</mml:mrow>
<mml:mi>ω</mml:mi>
</mml:mfrac>
</mml:mfenced>
</mml:math>
</inline-formula>
if we would store
<italic>S</italic>
completely. But we do not, as we employ a partitioning strategy, explained next.</p>
<p>Although the data structures representing tables
<italic>S</italic>
,
<italic>K</italic>
,
<italic>P</italic>
and
<italic>π</italic>
are of different sizes, their access follows the same scheme: Suppose that
<italic>i</italic>
is the smallest index, such that
<inline-formula>
<mml:math id="M72" name="1471-2105-13-82-i72" overflow="scroll">
<mml:mrow>
<mml:mfrac>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo></mml:mo>
<mml:mi> π</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. Roughly half of the suffixes to be inserted in
<italic>S</italic>
are placed in buckets of lower order (with index ≤ 
<italic>i</italic>
) and the other half are placed in buckets of higher order (with index > 
<italic>i</italic>
). The buckets of lower order are associated with the
<italic>k</italic>
-mers in
<italic>K</italic>
up to index
<italic>i</italic>
. Hence, for these, one needs table
<italic>K</italic>
and
<italic>PS</italic>
only up to index
<italic>i</italic>
. Let
<italic>s</italic>
be some suffix of length ≤ ℓ
<sub>
<italic>min</italic>
</sub>
such that
<italic>ϕ</italic>
<sub>
<italic>k</italic>
</sub>
(
<italic>s</italic>
) ≤ 
<italic>K</italic>
[
<italic>i</italic>
]. To apply the
<italic>P</italic>
-filter to
<italic>s</italic>
, one checks
<italic>v</italic>
<sub>
<italic>P</italic>
</sub>
at index
<inline-formula>
<mml:math id="M73" name="1471-2105-13-82-i73" overflow="scroll">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>ϕ</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:mfrac>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
, which is in the first half of vector
<italic>v</italic>
<sub>
<italic>P</italic>
</sub>
. This strategy, dividing tables
<italic>S</italic>
,
<italic>K</italic>
,
<italic>P</italic>
and
<italic>π</italic>
into
<italic>q</italic>
 = 2 parts of roughly the same size, can be generalized to
<italic>q</italic>
 > 2 parts. Each part is defined by a lower and an upper integer code and by corresponding lower and upper boundaries referring to sections of the four mentioned tables. Partitioning
<italic>S</italic>
means to only allocate the maximum space for holding all buckets belonging to a single part.</p>
<p>The four tables that can be split over
<italic>q</italic>
parts require
<italic>h</italic>
(
<italic>g</italic>
, 
<italic>k</italic>
, 
<italic>d</italic>
, 
<italic>k</italic>
′′, 
<italic>σ</italic>
) = 2
<italic>kd</italic>
 + 4
<sup>
<italic>k</italic>
</sup>
 + 32(
<italic>d</italic>
 + 1) + 
<italic>g σ</italic>
bits. Hence, in the insertion phase, our method requires
<inline-formula>
<mml:math id="M74" name="1471-2105-13-82-i74" overflow="scroll">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>n</mml:mi>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>σ</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:mi>q</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
bits, where 2
<italic>n</italic>
 + 4
<sup>
<italic>k</italic>
′′</sup>
bits are for the representation of the reads and the set
<italic>Q</italic>
(which cannot be split). As
<italic></italic>
dominates all other terms,
<italic>h</italic>
(
<italic>g</italic>
, 
<italic>k</italic>
, 
<italic>d</italic>
, 
<italic>k</italic>
′′, 
<italic>σ</italic>
) is much larger than 2
<italic>n</italic>
 + 4
<sup>
<italic>k</italic>
′′</sup>
so that the space gain of our partitioning strategy is obvious. As the space required for the insertion phase for any number of parts can be precalculated, one can choose a memory limit and calculate the minimal number of parts such that the limit is not exceeded. In particular, choosing the space peak of the counting phase as a memory limit for the insertion phase allows for balancing the space requirement of both phases. More details on the partitioning technique are given in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 3.</p>
<p>An obvious disadvantage of the partitioning strategy (with, say
<italic>q</italic>
, parts) is the requirement of
<italic>q</italic>
scans over the read set. However, the sequential scan over the read set is very fast in practice and only makes up for a small part of the running time of the insertion phase.</p>
<p>The expected size of a bucket to be sorted after the insertion phase is smaller than the average read length. The maximum bucket size (determining the space requirement for this phase) is 1 to 2 orders of magnitude smaller than
<italic>d</italic>
. As we can store
<inline-formula>
<mml:math id="M75" name="1471-2105-13-82-i75" overflow="scroll">
<mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:math>
</inline-formula>
bases in one integer of
<italic>ω</italic>
bits, the remaining suffixes (which form the sort keys) can be stored in
<inline-formula>
<mml:math id="M76" name="1471-2105-13-82-i76" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>β</mml:mi>
<mml:mi mathvariant="italic">max</mml:mi>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>λ</mml:mi>
<mml:mtext>max</mml:mtext>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mi>ω</mml:mi>
</mml:mfrac>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>
integers, where
<italic>β</italic>
<sub>
<italic>max</italic>
</sub>
is the maximum size of a bucket and
<italic>λ</italic>
<sub>max</sub>
is the maximum length of a read. The additional constant 2 is for the length of the remaining suffix, for the read number and the read offset. The sort keys are thus sequences of integers of different length which have to be compared up to the longest prefix of the strings they encode. We use quicksort in which
<inline-formula>
<mml:math id="M77" name="1471-2105-13-82-i77" overflow="scroll">
<mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:math>
</inline-formula>
bases are compared using a single integer comparison. As a side effect of the comparison of the suffixes, we obtain the longest common prefix of two compared suffixes in constant extra time, and store this in a table
<italic>L</italic>
of the size of the bucket. The suffixes in the bucket and the table
<italic>L</italic>
are passed to Algorithm 2, described next, and to Algorithm 3( Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 7).</p>
</sec>
<sec>
<title>An efficient algorithm for computing suffix-prefix matches from buckets of sorted SPM-relevant suffixes</title>
<p>The input to the algorithm described next is a bucket of sorted SPM-relevant suffixes, with the corresponding table
<italic>L</italic>
, as computed by the algorithm of the previous subsection. Consider the
<italic>i</italic>
th bucket in
<italic>S</italic>
and let
<italic>H</italic>
<sub>
<italic>j</italic>
</sub>
 = 
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
] 
<italic>+ j</italic>
] be the
<italic>j</italic>
th suffix in this bucket for all
<italic>j</italic>
, 0 ≤ 
<italic>j</italic>
 ≤ 
<italic>β</italic>
 − 1 where
<italic>β</italic>
 = 
<italic>π</italic>
[
<italic>i</italic>
 + 1] − 
<italic>π</italic>
[
<italic>i</italic>
] is the size of the bucket. By construction, we have
<italic>H</italic>
<sub>
<italic>j</italic>
-1</sub>
 ⊴ 
<italic>Hj</italic>
,
<italic>L</italic>
<sub>
<italic>j</italic>
</sub>
 ≥ 
<italic>k</italic>
, and
<italic>L</italic>
<sub>
<italic>j</italic>
</sub>
is the length of the longest common prefix of
<italic>H</italic>
<sub>
<italic>j</italic>
−1</sub>
and
<italic>H</italic>
<sub>
<italic>j</italic>
</sub>
for
<italic>j</italic>
, 1 ≤ 
<italic>j</italic>
 ≤ 
<italic>β</italic>
 − 1.</p>
<p>Note that the bucket-wise computation does not deliver the lcp-values of pairs of SPM-relevant suffixes on the boundary of the buckets. That is, for all
<italic>i</italic>
 > 0, the length of the longest common prefix of
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
] − 1] and
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
]] is not computed, because
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
] − 1] is the last suffix of the (
<italic>i</italic>
 − 1)th bucket and
<italic>S</italic>
[
<italic>π</italic>
[
<italic>i</italic>
]] is the first suffix of the
<italic>i</italic>
th bucket. However, as both suffixes belong to two different buckets, their longest common prefix is smaller than
<italic>k</italic>
(and thus smaller than ℓ
<sub>
<italic>min</italic>
</sub>
) and therefore not of interest for the suffix-prefix matching problem.</p>
<p>The suffixes occurring in a bucket will be processed in nested intervals, called lcp-intervals, a notion introduced for enhanced suffix arrays by [
<xref ref-type="bibr" rid="B18">18</xref>
]. We generalize this notion to table
<italic>H</italic>
and
<italic>L</italic>
as follows: An interval
<italic>e</italic>
.
<italic>f</italic>
, 0 ≤ 
<italic>e</italic>
 < 
<italic>f</italic>
 ≤ 
<italic>β</italic>
 − 1, is an
<italic>lcp-interval of lcp-value</italic>
ℓ if the following holds:</p>
<p> • 
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M78" name="1471-2105-13-82-i78" overflow="scroll">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>or</mml:mtext>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo><</mml:mo>
<mml:mi mathvariant="script">|</mml:mi>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> • 
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M79" name="1471-2105-13-82-i79" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">|</mml:mi>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>for all</mml:mtext>
<mml:mi> q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo> ≤</mml:mo>
<mml:mi>f</mml:mi>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> • 
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M80" name="1471-2105-13-82-i80" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="script">|</mml:mi>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>for at least one</mml:mtext>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo> ≤</mml:mo>
<mml:mi>f</mml:mi>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> • 
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M81" name="1471-2105-13-82-i81" overflow="scroll">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>β</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mtext>or</mml:mtext>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo><</mml:mo>
<mml:mi mathvariant="script">|</mml:mi>
<mml:mtext>.</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>We will also use the notation ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] for an lcp-interval [
<italic>e</italic>
.
<italic>f</italic>
] of lcp-value ℓ. If ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] is an lcp-interval such that
<italic>w</italic>
 = 
<italic>H</italic>
<sub>
<italic>e</italic>
</sub>
[1…ℓ] is the longest common prefix of the suffixes
<italic>H</italic>
<sub>
<italic>e</italic>
</sub>
, 
<italic>H</italic>
<sub>
<italic>e</italic>
+1</sub>
, …, 
<italic>H</italic>
<sub>
<italic>f</italic>
</sub>
, then [
<italic>e</italic>
.
<italic>f</italic>
] is called the
<italic>w</italic>
-interval.</p>
<p>An lcp-interval ℓ′ − [
<italic>e</italic>
′.
<italic>f</italic>
′] is said to be
<italic>embedded</italic>
in an lcp-interval ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] if it is a proper subinterval of ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] (i.e., 
<italic>e</italic>
 ≤ 
<italic>e</italic>
′ < 
<italic>f</italic>
′ ≤ 
<italic>f</italic>
) and ℓ′ > ℓ. The lcp-interval ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] is then said to be
<italic>enclosing</italic>
[
<italic>e</italic>
′.
<italic>f</italic>
′]. If [
<italic>e</italic>
.
<italic>f</italic>
] encloses [
<italic>e</italic>
′.
<italic>f</italic>
′] and there is no interval embedded in [
<italic>e</italic>
.
<italic>f</italic>
] that also encloses [
<italic>e</italic>
′.
<italic>f</italic>
′], then [
<italic>e</italic>
′.
<italic>f</italic>
′] is called a
<italic>child interval</italic>
of [
<italic>e</italic>
.
<italic>f</italic>
] and [
<italic>e</italic>
.
<italic>f</italic>
] is the
<italic>parent interval</italic>
of [
<italic>e</italic>
′.
<italic>f</italic>
′]. We distinguish lcp-intervals from singleton intervals [
<italic>e</italic>
′] for any
<italic>e</italic>
′, 0 ≤ 
<italic>e</italic>
′, ≤ 
<italic>β</italic>
 − 1. [
<italic>e</italic>
′] represents
<italic>H</italic>
<sub>
<italic>e</italic>
</sub>
. The parent interval of [
<italic>e</italic>
′] is the smallest lcp-interval [
<italic>e</italic>
.
<italic>f</italic>
] with
<italic>e</italic>
 ≤ 
<italic>e</italic>
′ ≤ 
<italic>f</italic>
.</p>
<p>This parent–child relationship of lcp-intervals with other lcp-intervals and singleton intervals constitutes a virtual tree which we call the
<italic>lcp-interval tree</italic>
for
<italic>H</italic>
and
<italic>L</italic>
. The root of this tree is the lcp-interval 0 − [0.(
<italic>β</italic>
 − 1)]. The implicit edges to lcp-intervals are called
<italic>branch</italic>
-edges. The implicit edges to singleton-intervals are called
<italic>leaf</italic>
-edges. Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 10 gives a comprehensive example illustrating these notions.</p>
<p>Abouelhoda et al. ([
<xref ref-type="bibr" rid="B18">18</xref>
], Algorithm 4.4) present a linear time algorithm to compute the implicit branch-edges of the lcp-interval tree in bottom-up order. When applied to a bucket of sorted suffixes, the algorithm performs a linear scan of tables
<italic>H</italic>
and
<italic>L</italic>
. In the
<italic>e</italic>
th iteration, 0 ≤ 
<italic>e</italic>
 ≤ 
<italic>β</italic>
 − 2, it accesses the value
<italic>L</italic>
<sub>
<italic>e</italic>
+1</sub>
and
<italic>H</italic>
<sub>
<italic>e</italic>
</sub>
. We have non-trivially extended the algorithm to additionally deliver leaf edges. The pseudocode, with some additions in the lines marked as
<italic>new</italic>
, is given in Algorithm 1 (Figure
<xref ref-type="fig" rid="F1">1</xref>
). We use the following notation and operations:</p>
<p> • A stack stores triples (ℓ, 
<italic>e</italic>
, 
<italic>f</italic>
) representing an lcp-interval ℓ − [
<italic>e</italic>
.
<italic>f</italic>
]. To access the elements of such a triple, say
<italic>sv</italic>
, we use the notation
<italic>sv</italic>
.
<italic>lcp</italic>
(for the lcp-value ℓ),
<italic>sv</italic>
.
<italic>lb</italic>
(for the left boundary
<italic>e</italic>
) and
<italic>sv</italic>
.
<italic>rb</italic>
(for the right boundary
<italic>f</italic>
).</p>
<p> • 
<italic>stack</italic>
.
<italic>push</italic>
(
<italic>e</italic>
) pushes an element
<italic>e</italic>
onto the stack.</p>
<p> • 
<italic>stack</italic>
.
<italic>pop</italic>
pops an element from the stack and returns it.</p>
<p> • 
<italic>stack</italic>
.
<italic>top</italic>
returns a reference to the topmost element of the stack.</p>
<p> • ⊥ stands for an undefined value.</p>
<p> • 
<italic>process</italic>
_
<italic>leafedge</italic>
(
<italic>firstedge</italic>
, 
<italic>itv</italic>
, (
<italic>p</italic>
, 
<italic>q</italic>
)) processes an edge from the lcp-interval
<italic>itv</italic>
to the singleton interval representing the suffix
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
[
<italic>q</italic>
…|
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
|].
<italic>firstedge</italic>
is true if and only if the edge is the first processed edge outgoing from
<italic>itv</italic>
.</p>
<p> • 
<italic>process</italic>
_
<italic>branchedge</italic>
(
<italic>firstedge</italic>
, 
<italic>itv</italic>
, 
<italic>itv’</italic>
) processes an edge from the lcp-interval
<italic>itv</italic>
to the lcp-interval
<italic>itv’</italic>
. The value
<italic>itv’</italic>
.
<italic>rb</italic>
is defined and
<italic>firstedge</italic>
is true if and only if the edge is the first edge outgoing from
<italic>itv</italic>
.</p>
<p> • 
<italic>process</italic>
_
<italic>lcpinterval</italic>
(
<italic>itv</italic>
) processes the lcp-interval
<italic>itv</italic>
. The value
<italic>itv</italic>
.
<italic>rb</italic>
is defined.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Algorithm 1.</bold>
Bottom-up traversal algorithm for arrays of SPM-relevant suffixes. This is an extension of [18, Algorithm 4.4] with the additional lines marked as new.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-1"></graphic>
</fig>
<p>Depending on the application, we use different functions
<italic>process</italic>
_
<italic>leafedge</italic>
,
<italic>process</italic>
_
<italic>branchedge</italic>
, and
<italic>process</italic>
_
<italic>lcpinterval</italic>
.</p>
<p>Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 4, explains why Algorithm 1 also delivers the leaf edges of the lcp-interval tree in the correct bottom-up order.</p>
<p>Consider a path in the lcp-interval tree from the root to a singleton interval [
<italic>e</italic>
′] representing
<italic>H</italic>
<sub>
<italic>e</italic>
</sub>
 = 
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
[
<italic>q</italic>
…|
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
|]. Let ℓ − [
<italic>e</italic>
.
<italic>f</italic>
] be an lcp-interval on this path, and consider the edge on this path outgoing from ℓ − [
<italic>e</italic>
.
<italic>f</italic>
]. If the edge goes to an lcp-interval of, say lcp-value ℓ′, then the edge is implicitly labeled by the non-empty sequence
<inline-formula>
<mml:math id="M82" name="1471-2105-13-82-i82" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. Suppose the edge goes to a singleton interval: Then the edge is implicitly labeled by the non-empty sequence
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
[
<italic>q</italic>
 + ℓ…|
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
| − 1]$
<sub>
<italic>p</italic>
</sub>
. If
<italic>q</italic>
 + ℓ = |
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
|, then
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
[
<italic>q</italic>
 + ℓ…|
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
| − 1] is the empty string, which implies that the edge to the singleton interval is labeled by the sentinel $
<sub>
<italic>p</italic>
</sub>
. Such an edge is a
<italic>terminal edge</italic>
for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
. If the read offset
<italic>q</italic>
is 0, we call [
<italic>e</italic>
′] a
<italic>whole</italic>
-
<italic>read interval</italic>
for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
, and the path in the lcp-interval tree from the root to [
<italic>e</italic>
′] a
<italic>whole</italic>
-
<italic>read path</italic>
for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
.</p>
<p>Consider a suffix-prefix match 〈 
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
, 
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
, ℓ 〉 , such that the suffix
<italic>w</italic>
of
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
of length ℓ has a prefix
<italic>u</italic>
of length
<italic>k</italic>
. Recall that
<italic>u</italic>
is the common prefix of all suffixes in the
<italic>i</italic>
th bucket. Due to the implicit padding of reads at their end, the symbol following
<italic>w</italic>
as a suffix of
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
is $
<sub>
<italic>p</italic>
</sub>
. By definition,
<italic>w</italic>
is also prefix of
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
and the symbol in
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
following this occurrence of
<italic>w</italic>
is different from $
<sub>
<italic>p</italic>
</sub>
. Thus, there is a
<italic>w</italic>
-interval [
<italic>e</italic>
.
<italic>f</italic>
] in the lcp-interval tree for
<italic>H</italic>
and
<italic>L</italic>
. [
<italic>e</italic>
.
<italic>f</italic>
] is on the path from the root-interval to the whole-read leaf interval for
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
. Moreover, there is a terminal edge for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
outgoing from [
<italic>e</italic>
.
<italic>f</italic>
]. Vice versa, an lcp-interval of lcp-value ℓ on the path to the whole-read leaf interval for
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
and with a terminal edge for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
identifies the suffix-prefix match 〈 
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
, 
<italic>r</italic>
<sub>
<italic>j</italic>
</sub>
, ℓ 〉 . This observation about suffix-prefix matches is exploited in Algorithm 2 (Figure
<xref ref-type="fig" rid="F2">2</xref>
) which performs a bottom-up traversal of the lcp-interval tree for
<italic>H</italic>
and
<italic>L</italic>
, collecting whole-read leaves and terminal edges for lcp-intervals of lcp-value at least ℓ
<sub>
<italic>min</italic>
</sub>
. More precisely, whenever a whole-read leaf for
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
, 1 ≤ 
<italic>p</italic>
 ≤ 
<italic>m</italic>
, is found (line 9),
<italic>p</italic>
is appended to the list
<italic>W</italic>
. With each lcp-interval
<italic>itv</italic>
on the stack used in the bottom-up traversal, an integer
<italic>itv</italic>
.
<italic>firstinW</italic>
is associated. The elements in
<italic>W</italic>
[
<italic>itv</italic>
.
<italic>firstinW</italic>
…|
<italic>W</italic>
|] are exactly the read numbers of whole-read leaves collected for
<italic>itv</italic>
. The value of
<italic>itv</italic>
.
<italic>firstinW</italic>
is set whenever the first edge outgoing from
<italic>itv</italic>
is detected: If the first edge outgoing from
<italic>itv</italic>
is a leaf-edge, no previous whole-read leaf for
<italic>itv</italic>
has been processed: Thus |
<italic>W</italic>
| + 1 is the first index in list
<italic>W</italic>
where the whole read leaf information (if any) for
<italic>itv</italic>
will be stored (see line 8). If the first edge is a branch-edge to lcp-interval
<italic>itv</italic>
′, then the corresponding subset of
<italic>W</italic>
for
<italic>itv</italic>
′ must be inherited to
<italic>itv</italic>
. Technically, this is achieved by inheriting the
<italic>firstinW</italic>
-attribute from
<italic>itv</italic>
′ to
<italic>itv</italic>
, see line 18 of Algorithm 2.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Algorithm 2.</bold>
Bottom-up traversal of lcp-interval tree enumerating suffix-prefix matches.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-2"></graphic>
</fig>
<p>Whenever a terminal edge for read
<italic>r</italic>
<sub>
<italic>p</italic>
</sub>
, outgoing from an interval
<italic>itv</italic>
is processed (line 11),
<italic>p</italic>
is added to the list
<italic>T</italic>
. Suppose that this terminal edge is outgoing from the lcp-interval
<italic>itv</italic>
. The first symbol of the label of the terminal edge is $
<sub>
<italic>p</italic>
</sub>
. Suppose there is a branch-edge outgoing from
<italic>itv</italic>
to some lcp-interval
<italic>itv</italic>
′. Then the first symbol, say
<italic>a</italic>
, of the implicit label of this edge must occur more than once. Thus it cannot be a sentinel, as these are considered different in the lexicographic ordering of the suffixes. Hence the first symbol
<italic>a</italic>
is either
<italic>A</italic>
,
<italic>C</italic>
,
<italic>G</italic>
or
<italic>T</italic>
. As these symbols are, with respect to the lexicographic order, smaller than the sentinels, the branch-edge from
<italic>itv</italic>
to
<italic>itv</italic>
’ appears before the terminal edge from
<italic>itv</italic>
. Hence the terminal edges outgoing from
<italic>itv</italic>
′ have been processed in line 25, and so we only need a single list
<italic>T</italic>
for the entire algorithm.</p>
<p>As soon as all edges outgoing from
<italic>itv</italic>
have been processed, we have collected the terminal edges in
<italic>T</italic>
and the whole-read leaves in
<italic>W</italic>
. If
<italic>itv</italic>
.
<italic>lcp</italic>
exceeds the minimum length, Algorithm 2 computes the cartesian product of
<italic>T</italic>
with the appropriate subset of
<italic>W</italic>
and processes the corresponding suffix-prefix matches of length
<italic>itv</italic>
.
<italic>lcp</italic>
in line 25. At this point suffix-prefix matches may be output or post-processed to check for additional constraints, such as transitivity. Once the cartesian product has been computed, the elements from
<italic>T</italic>
are no longer needed and
<italic>T</italic>
is emptied (line 26). Note that the algorithm empties
<italic>W</italic>
once an lcp-interval of lcp-value smaller than ℓ
<sub>
<italic>min</italic>
</sub>
is processed. After this event, there will only be terminal edges from
<italic>v</italic>
-intervals such that the longest common prefix of
<italic>v</italic>
and the reads in
<italic>W</italic>
is smaller than ℓ
<sub>
<italic>min</italic>
</sub>
. Therefore there will be no suffix-prefix match of the form 〈_, 
<italic>w</italic>
, ℓ〉 such that ℓ ≥ ℓ
<sub>
<italic>min</italic>
</sub>
and
<italic>w</italic>
is a read represented in
<italic>W</italic>
. So the list can safely be emptied.</p>
<p>The lcp-interval tree for
<italic>H</italic>
and
<italic>L</italic>
contains
<italic>β</italic>
leaf-edges. As all lcp-intervals have at least two children, there are at most
<italic>β</italic>
 − 1 branch-edges and
<italic>β</italic>
lcp-intervals. As each of the three functions specified in Algorithm 2 is called once for every corresponding item, the number of functions calls is at most 3
<italic>β</italic>
 − 1. Recall that Algorithm 2 is applied to each bucket and the total size of all buckets is
<italic>g</italic>
. Hence there are at most 3
<italic>g</italic>
 − 
<italic>d</italic>
calls to the three functions.
<italic>process</italic>
_
<italic>leafedge</italic>
and
<italic>process</italic>
_
<italic>branchedge</italic>
run in constant time. The running time of
<italic>process</italic>
_
<italic>lcpinterval</italic>
is determined by the number of
<italic>SPM</italic>
s processed. Assuming that the processing takes constant time, the overall running time of Algorithm 2 for all buckets is
<italic>O</italic>
(
<italic>g</italic>
 + 
<italic>z</italic>
) where
<italic>z</italic>
is the number of processed
<italic>SPM</italic>
s.</p>
</sec>
<sec>
<title>Handling reverse complements of reads</title>
<p>Reads may originate from both strands of a DNA molecule. For this reason, suffix-prefix matches shall also be computed between reads and reverse complements of other reads. Handling the reverse complements of all reads is conceptually easy to integrate into our approach: One just has to process
<inline-formula>
<mml:math id="M83" name="1471-2105-13-82-i83" overflow="scroll">
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
instead of
<inline-formula>
<mml:math id="M84" name="1471-2105-13-82-i84" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
.</p>
<p>The three steps which involve scanning the reads are extended to process both strands of all reads. This does not require doubling the size of the read set representation, as all information for the reverse complemented reads can efficiently be extracted from the forward reads. Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 5, shows how to compute the integer codes for the reversed reads from the integer codes of the forward reads in constant time.</p>
<p>The scan of the reverse complemented reads has a negligible impact on the runtime. Of course, the size of the table
<italic>S</italic>
,
<italic>K</italic>
and
<italic>PS</italic>
roughly doubles when additionally considering reverse complements.</p>
<p>When computing suffix-prefix matches some minor modifications are necessary: Applying Algorithm 2 to
<inline-formula>
<mml:math id="M85" name="1471-2105-13-82-i85" overflow="scroll">
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
finds all
<italic>SPM</italic>
s, including some redundant ones, which we want to omit. This is formalized as follows: an
<italic>SPM</italic>
<inline-formula>
<mml:math id="M86" name="1471-2105-13-82-i86" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>M</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is
<italic>non-redundant</italic>
if and only if one of the following conditions is true:</p>
<p> • 
<disp-formula id="bmcM7">
<label>(7)</label>
<mml:math id="M87" name="1471-2105-13-82-i87" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> • 
<inline-formula>
<mml:math id="M88" name="1471-2105-13-82-i88" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p> • 
<inline-formula>
<mml:math id="M89" name="1471-2105-13-82-i89" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mtext>.</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>For any
<italic>SPM</italic>
, these conditions can easily be checked in constant time, see Algorithm 3 ( Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 7).</p>
</sec>
</sec>
<sec>
<title>Recognition of transitive and irreducible suffix-prefix matches</title>
<p>For the construction of the string graph, we do not need transitive
<italic>SPM</italic>
s. An
<italic>SPM</italic>
<inline-formula>
<mml:math id="M90" name="1471-2105-13-82-i90" overflow="scroll">
<mml:mfenced open="〈" close="〉">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>
is
<italic>transitive</italic>
if and only if there are two
<italic>SPM</italic>
s 〈
<italic>r</italic>
, 
<italic>s</italic>
, ℓ〉 and 〈
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉 such that ℓ + ℓ′ = |
<italic>s</italic>
|+ ℓ″. Figure
<xref ref-type="fig" rid="F3">3</xref>
shows a concrete example of a transitive
<italic>SPM</italic>
. An
<italic>SPM</italic>
which is not transitive is
<italic>irreducible</italic>
.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Example of a transitive suffix-prefix match.</bold>
An example of a transitive SPM. A set of three reads with a transitive SPM 〈
<italic>r</italic>
, 
<italic>t</italic>
, 10〉 derived from 〈
<italic>s</italic>
, 
<italic>t</italic>
, 16〉.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-3"></graphic>
</fig>
<p>The following theorem characterizes an
<italic>SPM</italic>
by a read and a single irreducible
<italic>SPM</italic>
satisfying a length constraint and a match constraint, see Figure
<xref ref-type="fig" rid="F4">4</xref>
for an illustration.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Illustration of transitivity of a suffix-prefix match.</bold>
Schematic illustration of transitivity.
<inline-formula>
<mml:math id="M91" name="1471-2105-13-82-i91" overflow="scroll">
<mml:mfenced open="〈" close="〉">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>
is a transitive SPM derived from the SPM 〈
<italic>s</italic>
, 
<italic>t</italic>
, ℓ’〉. Hence, the prefix v of s is a suffix of
<inline-formula>
<mml:math id="M92" name="1471-2105-13-82-i92" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mtext>1…</mml:mtext>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-4"></graphic>
</fig>
<p>
<bold>Theorem 1.</bold>
Let 〈
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 be an
<italic>SPM</italic>
. Then 〈
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 is transitive if and only if there is an
<inline-formula>
<mml:math id="M93" name="1471-2105-13-82-i93" overflow="scroll">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
and an irreducible
<italic>SPM</italic>
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉 such that ℓ′ > ℓ″, |
<italic>r</italic>
| − ℓ″ ≥ |
<italic>s</italic>
| − ℓ′ and
<italic>s</italic>
[1…|
<italic>s</italic>
| − ℓ′] = 
<italic>r</italic>
[|
<italic>r</italic>
| − ℓ″ − (|
<italic>s</italic>
| − ℓ′) + 1…|
<italic>r</italic>
| − ℓ″].</p>
<p>The proof of Theorem 1 can be found in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 6.</p>
<p>If the
<italic>SPM</italic>
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 is transitive and 〈
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉 is the
<italic>SPM</italic>
satisfying the conditions of Theorem 1, then we say that 〈
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉
<italic>is derived from</italic>
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉.</p>
<p>Theorem 1 suggests a way to decide the transitivity of an
<italic>SPM</italic>
<italic>r</italic>
, 
<italic>t</italic>
, ℓ〉: Check if there is an irreducible
<italic>SPM</italic>
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉 from which it is derived. The check involves comparison of up to |
<italic>s</italic>
| − ℓ′ symbols to verify if
<italic>s</italic>
[1…|
<italic>s</italic>
| − ℓ′] is a suffix of
<inline-formula>
<mml:math id="M94" name="1471-2105-13-82-i94" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. As there may be several irreducible
<italic>SPM</italic>
s from which 〈
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 may be derived, it is necessary to store the corresponding left contexts: For any sequence
<italic>s</italic>
and any ℓ′, 1 ≤ ℓ′ < |
<italic>s</italic>
|, the
<italic>left context LC</italic>
(
<italic>s</italic>
, ℓ′) of
<italic>s</italic>
is the non-empty string
<italic>s</italic>
[1…|
<italic>s</italic>
| − ℓ′].</p>
<p>Due to the bottom-up nature of the traversal in Algorithm 2, the
<italic>SPM</italic>
s involving the different prefixes of a given read are enumerated in order of match length, from the longest to the shortest one. Thus, Algorithm 2 first delivers the irreducible
<italic>SPM</italic>
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉 from which
<inline-formula>
<mml:math id="M95" name="1471-2105-13-82-i95" overflow="scroll">
<mml:mfenced open="〈" close="〉">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>
is possibly derived, because
<inline-formula>
<mml:math id="M96" name="1471-2105-13-82-i96" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo>></mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>From Theorem 1 one can conclude that the first
<italic>SPM</italic>
, say 〈
<italic>s</italic>
, 
<italic>t</italic>
, ℓ′〉, found on a whole-read path for
<italic>t</italic>
is always irreducible. Hence, one stores
<italic>LC</italic>
(
<italic>s</italic>
, ℓ′). An
<italic>SPM</italic>
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 detected later while traversing the same whole-read path for
<italic>t</italic>
is classified as transitive if and only if
<italic>LC</italic>
(
<italic>s</italic>
, ℓ′) is a suffix of
<italic>LC</italic>
(
<italic>r</italic>
, ℓ″) (see Figure
<xref ref-type="fig" rid="F5">5</xref>
for an illustration). If 〈
<italic>r</italic>
, 
<italic>t</italic>
, ℓ″〉 is transitive it is discarded. Otherwise,
<italic>LC</italic>
(
<italic>r</italic>
, ℓ″) must be stored as well to check the transitivity of the
<italic>SPM</italic>
s found later for the same whole-read path. So each
<italic>SPM</italic>
is either classified as transitive, or irreducible, in which case a left context is stored. To implement this method, we use a dictionary
<italic>D</italic>
of left contexts, with an operation
<italic>LCsearch</italic>
(
<italic>D</italic>
, 
<italic>s</italic>
), which returns
<italic>true</italic>
if there is some
<italic>t</italic>
 ∈ 
<italic>D</italic>
such that
<italic>t</italic>
is a suffix of
<italic>s</italic>
. Otherwise, it adds
<italic>s</italic>
to
<italic>D</italic>
and returns
<italic>false</italic>
. Such a dictionary can, for example, be implemented by a trie [
<xref ref-type="bibr" rid="B19">19</xref>
] storing the left contexts in reverse order. In our implementation we use a blind-trie [
<xref ref-type="bibr" rid="B20">20</xref>
]. In Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 7 we present a modification of Algorithm 2 to output non-redundant irreducible
<italic>SPM</italic>
s only.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Transitivity and left contexts.</bold>
Transitivity and left contexts. Let the SPM 〈
<italic>t</italic>
, 
<italic>r</italic>
, |
<italic>z</italic>
|〉 be derived from 〈
<italic>s</italic>
, 
<italic>r</italic>
, |
<italic>zz</italic>
’|〉. Hence the left context
<italic>LC</italic>
(
<italic>s</italic>
, |
<italic>zz</italic>
’|) = 
<italic>xy</italic>
is a suffix of the left context
<italic>LC</italic>
(
<italic>t</italic>
, |
<italic>z</italic>
|) = 
<italic>wxy</italic>
. Let 〈
<italic>u</italic>
, 
<italic>r</italic>
, |
<italic>z</italic>
|〉 be an irreducible SPM. Then
<italic>LC</italic>
(
<italic>u</italic>
, |
<italic>yz</italic>
|) = 
<italic>vy</italic>
for some non empty string
<italic>v</italic>
and
<italic>LC</italic>
(
<italic>s</italic>
, |
<italic>zz</italic>
’|) is not a suffix of
<italic>vy</italic>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-5"></graphic>
</fig>
</sec>
<sec>
<title>Recognition of internally contained reads</title>
<p>At the beginning of the methods section we have shown how to detect reads which are prefixes or suffixes of other reads. When constructing the string graph we also have to discard internally contained reads, which are contained in other reads without being a suffix or a prefix. More precisely,
<inline-formula>
<mml:math id="M97" name="1471-2105-13-82-i97" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
is
<italic>internally contained</italic>
, if a read
<inline-formula>
<mml:math id="M98" name="1471-2105-13-82-i98" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
exists, such that
<italic>r</italic>
′ = 
<italic>urw</italic>
for some non-empty strings
<italic>u</italic>
and
<italic>v</italic>
. In Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 8, we show how to efficiently detect internally contained reads.</p>
</sec>
<sec>
<title>Construction of the assembly string graph</title>
<p>Consider a read set
<inline-formula>
<mml:math id="M99" name="1471-2105-13-82-i99" overflow="scroll">
<mml:mi mathvariant="script">R</mml:mi>
</mml:math>
</inline-formula>
which is suffix- and prefix-free. The assembly string graph [
<xref ref-type="bibr" rid="B8">8</xref>
] is a graph of the relationships between the reads, constructed from
<inline-formula>
<mml:math id="M100" name="1471-2105-13-82-i100" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, the set of all non-redundant irreducible
<italic>SPM</italic>
s from
<inline-formula>
<mml:math id="M101" name="1471-2105-13-82-i101" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>M</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
restricted to reads which are not internally contained.</p>
<p>For each
<inline-formula>
<mml:math id="M102" name="1471-2105-13-82-i102" overflow="scroll">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
the graph contains two vertices denoted by
<italic>r</italic>
.
<italic>B</italic>
and
<italic>r</italic>
.
<italic>E</italic>
, representing the two extremities of the read.
<italic>B</italic>
stands for begin,
<italic>E</italic>
stands for end.</p>
<p>For each non-redundant irreducible
<italic>SPM</italic>
<inline-formula>
<mml:math id="M103" name="1471-2105-13-82-i103" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
satisfying ℓ ≥ ℓ
<sub>
<italic>min</italic>
</sub>
, the graph contains two directed edges, defined as follows:</p>
<p>1. if
<inline-formula>
<mml:math id="M104" name="1471-2105-13-82-i104" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
there are two edges:</p>
<p> • 
<italic>r</italic>
.
<italic>E</italic>
 → 
<italic>s</italic>
.
<italic>E</italic>
with edge label
<italic>s</italic>
[ℓ + 1…|
<italic>s</italic>
|]</p>
<p> • 
<italic>s</italic>
.
<italic>B</italic>
 → 
<italic>r</italic>
.
<italic>B</italic>
with edge label
<inline-formula>
<mml:math id="M105" name="1471-2105-13-82-i105" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>2. if
<inline-formula>
<mml:math id="M106" name="1471-2105-13-82-i106" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mover accent="true">
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
there are two edges:</p>
<p> • 
<italic>r</italic>
.
<italic>E</italic>
 → 
<italic>s</italic>
.
<italic>B</italic>
with edge label
<inline-formula>
<mml:math id="M107" name="1471-2105-13-82-i107" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p> • 
<italic>s</italic>
.
<italic>E</italic>
 → 
<italic>r</italic>
.
<italic>B</italic>
with edge label
<inline-formula>
<mml:math id="M108" name="1471-2105-13-82-i108" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="true">|</mml:mo>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>3. if
<inline-formula>
<mml:math id="M109" name="1471-2105-13-82-i109" overflow="scroll">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mover accent="true">
<mml:mi>s</mml:mi>
<mml:mo stretchy="true"></mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
there are two edges:</p>
<p> • 
<italic>r</italic>
.
<italic>B</italic>
 → 
<italic>s</italic>
.
<italic>E</italic>
with edge label
<italic>s</italic>
[ℓ + 1…|
<italic>s</italic>
|]</p>
<p> • 
<italic>s</italic>
.
<italic>B</italic>
 → 
<italic>r</italic>
.
<italic>E</italic>
with edge label
<italic>r</italic>
[ℓ + 1…|
<italic>r</italic>
|]</p>
<p>In our implementation of the string graph, vertices are represented by integers from 0 to 2
<italic>m</italic>
 − 1. To construct the graph from the list of non-redundant irreducible
<italic>SPM</italic>
s, we first calculate the outdegree of each vertex. From the counts we calculate partial sums. In a second scan over the list of
<italic>SPM</italic>
s, we insert the edges in an array of size 2
<italic>ρ</italic>
, where
<inline-formula>
<mml:math id="M110" name="1471-2105-13-82-i110" overflow="scroll">
<mml:mrow>
<mml:mi>ρ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:msup>
<mml:mi>M</mml:mi>
<mml:mi mathvariant="italic">nr</mml:mi>
</mml:msup>
<mml:mo stretchy="true">(</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="script">R</mml:mi>
<mml:mo stretchy="true">¯</mml:mo>
</mml:mover>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">|</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. This strategy allows to allocate exactly the necessary space for the edges and to access the first edge outgoing from a vertex in constant time. The array of edges is stored compactly using
<inline-formula>
<mml:math id="M111" name="1471-2105-13-82-i111" overflow="scroll">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>ρ</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>m</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>λ</mml:mi>
<mml:mtext>max</mml:mtext>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
bits, where
<italic>λ</italic>
<sub>max</sub>
is the maximum length of a read.
<inline-formula>
<mml:math id="M112" name="1471-2105-13-82-i112" overflow="scroll">
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>m</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>
bits are used for the destination of an edge (the source of the edge is clear from the array index where the edge is stored).
<inline-formula>
<mml:math id="M113" name="1471-2105-13-82-i113" overflow="scroll">
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>λ</mml:mi>
<mml:mtext>max</mml:mtext>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>
bits are used for the length of the edge label.</p>
<p>To output the contigs, we first write references (such as read numbers and edge lengths) to a temporary file. Once this is completed, the memory for the string graph is deallocated, and the read sequences are mapped into memory. Finally, the sequences of the contigs are derived from the references and the contigs are output.</p>
<p>To verify the correctness of our string graph implementation and to allow comparison with other tools, we have implemented the graph cleaning algorithms described in [
<xref ref-type="bibr" rid="B9">9</xref>
] as an experimental feature. More sophisticated techniques, such as the network flow approach described in [
<xref ref-type="bibr" rid="B8">8</xref>
], are left for future work, as the main focus of this paper lies in the efficient computation of the irreducible
<italic>SPM</italic>
s and the construction of the string graph.</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<p>The presented methods for constructing the string graph and the subsequent computation of contigs have been implemented in a sequence assembler named
<italic>Readjoiner</italic>
, which is part of the
<italic>GenomeTools</italic>
software suite [
<xref ref-type="bibr" rid="B21">21</xref>
]. The
<italic>Readjoiner</italic>
pipeline consists of three steps:</p>
<p> • 
<italic>Readjoiner prefilter</italic>
takes a read set in form of one or more Fasta-formatted files and removes reads containing ambiguity codes and reads which are prefixes or suffixes of other reads. It outputs the prefiltered reads in the
<italic>GtEncseq</italic>
-format [
<xref ref-type="bibr" rid="B16">16</xref>
].</p>
<p> • 
<italic>Readjoiner overlap</italic>
maps the representation of prefiltered reads in memory, enumerates non-redundant irreducible suffix-prefix matches, and stores them on file. The time/space tradeoff for this step can be adjusted by an option specifying the number of parts in which the sorted array of SPM-relevant suffixes is computed. Alternatively, one can specify a memory limit, according to which the minimum number of parts is determined to not exceed this limit.</p>
<p> • 
<italic>Readjoiner assembly</italic>
builds the string graph and traverses it to output the contigs.</p>
<sec>
<title>Experimental setup</title>
<p>For our benchmarks, the 64bit
<italic>GenomeTools</italic>
binary was compiled by
<italic>gcc</italic>
version 4.3.2 using the provided Makefile with the option “64bit = yes assert = no amalgamation = yes”. These last two options trigger the build-system to not compile assertions and to generate a single C-code file from which an amalgamation object is compiled, thus allowing for a maximum of inlined code, which in turn is executed faster.</p>
<p>All tests were performed on a computer with a 2.40 Ghz Intel Xeon E5620 4-core processor, 64 GB RAM, under a 64bit Linux operating system, using a single core only.</p>
<p>For memory usage measurements, we monitored the VmHWM (“high water mark”) value in the /proc file system [
<xref ref-type="bibr" rid="B22">22</xref>
] associated with the process of each particular program over the time of its execution, including both allocated heap memory and memory made available via the mmap() system call. The running time is the CPU time (sum of user and system time) as measured using
<italic>GNU time</italic>
.</p>
<p>For all runs of
<italic>Readjoiner</italic>
we used
<italic>k</italic>
 = 32,
<inline-formula>
<mml:math id="M114" name="1471-2105-13-82-i114" overflow="scroll">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mo stretchy="true">{</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo>,</mml:mo>
<mml:mfenced open="⌈" close="⌉">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mn>8</mml:mn>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
and
<italic>k″</italic>
 = 
<italic>k</italic>
′ − 1. If not stated otherwise, the number of parts in which
<italic>Readjoiner</italic>
computes the sorted array of SPM-relevant suffixes was 7. All programs were run with
<sub>
<italic>min</italic>
</sub>
 = 45 (which is the default minimum match length in SGA).</p>
</sec>
<sec>
<title>Human genome sequencing simulations</title>
<p>We tested our assembler on simulated error-free sequencing read sets based on human genomic sequences (latest available release of GRCh37). For each human chromosome we prepared a template sequence by deleting ambiguity symbols. Then we simulated reads by pseudo random sampling of the template sequence and its reverse complement, until the desired number of reads was obtained. This was done using the
<italic>GenomeTools simreads</italic>
-tool and is the same procedure as used in [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
].</p>
<p>From each of the 24 human chromosome sequences, we generated a separate read set with 20 × coverage and a constant read length of 100 bp. The read set are called
<italic>c</italic>
1, 
<italic>c</italic>
2, …, 
<italic>c</italic>
22, 
<italic>cX</italic>
, 
<italic>cY</italic>
. Furthermore, using the entire human genome as template we generated read sets with 20×, 30× and 40× coverage, referred to by hg20×, hg30× and hg40×, respectively. Additionally, from chromosome 22, a set of reads of variable length was prepared: The results for this dataset are reported in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Section 9.</p>
<p>In a first computational experiment, we determined the time vs. space tradeoff of our partitioning strategy, by applying
<italic>Readjoiner</italic>
to c2 with a varying number of parts ranging from 1 to 9. The results are shown in Figure
<xref ref-type="fig" rid="F6">6</xref>
.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Influence of the partitioning technique on space and time requirement.</bold>
Influence of the partitioning technique on space and time requirement. Running time and space peak of Readjoiner for the index construction of the c2 dataset with a varying number of parts (from 1 to 9, ℓ
<sub>
<italic>min</italic>
</sub>
 = 45).</p>
</caption>
<graphic xlink:href="1471-2105-13-82-6"></graphic>
</fig>
<p>The complete
<italic>Readjoiner</italic>
-pipeline was applied to each of the 24 datasets
<italic>c</italic>
1, 
<italic>c</italic>
2, 
<italic>c</italic>
3, …, 
<italic>c</italic>
22, 
<italic>cX</italic>
, 
<italic>cY</italic>
. We considered the running time as a function of the length of each chromosome from which the dataset was generated and performed a linear regression, which delivered an
<italic>R</italic>
<sup>2</sup>
-value of 0.997. The same was done for the space peak, delivering an
<italic>R</italic>
<sup>2</sup>
-value of 0.998. Figure
<xref ref-type="fig" rid="F7">7</xref>
shows plots of the time vs. length and space peak vs. length functions.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Running time and space peak for Readjoiner for all 24 read sets derived from human chromosomes.</bold>
Running time (
<bold>A</bold>
) and space peak (
<bold>B</bold>
) of Readjoiner for all 24 read sets c1, c2, …, c22, cX, cY derived from the human chromosomes (ℓ
<sub>
<italic>min</italic>
</sub>
 = 45). Each dot represents a human chromosome placed on the X-axes according to its length and on the Y-axes according to the running time (
<bold>A</bold>
) and the space peak (
<bold>B</bold>
) required by Readjoiner to process it. The line was fitted to the dots using the least square regression command lm from the R-project
<ext-link ext-link-type="uri" xlink:href="http://www.r-project.org/">http://www.r-project.org/</ext-link>
. This delivered R
<sup>2</sup>
 = 0.997 for the running time and R
<sup>2</sup>
 = 0.998 for the space peak.</p>
</caption>
<graphic xlink:href="1471-2105-13-82-7"></graphic>
</fig>
</sec>
<sec>
<title>Comparison with other string graph-based assemblers</title>
<p>The 64bit Linux binaries of Edena [
<xref ref-type="bibr" rid="B9">9</xref>
] were downloaded from [
<xref ref-type="bibr" rid="B23">23</xref>
]. We tested version 2.1.1 and version 3 dev110920. Edena 3 is an untested version and under active development. In our tests, it required slightly more memory and was significantly slower than Edena version 2.1.1, which we therefore selected for the comparative test. The source code of SGA (version 0.9.13) was obtained from its public GitHub repository [
<xref ref-type="bibr" rid="B24">24</xref>
]. The 64-bit Linux binary of LEAP was downloaded from [
<xref ref-type="bibr" rid="B25">25</xref>
].</p>
<p>As Edena is based on the original string graph construction method proposed by [
<xref ref-type="bibr" rid="B8">8</xref>
], a comparison to
<italic>Readjoiner</italic>
allows validating our construction method. Table
<xref ref-type="table" rid="T1">1</xref>
reports the performance of Edena and
<italic>Readjoiner</italic>
for the assembly of datasets c22, c15 and c7. These and additionally c2 were used as benchmark sets in [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
]. For all three datasets, Edena and
<italic>Readjoiner</italic>
produce exactly the same list of irreducible
<italic>SPM</italic>
s, which allows concluding that the string graphs are identical. The resulting contig sets are almost identical: Edena was slightly more stringent in the output of the smallest contigs.
<italic>Readjoiner</italic>
was 13 − 14× faster than Edena and used about 11% of the space used by Edena. Due to a segmentation fault, Edena did not complete the overlap phase for the largest chromosome dataset c2.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Comparison of Readjoiner and Edena</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="left">RJ</th>
<th align="left">Edena</th>
<th align="left">
<inline-formula>
<mml:math id="M115" name="1471-2105-13-82-i115" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">Edena</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">Edena</th>
<th align="left">
<inline-formula>
<mml:math id="M116" name="1471-2105-13-82-i116" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">Edena</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">Edena</th>
<th align="left">
<inline-formula>
<mml:math id="M117" name="1471-2105-13-82-i117" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">Edena</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Read set
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c15
<hr></hr>
</td>
<td align="left" valign="bottom">c15
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c7
<hr></hr>
</td>
<td align="left" valign="bottom">c7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Genome size (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">81.7
<hr></hr>
</td>
<td align="left" valign="bottom">81.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">155.4
<hr></hr>
</td>
<td align="left" valign="bottom">155.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of reads (M)
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">16.3
<hr></hr>
</td>
<td align="left" valign="bottom">16.3
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">31.1
<hr></hr>
</td>
<td align="left" valign="bottom">31.1
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contained reads (K)
<hr></hr>
</td>
<td align="left" valign="bottom">686.4
<hr></hr>
</td>
<td align="left" valign="bottom">686.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">1665.7
<hr></hr>
</td>
<td align="left" valign="bottom">1665.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3103.0
<hr></hr>
</td>
<td align="left" valign="bottom">3103.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Irreducible SPM (M)
<hr></hr>
</td>
<td align="left" valign="bottom">7.2
<hr></hr>
</td>
<td align="left" valign="bottom">7.2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">17.2
<hr></hr>
</td>
<td align="left" valign="bottom">17.2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">36.4
<hr></hr>
</td>
<td align="left" valign="bottom">36.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall time (s)
<hr></hr>
</td>
<td align="left" valign="bottom">360
<hr></hr>
</td>
<td align="left" valign="bottom">4903
<hr></hr>
</td>
<td align="left" valign="bottom">13.62×
<hr></hr>
</td>
<td align="left" valign="bottom">945
<hr></hr>
</td>
<td align="left" valign="bottom">13609
<hr></hr>
</td>
<td align="left" valign="bottom">14.40×
<hr></hr>
</td>
<td align="left" valign="bottom">2035
<hr></hr>
</td>
<td align="left" valign="bottom">29404
<hr></hr>
</td>
<td align="left" valign="bottom">14.45×
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall space (MB)
<hr></hr>
</td>
<td align="left" valign="bottom">294
<hr></hr>
</td>
<td align="left" valign="bottom">2753
<hr></hr>
</td>
<td align="left" valign="bottom">9.35×
<hr></hr>
</td>
<td align="left" valign="bottom">703
<hr></hr>
</td>
<td align="left" valign="bottom">6415
<hr></hr>
</td>
<td align="left" valign="bottom">9.13×
<hr></hr>
</td>
<td align="left" valign="bottom">1331
<hr></hr>
</td>
<td align="left" valign="bottom">12255
<hr></hr>
</td>
<td align="left" valign="bottom">9.21×
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">120712
<hr></hr>
</td>
<td align="left" valign="bottom">120462
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">254830
<hr></hr>
</td>
<td align="left" valign="bottom">254111
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">503446
<hr></hr>
</td>
<td align="left" valign="bottom">502706
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Total contigs length (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">45.7
<hr></hr>
</td>
<td align="left" valign="bottom">44.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">103.0
<hr></hr>
</td>
<td align="left" valign="bottom">101.1
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">198.8
<hr></hr>
</td>
<td align="left" valign="bottom">195.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly N50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">1.6
<hr></hr>
</td>
<td align="left" valign="bottom">1.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2.4
<hr></hr>
</td>
<td align="left" valign="bottom">2.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2.3
<hr></hr>
</td>
<td align="left" valign="bottom">2.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly NG50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">2.7
<hr></hr>
</td>
<td align="left" valign="bottom">2.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.7
<hr></hr>
</td>
<td align="left" valign="bottom">3.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.9
<hr></hr>
</td>
<td align="left" valign="bottom">3.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Longest contig (Kbp)</td>
<td align="left">41.4</td>
<td align="left">41.4</td>
<td align="left">-</td>
<td align="left">54.2</td>
<td align="left">54.2</td>
<td align="left">-</td>
<td align="left">44.9</td>
<td align="left">44.9</td>
<td align="left">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Results of applying Readjoiner (RJ) and Edena to the datasets c22, c15, c7 (ℓ
<sub>
<italic>min</italic>
</sub>
 = 45).</p>
</table-wrap-foot>
</table-wrap>
<p>SGA [
<xref ref-type="bibr" rid="B12">12</xref>
] is based on the direct string graph construction methods first introduced in [
<xref ref-type="bibr" rid="B11">11</xref>
]. Currently, to the best of our knowledge, it is the only other open source string graph-based assembler, besides
<italic>Readjoiner</italic>
. The SGA default pipeline consists of the index, overlap and assemble tools: However, memory can be reduced by using sga index, rmdup and fm-merge [
<xref ref-type="bibr" rid="B12">12</xref>
]. The parameter
<italic>d</italic>
of the SGA-index phase allows selecting the number of sequences to be processed at a time: without this parameter, the index phase requires much more memory than other phases. With memory being the most limiting factor in the assembly, we optimized the space peak of SGA by gradually reducing the value of
<italic>d</italic>
until either the space peak of the index phase was less than the space peak of the other phases, or a further decrease of
<italic>d</italic>
did not reduce the space peak. SGA’s index construction and overlap calculation can be threaded. However, for a fair comparison we used a single thread. Table
<xref ref-type="table" rid="T2">2</xref>
reports the results of running SGA and
<italic>Readjoiner</italic>
for the datasets c22, c15, c7 and c2.
<italic>Readjoiner</italic>
was 19× to 20× faster than SGA and used 14% to 23% less space than SGA. Using overlap and assemble instead of fm-merge, SGA became slightly faster but required about 7 times more memory (data not shown). By default, SGA is less stringent than
<italic>Readjoiner</italic>
when computing the contigs from the string graph, thus producing more contigs with a lower N50 value. For each of the four datasets, the longest contig produced by SGA and
<italic>Readjoiner</italic>
is identical, and the NG50 value of the assembly is comparable.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Comparison of Readjoiner and SGA</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">
<inline-formula>
<mml:math id="M118" name="1471-2105-13-82-i118" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">SGA</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">
<inline-formula>
<mml:math id="M119" name="1471-2105-13-82-i119" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">SGA</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">
<inline-formula>
<mml:math id="M120" name="1471-2105-13-82-i120" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">SGA</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">
<inline-formula>
<mml:math id="M121" name="1471-2105-13-82-i121" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">SGA</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Read set
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c15
<hr></hr>
</td>
<td align="left" valign="bottom">c15
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c7
<hr></hr>
</td>
<td align="left" valign="bottom">c7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c2
<hr></hr>
</td>
<td align="left" valign="bottom">c2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Genome size (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">81.7
<hr></hr>
</td>
<td align="left" valign="bottom">81.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">155.4
<hr></hr>
</td>
<td align="left" valign="bottom">155.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">238.2
<hr></hr>
</td>
<td align="left" valign="bottom">238.2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of reads (M)
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">16.3
<hr></hr>
</td>
<td align="left" valign="bottom">16.3
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">31.1
<hr></hr>
</td>
<td align="left" valign="bottom">31.1
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">47.6
<hr></hr>
</td>
<td align="left" valign="bottom">47.6
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Sga index -d (K)
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">300
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">700
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">1350
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2300
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall time (s)
<hr></hr>
</td>
<td align="left" valign="bottom">360
<hr></hr>
</td>
<td align="left" valign="bottom">7508
<hr></hr>
</td>
<td align="left" valign="bottom">20.86×
<hr></hr>
</td>
<td align="left" valign="bottom">945
<hr></hr>
</td>
<td align="left" valign="bottom">19334
<hr></hr>
</td>
<td align="left" valign="bottom">20.46×
<hr></hr>
</td>
<td align="left" valign="bottom">2035
<hr></hr>
</td>
<td align="left" valign="bottom">39988
<hr></hr>
</td>
<td align="left" valign="bottom">19.65×
<hr></hr>
</td>
<td align="left" valign="bottom">3185
<hr></hr>
</td>
<td align="left" valign="bottom">65194
<hr></hr>
</td>
<td align="left" valign="bottom">20.47×
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall space (MB)
<hr></hr>
</td>
<td align="left" valign="bottom">294
<hr></hr>
</td>
<td align="left" valign="bottom">383
<hr></hr>
</td>
<td align="left" valign="bottom">1.30×
<hr></hr>
</td>
<td align="left" valign="bottom">703
<hr></hr>
</td>
<td align="left" valign="bottom">842
<hr></hr>
</td>
<td align="left" valign="bottom">1.20×
<hr></hr>
</td>
<td align="left" valign="bottom">1331
<hr></hr>
</td>
<td align="left" valign="bottom">1568
<hr></hr>
</td>
<td align="left" valign="bottom">1.18×
<hr></hr>
</td>
<td align="left" valign="bottom">2094
<hr></hr>
</td>
<td align="left" valign="bottom">2436
<hr></hr>
</td>
<td align="left" valign="bottom">1.16×
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">120712
<hr></hr>
</td>
<td align="left" valign="bottom">231594
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">254830
<hr></hr>
</td>
<td align="left" valign="bottom">547217
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">503446
<hr></hr>
</td>
<td align="left" valign="bottom">1215816
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">634403
<hr></hr>
</td>
<td align="left" valign="bottom">1702714
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Total contigs length (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">45.7
<hr></hr>
</td>
<td align="left" valign="bottom">55.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">103.0
<hr></hr>
</td>
<td align="left" valign="bottom">130.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">198.8
<hr></hr>
</td>
<td align="left" valign="bottom">266.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">292.2
<hr></hr>
</td>
<td align="left" valign="bottom">396.1
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly N50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">1.6
<hr></hr>
</td>
<td align="left" valign="bottom">0.8
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2.4
<hr></hr>
</td>
<td align="left" valign="bottom">1.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2.3
<hr></hr>
</td>
<td align="left" valign="bottom">0.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.2
<hr></hr>
</td>
<td align="left" valign="bottom">1.2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly NG50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">2.7
<hr></hr>
</td>
<td align="left" valign="bottom">2.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.7
<hr></hr>
</td>
<td align="left" valign="bottom">3.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.9
<hr></hr>
</td>
<td align="left" valign="bottom">3.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">4.5
<hr></hr>
</td>
<td align="left" valign="bottom">4.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Longest contig (Kbp)</td>
<td align="left">41.4</td>
<td align="left">41.4</td>
<td align="left">-</td>
<td align="left">54.2</td>
<td align="left">54.2</td>
<td align="left">-</td>
<td align="left">44.9</td>
<td align="left">44.9</td>
<td align="left">-</td>
<td align="left">52.9</td>
<td align="left">52.9</td>
<td align="left">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Results of applying Readjoiner (RJ) and SGA to the datasets c22, c15, c7, c2 (ℓ
<sub>
<italic>min</italic>
</sub>
 = 45).</p>
</table-wrap-foot>
</table-wrap>
<p>LEAP implements the methods described in [
<xref ref-type="bibr" rid="B13">13</xref>
] to construct a full overlap graph. The efficiency of LEAP is remarkable, and allowed us to extend the comparison with
<italic>Readjoiner</italic>
to the hg20× dataset. Table
<xref ref-type="table" rid="T3">3</xref>
reports the results when applying
<italic>Readjoiner</italic>
and LEAP to the datasets c22, c2 and hg20×. Additionally, it shows the results for
<italic>Readjoiner</italic>
on hg30× and hg40× (for which LEAP was not able to complete the overlap phase on our test machine with 64 GB RAM).
<italic>Readjoiner</italic>
was faster than LEAP for all datasets with a speedup factor of 1.6 to 1.8. Furthermore, it required less memory: While the reduction in the space peak was at a maximum for the small datasets (c22, 2.99×), it is still significant (1.63 × ) for the hg20× dataset which contains almost two orders of magnitude more reads. In approximately the same time in which LEAP assembles hg20× , and using less memory,
<italic>Readjoiner</italic>
was able to assemble the hg30× dataset.
<italic>Readjoiner</italic>
was also able to assemble the hg40× dataset in 51 hours using 52 GB RAM.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Comparison of Readjoiner and LEAP</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="left">RJ</th>
<th align="left">LEAP</th>
<th align="left">
<inline-formula>
<mml:math id="M122" name="1471-2105-13-82-i122" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">LEAP</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">LEAP</th>
<th align="left">
<inline-formula>
<mml:math id="M123" name="1471-2105-13-82-i123" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">LEAP</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">LEAP</th>
<th align="left">
<inline-formula>
<mml:math id="M124" name="1471-2105-13-82-i124" overflow="scroll">
<mml:mfrac>
<mml:mtext mathvariant="bold">LEAP</mml:mtext>
<mml:mtext mathvariant="bold">RJ</mml:mtext>
</mml:mfrac>
</mml:math>
</inline-formula>
</th>
<th align="left">RJ</th>
<th align="left">RJ</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Read set
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">c22
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">c2
<hr></hr>
</td>
<td align="left" valign="bottom">c2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">hg20×
<hr></hr>
</td>
<td align="left" valign="bottom">hg20×
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">hg30×
<hr></hr>
</td>
<td align="left" valign="bottom">hg40×
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Genome size (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">34.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">238.2
<hr></hr>
</td>
<td align="left" valign="bottom">238.2
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2861.3
<hr></hr>
</td>
<td align="left" valign="bottom">2861.3
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2861.3
<hr></hr>
</td>
<td align="left" valign="bottom">2861.3
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Number of reads (M)
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">7.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">47.6
<hr></hr>
</td>
<td align="left" valign="bottom">47.6
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">579.5
<hr></hr>
</td>
<td align="left" valign="bottom">579.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">869.2
<hr></hr>
</td>
<td align="left" valign="bottom">1155.3
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall time
<hr></hr>
</td>
<td align="left" valign="bottom">6 min
<hr></hr>
</td>
<td align="left" valign="bottom">9 min
<hr></hr>
</td>
<td align="left" valign="bottom">1.60×
<hr></hr>
</td>
<td align="left" valign="bottom">53 min
<hr></hr>
</td>
<td align="left" valign="bottom">1 h 36 min
<hr></hr>
</td>
<td align="left" valign="bottom">1.81×
<hr></hr>
</td>
<td align="left" valign="bottom">20 h 4 min
<hr></hr>
</td>
<td align="left" valign="bottom">35 h 56 min
<hr></hr>
</td>
<td align="left" valign="bottom">1.79×
<hr></hr>
</td>
<td align="left" valign="bottom">34 h 9 min
<hr></hr>
</td>
<td align="left" valign="bottom">51 h 16 min
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Overall space (GB)
<hr></hr>
</td>
<td align="left" valign="bottom">0.3
<hr></hr>
</td>
<td align="left" valign="bottom">0.9
<hr></hr>
</td>
<td align="left" valign="bottom">2.99×
<hr></hr>
</td>
<td align="left" valign="bottom">2.0
<hr></hr>
</td>
<td align="left" valign="bottom">4.0
<hr></hr>
</td>
<td align="left" valign="bottom">1.98×
<hr></hr>
</td>
<td align="left" valign="bottom">27.9
<hr></hr>
</td>
<td align="left" valign="bottom">45.6
<hr></hr>
</td>
<td align="left" valign="bottom">1.63×
<hr></hr>
</td>
<td align="left" valign="bottom">39.8
<hr></hr>
</td>
<td align="left" valign="bottom">52.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">120712
<hr></hr>
</td>
<td align="left" valign="bottom">113428
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">634403
<hr></hr>
</td>
<td align="left" valign="bottom">630408
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3239309
<hr></hr>
</td>
<td align="left" valign="bottom">11662607
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">13497497
<hr></hr>
</td>
<td align="left" valign="bottom">16253905
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Total contigs length (Mbp)
<hr></hr>
</td>
<td align="left" valign="bottom">45.7
<hr></hr>
</td>
<td align="left" valign="bottom">43.1
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">292.2
<hr></hr>
</td>
<td align="left" valign="bottom">280.6
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2833.1
<hr></hr>
</td>
<td align="left" valign="bottom">3642.7
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">4003.9
<hr></hr>
</td>
<td align="left" valign="bottom">4281.1
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly N50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">1.6
<hr></hr>
</td>
<td align="left" valign="bottom">1.6
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.2
<hr></hr>
</td>
<td align="left" valign="bottom">3.0
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.0
<hr></hr>
</td>
<td align="left" valign="bottom">1.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">1.2
<hr></hr>
</td>
<td align="left" valign="bottom">0.9
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Assembly NG50 (Kbp)
<hr></hr>
</td>
<td align="left" valign="bottom">2.7
<hr></hr>
</td>
<td align="left" valign="bottom">2.4
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">4.5
<hr></hr>
</td>
<td align="left" valign="bottom">3.9
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">3.0
<hr></hr>
</td>
<td align="left" valign="bottom">2.5
<hr></hr>
</td>
<td align="left" valign="bottom">-
<hr></hr>
</td>
<td align="left" valign="bottom">2.9
<hr></hr>
</td>
<td align="left" valign="bottom">2.8
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Longest contig (Kbp)</td>
<td align="left">41.4</td>
<td align="left">39.4</td>
<td align="left">-</td>
<td align="left">52.9</td>
<td align="left">48.9</td>
<td align="left">-</td>
<td align="left">63.4</td>
<td align="left">58.6</td>
<td align="left">-</td>
<td align="left">63.4</td>
<td align="left">63.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Results of applying Readjoiner (RJ) and LEAP to the datasets c22, c2, hg20× , hg30× , hg40× (ℓ
<sub>
<italic>min</italic>
</sub>
 = 45). LEAP was not able to process hg30× and hg40× on the test machine with 64 GB RAM.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Evaluation of assemblies</title>
<p>In order to assess the quality of the assemblies delivered by the different programs, we used the script assess_assembly.pl of the Plantagora project [
<xref ref-type="bibr" rid="B26">26</xref>
]. The script aligns the contigs to the template sequence from which the reads were sampled, using the Nucmer alignment tool [
<xref ref-type="bibr" rid="B27">27</xref>
]. Several metrics, including the number of unaligned contigs, misassemblies, SNPs and gaps are reported.</p>
<p>Furthermore, assemblies were evaluated using the basic Assemblathon 1 statistics as defined in [
<xref ref-type="bibr" rid="B28">28</xref>
], including total length of the contigs, length of the longest and shortest contig, N50, L50, NG50 and LG50. Table
<xref ref-type="table" rid="T4">4</xref>
reports the results of the evaluation of the assemblies of dataset c22.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption>
<p>Metrics of assemblies of the c22 dataset</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">Assemblathon metrics</th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">Edena</th>
<th align="left">LEAP</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Number of contigs
<hr></hr>
</td>
<td align="left" valign="bottom">120712
<hr></hr>
</td>
<td align="left" valign="bottom">231594
<hr></hr>
</td>
<td align="left" valign="bottom">120462
<hr></hr>
</td>
<td align="left" valign="bottom">113428
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Genome size (bp)
<hr></hr>
</td>
<td align="left" valign="bottom">34894545
<hr></hr>
</td>
<td align="left" valign="bottom">34894545
<hr></hr>
</td>
<td align="left" valign="bottom">34894545
<hr></hr>
</td>
<td align="left" valign="bottom">34894545
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Total contigs length
<hr></hr>
</td>
<td align="left" valign="bottom">45667531
<hr></hr>
</td>
<td align="left" valign="bottom">55880641
<hr></hr>
</td>
<td align="left" valign="bottom">44737441
<hr></hr>
</td>
<td align="left" valign="bottom">43099113
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- as  % of genome
<hr></hr>
</td>
<td align="left" valign="bottom">130.87
<hr></hr>
</td>
<td align="left" valign="bottom">160.14
<hr></hr>
</td>
<td align="left" valign="bottom">128.21
<hr></hr>
</td>
<td align="left" valign="bottom">123.51
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Mean contig size
<hr></hr>
</td>
<td align="left" valign="bottom">378.32
<hr></hr>
</td>
<td align="left" valign="bottom">241.29
<hr></hr>
</td>
<td align="left" valign="bottom">371.38
<hr></hr>
</td>
<td align="left" valign="bottom">379.97
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Median contig size
<hr></hr>
</td>
<td align="left" valign="bottom">132
<hr></hr>
</td>
<td align="left" valign="bottom">101
<hr></hr>
</td>
<td align="left" valign="bottom">120
<hr></hr>
</td>
<td align="left" valign="bottom">117
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Longest contig
<hr></hr>
</td>
<td align="left" valign="bottom">41352
<hr></hr>
</td>
<td align="left" valign="bottom">41352
<hr></hr>
</td>
<td align="left" valign="bottom">41352
<hr></hr>
</td>
<td align="left" valign="bottom">39379
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Shortest contig
<hr></hr>
</td>
<td align="left" valign="bottom">102
<hr></hr>
</td>
<td align="left" valign="bottom">100
<hr></hr>
</td>
<td align="left" valign="bottom">100
<hr></hr>
</td>
<td align="left" valign="bottom">101
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs > 500 bp
<hr></hr>
</td>
<td align="left" valign="bottom">13467 (11.16%)
<hr></hr>
</td>
<td align="left" valign="bottom">13416 (5.79%)
<hr></hr>
</td>
<td align="left" valign="bottom">13439 (11.16%)
<hr></hr>
</td>
<td align="left" valign="bottom">13430 (11.84%)
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs > 1 Kbp
<hr></hr>
</td>
<td align="left" valign="bottom">8700 (7.21%)
<hr></hr>
</td>
<td align="left" valign="bottom">8684 (3.75%)
<hr></hr>
</td>
<td align="left" valign="bottom">8696 (7.22%)
<hr></hr>
</td>
<td align="left" valign="bottom">8578 (7.56%)
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Contigs > 10 Kbp
<hr></hr>
</td>
<td align="left" valign="bottom">264 (0.22%)
<hr></hr>
</td>
<td align="left" valign="bottom">264 (0.11%)
<hr></hr>
</td>
<td align="left" valign="bottom">264 (0.22%)
<hr></hr>
</td>
<td align="left" valign="bottom">228 (0.20%)
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">N50
<hr></hr>
</td>
<td align="left" valign="bottom">1614
<hr></hr>
</td>
<td align="left" valign="bottom">815
<hr></hr>
</td>
<td align="left" valign="bottom">1699
<hr></hr>
</td>
<td align="left" valign="bottom">1617
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">L50
<hr></hr>
</td>
<td align="left" valign="bottom">5684
<hr></hr>
</td>
<td align="left" valign="bottom">10118
<hr></hr>
</td>
<td align="left" valign="bottom">5416
<hr></hr>
</td>
<td align="left" valign="bottom">5488
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">NG50
<hr></hr>
</td>
<td align="left" valign="bottom">2737
<hr></hr>
</td>
<td align="left" valign="bottom">2739
<hr></hr>
</td>
<td align="left" valign="bottom">2733
<hr></hr>
</td>
<td align="left" valign="bottom">2461
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">LG50
<hr></hr>
</td>
<td align="left" valign="bottom">3120
<hr></hr>
</td>
<td align="left" valign="bottom">3113
<hr></hr>
</td>
<td align="left" valign="bottom">3121
<hr></hr>
</td>
<td align="left" valign="bottom">3429
<hr></hr>
</td>
</tr>
<tr>
<td colspan="5" align="left" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Plantagora metrics</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>RJ</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>SGA</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>Edena</bold>
<hr></hr>
</td>
<td align="left" valign="bottom">
<bold>LEAP</bold>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Covered Bases
<hr></hr>
</td>
<td align="left" valign="bottom">34343945
<hr></hr>
</td>
<td align="left" valign="bottom">34357693
<hr></hr>
</td>
<td align="left" valign="bottom">34300114
<hr></hr>
</td>
<td align="left" valign="bottom">12968118
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Ambiguous Bases
<hr></hr>
</td>
<td align="left" valign="bottom">159997
<hr></hr>
</td>
<td align="left" valign="bottom">154584
<hr></hr>
</td>
<td align="left" valign="bottom">182952
<hr></hr>
</td>
<td align="left" valign="bottom">696334
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Misassemblies
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">3693
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Misassembled Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">4
<hr></hr>
</td>
<td align="left" valign="bottom">2344
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Misassembled Contig Bases
<hr></hr>
</td>
<td align="left" valign="bottom">1283
<hr></hr>
</td>
<td align="left" valign="bottom">417
<hr></hr>
</td>
<td align="left" valign="bottom">1245
<hr></hr>
</td>
<td align="left" valign="bottom">2797710
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SNPs
<hr></hr>
</td>
<td align="left" valign="bottom">104
<hr></hr>
</td>
<td align="left" valign="bottom">125
<hr></hr>
</td>
<td align="left" valign="bottom">120
<hr></hr>
</td>
<td align="left" valign="bottom">46270
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Insertions
<hr></hr>
</td>
<td align="left" valign="bottom">5
<hr></hr>
</td>
<td align="left" valign="bottom">2
<hr></hr>
</td>
<td align="left" valign="bottom">1
<hr></hr>
</td>
<td align="left" valign="bottom">2403
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Deletions
<hr></hr>
</td>
<td align="left" valign="bottom">43
<hr></hr>
</td>
<td align="left" valign="bottom">23
<hr></hr>
</td>
<td align="left" valign="bottom">28
<hr></hr>
</td>
<td align="left" valign="bottom">5187
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Positive Gaps
<hr></hr>
</td>
<td align="left" valign="bottom">2679
<hr></hr>
</td>
<td align="left" valign="bottom">2471
<hr></hr>
</td>
<td align="left" valign="bottom">2925
<hr></hr>
</td>
<td align="left" valign="bottom">26495
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Internal Gaps
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">21
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">External Gaps
<hr></hr>
</td>
<td align="left" valign="bottom">2679
<hr></hr>
</td>
<td align="left" valign="bottom">2471
<hr></hr>
</td>
<td align="left" valign="bottom">2925
<hr></hr>
</td>
<td align="left" valign="bottom">26474
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- total length
<hr></hr>
</td>
<td align="left" valign="bottom">547408
<hr></hr>
</td>
<td align="left" valign="bottom">558921
<hr></hr>
</td>
<td align="left" valign="bottom">589979
<hr></hr>
</td>
<td align="left" valign="bottom">19064103
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- average length
<hr></hr>
</td>
<td align="left" valign="bottom">204
<hr></hr>
</td>
<td align="left" valign="bottom">226
<hr></hr>
</td>
<td align="left" valign="bottom">202
<hr></hr>
</td>
<td align="left" valign="bottom">720
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Negative Gaps
<hr></hr>
</td>
<td align="left" valign="bottom">110888
<hr></hr>
</td>
<td align="left" valign="bottom">218908
<hr></hr>
</td>
<td align="left" valign="bottom">110811
<hr></hr>
</td>
<td align="left" valign="bottom">18198
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Internal Overlaps
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">0
<hr></hr>
</td>
<td align="left" valign="bottom">17
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">External Overlaps
<hr></hr>
</td>
<td align="left" valign="bottom">110888
<hr></hr>
</td>
<td align="left" valign="bottom">218908
<hr></hr>
</td>
<td align="left" valign="bottom">110811
<hr></hr>
</td>
<td align="left" valign="bottom">18181
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- total length
<hr></hr>
</td>
<td align="left" valign="bottom">−10247647
<hr></hr>
</td>
<td align="left" valign="bottom">−20078971
<hr></hr>
</td>
<td align="left" valign="bottom">−9424823
<hr></hr>
</td>
<td align="left" valign="bottom">−1859835
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- average length
<hr></hr>
</td>
<td align="left" valign="bottom">−92
<hr></hr>
</td>
<td align="left" valign="bottom">−92
<hr></hr>
</td>
<td align="left" valign="bottom">−85
<hr></hr>
</td>
<td align="left" valign="bottom">−102
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Redundant Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">864
<hr></hr>
</td>
<td align="left" valign="bottom">1158
<hr></hr>
</td>
<td align="left" valign="bottom">607
<hr></hr>
</td>
<td align="left" valign="bottom">6329
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Unaligned Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">3262
<hr></hr>
</td>
<td align="left" valign="bottom">4686
<hr></hr>
</td>
<td align="left" valign="bottom">3221
<hr></hr>
</td>
<td align="left" valign="bottom">60563
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- partial
<hr></hr>
</td>
<td align="left" valign="bottom">18
<hr></hr>
</td>
<td align="left" valign="bottom">57
<hr></hr>
</td>
<td align="left" valign="bottom">21
<hr></hr>
</td>
<td align="left" valign="bottom">3252
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">- total length
<hr></hr>
</td>
<td align="left" valign="bottom">462668
<hr></hr>
</td>
<td align="left" valign="bottom">599320
<hr></hr>
</td>
<td align="left" valign="bottom">447922
<hr></hr>
</td>
<td align="left" valign="bottom">27666823
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Ambiguous Contigs
<hr></hr>
</td>
<td align="left" valign="bottom">2631
<hr></hr>
</td>
<td align="left" valign="bottom">3876
<hr></hr>
</td>
<td align="left" valign="bottom">2619
<hr></hr>
</td>
<td align="left" valign="bottom">799
<hr></hr>
</td>
</tr>
<tr>
<td align="left">- total length</td>
<td align="left">369284</td>
<td align="left">483895</td>
<td align="left">366418</td>
<td align="left">93102</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Metrics of the assemblies of the dataset c22 as delivered by Readjoiner (RJ), SGA, Edena and LEAP (
<italic></italic>
<sub>
<italic>min</italic>
</sub>
=45). The metrics are explained in [
<xref ref-type="bibr" rid="B26">26</xref>
] and in [
<xref ref-type="bibr" rid="B28">28</xref>
].</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Effect of sequencing errors</title>
<p>
<italic>Readjoiner</italic>
is based on the computation of exact suffix-prefix matches. Real-world datasets, however, contain a certain amount of errors. To better assess the effect of sequencing errors on the assembly, we sampled two sets of reads from the
<italic>Escherichia coli</italic>
K-12 genome, each consisting of 2 million reads: The first one (
<italic>Ecoli-without-errors</italic>
) is error-free, while in the second read set (
<italic>Ecoli-with-errors</italic>
) sequencing errors were introduced using a 0.75% position-independent substitution rate.</p>
<p>In order to assess the efficiency of
<italic>k</italic>
-mer based error correction (which we plan to implement in
<italic>Readjoiner</italic>
), we pre-processed
<italic>Ecoli-with-errors</italic>
using SGA. This constructs an FM-index from the reads set and errors are corrected using sga correct, delivering the read set termed
<italic>Ecoli-SGA-corrected</italic>
. This is further processed by sga filter, after constructing a new FM-index, to eliminate further error-containing reads. This resulting set
<italic>Ecoli-SGA-corrected+filtered</italic>
was assembled using both
<italic>Readjoiner</italic>
and SGA. Table
<xref ref-type="table" rid="T5">5</xref>
gives the most important statistics of the
<italic>Readjoiner</italic>
and SGA assemblies of the four different read sets defined here.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption>
<p>Assembly of error-containing reads</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left" valign="bottom"> 
<hr></hr>
</th>
<th colspan="2" align="left" valign="bottom">N50 (bp)
<hr></hr>
</th>
<th colspan="2" align="left" valign="bottom">NG50 (bp)
<hr></hr>
</th>
</tr>
<tr>
<th align="left"> </th>
<th align="left">RJ</th>
<th align="left">SGA</th>
<th align="left">RJ</th>
<th align="left">SGA</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">
<italic>Ecoli-without-errors</italic>
<hr></hr>
</td>
<td align="left" valign="bottom">54948
<hr></hr>
</td>
<td align="left" valign="bottom">54936
<hr></hr>
</td>
<td align="left" valign="bottom">57213
<hr></hr>
</td>
<td align="left" valign="bottom">57210
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<italic>Ecoli-with-errors</italic>
<hr></hr>
</td>
<td align="left" valign="bottom">203
<hr></hr>
</td>
<td align="left" valign="bottom">5110
<hr></hr>
</td>
<td align="left" valign="bottom">245
<hr></hr>
</td>
<td align="left" valign="bottom">8645
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<italic>Ecoli-SGA-corrected</italic>
<hr></hr>
</td>
<td align="left" valign="bottom">38178
<hr></hr>
</td>
<td align="left" valign="bottom">40002
<hr></hr>
</td>
<td align="left" valign="bottom">39999
<hr></hr>
</td>
<td align="left" valign="bottom">40824
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<italic>Ecoli-SGA-corrected+filtered</italic>
</td>
<td align="left">41872</td>
<td align="left">41872</td>
<td align="left">41905</td>
<td align="left">41903</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>N50 and NG50 values for the Readjoiner and SGA assemblies of Ecoli-without-errors, Ecoli-with-errors, Ecoli-SGA-corrected and Ecoli-SGA-corrected+filtered.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec>
<title>Discussion and conclusion</title>
<p>In this paper, we presented methods and implementation techniques of a new string graph based assembler, named
<italic>Readjoiner</italic>
, which is significantly faster or more space efficient than the previous software tools Edena [
<xref ref-type="bibr" rid="B9">9</xref>
], SGA [
<xref ref-type="bibr" rid="B11">11</xref>
] and LEAP [
<xref ref-type="bibr" rid="B13">13</xref>
]. In particular,
<italic>Readjoiner</italic>
can handle a set of reads with 40× coverage of the entire human genome (total length of all reads 115 Gbp) on a machine with 64 GB RAM.</p>
<p>Although the different string graph-based assemblers aim at constructing the same graph, they apply different heuristics to compute a layout from the string graph. The quality of assemblies of simulated datasets was compared using metrics from the Plantagora project [
<xref ref-type="bibr" rid="B26">26</xref>
] and the Assemblathon 1 project [
<xref ref-type="bibr" rid="B28">28</xref>
]. In the assemblies of c22 delivered by
<italic>Readjoiner</italic>
, SGA and Edena there are 4 misassembled contigs. In contrast, 53.4% of the contigs of the LEAP assembly could not be aligned to the reference and 4.3% of the aligned contigs were misassembled. The “Negative Gaps” metric computed by Plantagora reflects the amount of overlaps among the contigs. Its high value for all tools can be explained by the fact that branching nodes in the string graph start new contigs in which the read corresponding to the branching node is included. Additionally considering the “Positive Gaps” metrics, one can conclude that most contigs were interrupted due to the presence of repetitive sequences, but not due to low coverage.</p>
<p>Our main development is a new efficient algorithm to compute all irreducible suffix-prefix matches from which the string graph is constructed. While the basic techniques we use (e.g. integer encodings, suffix sorting, integer sorting, binary search, bottom-up traversal of lcp-interval trees) are mostly well-established in sequence processing, their combination is novel for the considered problem. The different techniques were chosen with the overall goal of performing as few as possible random accesses to large data structures to obtain algorithms with excellent data locality which in turn leads to short run times. For most parts of our method, this goal was achieved, mostly due to the partitioning of the set of SPM-relevant suffixes. There are still many random accesses to the representation of the reads, which, however, cannot fully be prevented in an index based approach.</p>
<p>The problem of computing suffix-prefix matches has long been studied in the literature, mostly with the goal of finding, for each pair of reads
<italic>r</italic>
and
<italic>s</italic>
, the longest suffix-prefix match of
<italic>r</italic>
and
<italic>s</italic>
. Gusfield et al. [
<xref ref-type="bibr" rid="B29">29</xref>
] solved this maximum suffix-prefix matching problem in optimal
<italic>O</italic>
(
<italic>n</italic>
 + 
<italic>m</italic>
<sup>2</sup>
) time and optimal
<italic>O</italic>
(
<italic>n</italic>
) space using the suffix tree for all suffixes of
<italic>m</italic>
reads of total length
<italic>n</italic>
.</p>
<p>Ohlebusch and Gog [
<xref ref-type="bibr" rid="B30">30</xref>
] present a solution to the same problem with the same time and space complexity using a linear scan of an enhanced suffix array. We do not know of any solution of the maximum suffix-prefix match problem which appropriately handles the reverse complements of the reads. Applying the algorithms of [
<xref ref-type="bibr" rid="B29">29</xref>
] or of [
<xref ref-type="bibr" rid="B30">30</xref>
] to the set of all reads and their reverse complements would not guarantee the maximality constraint, as the forward and reverse complement of a read are represented in different locations of the employed index structure.</p>
<p>In Edena, suffix-prefix matches are computed using a suffix array. Details of the algorithm or the implementation are not published.</p>
<p>Like Simpson and Durbin [
<xref ref-type="bibr" rid="B11">11</xref>
], we replaced the maximality constraint by a minimum length constraint imposed on each suffix-prefix match. The modified problem allows for an algorithm with two important advantages (compared to the algorithms of [
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
]): At first, the algorithm does not require a stack for each of
<italic>m</italic>
reads, and still can employ the space and time efficient bottom-up traversal of an lcp-interval tree as presented in Algorithm 1. Moreover, the algorithm can easily handle reverse complements of the reads and efficient selection of irreducible suffix-prefix matches is possible.</p>
<p>There are two main approaches to the construction of a string graph. The original approach of Myers [
<xref ref-type="bibr" rid="B8">8</xref>
] was to first construct a full overlap graph before transitive edges are removed. The resulting string graph contains all information relevant for the layout of the contigs. As the string graph contains much less edges than the overlap graph (the ratio depends on the coverage of the read set, see [
<xref ref-type="bibr" rid="B8">8</xref>
]), the explicit representation of this usually defines the space peak.</p>
<p>An alternative overlap graph representation for exact suffix-prefix matches was introduced in [
<xref ref-type="bibr" rid="B13">13</xref>
] and implemented in the LEAP software. The basic idea of this approach is to implicitly store many suffix-prefix matches for a set of lexicographically related reads in constant space using an interval representation. This allows for a compact storage of the full overlap graph. The representation does not apply to irreducible
<italic>SPM</italic>
s. In [
<xref ref-type="bibr" rid="B13">13</xref>
] only asymptotic results regarding the space requirement of the compact overlap graph representation are given, and LEAP does not give any clues on the size of the graph it constructs. So it remains unclear if this representation of the overlap graph is smaller than our representation of the string graph. A comparison of the overall space requirement of LEAP and
<italic>Readjoiner</italic>
shows a clear advantage for
<italic>Readjoiner</italic>
, see Table
<xref ref-type="table" rid="T3">3</xref>
for details.</p>
<p>It is worthwhile to note that the contigs output by LEAP contain many differences with respect to the target sequences they were sampled from. It is not clear to us, whether this is an artifact of the method or an implementation issue.</p>
<p>Another efficient way to reduce the space peak for string graph constructions is to recognize transitive
<italic>SPM</italic>
s and prevent their insertion in the graph structure. Simpson and Durbin [
<xref ref-type="bibr" rid="B11">11</xref>
] developed the first method following this approach and implemented it in the SGA software. In this paper, we have described an alternative algorithm, exploiting a property of transitive
<italic>SPM</italic>
s that can easily be checked on a small set of strings.</p>
<p>Our comparative tests (Table
<xref ref-type="table" rid="T2">2</xref>
) indicate that
<italic>Readjoiner</italic>
is more than one order of magnitude faster than the current SGA implementation and uses less space. This may come as a surprise as SGA uses a compact index structure based on the BWT, while
<italic>Readjoiner</italic>
employs techniques known from enhanced suffix arrays, which are usually more space consuming. The space advantage of
<italic>Readjoiner</italic>
is mainly a result of our partitioning approach applied to the array of SPM-relevant suffixes. The partitioning technique leads to a large reduction in the overall memory peak and a small increase in the running time. This can be explained by an improved cache coherence: For a given part, only a small portion of the different tables are accessed. This seems to outweigh the time for the additional passes over the reads.</p>
<p>We see two reasons for the time advantage of
<italic>Readjoiner</italic>
: at first it employs a suffix selection and sorting method which is specifically tailored for the suffix-prefix matching problem and the given minimum match length ℓ
<sub>
<italic>min</italic>
</sub>
. In contrast, the BWT employed by SGA provides a general string indexing technique that is not optimized for computing
<italic>SPM</italic>
s of an arbitrary but fixed minimum length. Secondly,
<italic>Readjoiner</italic>
computes suffix-prefix matches by a linear scan of two integer tables, which is a very fast operation. In contrast, SGA relies on random accesses to the BWT which may take longer for large data sets.</p>
<p>The minimum match length parameter ℓ
<sub>
<italic>min</italic>
</sub>
is used to restrict the search to the exact
<italic>SPM</italic>
s that are considered to be significant. To balance the required computational resources and the quality of the assembly, one has to carefully choose an appropriate value for ℓ
<sub>
<italic>min</italic>
</sub>
 = 45. A larger value of ℓ
<sub>
<italic>min</italic>
</sub>
reduces the number of SPM-relevant suffixes, and in turn speeds up the computation and reduces the space requirement, but may lead to a poor assembly. Interestingly, in our simulations based on reads of length 100 bp, we obtained the best assembly results for a relatively large value of ℓ
<sub>
<italic>min</italic>
</sub>
around 65. However, for a fair benchmarking of the tools and to simplify comparison with previous publications, we have chosen ℓ
<sub>
<italic>min</italic>
</sub>
 = 45.</p>
<p>Among the string graph-based assemblers mentioned here, SGA is the only one that can distribute parts of the computation across multiple threads. Some of the algorithms employed in
<italic>Readjoiner</italic>
are well suited for a multi-threaded implementation. For example, each bucket of SPM-relevant suffixes is sorted independently and the corresponding
<italic>SPM</italic>
s are computed independently of all other buckets. This step only requires random read access to the representation of the reads. A multi-threaded implementation with shared memory access to the reads and buckets which are (with respect to their sizes) evenly distributed over the threads, is expected to provide a considerable speedup within a small amount of additional space.</p>
<p>Another important issue for future development is the improvement of the assembly quality for real world data. Here further preprocessing steps, in particular quality filtering and error detection are required, as well as the handling of paired read information in the assembly phase.</p>
<p>The present manuscript focuses on the algorithmic approach and implementation of methods for the computation of irreducible suffix-prefix matches and the construction of the string graph. We report our results on error-free datasets: This is in analogy to the first papers describing the methods implemented in SGA [
<xref ref-type="bibr" rid="B11">11</xref>
] and LEAP [
<xref ref-type="bibr" rid="B13">13</xref>
].</p>
<p>Several error correction strategies have been applied so far: The classical method was to consider approximative suffix-prefix matches of the reads and to correct the resulting contigs in a consensus phase. With large next-generation datasets, the method of choice consist in
<italic>k</italic>
-mer counting, identification of a subset of trusted
<italic>k</italic>
-mer, which occur at least a given number of times in the read set, and correction of the reads containing untrusted
<italic>k</italic>
-mers [
<xref ref-type="bibr" rid="B12">12</xref>
,
<xref ref-type="bibr" rid="B31">31</xref>
].</p>
<p>Approximative suffix-prefix matching algorithms can be implemented to work on index structures, but the increased search space makes them significantly slower than exact matching algorithms. Among the string graph-based assemblers, only SGA implements an approximate suffix-prefix matching algorithm: Nevertheless, this is not used by default, and the authors recommend using their faster
<italic>k</italic>
-mer based error correction method instead [
<xref ref-type="bibr" rid="B12">12</xref>
].</p>
<p>The fact that
<italic>Readjoiner</italic>
is based on exact suffix-prefix matches makes it sensible to errors. We have demonstrated that using a
<italic>k</italic>
-mer based error correction step delivers read sets for which
<italic>Readjoiner</italic>
delivers assemblies with metrics comparable to SGA. We therefore plan to implement a
<italic>k</italic>
-mer based error correction for
<italic>Readjoiner</italic>
, employing techniques similar to those used for computing suffix-prefix matches.</p>
<p>Paired-end and mate pairs provide short and long range positional information, which is critical for improving the quality of assembling eukaryotic genomes. The classical approach consists in using this information for connecting contigs into scaffolds either in a post-processing phase, which may be integrated in the assembler software, or using a stand-alone tool, such as Bambus [
<xref ref-type="bibr" rid="B32">32</xref>
] or SSPACE [
<xref ref-type="bibr" rid="B33">33</xref>
]. A complementary approach, which we intend to introduce in a future version of our software, is to exploit the pairing information already during the traversal of the string graph, by restricting to paths connecting the mate pairs with a length compatible to the insert size of the library. Details of such an approach are given in [
<xref ref-type="bibr" rid="B34">34</xref>
,
<xref ref-type="bibr" rid="B35">35</xref>
].</p>
</sec>
<sec>
<title>Availability</title>
<p>The
<italic>Readjoiner</italic>
software is available as part of the
<italic>GenomeTools</italic>
genome analysis package [
<xref ref-type="bibr" rid="B21">21</xref>
], a free, open source collection of bioinformatics software. See
<ext-link ext-link-type="uri" xlink:href="http://www.zbh.uni-hamburg.de/readjoiner">http://www.zbh.uni-hamburg.de/readjoiner</ext-link>
for more details.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>GG developed most of the methods, implemented
<italic>Readjoiner</italic>
, and performed the benchmarks. SK conceived the project and developed the suffix selection and sorting methods. Both authors wrote the manuscript and approved its final version.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>Supplemental Material.</bold>
This document describes implementation techniques for the methods and algorithms described in the main document. Moreover, it gives a lemma and a theorem (including proofs) characterizing transitive SPMs, and an algorithm to enumerate irreducible and non-redundant suffix-prefix matches. Furthermore, a method to recognize internally contained reads is given, as well as results for a benchmark set with reads of variable length. Finally, an example of SPM-relevant suffixes and their corresponding lcp-interval is presented.</p>
</caption>
<media xlink:href="1471-2105-13-82-S1.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We would like to thank Gordon Gremme and Sascha Steinbiss for developing and maintaining the
<italic>GenomeTools</italic>
, which proved to be a powerful software framework for the implementation of
<italic>Readjoiner</italic>
. Furthermore, we thank Dirk Willrodt and Sascha Steinbiss for proofreading earlier versions of this manuscript. Many thanks to Martin Frith for pointing out an error in the previous definition of transitive SPMs.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<article-title>Toward simplifying and accurately formulating fragment assembly</article-title>
<source>J Comput Biol</source>
<year>1995</year>
<volume>2</volume>
<fpage>275</fpage>
<lpage>290</lpage>
<pub-id pub-id-type="pmid">7497129</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="other">
<article-title>Illumina Sequencing - Performance and Specifications for HiSeq 2000</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.illumina.com/systems/hiseq_2000.ilmn">http://www.illumina.com/systems/hiseq_2000.ilmn</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>McPherson</surname>
<given-names>JD</given-names>
</name>
<article-title>Next-generation gap</article-title>
<source>Nature Methods</source>
<year>2009</year>
<volume>6</volume>
<issue>11 Suppl</issue>
<fpage>S2</fpage>
<lpage>S5</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nmeth.f.268">http://dx.doi.org/10.1038/nmeth.f.268</ext-link>
]</comment>
<pub-id pub-id-type="pmid">19844227</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2001</year>
<volume>98</volume>
<fpage>9748</fpage>
<lpage>9753</lpage>
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Short read fragment assembly of bacterial genomes</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>2</issue>
<fpage>324</fpage>
<lpage>330</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1101/gr.7088808">http://dx.doi.org/10.1101/gr.7088808</ext-link>
]</comment>
<pub-id pub-id-type="pmid">18083777</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="pmid">19251739</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<article-title>The fragment assembly string graph</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>Suppl 2</issue>
<fpage>79</fpage>
<lpage>85</lpage>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Hernandez</surname>
<given-names>D</given-names>
</name>
<name>
<surname>FranÇois</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Farinelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Osterås</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schrenzel</surname>
<given-names>J</given-names>
</name>
<article-title>De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>802</fpage>
<lpage>809</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1101/gr.072033.107">http://dx.doi.org/10.1101/gr.072033.107</ext-link>
]</comment>
<pub-id pub-id-type="pmid">18332092</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Manber</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>G</given-names>
</name>
<article-title>Suffix Arrays: A New Method for On-Line String Searches</article-title>
<source>SIAM J Comput</source>
<year>1993</year>
<volume>22</volume>
<issue>5</issue>
<fpage>935</fpage>
<lpage>948</lpage>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Efficient construction of an assembly string graph using the FM-index</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>12</issue>
<fpage>i367</fpage>
<lpage>373</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367">http://bioinformatics.oxfordjournals.org/cgi/content/abstrac t/26/12/i367</ext-link>
]</comment>
<pub-id pub-id-type="pmid">20529929</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="other">
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
<source>Genome Research</source>
<year>2011</year>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/early/2011/12/07/gr.126953.111.abstract">http://genome.cshlp.org/content/early/2011/12/07/gr.126953.1 11.abstract</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Dinh</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>14</issue>
<fpage>1901</fpage>
<lpage>1907</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btr321">http://dx.doi.org/10.1093/bioinformatics/btr321</ext-link>
]</comment>
<pub-id pub-id-type="pmid">21636593</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rantala</surname>
<given-names>T</given-names>
</name>
<person-group person-group-type="editor">Amir A, Turpin A, Moffat A</person-group>
<article-title>Engineering Radix Sort for Strings</article-title>
<source>String Processing and Information Retrieval, Volume 5280 of Lecture Notes in Computer Science</source>
<year>2009</year>
<publisher-name>Springer Berlin/Heidelberg</publisher-name>
<fpage>3</fpage>
<lpage>14</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/978-3-540-89097-3_3">http://dx.doi.org/10.1007/978-3-540-89097-3_3</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="book">
<name>
<surname>Cormen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Leiserson</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rivest</surname>
<given-names>R</given-names>
</name>
<source>Introduction to Algorithms</source>
<year>1990</year>
<publisher-name>Cambridge, MA: MIT Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Steinbiss</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<article-title>A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences</article-title>
<source>IEEE/ACM Transactions on Computational Biology and Bioinformatics</source>
<year>2012</year>
<volume>9</volume>
<issue>2</issue>
<fpage>345</fpage>
<lpage>357</lpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Bentley</surname>
<given-names>J</given-names>
</name>
<name>
<surname>McIllroy</surname>
<given-names>M</given-names>
</name>
<article-title>Engineering a Sort Function</article-title>
<source>Software Pract Ex</source>
<year>1993</year>
<volume>23</volume>
<issue>11</issue>
<fpage>1249</fpage>
<lpage>1265</lpage>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Abouelhoda</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<article-title>Replacing Suffix Trees with Enhanced Suffix Arrays</article-title>
<source>J Discrete Algorithm</source>
<year>2004</year>
<volume>2</volume>
<fpage>53</fpage>
<lpage>86</lpage>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Fredkin</surname>
<given-names>E</given-names>
</name>
<article-title>Trie Memory</article-title>
<source>Commun ACM</source>
<year>1960</year>
<volume>3</volume>
<issue>9</issue>
<fpage>490</fpage>
<lpage>499</lpage>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<article-title>The string B-tree: a new data structure for string search in external memory and its applications</article-title>
<source>J ACM</source>
<year>1999</year>
<volume>46</volume>
<issue>2</issue>
<fpage>236</fpage>
<lpage>280</lpage>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="other">
<article-title>GenomeTools - The versatile open source genome analysis software</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://genometools.org">http://genometools.org</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<name>
<surname>Bowden</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Nerin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Seibold</surname>
<given-names>S</given-names>
</name>
<article-title>The/proc filesystem</article-title>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.kernel.org/doc/Documentation/filesystems/proc.txt2009">http://www.kernel.org/doc/Documentation/filesystems/proc.txt 2009</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="other">
<article-title>Edena: very short reads assembler</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.genomic.ch/edena.php">http://www.genomic.ch/edena.php</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="other">
<article-title>jts/sga - Github</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="https://github.com/jts/sga">https://github.com/jts/sga</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="other">
<article-title>Software distribution of LEAP</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.engr.uconn.edu/htd06001/assembler/leap.zip">http://www.engr.uconn.edu/htd06001/assembler/leap.zip</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Barthelson</surname>
<given-names>R</given-names>
</name>
<name>
<surname>McFarlin</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Rounsley</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>S</given-names>
</name>
<article-title>Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes</article-title>
<source>PLoS ONE</source>
<year>2011</year>
<volume>6</volume>
<issue>12</issue>
<fpage>e28436</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0028436">http://dx.doi.org/10.1371/journal.pone.0028436</ext-link>
]</comment>
<pub-id pub-id-type="pmid">22174807</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Smoot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shumway</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biology</source>
<year>2004</year>
<volume>5</volume>
<issue>2</issue>
<fpage>R12</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://genomebiology.com/2004/5/2/R12">http://genomebiology.com/2004/5/2/R12</ext-link>
]</comment>
<pub-id pub-id-type="pmid">14759262</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="other">
<name>
<surname>Earl</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Bradnam</surname>
<given-names>K</given-names>
</name>
<name>
<surname>St John</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Darling</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Faas</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>HOK</given-names>
</name>
<name>
<surname>Vince</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Nuwantha</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>AWK</given-names>
</name>
<name>
<surname>Ning</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Haimel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Fronseca</surname>
<given-names>NA</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Docking</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Ho</surname>
<given-names>IY</given-names>
</name>
<name>
<surname>Rokhsar</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lavenier</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Chapuis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Naquin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Maillet</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Kelly</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>SP</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>WC</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Shaw</surname>
<given-names>TI</given-names>
</name>
<name>
<surname>Ruby</surname>
<given-names>JG</given-names>
</name>
<name>
<surname>Skewes-Cox</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Betegon</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Dimon</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Solovyev</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kosarev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vorobyev</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ramirez-Gonzalez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Leggett</surname>
<given-names>R</given-names>
</name>
<name>
<surname>MacLean</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>L</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gnerre</surname>
<given-names>S</given-names>
</name>
<name>
<surname>MacCallum</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Przybylski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ribeiro</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sharpe</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kersey</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Chapman</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>DeRisi</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Korf</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Paten</surname>
<given-names>B</given-names>
</name>
<article-title>Assemblathon 1: A competitive assessment of de novo short read assembly methods</article-title>
<source>Genome Research</source>
<year>2011</year>
<comment>http://genome.cshlp.org/content/early/2011/09/16/gr.126599.1 11.abstract</comment>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Gusfield</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Landau</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Schieber</surname>
<given-names>B</given-names>
</name>
<article-title>An efficient algorithm for the All Pairs Suffix-Prefix Problem</article-title>
<source>Inf Process Lett</source>
<year>1992</year>
<volume>41</volume>
<issue>4</issue>
<fpage>181</fpage>
<lpage>185</lpage>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
<article-title>Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem</article-title>
<source>Inf Process Lett</source>
<year>2010</year>
<volume>110</volume>
<issue>3</issue>
<fpage>123</fpage>
<lpage>128</lpage>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Kelley</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biology</source>
<year>2010</year>
<volume>11</volume>
<fpage>1</fpage>
<lpage>13</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/gb-2010-11-11-r116].[10.1186/gb-2010-11-11-r116">http://dx.doi.org/10.1186/gb-2010-11-11-r116]. [10.1186/gb-2010-11-11-r116</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kosack</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Hierarchical Scaffolding With Bambus</article-title>
<source>Genome Research</source>
<year>2004</year>
<volume>14</volume>
<fpage>149</fpage>
<lpage>159</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://genome.cshlp.org/content/14/1/149.abstract">http://genome.cshlp.org/content/14/1/149.abstract</ext-link>
]</comment>
<pub-id pub-id-type="pmid">14707177</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="other">
<name>
<surname>Boetzer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Henkel</surname>
<given-names>CV</given-names>
</name>
<name>
<surname>Jansen</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pirovano</surname>
<given-names>W</given-names>
</name>
<article-title>Scaffolding pre-assembled contigs using SSPACE</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/early/2010/12/12/bioinformatics.btq683.abstract">http://bioinformatics.oxfordjournals.org/content/early/2010/ 12/12/bioinformatics.btq683.abstract</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="book">
<name>
<surname>Donmez</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
<person-group person-group-type="editor">Bafna V, Sahinalp S</person-group>
<article-title>Hapsembler: An Assembler for Highly Polymorphic Genomes</article-title>
<source>Research in Computational Molecular Biology, Volume 6577 of Lecture Notes in Computer Science</source>
<year>2011</year>
<publisher-name>Springer Berlin/Heidelberg</publisher-name>
<fpage>38</fpage>
<lpage>52</lpage>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="book">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lavenier</surname>
<given-names>D</given-names>
</name>
<person-group person-group-type="editor">Teresa M, Przytycka TM, Marie-France S</person-group>
<article-title>Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph</article-title>
<source>Algorithms in Bioinformatics - 11th International Workshop, WABI 2011, Saarbrücken, Germany, September 5-7, 2011. Proceedings. Lecture Notes in Computer Science 6833 Springer</source>
<year>2011</year>
<comment>ISBN 978-3-642-23037-0</comment>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000943 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000943 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3507659
   |texte=   Readjoiner: a fast and memory efficient string graph-based sequence assembler
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22559072" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021