Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Accurate self-correction of errors in long reads using de Bruijn graphs

Identifieur interne : 000B15 ( Pmc/Corpus ); précédent : 000B14; suivant : 000B16

Accurate self-correction of errors in long reads using de Bruijn graphs

Auteurs : Leena Salmela ; Riku Walve ; Eric Rivals ; Esko Ukkonen

Source :

RBID : PMC:5351550

Abstract

AbstractMotivation

New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.

Results

We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.

Availability and Implementation

LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/.


Url:
DOI: 10.1093/bioinformatics/btw321
PubMed: 27273673
PubMed Central: 5351550

Links to Exploration step

PMC:5351550

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Accurate self-correction of errors in long reads using de Bruijn graphs</title>
<author>
<name sortKey="Salmela, Leena" sort="Salmela, Leena" uniqKey="Salmela L" first="Leena" last="Salmela">Leena Salmela</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Walve, Riku" sort="Walve, Riku" uniqKey="Walve R" first="Riku" last="Walve">Riku Walve</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rivals, Eric" sort="Rivals, Eric" uniqKey="Rivals E" first="Eric" last="Rivals">Eric Rivals</name>
<affiliation>
<nlm:aff id="btw321-aff2">LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ukkonen, Esko" sort="Ukkonen, Esko" uniqKey="Ukkonen E" first="Esko" last="Ukkonen">Esko Ukkonen</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27273673</idno>
<idno type="pmc">5351550</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5351550</idno>
<idno type="RBID">PMC:5351550</idno>
<idno type="doi">10.1093/bioinformatics/btw321</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000B15</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B15</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Accurate self-correction of errors in long reads using de Bruijn graphs</title>
<author>
<name sortKey="Salmela, Leena" sort="Salmela, Leena" uniqKey="Salmela L" first="Leena" last="Salmela">Leena Salmela</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Walve, Riku" sort="Walve, Riku" uniqKey="Walve R" first="Riku" last="Walve">Riku Walve</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rivals, Eric" sort="Rivals, Eric" uniqKey="Rivals E" first="Eric" last="Rivals">Eric Rivals</name>
<affiliation>
<nlm:aff id="btw321-aff2">LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ukkonen, Esko" sort="Ukkonen, Esko" uniqKey="Ukkonen E" first="Esko" last="Ukkonen">Esko Ukkonen</name>
<affiliation>
<nlm:aff id="btw321-aff1">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g.
<italic>de novo</italic>
genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of
<italic>k</italic>
-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.</p>
</sec>
<sec id="SA3">
<title>Availability and Implementation</title>
<p>LoRMA is freely available at
<ext-link ext-link-type="uri" xlink:href="http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/">http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Au, K F" uniqKey="Au K">K.F. Au</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A. Bankevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berlin, K" uniqKey="Berlin K">K. Berlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boucher, C" uniqKey="Boucher C">C. Boucher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cazaux, B" uniqKey="Cazaux B">B. Cazaux</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, M J" uniqKey="Chaisson M">M.J. Chaisson</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G. Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chin, C S" uniqKey="Chin C">C.S. Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Drezen, E" uniqKey="Drezen E">E. Drezen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hackl, T" uniqKey="Hackl T">T. Hackl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S. Koren</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S. Koren</name>
</author>
<author>
<name sortKey="Philippy, A M" uniqKey="Philippy A">A.M. Philippy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Laehnemann, D" uniqKey="Laehnemann D">D. Laehnemann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Laver, T" uniqKey="Laver T">T. Laver</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, C" uniqKey="Lee C">C. Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Madoui, M A" uniqKey="Madoui M">M.A. Madoui</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miclotte, G" uniqKey="Miclotte G">G. Miclotte</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nakamura, K" uniqKey="Nakamura K">K. Nakamura</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ono, Y" uniqKey="Ono Y">Y. Ono</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y. Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L. Salmela</name>
</author>
<author>
<name sortKey="Rivals, E" uniqKey="Rivals E">E. Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L. Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J. Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schirmer, M" uniqKey="Schirmer M">M. Schirmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X. Yang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27273673</article-id>
<article-id pub-id-type="pmc">5351550</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btw321</article-id>
<article-id pub-id-type="publisher-id">btw321</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Recomb-Seq/Recomb-Cbb 2016</subject>
<subj-group subj-group-type="category-toc-heading">
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Accurate self-correction of errors in long reads using de Bruijn graphs</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Salmela</surname>
<given-names>Leena</given-names>
</name>
<xref ref-type="aff" rid="btw321-aff1">1</xref>
<xref ref-type="corresp" rid="btw321-cor1"></xref>
<pmc-comment>leena.salmela@cs.helsinki.fi</pmc-comment>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Walve</surname>
<given-names>Riku</given-names>
</name>
<xref ref-type="aff" rid="btw321-aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Rivals</surname>
<given-names>Eric</given-names>
</name>
<xref ref-type="aff" rid="btw321-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ukkonen</surname>
<given-names>Esko</given-names>
</name>
<xref ref-type="aff" rid="btw321-aff1">1</xref>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Sahinalp</surname>
<given-names>Cenk</given-names>
</name>
<role>Associate Editor</role>
</contrib>
</contrib-group>
<aff id="btw321-aff1">
<label>1</label>
Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland</aff>
<aff id="btw321-aff2">
<label>2</label>
LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, Montpellier, France</aff>
<author-notes>
<corresp id="btw321-cor1">To whom correspondence should be addressed. Email:
<email>leena.salmela@cs.helsinki.fi</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>3</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2016-06-06">
<day>06</day>
<month>6</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>06</day>
<month>6</month>
<year>2016</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>33</volume>
<issue>6</issue>
<fpage>799</fpage>
<lpage>806</lpage>
<history>
<date date-type="received">
<day>19</day>
<month>3</month>
<year>2016</year>
</date>
<date date-type="rev-recd">
<day>3</day>
<month>5</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>5</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2016. Published by Oxford University Press.</copyright-statement>
<copyright-year>2016</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by-nc/4.0/" license-type="cc-by-nc">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="btw321.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g.
<italic>de novo</italic>
genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of
<italic>k</italic>
-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.</p>
</sec>
<sec id="SA3">
<title>Availability and Implementation</title>
<p>LoRMA is freely available at
<ext-link ext-link-type="uri" xlink:href="http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/">http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/</ext-link>
.</p>
</sec>
</abstract>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Academy of Finland</named-content>
</funding-source>
<award-id>267591</award-id>
</award-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">ANR Colib’read</named-content>
</funding-source>
<award-id>ANR-12-BS02-0008</award-id>
</award-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">IBC</named-content>
</funding-source>
<award-id>ANR-11-BINF-0002</award-id>
</award-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Défi MASTODONS to E.R., and EU FP7 project SYSCOL</named-content>
</funding-source>
<award-id>UE7-SYSCOL-258236 to E.U.</award-id>
</award-group>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<p>With the diminishing costs, high-throughput DNA sequencing has become a commonplace technology in biological research. Whereas the second generation sequencers produced short but quite accurate reads, new technologies such as Pacific Biosciences and Oxford NanoPore are producing reads up to 50 000 bp long but with an error rate at least 15%. Although the long reads have proven to be very helpful in applications like genome assembly (
<xref rid="btw321-B11" ref-type="bibr">Koren and Philippy, 2015</xref>
;
<xref rid="btw321-B15" ref-type="bibr">Madoui
<italic>et al.</italic>
, 2015</xref>
), the error rate poses a challenge for the utilization of this data.</p>
<p>Many methods have been developed for correcting short reads (
<xref rid="btw321-B12" ref-type="bibr">Laehnemann
<italic>et al.</italic>
, 2016</xref>
;
<xref rid="btw321-B23" ref-type="bibr">Yang
<italic>et al.</italic>
, 2013</xref>
) but these methods are not directly applicable to the long reads because of their much higher error rate. Moreover, most research of short read error correction has concentrated on mismatches, the dominant error type in Illumina data, whereas in long reads indels are more common. Recently, several methods for error correction of long reads have also been developed. These methods fall into two categories: either the highly erroneous long reads are self-corrected by aligning them against each other, or a hybrid strategy is adopted in which the long reads are corrected using the accurate short reads that are assumed to be available. Most standalone error correction tools like proovread (
<xref rid="btw321-B9" ref-type="bibr">Hackl
<italic>et al.</italic>
, 2014</xref>
), LoRDEC (
<xref rid="btw321-B20" ref-type="bibr">Salmela and Rivals, 2014</xref>
), LSC (
<xref rid="btw321-B1" ref-type="bibr">Au
<italic>et al.</italic>
, 2012</xref>
) and Jabba (
<xref rid="btw321-B16" ref-type="bibr">Miclotte
<italic>et al.</italic>
, 2015</xref>
) are hybrid methods. PBcR (
<xref rid="btw321-B3" ref-type="bibr">Berlin
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btw321-B10" ref-type="bibr">Koren
<italic>et al.</italic>
, 2012</xref>
) is a tool that can employ either the hybrid or self-correction strategy.</p>
<p>Most hybrid methods like PBcR, LSC and proovread are based on the mapping approach. They first map the short reads on the long reads and then correct the long reads according to a consensus built on the mapped short reads. PBcR extends this strategy to self-correction of PacBio reads by computing overlaps between the long reads using probabilistic locality-sensitive hashing and then correcting the reads according to a consensus built on the overlapping reads. As the mapping of short reads is time and memory consuming, LoRDEC avoids the mapping phase by building a de Bruijn graph (DBG) of the short reads and then threading the long reads through this graph to correct them. Jabba is a recent tool that is also based on building a DBG of short reads. While LoRDEC finds matches of complete
<italic>k</italic>
-mers in the long reads, Jabba searches for maximal exact matches between the
<italic>k</italic>
-mers and the long reads allowing it to use a larger
<italic>k</italic>
in the DBG.</p>
<p>In this paper, we present a self-correction method for long reads that is based on DBGs and multiple alignments. First our method performs initial correction that is similar to LoRDEC, but uses only long reads and performs iterative correction rounds with longer and longer
<italic>k</italic>
-mers. This phase considers only the local context of errors and hence it misses the long-distance dependency information available in the long reads. To capture such dependencies, the second phase of our method uses multiple alignments between carefully selected reads to further improve the error correction.</p>
<p>Our experiments show that our method is currently the most accurate one relying on long reads only. The error rate of the reads after our error correction is less than half of the error rate of reads corrected by PBcR using long reads only. Furthermore, when the coverage of the read set is at least 75×, the size of the corrected read set of our method is at least 20% higher than for PBcR.</p>
</sec>
<sec>
<title>2 Overview of LoRDEC</title>
<p>LoRDEC (
<xref rid="btw321-B20" ref-type="bibr">Salmela and Rivals, 2014</xref>
) is a hybrid method for the error correction of long reads. It presents the short reads in a DBG and then maps the long reads to the graph. The DBG of a read set is a graph whose nodes are all
<italic>k</italic>
-mers occurring in the reads and there is an edge between two nodes if the corresponding
<italic>k</italic>
-mers overlap by
<italic>k</italic>
− 1 bases. LoRDEC classifies the
<italic>k</italic>
-mers of long reads as
<italic>solid</italic>
if they are in the DBG and
<italic>weak</italic>
otherwise. The correction then proceeds by replacing the weak areas of the long reads by solid ones. This is done by searching paths in the DBG between solid
<italic>k</italic>
-mers to bridge the weak areas between them. If several paths are found, the path with the shortest edit distance as compared to the weak region is chosen to be the correct sequence, which replaces the weak region of the long read. The weak heads and tails of the long reads are the extreme regions of the reads that are bordered by just one solid
<italic>k</italic>
-mer in the beginning (resp. end) of the read. LoRDEC attempts to correct these regions by starting a path search from the solid
<italic>k</italic>
-mer and choosing a sequence that is as close as possible to the weak head or tail.</p>
<p>Repetitive regions of the genome can make the DBG tangled. The path search in these areas of the DBG can then become intractable. Therefore, LoRDEC employs a limit on the number of branches it explores during the search. If this limit is exceeded, LoRDEC checks if at least one path within the maximum allowed error rate has been found and then uses the best path found for correction. If no such path has been found, LoRDEC starts a path search similar to the correction of the head and tail of the read, to attempt a partial correction of the weak region.</p>
<p>Some segments of the long reads remain erroneous after the correction. LoRDEC outputs bases in upper case if at least one of the
<italic>k</italic>
-mers containing that base is solid, i.e. it occurs in the DBG of the short reads, and in lower case otherwise. For most applications, it is preferable to extract only the upper case regions of the sequences as the lower case bases are likely to contain errors.</p>
</sec>
<sec>
<title>3 Self-correction of long reads</title>
<p>In this section, we will show how an error correction procedure similar to LoRDEC can be used to iteratively correct long reads without short read data. We will use LoRDEC* to refer to LoRDEC in this long reads only mode. Then, we further describe a polishing method to improve the accuracy of correction.
<xref ref-type="fig" rid="btw321-F1">Figure 1</xref>
shows the workflow of our approach.</p>
<fig id="btw321-F1" orientation="portrait" position="float">
<label>Fig. 1.</label>
<caption>
<p>Workflow of error correction. LoRDEC* is first applied iteratively to the read set, with an increasing
<italic>k</italic>
. The corrected reads are further corrected by LoRMA, which uses multiple alignments to find long-distance dependencies in the reads</p>
</caption>
<graphic xlink:href="btw321f1"></graphic>
</fig>
<sec>
<title>3.1 Iterative correction</title>
<p>To describe how LoRDEC can be adapted for self-correction of read sets, let
<italic>Q</italic>
be a set of long reads to be corrected, and let integer
<italic>h</italic>
be the
<italic>abundancy threshold</italic>
that is used in choosing the
<italic>k</italic>
-mers to the DBG. The correction procedure repeats for an increasing sequence
<inline-formula id="IE1">
<mml:math id="IEQ1">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
the following steps 1–3:
<list list-type="order">
<list-item>
<p>Construct the DBG of set
<italic>Q</italic>
using as the nodes the
<italic>k</italic>
-mers that occur in
<italic>Q</italic>
at least
<italic>h</italic>
times;</p>
</list-item>
<list-item>
<p>Correct
<italic>Q</italic>
using the LoRDEC algorithm with this DBG;</p>
</list-item>
<list-item>
<p>Replace
<italic>Q</italic>
with the corrected
<italic>Q</italic>
.</p>
</list-item>
</list>
</p>
<p>After the final round, the regions of the reads identified as correct in the last iteration are extracted for further correction with the multiple alignment technique by LoRMA.</p>
<p>As the initial error level is assumed high, the above iterations have to start with a relatively small
<italic>k </italic>
=
<italic> k</italic>
<sub>1</sub>
. With a suitable abundancy threshold
<italic>h</italic>
, the DBG should then contain most of the correct
<italic>k</italic>
-mers (i.e. the
<italic>k</italic>
-mers of the target genome) and a few erroneous ones. Although path search over long weak regions may not be feasible because of strong branching of the DBG, shorter paths are likely to be found and hence, short weak regions can be corrected. After the first round, the correct regions in the reads have become longer because close-by correct regions have been merged whenever a path between them has been found, and thus, we can increase
<italic>k</italic>
. Then, with increasing
<italic>k</italic>
s, the DBG gets less tangled and the path search over the longer weak regions becomes feasible allowing for the correction of the complete reads. A similar iterative approach has previously been proposed for short read assembly (
<xref rid="btw321-B2" ref-type="bibr">Bankevich
<italic>et al.</italic>
, 2012</xref>
;
<xref rid="btw321-B19" ref-type="bibr">Peng
<italic>et al.</italic>
, 2010</xref>
).</p>
<p>When the path search is abandoned because of excessive branching, the original LoRDEC algorithm still uses the best path found so far to correct the region. Such a greedy strategy improves correction accuracy in a single run, but in the present iterative approach false corrections start to accumulate. Therefore, we make a correction only if it is guaranteed that the correction is the best one available in the DBG, i.e. all branches have been explored.</p>
<p>Abundancy threshold
<italic>h</italic>
controls the quality of the
<italic>k</italic>
-mers that are used for correction. In our experiments, we used a fixed threshold of
<italic>h</italic>
 = 4 in all iterations, meaning that the
<italic>k</italic>
-mers with less than four occurrences in the read set were considered erroneous.</p>
<p>To justify the value of
<italic>h</italic>
, we need to analyse how many times a fixed
<italic>k</italic>
-mer of the genome is expected to occur without any error in the reads. Then an
<italic>h</italic>
that is about one or two standard deviations below the expected value should give a DBG that contains the majority of the correct
<italic>k</italic>
-mers and not too many erroneous ones. We will use an analysis similar to
<xref rid="btw321-B16" ref-type="bibr">Miclotte
<italic>et al.</italic>
(2015)</xref>
.</p>
<p>Let
<inline-formula id="IE2">
<mml:math id="IEQ2">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
denote
<italic>the coverage of a genomic k-mer by exact regions of length at least k</italic>
. Here
<italic>exact region</italic>
refers to a continuous maximal error-free segment of some read in our read set.
<xref ref-type="fig" rid="btw321-F2">Figure 2</xref>
gives an example of exact regions. Let us add a $ character to the end of each read, and then consider the concatenation of all these reads. In this sequence, an exact region (of length 0 or more) ends either at an error or when encountering the $ character. Let
<italic>n</italic>
denote the number of reads,
<italic>N</italic>
the length of the concatenation of all reads and
<italic>p</italic>
the error rate. Then the probability for an exact region to end at a given position of the concatenated sequence is
<inline-formula id="IE3">
<mml:math id="IEQ3">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. As the reads are long and the error rate is high, we have
<inline-formula id="IE4">
<mml:math id="IEQ4">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo></mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. The length of the exact regions is distributed according to the geometric distribution
<inline-formula id="IE5">
<mml:math id="IEQ5">
<mml:mrow>
<mml:mtext>Geom</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<sub>,</sub>
and therefore, the probability of an exact region to have length
<italic>i</italic>
is
<inline-formula id="IE6">
<mml:math id="IEQ6">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. The expected number of exact regions is
<italic>Nq</italic>
. An exact region is
<italic>maximal</italic>
if it cannot be extended to the left or right. Let
<italic>R
<sub>i</sub>
</italic>
be the random variable denoting the number of maximal exact regions of length
<italic>i</italic>
. Then
<inline-formula id="IE7">
<mml:math id="IEQ7">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>N</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>N</mml:mi>
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<fig id="btw321-F2" orientation="portrait" position="float">
<label>Fig. 2.</label>
<caption>
<p>Division of a read into maximal exact regions, shown as boxed areas. The shaded boxes give the regions that could cover a 4-mer</p>
</caption>
<graphic xlink:href="btw321f2"></graphic>
</fig>
<p>Let
<inline-formula id="IE8">
<mml:math id="IEQ8">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
denote the coverage of a
<italic>k</italic>
-mer in the genome by maximal exact regions of length
<italic>i</italic>
, and let
<italic>r
<sub>i</sub>
</italic>
denote the number of maximal exact regions of length
<italic>i</italic>
. An exact region of length
<italic>i</italic>
,
<inline-formula id="IE9">
<mml:math id="IEQ9">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, covers a fixed genomic
<italic>k</italic>
-mer (i.e. the read with that exact region is read from the genomic segment containing that
<italic>k</italic>
-mer) if the region starts in the genome from the starting location of the
<italic>k</italic>
-mer or from some of the
<italic>i</italic>
<italic>k</italic>
locations before it. Assuming that the reads are randomly sampled from the genome, this happens with probability
<inline-formula id="IE10">
<mml:math id="IEQ10">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, where
<italic>G</italic>
is the length of the genome. Therefore,
<inline-formula id="IE11">
<mml:math id="IEQ11">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is distributed according to the binomial distribution
<inline-formula id="IE12">
<mml:math id="IEQ12">
<mml:mrow>
<mml:mtext>Bin</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
(independence of locations of exact regions is assumed), and the expected coverage of a genomic
<italic>k</italic>
-mer by maximal exact regions of length
<italic>i</italic>
is
<disp-formula id="E1">
<mml:math id="EQ1">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi></mml:mi>
</mml:munderover>
<mml:mi>P</mml:mi>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>G</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow></mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>G</mml:mi>
</mml:mfrac>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>N</mml:mi>
<mml:mi>G</mml:mi>
</mml:mfrac>
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mo>·</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>By the linearity of expectation, the expected coverage of a genomic
<italic>k</italic>
-mer by exact regions of length at least
<italic>k</italic>
is
<disp-formula id="E2">
<mml:math id="EQ2">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mi></mml:mi>
</mml:munderover>
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>N</mml:mi>
<mml:mi>G</mml:mi>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mi></mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mo>·</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Because
<inline-formula id="IE13">
<mml:math id="IEQ13">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
is small, we can approximate the binomial distribution of
<inline-formula id="IE14">
<mml:math id="IEQ14">
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
with the Poisson distribution. Therefore,
<inline-formula id="IE15">
<mml:math id="IEQ15">
<mml:mrow>
<mml:msup>
<mml:mi>σ</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>Assuming that the coverages of a genomic
<italic>k</italic>
-mer by maximal exact regions of different lengths are independent, the variance of the coverage by exact regions of length at least
<italic>k</italic>
is
<inline-formula id="IE16">
<mml:math id="IEQ16">
<mml:mrow>
<mml:msup>
<mml:mi>σ</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="false">
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:msup>
<mml:mi>σ</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>
<xref ref-type="fig" rid="btw321-F3">Figure 3</xref>
illustrates
<inline-formula id="IE17">
<mml:math id="IEQ17">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for various
<italic>k</italic>
and
<inline-formula id="IE18">
<mml:math id="IEQ18">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo></mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, with 100× original coverage of the target. Note that original coverage of the target genome by the read set is
<italic>N</italic>
/
<italic>G</italic>
. For the three datasets in our experiments (
<xref rid="btw321-T1" ref-type="table">Table 1</xref>
), with coverages 200×, 208× and 129×, the expected coverage
<inline-formula id="IE19">
<mml:math id="IEQ19">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
has values 9.12, 9.48 and 5.89, respectively, for our initial
<italic>k</italic>
 = 19 and for our assumed error rate
<italic>p</italic>
 = 0.15. Hence, our adopted threshold
<italic>h</italic>
 = 4 is from 0.8 to 1.8 standard deviations below the expected coverage meaning that most of the correct
<italic>k</italic>
-mers should be distinguishable from the erroneous ones.</p>
<fig id="btw321-F3" orientation="portrait" position="float">
<label>Fig. 3.</label>
<caption>
<p>Expected coverage of a genomic
<italic>k</italic>
-mer by exact regions of length at least
<italic>k</italic>
for a read set with coverage 100× for different error rates
<italic>p</italic>
</p>
</caption>
<graphic xlink:href="btw321f3"></graphic>
</fig>
<p>
<table-wrap id="btw321-T1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Datasets used in the experiments</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">
<italic>E.coli</italic>
(simulated)</th>
<th rowspan="1" colspan="1">
<italic>E.coli</italic>
</th>
<th rowspan="1" colspan="1">Yeast</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">
<italic>Reference organism</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Name</td>
<td rowspan="1" colspan="1">
<italic>Escherichia coli</italic>
</td>
<td rowspan="1" colspan="1">
<italic>Escherichia coli</italic>
</td>
<td rowspan="1" colspan="1">
<italic>Saccharomyces cerevisiae</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Strain</td>
<td rowspan="1" colspan="1">K-12 substr. MG1655</td>
<td rowspan="1" colspan="1">K-12 substr. MG1655</td>
<td rowspan="1" colspan="1">W303</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Reference sequence</td>
<td rowspan="1" colspan="1">NC_000913</td>
<td rowspan="1" colspan="1">NC_000913</td>
<td rowspan="1" colspan="1">CM001806-CM001823</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Genome size</td>
<td rowspan="1" colspan="1">4.6 Mbp</td>
<td rowspan="1" colspan="1">4.6 Mbp</td>
<td rowspan="1" colspan="1">12 Mbp</td>
</tr>
<tr>
<td colspan="4" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<italic>PacBio data</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Number of reads</td>
<td rowspan="1" colspan="1">92 818</td>
<td rowspan="1" colspan="1">89 481</td>
<td rowspan="1" colspan="1">261 964</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Avg. read length</td>
<td rowspan="1" colspan="1">9997</td>
<td rowspan="1" colspan="1">10 779</td>
<td rowspan="1" colspan="1">5891</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Coverage</td>
<td rowspan="1" colspan="1">200×</td>
<td rowspan="1" colspan="1">208×</td>
<td rowspan="1" colspan="1">129×</td>
</tr>
<tr>
<td colspan="4" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<italic>Illumina data</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Accession number</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">ERR022075</td>
<td rowspan="1" colspan="1">SRR567755</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Number of reads</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">2 316 613</td>
<td rowspan="1" colspan="1">4 503 422</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Read length</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Coverage</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">50×</td>
<td rowspan="1" colspan="1">38×</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec>
<title>3.2 Polishing with multiple alignments</title>
<p>The error correction performed by LoRDEC* does not make use of long range information contained in the reads. In particular, approximate repeats of the target are collapsed in the DBG into a path with alternative branches. In practice, such repeat regions are corrected towards a copy of the repeat but not necessarily towards the correct copy. However, the correct copy is more likely uncovered because we choose the path that minimizes the edit distance between the weak region to be corrected and the sequence spelled out by the path. Therefore, if we have several reads from the same location, the majority of them are likely corrected towards the correct copy.</p>
<p>Our multiple alignment error correction exploits the long range similarity of reads by identifying the reads that are likely to originate from the same genomic location. If the reads contain a repeat area, the most abundant copy of the repeat present in the reads is likely the correct one. Then by aligning the reads with each other we can correct them towards this most abundant copy. The approach we use here bears some similarity to the method used in Coral (
<xref rid="btw321-B21" ref-type="bibr">Salmela and Schröder, 2011</xref>
).</p>
<p>As preprocessing phase for the method, we build a DBG of all the reads using abundancy threshold
<italic>h</italic>
 = 1 to ensure that all
<italic>k</italic>
-mers present in the reads are indexed. Then we enumerate the simple paths of the DBG and find for each read the unique path that spells it out. Each such path is composed of non-overlapping unitig segments that have no branches. We call such segments the parts of a path. We associate to each path segment (i.e. a unitig path of the DBG) a set of triples describing the reads traversing that segment. Each triple consists of read id, part id and the direction of the read on this path. Hence, the path for a read
<italic>i</italic>
consists of segments who have a triplet with
<italic>i</italic>
as the read id and with part id values 1, 2,…, the path being composed of these segments in the order of the part id value (
<xref ref-type="fig" rid="btw321-F4">Fig. 4</xref>
). Using this information, it is now possible to reconstruct each read from the DBG except that the reads will be prefixed (suffixed) by the complete simple path that starts (ends) the read.</p>
<fig id="btw321-F4" orientation="portrait" position="float">
<label>Fig. 4.</label>
<caption>
<p>Augmented DBG. For simplicity, reverse complements are not considered. The lower graph only shows the branching nodes of the DBG and the labels on the paths/edges are of the form
<italic>read id</italic>
:
<italic>read part id</italic>
. For example, the path for read 2 consists of segments with labels 2:1, 2:2, 2:3, 2:4 and 2:5</p>
</caption>
<graphic xlink:href="btw321f4"></graphic>
</fig>
<p>In the second phase of our method, we take the reads one by one and use the DBG to select reads that are similar to the current read. We follow the path for the current read and gather the set of reads sharing
<italic>k</italic>
-mers with it, which can be done using the triplets of the augmented DBG. Out of these reads, we then first select each read
<italic>R</italic>
such that the shared
<italic>k</italic>
-mers span at least 80% of the shorter one of the read
<italic>R</italic>
and the current read. Furthermore, out of these reads, we select those that share the most
<italic>k</italic>
-mers with the current read. We call this read set the
<italic>friends</italic>
of the current read. The number of selected reads is a parameter of our method (by default 7).</p>
<p>We then proceed to compute a multiple alignment of the current read and its friends. To keep the running time feasible, we use the same simple method as in Coral (
<xref rid="btw321-B21" ref-type="bibr">Salmela and Schröder, 2011</xref>
). First, the current read is set to be the initial consensus. Then we take each friend of the current read one by one, align them against the current consensus using banded alignment, and finally update the consensus according to the alignment. Finally, we inspect every column of the multiple alignment and correct the current read towards the consensus if the consensus is supported by at least two reads.</p>
<p>We implemented the above procedure in a tool called Long Read Multiple Aligner (LoRMA) using the GATB library (
<xref rid="btw321-B8" ref-type="bibr">Drezen
<italic>et al.</italic>
, 2014</xref>
) for the implementation of the DBG.</p>
</sec>
</sec>
<sec>
<title>4 Experimental results</title>
<p>We ran experiments on three datasets that are detailed in
<xref rid="btw321-T1" ref-type="table">Table 1</xref>
. The simulated
<italic>Escherichia coli</italic>
dataset was generated with PBSIM (
<xref rid="btw321-B18" ref-type="bibr">Ono
<italic>et al.</italic>
, 2013</xref>
) using the following parameters: mean accuracy 85%, average read length 10 000, and minimum read length 1000. The other two datasets are real data. Although our method works solely on the PacBio reads, the table also includes statistics of complementary Illumina reads that were used to compare our method against hybrid methods that need also short reads. All experiments were run on 32 GB RAM machines equipped with 8 cores.</p>
<sec>
<title>4.1 Evaluation of the quality of error correction</title>
<p>In the simulated dataset, the genomic position where each read derives from is known. Therefore, the quality of error correction on the simulated dataset is evaluated by aligning the corrected read against the corresponding correct genomic sequence. We allow free deletions in the flanks of the corrected read because the tools trim regions they are not able to correct. To check if the corrected reads align to the correct genomic position, we aligned the corrected reads on the reference genome with BLASR (
<xref rid="btw321-B6" ref-type="bibr">Chaisson and Tesler, 2012</xref>
) keeping only a single best alignment for each read. The following statistics were computed:
<list list-type="bullet">
<list-item>
<p>
<bold>Size:</bold>
The relative size of the corrected read set as compared to the original one.</p>
</list-item>
<list-item>
<p>
<bold>Error rate:</bold>
The number of substitutions, insertions and deletions divided by the length of the correct genomic sequence.</p>
</list-item>
<list-item>
<p>
<bold>Correctly aligned:</bold>
The relative number of reads that align to the same genomic position where the read derives from.</p>
</list-item>
</list>
</p>
<p>To evaluate the quality of error correction on the real datasets, we used BLASR (
<xref rid="btw321-B6" ref-type="bibr">Chaisson and Tesler, 2012</xref>
) to align the original and corrected reads on the reference genome. For each read, we used only a single best alignment because a correct read should only have one continuous alignment against the reference. Thus, chimeric reads will be only partially aligned. We computed the following statistics:
<list list-type="bullet">
<list-item>
<p>
<bold>Size:</bold>
The relative size of the corrected read set as compared to the original one.</p>
</list-item>
<list-item>
<p>
<bold>Aligned:</bold>
The relative size of the aligned regions as compared to the complete read set.</p>
</list-item>
<list-item>
<p>
<bold>Error rate:</bold>
The number of substitutions, insertions and deletions in the aligned regions divided by the length of the aligned regions in the reference sequence.</p>
</list-item>
<list-item>
<p>
<bold>Genome coverage:</bold>
The proportion of the genome covered by the aligned regions of the reads.</p>
</list-item>
</list>
</p>
<p>Together, these statistics measure three aspects of the quality of error correction. Size measures the throughput of the method. Aligned and error rate together measure the accuracy of correction. Finally genome coverage estimates if reads deriving from all regions of the genome are corrected.</p>
</sec>
<sec>
<title>4.2 Parameters of our method</title>
<p>We ran experiments on the real
<italic>E.coli</italic>
dataset to test the effect of parameters on the performance of our method. First, we tried several progressions of
<italic>k</italic>
in the first phase where LoRDEC* is run iteratively. We started all iterations with
<italic>k</italic>
 = 19 because given the high error rate of the data
<italic>k</italic>
must be small for correct
<italic>k</italic>
-mers to occur in the read data. The results of these experiments are presented in
<xref rid="btw321-T2" ref-type="table">Table 2</xref>
. With more iterations, the size of the corrected read set and the aligned proportion of reads decrease, but the aligned regions are more accurate. The decrease in the size of the corrected read set may be a result of better correction because PacBio reads have more insertions than deletions. However, the decrease in the aligned proportion of the reads may indicate some accumulation of false corrections. The runtime of the method increases with the number of iterations but later iterations take less time as the reads have already been partially corrected during the previous rounds. To balance out these effects, we chose to use a moderate number of iterations, i.e.
<italic>k</italic>
 = 19, 40, 61, by default, which also optimizes the error rate of the aligned regions.
<table-wrap id="btw321-T2" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<p>The progression of
<italic>k</italic>
for the iterations of LoRDEC*</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1">
<italic>k</italic>
progression</th>
<th rowspan="1" colspan="1">Size (%)</th>
<th rowspan="1" colspan="1">Aligned (%)</th>
<th rowspan="1" colspan="1">Error rate (%)</th>
<th rowspan="1" colspan="1">Elapsed time (h)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">19</td>
<td align="char" char="." rowspan="1" colspan="1">64.901</td>
<td align="char" char="." rowspan="1" colspan="1">99.499</td>
<td align="char" char="." rowspan="1" colspan="1">0.294</td>
<td align="char" char="." rowspan="1" colspan="1">4.08</td>
</tr>
<tr>
<td colspan="5" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,22,25,28,31</td>
<td align="char" char="." rowspan="1" colspan="1">66.702</td>
<td align="char" char="." rowspan="1" colspan="1">99.302</td>
<td align="char" char="." rowspan="1" colspan="1">0.276</td>
<td align="char" char="." rowspan="1" colspan="1">12.97</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,22,25,28,31,34,37,40,43,46</td>
<td align="char" char="." rowspan="1" colspan="1">66.630</td>
<td align="char" char="." rowspan="1" colspan="1">99.311</td>
<td align="char" char="." rowspan="1" colspan="1">0.274</td>
<td align="char" char="." rowspan="1" colspan="1">20.65</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,22,25,28,31,34,37,40,43,46, 49,52,55,58,61</td>
<td align="char" char="." rowspan="1" colspan="1">66.546</td>
<td align="char" char="." rowspan="1" colspan="1">99.296</td>
<td align="char" char="." rowspan="1" colspan="1">0.271</td>
<td align="char" char="." rowspan="1" colspan="1">27.53</td>
</tr>
<tr>
<td colspan="5" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,26,33</td>
<td align="char" char="." rowspan="1" colspan="1">66.401</td>
<td align="char" char="." rowspan="1" colspan="1">99.329</td>
<td align="char" char="." rowspan="1" colspan="1">0.274</td>
<td align="char" char="." rowspan="1" colspan="1">9.58</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,26,33,40,47</td>
<td align="char" char="." rowspan="1" colspan="1">66.230</td>
<td align="char" char="." rowspan="1" colspan="1">99.298</td>
<td align="char" char="." rowspan="1" colspan="1">0.271</td>
<td align="char" char="." rowspan="1" colspan="1">13.07</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,26,33,40,47,54,61</td>
<td align="char" char="." rowspan="1" colspan="1">66.144</td>
<td align="char" char="." rowspan="1" colspan="1">99.283</td>
<td align="char" char="." rowspan="1" colspan="1">0.266</td>
<td align="char" char="." rowspan="1" colspan="1">16.08</td>
</tr>
<tr>
<td colspan="5" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,33</td>
<td align="char" char="." rowspan="1" colspan="1">66.705</td>
<td align="char" char="." rowspan="1" colspan="1">99.358</td>
<td align="char" char="." rowspan="1" colspan="1">0.277</td>
<td align="char" char="." rowspan="1" colspan="1">7.68</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,33,47</td>
<td align="char" char="." rowspan="1" colspan="1">66.178</td>
<td align="char" char="." rowspan="1" colspan="1">99.352</td>
<td align="char" char="." rowspan="1" colspan="1">0.268</td>
<td align="char" char="." rowspan="1" colspan="1">10.58</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,33,47,61</td>
<td align="char" char="." rowspan="1" colspan="1">65.991</td>
<td align="char" char="." rowspan="1" colspan="1">99.301</td>
<td align="char" char="." rowspan="1" colspan="1">0.261</td>
<td align="char" char="." rowspan="1" colspan="1">11.92</td>
</tr>
<tr>
<td colspan="5" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,40</td>
<td align="char" char="." rowspan="1" colspan="1">66.619</td>
<td align="char" char="." rowspan="1" colspan="1">99.360</td>
<td align="char" char="." rowspan="1" colspan="1">0.272</td>
<td align="char" char="." rowspan="1" colspan="1">8.32</td>
</tr>
<tr>
<td rowspan="1" colspan="1">19,40,61</td>
<td align="char" char="." rowspan="1" colspan="1">66.223</td>
<td align="char" char="." rowspan="1" colspan="1">99.317</td>
<td align="char" char="." rowspan="1" colspan="1">0.257</td>
<td align="char" char="." rowspan="1" colspan="1">10.30</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>LoRMA also builds a DBG of the reads and thus we need to specify
<italic>k</italic>
. We investigated the effect of the value of
<italic>k</italic>
on the
<italic>E.coli</italic>
dataset.
<xref rid="btw321-T3" ref-type="table">Table 3</xref>
shows the effect of
<italic>k</italic>
on the performance of LoRMA. Because the DBG is only used to detect similar reads in LoRMA, the performance is not greatly affected by the choice of
<italic>k</italic>
. There is a slight decrease in the throughput of the method as
<italic>k</italic>
increases as well as a slight increase in runtime but these effects are very modest. For the rest of the experiments, we set
<italic>k</italic>
 = 19.
<table-wrap id="btw321-T3" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<p>The effect of the
<italic>k</italic>
-mer size in LoRMA. The elapsed time is the runtime of LoRDEC*+LoRMA</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1">k</th>
<th rowspan="1" colspan="1">Size</th>
<th rowspan="1" colspan="1">Aligned</th>
<th rowspan="1" colspan="1">Error rate</th>
<th rowspan="1" colspan="1">Elapsed time</th>
<th rowspan="1" colspan="1">Memory peak</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(h)</th>
<th rowspan="1" colspan="1">(GB)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">19</td>
<td align="char" char="." rowspan="1" colspan="1">66.238</td>
<td align="char" char="." rowspan="1" colspan="1">99.306</td>
<td align="char" char="." rowspan="1" colspan="1">0.256</td>
<td align="char" char="." rowspan="1" colspan="1">10.38</td>
<td align="char" char="." rowspan="1" colspan="1">17.197</td>
</tr>
<tr>
<td rowspan="1" colspan="1">40</td>
<td align="char" char="." rowspan="1" colspan="1">66.170</td>
<td align="char" char="." rowspan="1" colspan="1">99.309</td>
<td align="char" char="." rowspan="1" colspan="1">0.258</td>
<td align="char" char="." rowspan="1" colspan="1">10.53</td>
<td align="char" char="." rowspan="1" colspan="1">16.958</td>
</tr>
<tr>
<td rowspan="1" colspan="1">61</td>
<td align="char" char="." rowspan="1" colspan="1">65.941</td>
<td align="char" char="." rowspan="1" colspan="1">99.313</td>
<td align="char" char="." rowspan="1" colspan="1">0.261</td>
<td align="char" char="." rowspan="1" colspan="1">13.87</td>
<td align="char" char="." rowspan="1" colspan="1">16.908</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Another parameter of the method is the size of the set of friends of the current read (-friends parameter). We tested also the effect of this parameter on the
<italic>E.coli</italic>
dataset. As the optimal value of this parameter might depend on the coverage of the dataset, we created several subsets of this dataset with different coverage to investigate this.
<xref rid="btw321-T4" ref-type="table">Table 4</xref>
shows the results of these experiments. We can see that the accuracy of the correction increases as the size of the friends set increases. However, for the dataset with the lowest coverage, 75×, the coverage of the genome by the corrected reads decreases when the size of the friends set is increased indicating that lower coverage areas are not well corrected. We can also see that increasing the size of the friends set increases the running time of the method. To keep the running time reasonable, we decided to set the default value of the parameter at a fairly low value, 7.
<table-wrap id="btw321-T4" orientation="portrait" position="float">
<label>Table 4.</label>
<caption>
<p>The effect of the size of the friends set on the quality of the correction. The elapsed time is the runtime of LoRDEC*+LoRMA</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1">Friends</th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">7</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">15</th>
<th rowspan="1" colspan="1">20</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">
<italic>Coverage 75×</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Size (%)</td>
<td align="char" char="." rowspan="1" colspan="1">59.173</td>
<td align="char" char="." rowspan="1" colspan="1">59.164</td>
<td align="char" char="." rowspan="1" colspan="1">59.146</td>
<td align="char" char="." rowspan="1" colspan="1">59.109</td>
<td align="char" char="." rowspan="1" colspan="1">59.085</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Aligned (%)</td>
<td align="char" char="." rowspan="1" colspan="1">98.894</td>
<td align="char" char="." rowspan="1" colspan="1">98.983</td>
<td align="char" char="." rowspan="1" colspan="1">99.099</td>
<td align="char" char="." rowspan="1" colspan="1">99.192</td>
<td align="char" char="." rowspan="1" colspan="1">99.226</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Error rate (%)</td>
<td align="char" char="." rowspan="1" colspan="1">0.169</td>
<td align="char" char="." rowspan="1" colspan="1">0.156</td>
<td align="char" char="." rowspan="1" colspan="1">0.148</td>
<td align="char" char="." rowspan="1" colspan="1">0.131</td>
<td align="char" char="." rowspan="1" colspan="1">0.128</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Gen. cov. (%)</td>
<td align="char" char="." rowspan="1" colspan="1">90.918</td>
<td align="char" char="." rowspan="1" colspan="1">90.907</td>
<td align="char" char="." rowspan="1" colspan="1">90.900</td>
<td align="char" char="." rowspan="1" colspan="1">90.888</td>
<td align="char" char="." rowspan="1" colspan="1">90.884</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Elapsed time (h)</td>
<td align="char" char="." rowspan="1" colspan="1">1.13</td>
<td align="char" char="." rowspan="1" colspan="1">1.22</td>
<td align="char" char="." rowspan="1" colspan="1">1.53</td>
<td align="char" char="." rowspan="1" colspan="1">1.88</td>
<td align="char" char="." rowspan="1" colspan="1">2.27</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Memory (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">14.522</td>
<td align="char" char="." rowspan="1" colspan="1">14.518</td>
<td align="char" char="." rowspan="1" colspan="1">14.522</td>
<td align="char" char="." rowspan="1" colspan="1">14.515</td>
<td align="char" char="." rowspan="1" colspan="1">14.525</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Disk (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">1.076</td>
<td align="char" char="." rowspan="1" colspan="1">1.076</td>
<td align="char" char="." rowspan="1" colspan="1">1.076</td>
<td align="char" char="." rowspan="1" colspan="1">1.076</td>
<td align="char" char="." rowspan="1" colspan="1">1.076</td>
</tr>
<tr>
<td colspan="6" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<italic>Coverage 100×</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Size (%)</td>
<td align="char" char="." rowspan="1" colspan="1">65.759</td>
<td align="char" char="." rowspan="1" colspan="1">65.738</td>
<td align="char" char="." rowspan="1" colspan="1">65.723</td>
<td align="char" char="." rowspan="1" colspan="1">65.670</td>
<td align="char" char="." rowspan="1" colspan="1">65.607</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Aligned (%)</td>
<td align="char" char="." rowspan="1" colspan="1">98.091</td>
<td align="char" char="." rowspan="1" colspan="1">98.317</td>
<td align="char" char="." rowspan="1" colspan="1">98.491</td>
<td align="char" char="." rowspan="1" colspan="1">98.556</td>
<td align="char" char="." rowspan="1" colspan="1">98.620</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Error rate(%)</td>
<td align="char" char="." rowspan="1" colspan="1">0.152</td>
<td align="char" char="." rowspan="1" colspan="1">0.140</td>
<td align="char" char="." rowspan="1" colspan="1">0.134</td>
<td align="char" char="." rowspan="1" colspan="1">0.114</td>
<td align="char" char="." rowspan="1" colspan="1">0.110</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Gen. cov. (%)</td>
<td align="char" char="." rowspan="1" colspan="1">99.404</td>
<td align="char" char="." rowspan="1" colspan="1">99.403</td>
<td align="char" char="." rowspan="1" colspan="1">99.405</td>
<td align="char" char="." rowspan="1" colspan="1">99.403</td>
<td align="char" char="." rowspan="1" colspan="1">99.405</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Elapsed time (h)</td>
<td align="char" char="." rowspan="1" colspan="1">2.53</td>
<td align="char" char="." rowspan="1" colspan="1">3.32</td>
<td align="char" char="." rowspan="1" colspan="1">4.32</td>
<td align="char" char="." rowspan="1" colspan="1">5.80</td>
<td align="char" char="." rowspan="1" colspan="1">7.08</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Memory (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">14.720</td>
<td align="char" char="." rowspan="1" colspan="1">14.720</td>
<td align="char" char="." rowspan="1" colspan="1">14.712</td>
<td align="char" char="." rowspan="1" colspan="1">14.723</td>
<td align="char" char="." rowspan="1" colspan="1">14.720</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Disk (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">1.417</td>
<td align="char" char="." rowspan="1" colspan="1">1.416</td>
<td align="char" char="." rowspan="1" colspan="1">1.417</td>
<td align="char" char="." rowspan="1" colspan="1">1.416</td>
<td align="char" char="." rowspan="1" colspan="1">1.416</td>
</tr>
<tr>
<td colspan="6" align="center" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<italic>Coverage 175×</italic>
</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">Size (%)</td>
<td align="char" char="." rowspan="1" colspan="1">66.933</td>
<td align="char" char="." rowspan="1" colspan="1">66.906</td>
<td align="char" char="." rowspan="1" colspan="1">66.905</td>
<td align="char" char="." rowspan="1" colspan="1">66.852</td>
<td align="char" char="." rowspan="1" colspan="1">66.816</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Aligned (%)</td>
<td align="char" char="." rowspan="1" colspan="1">98.927</td>
<td align="char" char="." rowspan="1" colspan="1">98.973</td>
<td align="char" char="." rowspan="1" colspan="1">99.153</td>
<td align="char" char="." rowspan="1" colspan="1">99.011</td>
<td align="char" char="." rowspan="1" colspan="1">99.104</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Error rate(%)</td>
<td align="char" char="." rowspan="1" colspan="1">0.222</td>
<td align="char" char="." rowspan="1" colspan="1">0.194</td>
<td align="char" char="." rowspan="1" colspan="1">0.191</td>
<td align="char" char="." rowspan="1" colspan="1">0.140</td>
<td align="char" char="." rowspan="1" colspan="1">0.133</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Gen. cov. (%)</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Elapsed time (h)</td>
<td align="char" char="." rowspan="1" colspan="1">6.77</td>
<td align="char" char="." rowspan="1" colspan="1">8.35</td>
<td align="char" char="." rowspan="1" colspan="1">10.62</td>
<td align="char" char="." rowspan="1" colspan="1">14.07</td>
<td align="char" char="." rowspan="1" colspan="1">17.22</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Memory (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">16.009</td>
<td align="char" char="." rowspan="1" colspan="1">16.016</td>
<td align="char" char="." rowspan="1" colspan="1">16.003</td>
<td align="char" char="." rowspan="1" colspan="1">16.002</td>
<td align="char" char="." rowspan="1" colspan="1">16.006</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Disk (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">2.361</td>
<td align="char" char="." rowspan="1" colspan="1">2.361</td>
<td align="char" char="." rowspan="1" colspan="1">2.362</td>
<td align="char" char="." rowspan="1" colspan="1">2.362</td>
<td align="char" char="." rowspan="1" colspan="1">2.362</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec>
<title>4.3 Comparison against previous methods</title>
<p>We compared our new method against PBcR (
<xref rid="btw321-B3" ref-type="bibr">Berlin
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btw321-B10" ref-type="bibr">Koren
<italic>et al.</italic>
, 2012</xref>
) which is to the best of our knowledge, the only previous self-correction method for long reads, and LoRDEC (
<xref rid="btw321-B20" ref-type="bibr">Salmela and Rivals, 2014</xref>
), proovread (
<xref rid="btw321-B9" ref-type="bibr">Hackl
<italic>et al.</italic>
, 2014</xref>
) and Jabba (
<xref rid="btw321-B16" ref-type="bibr">Miclotte
<italic>et al.</italic>
, 2015</xref>
) which also use short complementary reads.
<xref rid="btw321-T5" ref-type="table">Table 5</xref>
shows the results on the simulated dataset comparing our new method to PBcR using long reads only.
<xref rid="btw321-T6" ref-type="table">Table 6</xref>
shows the results of the comparison of our new method against previous methods on the real datasets. In the following, we will use LoRDEC to refer to the hybrid correction method using also short reads and LoRDEC*+LoRMA for our new method in which LoRDEC* is run in long reads selfcorrection mode followed by LoRMA.
<table-wrap id="btw321-T5" orientation="portrait" position="float">
<label>Table 5.</label>
<caption>
<p>Comparison of LoRDEC*+LoRMA against PBcR (PacBio only) on the simulated
<italic>E. coli</italic>
dataset</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1">Tool</th>
<th rowspan="1" colspan="1">Size</th>
<th rowspan="1" colspan="1">Error rate</th>
<th rowspan="1" colspan="1">Correctly aligned</th>
<th rowspan="1" colspan="1">Correctly aligned</th>
<th rowspan="1" colspan="1">Elapsed time</th>
<th rowspan="1" colspan="1">Memory peak</th>
<th rowspan="1" colspan="1">Disk peak</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">
<inline-formula id="IE20">
<mml:math id="IEQ20">
<mml:mo></mml:mo>
</mml:math>
</inline-formula>
2000 bp (%)</th>
<th rowspan="1" colspan="1">(h)</th>
<th rowspan="1" colspan="1">(GB)</th>
<th rowspan="1" colspan="1">(GB)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Original</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">13.015</td>
<td align="char" char="." rowspan="1" colspan="1">99.997</td>
<td align="char" char="." rowspan="1" colspan="1">99.997</td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">PBcR (PacBio only)</td>
<td align="char" char="." rowspan="1" colspan="1">92.457</td>
<td align="char" char="." rowspan="1" colspan="1">0.604</td>
<td align="char" char="." rowspan="1" colspan="1">99.953</td>
<td align="char" char="." rowspan="1" colspan="1">99.984</td>
<td align="char" char="." rowspan="1" colspan="1">2.63</td>
<td align="char" char="." rowspan="1" colspan="1">9.066</td>
<td align="char" char="." rowspan="1" colspan="1">17.823</td>
</tr>
<tr>
<td rowspan="1" colspan="1">LoRDEC
<inline-formula id="IE21">
<mml:math id="IEQ21">
<mml:mrow>
<mml:msup>
<mml:mrow></mml:mrow>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
LoRMA</td>
<td align="char" char="." rowspan="1" colspan="1">94.372</td>
<td align="char" char="." rowspan="1" colspan="1">0.109</td>
<td align="char" char="." rowspan="1" colspan="1">96.866</td>
<td align="char" char="." rowspan="1" colspan="1">99.987</td>
<td align="char" char="." rowspan="1" colspan="1">14.30</td>
<td align="char" char="." rowspan="1" colspan="1">17.338</td>
<td align="char" char="." rowspan="1" colspan="1">3.192</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="btw321-T6" orientation="portrait" position="float">
<label>Table 6.</label>
<caption>
<p>Comparison of both hybrid and self-correction tools on PacBio data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Tool</th>
<th rowspan="1" colspan="1">Size</th>
<th rowspan="1" colspan="1">Aligned</th>
<th rowspan="1" colspan="1">Error rate</th>
<th rowspan="1" colspan="1">Genome coverage</th>
<th rowspan="1" colspan="1">Elapsed time</th>
<th rowspan="1" colspan="1">Memory peak</th>
<th rowspan="1" colspan="1">Disk peak</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(%)</th>
<th rowspan="1" colspan="1">(h)</th>
<th rowspan="1" colspan="1">(GB)</th>
<th rowspan="1" colspan="1">(GB)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">
<italic>E. coli</italic>
</td>
<td rowspan="1" colspan="1">Original</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">71.108</td>
<td align="char" char="." rowspan="1" colspan="1">16.9126</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">LoRDEC</td>
<td align="char" char="." rowspan="1" colspan="1">65.672</td>
<td align="char" char="." rowspan="1" colspan="1">98.944</td>
<td align="char" char="." rowspan="1" colspan="1">0.1143</td>
<td align="char" char="." rowspan="1" colspan="1">99.820</td>
<td align="char" char="." rowspan="1" colspan="1">0.96</td>
<td align="char" char="." rowspan="1" colspan="1">0.368</td>
<td align="char" char="." rowspan="1" colspan="1">1.570</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">proovread</td>
<td align="char" char="." rowspan="1" colspan="1">61.590</td>
<td align="char" char="." rowspan="1" colspan="1">98.603</td>
<td align="char" char="." rowspan="1" colspan="1">0.2789</td>
<td align="char" char="." rowspan="1" colspan="1">99.728</td>
<td align="char" char="." rowspan="1" colspan="1">28.65</td>
<td align="char" char="." rowspan="1" colspan="1">9.522</td>
<td align="char" char="." rowspan="1" colspan="1">7.174</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">PBcR (with Illumina)</td>
<td align="char" char="." rowspan="1" colspan="1">52.103</td>
<td align="char" char="." rowspan="1" colspan="1">98.507</td>
<td align="char" char="." rowspan="1" colspan="1">0.0682</td>
<td align="char" char="." rowspan="1" colspan="1">98.769</td>
<td align="char" char="." rowspan="1" colspan="1">15.13</td>
<td align="char" char="." rowspan="1" colspan="1">17.429</td>
<td align="char" char="." rowspan="1" colspan="1">160.154</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Jabba</td>
<td align="char" char="." rowspan="1" colspan="1">2.873</td>
<td align="char" char="." rowspan="1" colspan="1">99.945</td>
<td align="char" char="." rowspan="1" colspan="1">0.0003</td>
<td align="char" char="." rowspan="1" colspan="1">99.745</td>
<td align="char" char="." rowspan="1" colspan="1">0.02</td>
<td align="char" char="." rowspan="1" colspan="1">0.168</td>
<td align="char" char="." rowspan="1" colspan="1">0.606</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">PBcR (only PacBio)</td>
<td align="char" char="." rowspan="1" colspan="1">51.068</td>
<td align="char" char="." rowspan="1" colspan="1">86.023</td>
<td align="char" char="." rowspan="1" colspan="1">0.6905</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">1.68</td>
<td align="char" char="." rowspan="1" colspan="1">22.00</td>
<td align="char" char="." rowspan="1" colspan="1">16.070</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">LoRDEC*+LoRMA</td>
<td align="char" char="." rowspan="1" colspan="1">66.223</td>
<td align="char" char="." rowspan="1" colspan="1">99.318</td>
<td align="char" char="." rowspan="1" colspan="1">0.2572</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">10.40</td>
<td align="char" char="." rowspan="1" colspan="1">16.984</td>
<td align="char" char="." rowspan="1" colspan="1">2.824</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Yeast</td>
<td rowspan="1" colspan="1">Original</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">89.929</td>
<td align="char" char="." rowspan="1" colspan="1">16.8442</td>
<td align="char" char="." rowspan="1" colspan="1">99.974</td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
<td align="char" char="." rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">LoRDEC</td>
<td align="char" char="." rowspan="1" colspan="1">75.522</td>
<td align="char" char="." rowspan="1" colspan="1">97.337</td>
<td align="char" char="." rowspan="1" colspan="1">0.9987</td>
<td align="char" char="." rowspan="1" colspan="1">99.833</td>
<td align="char" char="." rowspan="1" colspan="1">3.17</td>
<td align="char" char="." rowspan="1" colspan="1">0.451</td>
<td align="char" char="." rowspan="1" colspan="1">2.776</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">proovread</td>
<td align="char" char="." rowspan="1" colspan="1">0.306</td>
<td align="char" char="." rowspan="1" colspan="1">97.156</td>
<td align="char" char="." rowspan="1" colspan="1">0.8004</td>
<td align="char" char="." rowspan="1" colspan="1">20.346</td>
<td align="char" char="." rowspan="1" colspan="1">11.18</td>
<td align="char" char="." rowspan="1" colspan="1">4.764</td>
<td align="char" char="." rowspan="1" colspan="1">7.162</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">PBcR (with Illumina)</td>
<td align="char" char="." rowspan="1" colspan="1">57.337</td>
<td align="char" char="." rowspan="1" colspan="1">98.100</td>
<td align="char" char="." rowspan="1" colspan="1">0.3342</td>
<td align="char" char="." rowspan="1" colspan="1">99.652</td>
<td align="char" char="." rowspan="1" colspan="1">22.05</td>
<td align="char" char="." rowspan="1" colspan="1">20.085</td>
<td align="char" char="." rowspan="1" colspan="1">157.726</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Jabba</td>
<td align="char" char="." rowspan="1" colspan="1">24.979</td>
<td align="char" char="." rowspan="1" colspan="1">99.484</td>
<td align="char" char="." rowspan="1" colspan="1">0.1279</td>
<td align="char" char="." rowspan="1" colspan="1">99.900</td>
<td align="char" char="." rowspan="1" colspan="1">0.17</td>
<td align="char" char="." rowspan="1" colspan="1">1.031</td>
<td align="char" char="." rowspan="1" colspan="1">0.993</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">PBcR (only PacBio)</td>
<td align="char" char="." rowspan="1" colspan="1">60.065</td>
<td align="char" char="." rowspan="1" colspan="1">95.822</td>
<td align="char" char="." rowspan="1" colspan="1">2.1018</td>
<td align="char" char="." rowspan="1" colspan="1">99.907</td>
<td align="char" char="." rowspan="1" colspan="1">4.42</td>
<td align="char" char="." rowspan="1" colspan="1">9.571</td>
<td align="char" char="." rowspan="1" colspan="1">24.610</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">LoRDEC*+LoRMA</td>
<td align="char" char="." rowspan="1" colspan="1">71.987</td>
<td align="char" char="." rowspan="1" colspan="1">98.088</td>
<td align="char" char="." rowspan="1" colspan="1">0.3644</td>
<td align="char" char="." rowspan="1" colspan="1">99.375</td>
<td align="char" char="." rowspan="1" colspan="1">21.08</td>
<td align="char" char="." rowspan="1" colspan="1">17.968</td>
<td align="char" char="." rowspan="1" colspan="1">4.852</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="tblfn1">
<p>Results for tools utilizing also Illumina data are shown on a grey background</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>PBcR pipeline from Celera Assembler version 8.3rc2 was run without the assembly phase and memory limited to 16 GB. PBcR was run both only using PacBio reads and by utilizing also the short read data. For PBcR utilizing also short read data, the PacBio reads were divided into three subsets each of which was corrected in its own run. Proovread v2.12 was run with the sequence/fastq files chunked to 20M as per the usage manual and used 16 mapping threads. LoRDEC used an abundancy threshold of 3 and
<italic>k</italic>
-mer size was set to 19 similar to the experiments by
<xref rid="btw321-B20" ref-type="bibr">Salmela and Rivals (2014)</xref>
. Jabba 1.1.0 used
<italic>k</italic>
-mer size 31 and short output mode. LoRMA was run with 6 threads. The
<italic>k</italic>
-mer sizes for LoRDEC*+LoRMA iteration steps were chosen 19, 40 and 61. For proovread and LoRDEC, we present results for trimmed and split reads.</p>
<p>
<xref rid="btw321-T5" ref-type="table">Table 5</xref>
shows that on the simulated data both PBcR and LoRDEC*+LoRMA are able to correct most of the data. Our new method achieves a lower error rate and higher throughput. We see that the fraction of corrected reads aligning to the correct genomic position is lower for LoRDEC*+LoRMA than for PBcR when all reads are considered, which suggests that LoRDEC*+LoRMA tends to overcorrect some reads. However, for corrected reads longer than 2000 bp this difference disappears, and thus, we can conclude that the overcorrected reads are short. When compared to the other selfcorrection method, PBcR, our new tool has a higher throughput and produces more accurate results on both real datasets as shown in
<xref rid="btw321-T6" ref-type="table">Table 6</xref>
. Out of the hybrid methods, Jabba has a lower error rate than LoRDEC*+LoRMA but its throughput is lower. When compared to the other hybrid methods, LoRDEC*+LoRMA has comparable accuracy and throughput. All hybrid methods produce corrected reads that do not cover the whole
<italic>E.coli</italic>
reference, which could be a result of coverage bias in the Illumina data. On the yeast data proovread produced few corrected reads and thus the coverage of the corrected reads is very low.</p>
<p>
<xref rid="btw321-T6" ref-type="table">Table 6</xref>
shows that our method is slower and uses more memory than PBcR in self-correction mode but its disk usage is lower. On the
<italic>E.coli</italic>
dataset our new method is faster than proovread and PBcR utilising short read data but slower than LoRDEC, Jabba or PBcR using only PacBio data. On the yeast dataset, we are faster than PBcR in hybrid mode but slower than the others.</p>
<p>On the
<italic>E.coli</italic>
and yeast datasets, LoRDEC*+LoRMA uses 45% and 37%, respectively, of its running time on LoRDEC* iterations. On both datasets, the error rate of the reads after LoRDEC* iterations and trimming was 0.5%.</p>
</sec>
<sec>
<title>4.4 The effect of coverage</title>
<p>Especially for larger genomes, it is of interest to know how much coverage is needed for the error correction to succeed. We investigated this by creating random subsets of the
<italic>E.coli</italic>
dataset with coverages 25×, 50×, 100× and 150×. We then ran our method and PBcR (
<xref rid="btw321-B3" ref-type="bibr">Berlin
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btw321-B10" ref-type="bibr">Koren
<italic>et al.</italic>
, 2012</xref>
) on these subsets to investigate the effect of coverage on the error correction performance.
<xref rid="btw321-T7" ref-type="table">Table 7</xref>
shows the results of these experiments. The other tools, LoRDEC, Jabba and proovread, use also the complementary Illumina reads and the coverage of PacBio reads does not affect their performance.
<table-wrap id="btw321-T7" orientation="portrait" position="float">
<label>Table 7.</label>
<caption>
<p>The effect of coverage of the PacBio read set on the quality of the correction</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th colspan="5" rowspan="1">LoRDEC*+LoRMA
<hr></hr>
</th>
<th colspan="5" rowspan="1">PBcR
<hr></hr>
</th>
</tr>
<tr>
<th rowspan="1" colspan="1">Coverage</th>
<th rowspan="1" colspan="1">25×</th>
<th rowspan="1" colspan="1">50×</th>
<th rowspan="1" colspan="1">100×</th>
<th rowspan="1" colspan="1">150×</th>
<th rowspan="1" colspan="1">208×</th>
<th rowspan="1" colspan="1">25×</th>
<th rowspan="1" colspan="1">50×</th>
<th rowspan="1" colspan="1">100×</th>
<th rowspan="1" colspan="1">150×</th>
<th rowspan="1" colspan="1">208×</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Size (%)</td>
<td align="char" char="." rowspan="1" colspan="1">3.105</td>
<td align="char" char="." rowspan="1" colspan="1">30.348</td>
<td align="char" char="." rowspan="1" colspan="1">65.739</td>
<td align="char" char="." rowspan="1" colspan="1">67.198</td>
<td align="char" char="." rowspan="1" colspan="1">66.223</td>
<td align="char" char="." rowspan="1" colspan="1">31.132</td>
<td align="char" char="." rowspan="1" colspan="1">44.190</td>
<td align="char" char="." rowspan="1" colspan="1">48.391</td>
<td align="char" char="." rowspan="1" colspan="1">50.284</td>
<td align="char" char="." rowspan="1" colspan="1">51.068</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Aligned (%)</td>
<td align="char" char="." rowspan="1" colspan="1">99.400</td>
<td align="char" char="." rowspan="1" colspan="1">99.663</td>
<td align="char" char="." rowspan="1" colspan="1">98.328</td>
<td align="char" char="." rowspan="1" colspan="1">98.748</td>
<td align="char" char="." rowspan="1" colspan="1">99.318</td>
<td align="char" char="." rowspan="1" colspan="1">99.941</td>
<td align="char" char="." rowspan="1" colspan="1">99.794</td>
<td align="char" char="." rowspan="1" colspan="1">95.966</td>
<td align="char" char="." rowspan="1" colspan="1">90.003</td>
<td align="char" char="." rowspan="1" colspan="1">86.023</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Error rate (%)</td>
<td align="char" char="." rowspan="1" colspan="1">0.329</td>
<td align="char" char="." rowspan="1" colspan="1">0.187</td>
<td align="char" char="." rowspan="1" colspan="1">0.140</td>
<td align="char" char="." rowspan="1" colspan="1">0.159</td>
<td align="char" char="." rowspan="1" colspan="1">0.257</td>
<td align="char" char="." rowspan="1" colspan="1">2.224</td>
<td align="char" char="." rowspan="1" colspan="1">1.396</td>
<td align="char" char="." rowspan="1" colspan="1">0.874</td>
<td align="char" char="." rowspan="1" colspan="1">0.757</td>
<td align="char" char="." rowspan="1" colspan="1">0.6905</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Gen. cov. (%)</td>
<td align="char" char="." rowspan="1" colspan="1">3.886</td>
<td align="char" char="." rowspan="1" colspan="1">45.763</td>
<td align="char" char="." rowspan="1" colspan="1">99.403</td>
<td align="char" char="." rowspan="1" colspan="1">99.999</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">94.638</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
<td align="char" char="." rowspan="1" colspan="1">100.000</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Time (h)</td>
<td align="char" char="." rowspan="1" colspan="1">0.10</td>
<td align="char" char="." rowspan="1" colspan="1">0.32</td>
<td align="char" char="." rowspan="1" colspan="1">3.30</td>
<td align="char" char="." rowspan="1" colspan="1">7.17</td>
<td align="char" char="." rowspan="1" colspan="1">10.40</td>
<td align="char" char="." rowspan="1" colspan="1">0.08</td>
<td align="char" char="." rowspan="1" colspan="1">0.18</td>
<td align="char" char="." rowspan="1" colspan="1">0.47</td>
<td align="char" char="." rowspan="1" colspan="1">0.93</td>
<td align="char" char="." rowspan="1" colspan="1">1.68</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Memory (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">14.165</td>
<td align="char" char="." rowspan="1" colspan="1">14.275</td>
<td align="char" char="." rowspan="1" colspan="1">14.718</td>
<td align="char" char="." rowspan="1" colspan="1">15.415</td>
<td align="char" char="." rowspan="1" colspan="1">16.984</td>
<td align="char" char="." rowspan="1" colspan="1">7.851</td>
<td align="char" char="." rowspan="1" colspan="1">9.020</td>
<td align="char" char="." rowspan="1" colspan="1">9.706</td>
<td align="char" char="." rowspan="1" colspan="1">9.931</td>
<td align="char" char="." rowspan="1" colspan="1">22.00</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Disk (GB)</td>
<td align="char" char="." rowspan="1" colspan="1">0.272</td>
<td align="char" char="." rowspan="1" colspan="1">0.655</td>
<td align="char" char="." rowspan="1" colspan="1">1.416</td>
<td align="char" char="." rowspan="1" colspan="1">2.024</td>
<td align="char" char="." rowspan="1" colspan="1">2.824</td>
<td align="char" char="." rowspan="1" colspan="1">1.232</td>
<td align="char" char="." rowspan="1" colspan="1">2.443</td>
<td align="char" char="." rowspan="1" colspan="1">3.714</td>
<td align="char" char="." rowspan="1" colspan="1">7.114</td>
<td align="char" char="." rowspan="1" colspan="1">16.070</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>When the coverage is high, the new method retains a larger proportion of the reads than PBcR and is more accurate, whereas when the coverage is low, PBcR retains more of the data and a larger proportion of it can be aligned. However, the error rate remains much lower for our new tool. The reads corrected by PBcR also cover a larger part of the reference when the coverage is low.</p>
</sec>
</sec>
<sec>
<title>5 Conclusions</title>
<p>We have presented a new method for correcting long and highly erroneous sequencing reads. Our method shows that efficient alignment free methods can be applied to highly erroneous long read data. The current approach needs alignments to take into account the global context of errors. Reads corrected by the new method have an error rate less than half of the error rate of reads corrected by previous self-correction methods. Furthermore, the throughput of the new method is 20% higher than previous self-correction methods with read sets having coverage at least 75×.</p>
<p>Recently several algorithms for updating the DBG instead of constructing it from scratch when
<italic>k</italic>
changes have been proposed (
<xref rid="btw321-B4" ref-type="bibr">Boucher
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btw321-B5" ref-type="bibr">Cazaux
<italic>et al.</italic>
, 2014</xref>
). However, these methods are not directly applicable to our method because also the read set changes when we run LoRDEC* iteratively on the long reads.</p>
<p>Our method works solely on the long reads, whereas many previous methods require also short accurate reads produced by e.g. Illumina sequencing, which can incorporate sequencing biases in PacBio reads. This could have very negative effect on sequence quality, especially since Illumina suffers from GC content bias and some context-dependent errors (
<xref rid="btw321-B17" ref-type="bibr">Nakamura
<italic>et al.</italic>
, 2011</xref>
;
<xref rid="btw321-B22" ref-type="bibr">Schirmer
<italic>et al.</italic>
, 2015</xref>
).</p>
<p>As further work, we plan to improve the method to scale up to mammalian size genomes. We will investigate a more compact representation of the path labels in the augmented DBG to replace the simple hash tables currently used. Construction of multiple alignment could also be improved by exploiting partial order alignments (
<xref rid="btw321-B14" ref-type="bibr">Lee
<italic>et al.</italic>
, 2002</xref>
) which have been shown to work well with PacBio reads (
<xref rid="btw321-B7" ref-type="bibr">Chin
<italic>et al.</italic>
, 2013</xref>
).</p>
<p>Another direction of further work is to investigate the applicability of the new method on long reads produced by the Oxford NanoPore MinION platform.
<xref rid="btw321-B13" ref-type="bibr">Laver
<italic>et al.</italic>
(2015)</xref>
have reported an error rate of 38.2% for this platform and they also observed some GC content bias. Both of these factors make the error correction problem more challenging, and therefore, it will be interesting to see a comparison of the methods on this data.</p>
</sec>
<sec>
<title>Funding</title>
<p>This work was supported by the Academy of Finland (grant 267591 to L.S.), ANR Colib’read (grant ANR-12-BS02-0008), IBC (ANR-11-BINF-0002) and Défi MASTODONS to E.R., and EU FP7 project SYSCOL (grant UE7-SYSCOL-258236 to E.U.). </p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="btw321-B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Au</surname>
<given-names>K.F.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Improving PacBio long read accuracy by short read alignment</article-title>
.
<source>PLoS ONE</source>
,
<volume>7</volume>
,
<fpage>e46679</fpage>
.
<pub-id pub-id-type="pmid">23056399</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bankevich</surname>
<given-names>A.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
.
<source>J. Comput. Biol</source>
.,
<volume>19</volume>
,
<fpage>455</fpage>
<lpage>477</lpage>
.
<pub-id pub-id-type="pmid">22506599</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Berlin</surname>
<given-names>K.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>Assembling large genomes with single-molecule sequencing and locality-sensitive hashing</article-title>
.
<source>Nat. Biotechnol</source>
.,
<volume>33</volume>
,
<fpage>623</fpage>
<lpage>630</lpage>
.
<pub-id pub-id-type="pmid">26006009</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B4">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Boucher</surname>
<given-names>C.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
). Variable-order de Bruijn graphs. In:
<italic>Proc. DCC 2015</italic>
, pp. 383–392.</mixed-citation>
</ref>
<ref id="btw321-B5">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Cazaux</surname>
<given-names>B.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
). From indexing data structures to de Bruijn graphs. In:
<italic>Proc. CPM 2014</italic>
, volume 8486 of
<italic>LNCS</italic>
. Springer, pp. 89–99.</mixed-citation>
</ref>
<ref id="btw321-B6">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chaisson</surname>
<given-names>M.J.</given-names>
</name>
,
<name name-style="western">
<surname>Tesler</surname>
<given-names>G.</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory</article-title>
.
<source>BMC Bioinformatics</source>
,
<volume>13</volume>
,
<fpage>238.</fpage>
<pub-id pub-id-type="pmid">22988817</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chin</surname>
<given-names>C.S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data</article-title>
.
<source>Nat. Methods</source>
,
<volume>10</volume>
,
<fpage>563</fpage>
<lpage>569</lpage>
.
<pub-id pub-id-type="pmid">23644548</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Drezen</surname>
<given-names>E.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>GATB: genome assembly & analysis tool box</article-title>
.
<source>Bioinformatics</source>
,
<volume>30</volume>
,
<fpage>2959</fpage>
<lpage>2961</lpage>
.
<pub-id pub-id-type="pmid">24990603</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hackl</surname>
<given-names>T.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>proovread: large-scale high accuracy PacBio correction through iterative short read consensus</article-title>
.
<source>Bioinformatics</source>
,
<volume>30</volume>
,
<fpage>3004</fpage>
<lpage>3011</lpage>
.
<pub-id pub-id-type="pmid">25015988</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Koren</surname>
<given-names>S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Hybrid error correction and de novo assembly of single-molecule sequencing reads</article-title>
.
<source>Nat. Biotechnol</source>
.,
<volume>30</volume>
,
<fpage>693</fpage>
<lpage>700</lpage>
.
<pub-id pub-id-type="pmid">22750884</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Koren</surname>
<given-names>S.</given-names>
</name>
,
<name name-style="western">
<surname>Philippy</surname>
<given-names>A.M.</given-names>
</name>
</person-group>
(
<year>2015</year>
)
<article-title>One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly</article-title>
.
<source>Curr. Opin. Microbiol</source>
.,
<volume>23</volume>
,
<fpage>110</fpage>
<lpage>120</lpage>
.
<pub-id pub-id-type="pmid">25461581</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B12">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Laehnemann</surname>
<given-names>D.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016</year>
)
<article-title>Denoising DNA deep sequencing data – high-throughput sequencing errors and their correction</article-title>
.
<source>Brief. Bioinf</source>
.,
<volume>17</volume>
,
<fpage>154</fpage>
<lpage>179</lpage>
.</mixed-citation>
</ref>
<ref id="btw321-B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Laver</surname>
<given-names>T.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>Assessing the performance of the Oxford Nanopore Technologies MinION</article-title>
.
<source>Biomol. Detect. Quant</source>
.,
<volume>3</volume>
,
<fpage>1</fpage>
<lpage>8</lpage>
.</mixed-citation>
</ref>
<ref id="btw321-B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Lee</surname>
<given-names>C.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2002</year>
)
<article-title>Multiple sequence alignment using partial order graphs</article-title>
.
<source>Bioinformatics</source>
,
<volume>18</volume>
,
<fpage>452</fpage>
<lpage>464</lpage>
.
<pub-id pub-id-type="pmid">11934745</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Madoui</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>Genome assembly using Nanopore-guided long and error-free DNA reads</article-title>
.
<source>BMC Genomics</source>
,
<volume>16</volume>
,
<fpage>327</fpage>
.
<pub-id pub-id-type="pmid">25927464</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B16">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Miclotte</surname>
<given-names>G.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
). Jabba: Hybrid error correction for long sequencing reads using maximal exact matches. In:
<italic>Proc. WABI 2015</italic>
, volume 9289 of
<italic>LNBI</italic>
. Springer, pp. 175–188.</mixed-citation>
</ref>
<ref id="btw321-B17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Nakamura</surname>
<given-names>K.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2011</year>
)
<article-title>Sequence-specific error profile of Illumina sequencers</article-title>
.
<source>Nucleic Acids Res</source>
.,
<volume>39</volume>
,
<fpage>e90</fpage>
.
<pub-id pub-id-type="pmid">21576222</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B18">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ono</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>PBSIM: PacBio reads simulator – toward accurate genome assembly</article-title>
.
<source>Bioinformatics</source>
,
<volume>29</volume>
,
<fpage>119</fpage>
<lpage>121</lpage>
.
<pub-id pub-id-type="pmid">23129296</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B19">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Peng</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2010</year>
). IDBA – a practical iterative de Bruijn graph de novo assembler. In:
<italic>Proc. RECOMB 2010</italic>
, volume 6044 of
<italic>LNBI</italic>
. Springer, pp. 426–440.</mixed-citation>
</ref>
<ref id="btw321-B20">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Salmela</surname>
<given-names>L.</given-names>
</name>
,
<name name-style="western">
<surname>Rivals</surname>
<given-names>E.</given-names>
</name>
</person-group>
(
<year>2014</year>
)
<article-title>LoRDEC: accurate and efficient long read error correction</article-title>
.
<source>Bioinformatics</source>
,
<volume>30</volume>
,
<fpage>3506</fpage>
<lpage>3514</lpage>
.
<pub-id pub-id-type="pmid">25165095</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B21">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Salmela</surname>
<given-names>L.</given-names>
</name>
,
<name name-style="western">
<surname>Schröder</surname>
<given-names>J.</given-names>
</name>
</person-group>
(
<year>2011</year>
)
<article-title>Correcting errors in short reads by multiple alignments</article-title>
.
<source>Bioinformatics</source>
,
<volume>27</volume>
,
<fpage>1455</fpage>
<lpage>1461</lpage>
.
<pub-id pub-id-type="pmid">21471014</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B22">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Schirmer</surname>
<given-names>M.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform</article-title>
.
<source>Nucleic Acids Res</source>
.,
<volume>43</volume>
,
<fpage>e37</fpage>
.
<pub-id pub-id-type="pmid">25586220</pub-id>
</mixed-citation>
</ref>
<ref id="btw321-B23">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>A survey of error-correction methods for next-generation sequencing</article-title>
.
<source>Brief. Bioinf</source>
.,
<volume>14</volume>
,
<fpage>56</fpage>
<lpage>66</lpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B15 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B15 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5351550
   |texte=   Accurate self-correction of errors in long reads using de Bruijn graphs
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27273673" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021