Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000B46 ( Pmc/Corpus ); précédent : 000B459; suivant : 000B470 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly</title>
<author>
<name sortKey="Sameith, Katrin" sort="Sameith, Katrin" uniqKey="Sameith K" first="Katrin" last="Sameith">Katrin Sameith</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Roscito, Juliana G" sort="Roscito, Juliana G" uniqKey="Roscito J" first="Juliana G" last="Roscito">Juliana G. Roscito</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hiller, Michael" sort="Hiller, Michael" uniqKey="Hiller M" first="Michael" last="Hiller">Michael Hiller</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26868358</idno>
<idno type="pmc">5221426</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5221426</idno>
<idno type="RBID">PMC:5221426</idno>
<idno type="doi">10.1093/bib/bbw003</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000B46</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B46</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly</title>
<author>
<name sortKey="Sameith, Katrin" sort="Sameith, Katrin" uniqKey="Sameith K" first="Katrin" last="Sameith">Katrin Sameith</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Roscito, Juliana G" sort="Roscito, Juliana G" uniqKey="Roscito J" first="Juliana G" last="Roscito">Juliana G. Roscito</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hiller, Michael" sort="Hiller, Michael" uniqKey="Hiller M" first="Michael" last="Hiller">Michael Hiller</name>
<affiliation>
<nlm:aff id="AFF1">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Briefings in Bioinformatics</title>
<idno type="ISSN">1467-5463</idno>
<idno type="eISSN">1477-4054</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<p>Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous
<italic>k</italic>
-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of
<italic>k</italic>
-mer-based correction with an increasing
<italic>k</italic>
-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large
<italic>k</italic>
-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (
<ext-link ext-link-type="uri" xlink:href="https://github.com/hillerlab/IterativeErrorCorrection/">https://github.com/hillerlab/IterativeErrorCorrection/</ext-link>
) that implements iterative error correction by using modules from SGA.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Laehnemann, D" uniqKey="Laehnemann D">D Laehnemann</name>
</author>
<author>
<name sortKey="Borkhardt, A" uniqKey="Borkhardt A">A Borkhardt</name>
</author>
<author>
<name sortKey="Mchardy, Ac" uniqKey="Mchardy A">AC McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y Heo</name>
</author>
<author>
<name sortKey="Wu, Xl" uniqKey="Wu X">XL Wu</name>
</author>
<author>
<name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kao, Wc" uniqKey="Kao W">WC Kao</name>
</author>
<author>
<name sortKey="Chan, Ah" uniqKey="Chan A">AH Chan</name>
</author>
<author>
<name sortKey="Song, Ys" uniqKey="Song Y">YS Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaisson, Mj" uniqKey="Chaisson M">MJ Chaisson</name>
</author>
<author>
<name sortKey="Brinza, D" uniqKey="Brinza D">D Brinza</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Holtgrewe, M" uniqKey="Holtgrewe M">M Holtgrewe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Scott, E" uniqKey="Scott E">E Scott</name>
</author>
<author>
<name sortKey="Kakaradov, B" uniqKey="Kakaradov B">B Kakaradov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Fazayeli, F" uniqKey="Fazayeli F">F Fazayeli</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Allam, A" uniqKey="Allam A">A Allam</name>
</author>
<author>
<name sortKey="Kalnis, P" uniqKey="Kalnis P">P Kalnis</name>
</author>
<author>
<name sortKey="Solovyev, V" uniqKey="Solovyev V">V Solovyev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Florea, L" uniqKey="Florea L">L Florea</name>
</author>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author>
<name sortKey="Yin, J" uniqKey="Yin J">J Yin</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Molnar, M" uniqKey="Molnar M">M Molnar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Dorman, Ks" uniqKey="Dorman K">KS Dorman</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Chockalingam, Sp" uniqKey="Chockalingam S">SP Chockalingam</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Molnar, M" uniqKey="Molnar M">M Molnar</name>
</author>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wan, Q H" uniqKey="Wan Q">Q-H Wan</name>
</author>
<author>
<name sortKey="Pan, S K" uniqKey="Pan S">S-K Pan</name>
</author>
<author>
<name sortKey="Hu, L" uniqKey="Hu L">L Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
<author>
<name sortKey="Shi, Y" uniqKey="Shi Y">Y Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Myers, Jr" uniqKey="Myers J">JR Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosenbloom, Kr" uniqKey="Rosenbloom K">KR Rosenbloom</name>
</author>
<author>
<name sortKey="Armstrong, J" uniqKey="Armstrong J">J Armstrong</name>
</author>
<author>
<name sortKey="Barber, Gp" uniqKey="Barber G">GP Barber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simao, Fa" uniqKey="Simao F">FA Simão</name>
</author>
<author>
<name sortKey="Waterhouse, Rm" uniqKey="Waterhouse R">RM Waterhouse</name>
</author>
<author>
<name sortKey="Ioannidis, P" uniqKey="Ioannidis P">P Ioannidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Brief Bioinform</journal-id>
<journal-id journal-id-type="iso-abbrev">Brief. Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bib</journal-id>
<journal-title-group>
<journal-title>Briefings in Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1467-5463</issn>
<issn pub-type="epub">1477-4054</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26868358</article-id>
<article-id pub-id-type="pmc">5221426</article-id>
<article-id pub-id-type="doi">10.1093/bib/bbw003</article-id>
<article-id pub-id-type="publisher-id">bbw003</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Paper</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Sameith</surname>
<given-names>Katrin</given-names>
</name>
<xref ref-type="aff" rid="AFF1">1</xref>
<xref ref-type="aff" rid="AFF2">2</xref>
<xref ref-type="author-notes" rid="FM1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Roscito</surname>
<given-names>Juliana G</given-names>
</name>
<xref ref-type="aff" rid="AFF1">1</xref>
<xref ref-type="aff" rid="AFF2">2</xref>
<xref ref-type="author-notes" rid="FM1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hiller</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="aff" rid="AFF1">1</xref>
<xref ref-type="aff" rid="AFF2">2</xref>
<xref ref-type="corresp" rid="COR1"></xref>
<pmc-comment>hiller@mpi-cbg.de</pmc-comment>
</contrib>
</contrib-group>
<aff id="AFF1">
<label>1</label>
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</aff>
<aff id="AFF2">
<label>2</label>
Max Planck Institute for the Physics of Complex Systems, Dresden, Germany</aff>
<author-notes>
<corresp id="COR1">Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
<email>hiller@mpi-cbg.de</email>
</corresp>
<fn id="FM1">
<p>Katrin Sameith and Juliana Roscito authors contributed equally to this work.</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<month>1</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2016-02-10">
<day>10</day>
<month>2</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>10</day>
<month>2</month>
<year>2016</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>18</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>8</lpage>
<history>
<date date-type="received">
<day>22</day>
<month>10</month>
<year>2015</year>
</date>
<date date-type="rev-recd">
<day>02</day>
<month>1</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2016. Published by Oxford University Press.</copyright-statement>
<copyright-year>2016</copyright-year>
<license license-type="cc-by-nc" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="bbw003.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<p>Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous
<italic>k</italic>
-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of
<italic>k</italic>
-mer-based correction with an increasing
<italic>k</italic>
-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large
<italic>k</italic>
-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (
<ext-link ext-link-type="uri" xlink:href="https://github.com/hillerlab/IterativeErrorCorrection/">https://github.com/hillerlab/IterativeErrorCorrection/</ext-link>
) that implements iterative error correction by using modules from SGA.</p>
</abstract>
<kwd-group>
<kwd>sequencing errors</kwd>
<kwd>long Illumina reads</kwd>
<kwd>error correction</kwd>
<kwd>genome assembly</kwd>
</kwd-group>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Max Planck Society</named-content>
</funding-source>
<award-id>2012/01319-8</award-id>
</award-group>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>The use of high-throughput sequencing techniques has increased greatly over the past decade. Different sequencing platforms are available, such as Illumina, PacBio, 454, IonTorrent and Nanopore; Illumina is most widely used to date. While being limited to relatively short read lengths in the past, a single run on an Illumina MiSeq machine can now produce 15 gigabases (GB) of paired-end reads as long as 300 bp. The latest Illumina HiSeq 2500 machine can even produce up to 300 GB of paired-end 250 bp reads. This high throughput of long reads is attractive for genome assembly. Although Illumina data already contain relatively few errors at a rate of <1% [
<xref rid="bbw003-B1" ref-type="bibr">1</xref>
], the probability that reads are completely error-free is low, especially for longer 250 or 300 bp reads.</p>
<p>For genome assembly, it is desirable to use reads that are as accurate as possible. Therefore, correction of sequencing errors is an essential preprocessing step. Many tools for error correction of short read sequencing data exist: BFC [
<xref rid="bbw003-B2" ref-type="bibr">2</xref>
], BLESS [
<xref rid="bbw003-B3" ref-type="bibr">3</xref>
], Coral [
<xref rid="bbw003-B4" ref-type="bibr">4</xref>
], ECHO [
<xref rid="bbw003-B5" ref-type="bibr">5</xref>
], EULER [
<xref rid="bbw003-B6" ref-type="bibr">6</xref>
,
<xref rid="bbw003-B7" ref-type="bibr">7</xref>
], Fiona [
<xref rid="bbw003-B8" ref-type="bibr">8</xref>
], Hammer [
<xref rid="bbw003-B9" ref-type="bibr">9</xref>
], HiTEC [
<xref rid="bbw003-B10" ref-type="bibr">10</xref>
], Karect [
<xref rid="bbw003-B11" ref-type="bibr">11</xref>
], Lighter [
<xref rid="bbw003-B12" ref-type="bibr">12</xref>
], Musket [
<xref rid="bbw003-B13" ref-type="bibr">13</xref>
], MyHybrid [
<xref rid="bbw003-B14" ref-type="bibr">14</xref>
], Quake [
<xref rid="bbw003-B15" ref-type="bibr">15</xref>
], RACER [
<xref rid="bbw003-B16" ref-type="bibr">16</xref>
], Reptile [
<xref rid="bbw003-B17" ref-type="bibr">17</xref>
], SGA [
<xref rid="bbw003-B18" ref-type="bibr">18</xref>
] and SHREC [
<xref rid="bbw003-B19" ref-type="bibr">19</xref>
], among others. All rely on errors being infrequent and sequencing coverage being sufficiently high so that errors can be corrected using other reads covering the same genomic locus. Most of these tools can be divided into two categories according to how they approach the correction of sequencing reads:
<italic>k</italic>
-mer-based correction, which deals primarily with base substitutions, and overlap-based correction, which can also correct insertions and deletions. A detailed overview of each approach and the existing tools is given by Laehnemann
<italic>et al</italic>
. [
<xref rid="bbw003-B1" ref-type="bibr">1</xref>
].</p>
<p>The idea behind
<italic>k</italic>
-mer-based correction is that a sequencing error will result in a
<italic>k</italic>
bp long substring of the read (
<italic>k</italic>
-mer) that does not occur in the genome and thus has a low count in the input data. Such infrequent
<italic>k</italic>
-mers can be detected, and the erroneous base can be corrected if substituting it to another base results in a
<italic>k</italic>
-mer that occurs more frequently (solid
<italic>k</italic>
-mer).
<italic>K</italic>
-mer-based correction depends on a parameter choice for the
<italic>k</italic>
-mer size, as well as for the count threshold of infrequent and solid
<italic>k</italic>
-mers. The idea behind overlap-based correction is to build a multiple sequence alignment of similar reads that probably come from the same genomic locus. Then, sequencing errors are detected as rare differences in alignment columns and are corrected with the alignment column consensus. Overlap-based correction depends on parameters for the minimum read similarity, count thresholds for rare differences and the minimum number of reads supporting the consensus base.</p>
<p>Previous studies have addressed the performance of some of the error correction tools [
<xref rid="bbw003-B20" ref-type="bibr">20</xref>
,
<xref rid="bbw003-B21" ref-type="bibr">21</xref>
]. Most tools perform well on the tested data sets, and almost all reads can be corrected for smaller genomes [
<xref rid="bbw003-B20" ref-type="bibr">20</xref>
]. However, for complex repeat-rich genomes, such as the human genome, a substantial proportion of the reads still have errors after correction. For example, errors remain in 15–20% of human 100 bp HiSeq reads after correcting with the top-performing tool (Table 2 in [
<xref rid="bbw003-B20" ref-type="bibr">20</xref>
]). This performance is even worse for longer reads. Errors remain in more than half of 250 bp MiSeq reads from the small
<italic>Escherichia coli</italic>
genome (Table 3 in [
<xref rid="bbw003-B20" ref-type="bibr">20</xref>
]).</p>
<p>There are several possible reasons why sequencing errors may remain uncorrected. First, reads with many errors are more difficult to correct because they are not similar to other reads from the same locus. However, such reads are easy to discard because of many infrequent
<italic>k</italic>
-mers present in them. Second, reads coming from a genomic locus with a low sequencing coverage might not be corrected because of the lack of solid
<italic>k</italic>
-mers in other reads from the same locus. Third, sequencing errors might remain undetected if the error results in a
<italic>k</italic>
-mer found elsewhere in the genome, which frequently occurs in repeat regions. While the first two cases are harder to address computationally, correction of errors in repeats can be improved for longer 250 or 300 bp reads.</p>
<p>Here, we show that many reads with uncorrected sequencing errors after standard correction overlap repeats. In these reads, short erroneous
<italic>k</italic>
-mers occur identically in another repeat copy, and are thus mistakenly considered correct. To improve error correction of long Illumina reads, we used modules from the String Graph Assembler and developed an iterative error correction pipeline that runs multiple rounds of
<italic>k</italic>
-mer-based correction with an increasing
<italic>k</italic>
-mer size, followed by a final round of overlap-based correction. We show that this iterative strategy effectively corrects errors in repeats and reduces the total amount of erroneous reads. We further show that this higher read accuracy translates into two to three times longer contig assemblies.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Simulated data sets</title>
<p>To investigate why sequencing errors remain after standard error correction and to test if iterative error correction improves read accuracy, we first simulated reads from known genomes. Simulated data have the advantage that we know the true sequence of every sampled read before errors were introduced. This allows us to accurately measure whether a read is error-free after each round of error correction.</p>
<p>Our analyses are based on the human chromosome 11 (135 Mb),
<italic>Anolis carolinensis</italic>
chromosome 4 (156 Mb) and chicken chromosome 14 (15 Mb). We first filled assembly gaps and ambiguous bases (N’s) with random sequences. We then created a second haplotype by introducing heterozygosity at the known rate for each species [single nucleotide polymorphism (SNP) rate 0.001 for human, 0.003 for lizard and 0.0006 for chicken, according to [
<xref rid="bbw003-B22" ref-type="bibr">22</xref>
]; default indel rate 0.0001; no structural variation] using pirs [
<xref rid="bbw003-B23" ref-type="bibr">23</xref>
]. Last, we used ART [
<xref rid="bbw003-B24" ref-type="bibr">24</xref>
] and a MiSeq 2 × 300 bp specific error profile to simulate reads from the three chromosomes at 30X coverage each (parameters: 15X coverage for each haplotype, 550 ± 55 bp fragment size). All simulated data sets and the error-profile are available at
<ext-link ext-link-type="uri" xlink:href="http://bds.mpi-cbg.de/hillerlab/IterativeErrorCorrection/">http://bds.mpi-cbg.de/hillerlab/IterativeErrorCorrection/</ext-link>
.</p>
</sec>
<sec>
<title>Real data</title>
<p>To test our error correction strategy on real data sets, we downloaded 2 × 250 bp MiSeq reads for the rice strains IR64, DJ123 and Nipponbare (SRX180591, SRX186093 and SRX179262, respectively;
<ext-link ext-link-type="uri" xlink:href="http://schatzlab.cshl.edu/data/rice/">http://schatzlab.cshl.edu/data/rice/</ext-link>
). We also downloaded a human 2 × 100 bp HiSeq read data set (Library 1 from
<ext-link ext-link-type="uri" xlink:href="http://gage.cbcb.umd.edu/data/">http://gage.cbcb.umd.edu/data/</ext-link>
). To be consistent with our simulations, we down-sampled all data sets to 30X coverage.</p>
</sec>
<sec>
<title>Overlap of reads with repeats</title>
<p>We obtained coordinates of interspersed and tandem repeats for all three chromosomes from the UCSC genome browser ‘rmsk' tables [
<xref rid="bbw003-B25" ref-type="bibr">25</xref>
]. All repeats that are at most 10% diverged from the repeat consensus were downloaded. This set covers 9.5% of human chromosome 11, 2.3% of lizard chromosome 4 and 1% of chicken chromosome 14. For completeness, we also analyzed overlap with all repeats regardless of divergence rate (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 1</xref>
).</p>
</sec>
<sec>
<title>Error correction strategy</title>
<p>In our iterative error correction pipeline, base substitutions are corrected in subsequent rounds of
<italic>k</italic>
-mer-based correction with an increasing
<italic>k</italic>
-mer size. To also correct remaining small insertions and deletions, we run a final round of overlap-based correction.</p>
<p>We tested our strategy using the String Graph Assembler [
<xref rid="bbw003-B18" ref-type="bibr">18</xref>
], because (i) the SGA code is open source; (ii) it implements both the
<italic>k</italic>
-mer-based and overlap-based correction approaches; (iii) SGA is modular, allowing us to run only selected steps in the genome assembly workflow; (iv) SGA works with reads of different lengths and uses efficient data structures to handle large amounts of data [
<xref rid="bbw003-B26" ref-type="bibr">26</xref>
]; and, most importantly, (v) it is one of the best correction methods for complex repeat-rich genomes [
<xref rid="bbw003-B21" ref-type="bibr">21</xref>
].</p>
<p>The SGA error correction modules were not designed to work with large
<italic>k</italic>
-mer sizes, where at certain loci only a few reads might overlap by
<italic>k</italic>
bp or more. To avoid mis-corrections in these regions, for example, by mis-correcting one infrequent
<italic>k</italic>
-mer to a
<italic>k</italic>
-mer that has an only slightly higher frequency, we added the following parameters to SGA:</p>
<p>
<italic>K</italic>
-mer-based correction:
<list list-type="simple">
<list-item>
<p>– count-offset = N (default 1): When correcting a
<italic>k</italic>
-mer, require the count of the new
<italic>k</italic>
-mer to be at least N higher than the count of the old
<italic>k</italic>
-mer.</p>
</list-item>
<list-item>
<p>Overlap-based correction:</p>
</list-item>
<list-item>
<p>– base-threshold = N (default 2): Attempt to correct bases in a read that are seen less than N times in a specific column of the multiple sequence alignment.</p>
</list-item>
<list-item>
<p>– min-count-max-base = N (default 4): When correcting a base, require the count of the new consensus base to be at least N.</p>
</list-item>
</list>
</p>
<p>We used these parameters: –count-offset 2, –min-overlap 40, –error-rate 0.01 and defaults otherwise.</p>
</sec>
<sec>
<title>Measuring read accuracy for the simulated data sets</title>
<p>Because we aim at maximizing the number of error-free reads, we distinguish between ‘correct' and ‘erroneous' reads. A read is considered correct, if its sequence is identical to the genomic locus of the haplotype where it was sampled from. In contrast, a read is considered erroneous, if it has at least one mismatch or insertion/deletion. We also consider a read as erroneous, if its sequence is identical to the alternative haplotype, which can occur by mistaking a SNP for an error and correcting it to the other allele. Our approach is thus more stringent than considering a read as correct if it is found identically anywhere in the diploid genome. In addition to counting how many reads are correct, we determined (i) the total number of sequencing errors by aligning the sequence of a read to the sequence of the correct read, and (ii) the percent of the genome covered by ≥ 10/20/30 correct reads from the genomic coordinates where the read was sampled from.</p>
</sec>
<sec>
<title>Measuring read accuracy for the real data sets</title>
<p>We determined the number of error-free reads by counting the number of reads that map exactly to the reference genome. Specifically, we ran bowtie2 [
<xref rid="bbw003-B27" ref-type="bibr">27</xref>
] with default parameters and counted the number of occurrences of the flag ‘MD:Z:', followed by the exact read length in the resulting sam files. It should be noted that the rice reference genome has assembly gaps, ambiguous bases (N’s) and only one haplotype, which implies that not all reads can map exactly, even if they are error-free.</p>
</sec>
<sec>
<title>Genome assembly</title>
<p>We used SGA to assemble error-corrected reads into contigs. In total, six contig sets were separately assembled for (i) reads after the single best
<italic>k</italic>
-mer-based correction, and (ii) reads after correction with our iterative strategy. Potentially erroneous reads were filtered out with ‘sga filter' (parameters: –kmer-threshold 2, –homopolymer-check, –low-complexity-check; defaults otherwise), and an overlap-based string graph was generated with ‘sga overlap' (parameters: –min-overlap 75; defaults otherwise). Contigs were assembled with ‘sga assemble', and the minimum overlap between sequences in the string graph was optimized for each contig set (parameters: –min-overlap ranging from 75 to 200, –resolve-small 10; defaults otherwise).</p>
<p>We assessed assembly quality by testing how many of the assembled ≥400 bp contigs map back to a single continuous genomic region by using Blat with the parameters -tileSize=18 -minMatch=4 -maxIntron=10. Then, we counted every contig where at least 98% of its sequence matches with at least 98% identity to the genome and where the alignment span in the genome is at most 102% of the contig length. Manual inspection of several contigs that did not fulfill these criteria showed that these are short, <1000 bp sequences and match better to the other chromosome haplotype, which was not used for Blat. To assess the coverage of complete genes, we used BUSCO [
<xref rid="bbw003-B28" ref-type="bibr">28</xref>
] in genome mode with the vertebrate gene set.</p>
</sec>
<sec>
<title>Iterative error correction using other methods</title>
<p>For RACER, we set the genome size to the size of human chromosome 11 (135006516bp). Lighter was run with -K kmer_ length genome_size (-K 32 135006516). For BFC and Musket, we specified the
<italic>k</italic>
-mer size with the -k parameter. All other parameters were the default values.</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<sec>
<title>Most uncorrected 300 bp reads overlap repeats</title>
<p>We started by using simulated reads to investigate why error correction does not correct all errors. Before any correction, 87% of all reads were erroneous. After
<italic>k</italic>
 = 40 correction, 12.9, 9.6 and 5% of the human, lizard and chicken reads that became correct overlap repeats, whereas 64.8, 24.2 and 10.5% of the reads that still contained errors overlap repeats, which is 2- to 5-fold higher.</p>
<p>These numbers show that errors in repeat regions are particularly problematic and often remain uncorrected. Because repeats have many similar copies in the genome, a sequencing error in a repeat-overlapping read has a higher probability to result in an erroneous
<italic>k</italic>
-mer that occurs identically in another repeat copy.
<italic>K</italic>
-mer correction with a small
<italic>k</italic>
will fail to detect this erroneous
<italic>k</italic>
-mer as infrequent (not solid) because it is found in reads that come from the other repeat copy. Consequently, the error will not be corrected.
<xref ref-type="fig" rid="bbw003-F1">Figure 1</xref>
shows an example from our simulated data set.</p>
<fig id="bbw003-F1" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>Example of a sequencing error in a repeat region that remains undetected during correction with a small
<italic>k</italic>
-mer. Forty-mers containing a sequencing error (arrow) overlap a SINE repeat and are found identically elsewhere on human chromosome 11. Consequently, this
<italic>k</italic>
-mer is considered solid, and the error is not corrected in the corresponding read. However, increasing
<italic>k</italic>
to 50 or more would recognize this region as infrequent. Thus, longer sequence contexts, which 300 bp reads provide, allow detecting and correcting such errors in highly similar repeats.</p>
</caption>
<graphic xlink:href="bbw003f1"></graphic>
</fig>
</sec>
<sec>
<title>Iterative error correction corrects repeat-overlapping reads</title>
<p>Based on the example in
<xref ref-type="fig" rid="bbw003-F1">Figure 1</xref>
, we reasoned that additional correction rounds with larger
<italic>k</italic>
-mer sizes could correct additional reads. To test this, we subjected the simulated data sets after
<italic>k</italic>
 = 40 correction to additional rounds of correction, with
<italic>k</italic>
increasing up to two-third of the read length (
<italic>k</italic>
 = 75/100/125/150/175/200) and measured the percentage of erroneous reads. Consistently, for all three species, additional correction rounds substantially decreased the amount of erroneous reads (lines in
<xref ref-type="fig" rid="bbw003-F2">Figure 2A</xref>
;
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 2</xref>
). No single correction round achieved the accuracy of iterative correction, regardless of whether we used
<italic>k</italic>
-mer-based correction with varying
<italic>k</italic>
s or overlap-based correction (crosses in
<xref ref-type="fig" rid="bbw003-F2">Figure 2A</xref>
). For example, while 3.4% of the human reads have errors after the single best
<italic>k</italic>
 = 75 correction, only 0.82% have errors after iterative correction until
<italic>k</italic>
 = 200. In agreement with the observed decrease in the percentage of erroneous reads, subsequent correction rounds also steadily decrease the total number of errors and increase the percentage of the genome covered by 10, 20 or 30 correct reads (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Tables 3</xref>
and
<xref ref-type="supplementary-material" rid="sup1">4</xref>
).</p>
<fig id="bbw003-F2" orientation="portrait" position="float">
<label>Figure 2</label>
<caption>
<p>Iterative error correction corrects more errors than any single correction round and particularly corrects repeat-overlapping reads. (
<bold>A</bold>
) The percentage of erroneous reads decreases when correcting errors in iterative rounds (lines). The final achieved percentage is substantially lower than any single correction round (crosses), irrespective of which
<italic>k</italic>
-mer size is used. The final overlap-based correction is shown on the right. The human, lizard and chicken data are simulated 300 bp MiSeq reads. (
<bold>B</bold>
) The Y-axis shows the percentage of newly corrected reads after each iteration of
<italic>k</italic>
-mer correction that overlap repeats. Subsequent rounds correct more repeat-overlapping reads than the first
<italic>k</italic>
 = 40 round. (
<bold>C</bold>
) The percentage of 250 bp MiSeq reads from three rice subspecies that do not map exactly to the reference genome decreases when correcting errors in iterative rounds.</p>
</caption>
<graphic xlink:href="bbw003f2"></graphic>
</fig>
<p>We next explored whether iterative error correction helps to correct errors in repeat-overlapping reads. After each correction round, we extracted newly corrected reads that were erroneous in the previous round, and checked for overlap with repeats. We found that the percentage of repeat-overlapping reads is substantially higher in each subsequent round compared with the first
<italic>k</italic>
 = 40 correction round (
<xref ref-type="fig" rid="bbw003-F2">Figure 2B</xref>
;
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 1</xref>
). For the repeat-rich human genome, this percentage reaches 93% after
<italic>k</italic>
 = 125 correction. Taken together, these results show that iterative error correction can correct substantially more reads than any single correction round, and that subsequent rounds with a higher
<italic>k</italic>
increasingly correct errors in repeat regions.</p>
</sec>
<sec>
<title>A final round of overlap-based correction corrects insertion and deletion errors</title>
<p>While iterative
<italic>k</italic>
-mer-based correction minimized substitution errors, it alone cannot correct the errors that result from small insertion and deletions. Indeed, 20, 16 and 19% of the erroneous human, lizard and chicken reads contained insertion and deletion errors after
<italic>k</italic>
 = 200 correction. We therefore ran a final round of overlap-based correction, which reduced the percentage of erroneous reads with insertions and deletions to 9, 7 and 8%. After this final step of the error correction pipeline, only 0.71, 0.93 and 0.81% of the human, lizard and chicken reads are still erroneous (right-most data points in
<xref ref-type="fig" rid="bbw003-F2">Figure 2A</xref>
;
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 2</xref>
).</p>
</sec>
<sec>
<title>Iterative error correction minimizes errors in real 250 bp reads</title>
<p>To test if iterative error correction also improves read accuracy in real sequencing reads, we applied this strategy to 2 × 250 bp MiSeq data from three different rice subspecies. After each iteration, we determined the percentage of reads that map exactly to the reference genome. As shown in
<xref ref-type="fig" rid="bbw003-F2">Figure 2C</xref>
, iterative error correction consistently improved the percentage of exactly mapped reads (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 2</xref>
). We conclude that iterative error correction using multiple
<italic>k</italic>
-mer-based and a final overlap-based correction minimizes the total number of erroneous reads in both simulated and real data.</p>
</sec>
<sec>
<title>Reducing the number of erroneous reads substantially improves contig assembly</title>
<p>Next, we tested if the reduction in the number of erroneous reads translates into improved contig assembly for the human, lizard and chicken data. We applied SGA to assemble contigs from (i) reads after the single best
<italic>k</italic>
-mer-based correction (
<italic>k</italic>
 = 75 for human and lizard;
<italic>k</italic>
 = 40 for chicken), and (ii) reads after our iterative error correction strategy.
<xref ref-type="fig" rid="bbw003-F3">Figure 3</xref>
shows the corresponding NG50 values, where half of the respective chromosome consists of contigs of at least that length. NG50 values improve 2.1-fold for human, 2.2-fold for lizard and 2.9-fold for chicken. Thus, the reduction in the number of erroneous reads, even if just from 3.4 to 0.7% as for human, results in substantially longer contigs.</p>
<fig id="bbw003-F3" orientation="portrait" position="float">
<label>Figure 3</label>
<caption>
<p>Contig sizes improve 2- to 3-fold after iterative error correction. NG(x)% graph shows the contig size (Y-axis), where x% of the chromosome consists of contigs of at least that size. Contigs assembled from simulated reads after iterative error correction and after the single-best correction round are shown by solid and dashed lines, respectively. The vertical dotted line depicts NG50; corresponding numbers are stated alongside.</p>
</caption>
<graphic xlink:href="bbw003f3"></graphic>
</fig>
<p>To evaluate assembly accuracy and to rule out that the longer contigs are because of mis-assemblies joining nonadjacent genomic regions, we aligned the contigs to the genome from which we sampled the raw reads. We found that virtually all contigs assembled from iteratively corrected reads map to a single continuous genomic region (99.71% for human, 99.89% for lizard and 100% for chicken). We also used BUSCO [
<xref rid="bbw003-B28" ref-type="bibr">28</xref>
] to assess assembly correctness by counting the number of complete single-copy genes found in each assembly. BUSCO found more complete single-copy genes in the assemblies from iteratively corrected reads (17 versus 15 genes for human, 67 versus 44 genes for lizard and 27 versus 23 for chicken). Together, this shows that the increase in contig lengths is not because of assembly errors and that the assembly accuracy is high.</p>
</sec>
<sec>
<title>Fewer correction rounds reduce runtime while preserving correction accuracy</title>
<p>While iterative error correction minimizes the number of erroneous reads and substantially improves contig assembly, it clearly requires additional runtime. Compared with the runtime of the best single
<italic>k</italic>
-mer correction, seven iterative
<italic>k</italic>
-mer correction rounds and a final overlap correction run 6–8 times as long (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 5</xref>
). Overlap correction, which computes multiple sequence alignments of reads, takes up to 40% of the total runtime and needs 1.5 times as much memory compared with
<italic>k</italic>
-mer correction.</p>
<p>For practical considerations, we looked for ways to reduce runtime while preserving correction performance. We tested fewer correction rounds with larger step-sizes for
<italic>k</italic>
: three rounds with
<italic>k</italic>
 = 40/125/200 instead of seven rounds with
<italic>k</italic>
 = 40/75/100/125/150/175/200, both followed by a final overlap-based correction round. We found that fewer rounds run ∼5 times as long as the best single
<italic>k</italic>
-mer correction (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 5</xref>
) and increase the percentage of erroneous reads by only 0.18, 0.27 and 0.04% for human, lizard and chicken (
<xref ref-type="fig" rid="bbw003-F4">Figure 4</xref>
,
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 2</xref>
). This shows that the final achieved read accuracy is highly similar, irrespective of the
<italic>k</italic>
-mer step-size. Omitting the computationally expensive overlap correction step further reduces the runtime to just 1.8, 1.9 and 3.4 times at the cost of an increase of 0.31, 0.42 and 0.16% erroneous reads for human, lizard and chicken.</p>
<fig id="bbw003-F4" orientation="portrait" position="float">
<label>Figure 4</label>
<caption>
<p>Fewer iterative correction rounds preserve correction accuracy. Three instead of seven rounds of
<italic>k</italic>
-mer based correction, both followed by overlap-based correction, achieve a similar percentage of erroneous reads.</p>
</caption>
<graphic xlink:href="bbw003f4"></graphic>
</fig>
</sec>
<sec>
<title>
<italic>K</italic>
-mers larger than two-third of the read length can correct errors if sequencing coverage is high</title>
<p>In principle, the higher the sequencing coverage, the larger the
<italic>k</italic>
-mer size that can be used for error correction. To explore if this is indeed the case, we sampled reads with 60X and 100X coverage from our human simulated genome, in addition to the 30X data set described above. Iterative error correction was run as before, with the addition of
<italic>k</italic>
 = 225 and
<italic>k</italic>
 = 250 correction rounds. After each round, we measured the ratio of correct to erroneous reads; a ratio > 1 corresponds to more corrections than mis-corrections.</p>
<p>As shown in
<xref ref-type="fig" rid="bbw003-F5">Figure 5A</xref>
, correction with
<italic>k</italic>
 = 225 and
<italic>k</italic>
 = 250 further decreases the percentage of erroneous reads in the 60X and 100X data sets (dashed and dotted lines), but not in the 30X data set (solid line; see also
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 2</xref>
). In agreement with this result, we consistently observed more corrections than mis-corrections in the 60X and 100X data sets (
<xref ref-type="fig" rid="bbw003-F5">Figure 5B</xref>
). In fact, not a single base was mis-corrected for
<italic>k</italic>
 ≤ 150 in the 100X data set. In contrast, more mis-corrections than corrections were observed for
<italic>k</italic>
 ≥ 225 in the 30X data set.</p>
<fig id="bbw003-F5" orientation="portrait" position="float">
<label>Figure 5</label>
<caption>
<p>
<italic>K</italic>
-mers larger than two-third of the read length can correct errors if sequencing coverage is high. (
<bold>A</bold>
) The percentage of erroneous reads in the human 30X data set (solid line) starts to increase again with
<italic>k</italic>
 = 225 and
<italic>k</italic>
 = 250 correction. In contrast, the percentage consistently decreases in the 60X (dashed line) and 100X (dotted line) data set. (
<bold>B</bold>
) In agreement with A, we start to observe more mis-corrections than corrections after
<italic>k</italic>
 = 225 and
<italic>k</italic>
 = 250 correction in the human 30X data set (ratio below 1), but not in the 60X and 100X data sets (ratio > 1 irrespective of the
<italic>k</italic>
-mer size).</p>
</caption>
<graphic xlink:href="bbw003f5"></graphic>
</fig>
<p>As mis-corrections in the 30X data set with higher
<italic>k</italic>
-mer sizes are unexpected and unwanted, we explored potential reasons and found that they almost exclusively represent ‘haplotype conversions'. Ninety-six per cent of the reads mis-corrected after
<italic>k</italic>
 = 250 correction perfectly aligned to the respective alternative haplotype, that is, the haplotype from which the read did not originate. These haplotype conversions can occur if sequencing coverage is low and uneven for the two haplotypes, such that SNPs are mistakenly considered as errors. It should be noted that we strictly consider haplotype conversions as mis-corrections, even though removing heterozygosity is likely advantageous for genome assembly. Taken together, increasing
<italic>k</italic>
beyond two-third of the read length is beneficial for data sets with high sequencing coverage, even if the further reduction in erroneous reads is marginal.</p>
</sec>
<sec>
<title>Iterative error correction of 120 bp reads improves read accuracy but not consistently contig assembly</title>
<p>Because many genome assemblies are based on shorter read data, we also simulated 2 × 120 bp reads from all three chromosomes at 30X coverage each, and used real human 2 × 100 bp sequencing data to test iterative error correction. Indeed, as for 300 bp reads, iterative error correction consistently reduced the percentage of erroneous reads (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Figures 1</xref>
and
<xref ref-type="supplementary-material" rid="sup1">2</xref>
). However, in contrast to 300 bp reads, iterative correction of 120 bp reads improved contig assembly only for chicken, but not for human and lizard (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Figure 1B</xref>
). Two reasons likely contribute to this. First, shorter 120 bp reads span less repeats. For example, given 30X coverage, only 0.6% of the human Alu repeats are spanned by at least one 120 bp read, while 14.8% are spanned by at least one 300 bp read. Second, the length of the reads limits the length of the overlap between reads during assembly. In principle, longer read overlaps can help to resolve assembly ambiguities caused by repeats. We compared the minimum exact read overlap that resulted in the best assembly with regard to NG50. For both single and iteratively corrected 120 bp reads, the best human and lizard assembly was achieved with a minimum read overlap of 80 bp. In contrast, a larger minimum read overlap resulted in the best assembly of iteratively corrected 300 bp reads (single versus iterative correction: minimum overlap of 80 versus 100 bp for human and 100 versus 130 bp for lizard). Overall, iterative error correction consistently improves contig assembly only for longer sequencing reads.</p>
</sec>
<sec>
<title>Other methods also benefit from iterative error correction</title>
<p>To test if other error correction tools correct more errors if they are used in an iterative fashion, we applied musket [
<xref rid="bbw003-B13" ref-type="bibr">13</xref>
] and BFC [
<xref rid="bbw003-B2" ref-type="bibr">2</xref>
] to the human 30X data set. As shown in
<xref ref-type="fig" rid="bbw003-F6">Figure 6</xref>
, subsequent correction rounds decrease the percentage of erroneous reads for both methods. While Musket and BFC are faster than SGA (
<xref ref-type="supplementary-material" rid="sup1">Supplementary Table 5</xref>
), they do not achieve the accuracy of SGA. We also tested two other methods that cannot be run iteratively because their maximum
<italic>k</italic>
-mer size is restricted to small
<italic>k</italic>
s (Lighter [
<xref rid="bbw003-B12" ref-type="bibr">12</xref>
]) or because
<italic>k</italic>
is automatically determined from the genome size (RACER [
<xref rid="bbw003-B16" ref-type="bibr">16</xref>
]). We found that both methods are outperformed by iterative correction with Musket or SGA (
<xref ref-type="fig" rid="bbw003-F6">Figure 6</xref>
). These results show that iterative error correction is not specific to SGA but rather a general strategy for many error correction tools.</p>
<fig id="bbw003-F6" orientation="portrait" position="float">
<label>Figure 6</label>
<caption>
<p>The percentage of erroneous reads also decreases if other error correction methods are used with an iterative error correction strategy. Three correction rounds with Musket [
<xref rid="bbw003-B13" ref-type="bibr">13</xref>
] and
<italic>k</italic>
 = 40/125/200 and two correction rounds with BFC [
<xref rid="bbw003-B2" ref-type="bibr">2</xref>
] and
<italic>k</italic>
 = 40/63 also reduce the percentage of erroneous 300 bp reads in the human 30X data set. The SGA curve is reproduced from
<xref ref-type="fig" rid="bbw003-F4">Figure 4</xref>
for comparison. For BFC, we ran only two rounds because BFC requires
<italic>k</italic>
to be at most 63. Please note that for Musket, the percentage of erroneous reads increases from 2.5% (
<italic>k</italic>
 = 125) to 3% (
<italic>k</italic>
 = 200). For comparison, we also tested RACER [
<xref rid="bbw003-B16" ref-type="bibr">16</xref>
] and Lighter [
<xref rid="bbw003-B12" ref-type="bibr">12</xref>
]. We ran Lighter only once because it requires
<italic>k</italic>
to be at most 32. RACER was run only once because it automatically determines the optimal
<italic>k</italic>
, given the genome size (135006516bp) as input. Because RACER does not output the chosen value for
<italic>k</italic>
, we plot the performance (13.18% erroneous reads) arbitrarily at
<italic>k</italic>
 = 32.</p>
</caption>
<graphic xlink:href="bbw003f6"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>As sequencing technologies advance, longer reads will likely be the first choice for genome assembly projects. Genome assembly from next-generation sequencing data requires sequencing of a short insert library for contig building. In the past, short insert libraries were often sequenced with 2 × 100 bp reads from Illumina HiSeq sequencers. Illumina technology now offers to sequence 2 × 250 or even 2 × 300 bp reads. Given a short insert library with a mean fragment size of around 450 bp, these reads are able to span some classes of repeats. For example, longer reads will span the up to 300 bp long short interspersed elements that make up 15% of the human genome and represent a significant challenge for assembly because of 1.8 million present copies [
<xref rid="bbw003-B29" ref-type="bibr">29</xref>
]. However, longer Illumina reads have the disadvantage of being less accurate. Even if the sequencing error rate is the same, the longer the reads are, the lower is the probability that reads are completely error-free. Given that error-free read data are desirable for genome assembly, computational error correction is essential. Therefore, we developed a strategy for improving error correction that maximizes accuracy of long sequencing reads, especially of repeat-overlapping reads.</p>
<p>Existing error-correction tools typically use a single and relatively small
<italic>k</italic>
-mer size of <30 bp to correct as many errors as possible in a single round [
<xref rid="bbw003-B1" ref-type="bibr">1</xref>
]. Some of the tools determine the optimal
<italic>k</italic>
-mer size as the one that results in the greatest number of corrections, while other tools let the user chose
<italic>k</italic>
. However, there is an inherent trade-off to the choice of a single, fixed
<italic>k</italic>
-mer size: while a small
<italic>k</italic>
allows for more corrections, because reads have to overlap only by
<italic>k</italic>
bp, it can fail to detect errors if the erroneous
<italic>k</italic>
is found elsewhere in the genome [
<xref rid="bbw003-B15" ref-type="bibr">15</xref>
] (see also
<xref ref-type="fig" rid="bbw003-F1">Figure 1</xref>
). On the other hand, a large
<italic>k</italic>
allows for correcting errors in repeats (
<xref ref-type="fig" rid="bbw003-F1">Figure 1</xref>
), but fails to detect errors in low coverage regions and to correct errors in reads with many errors. Here, we show that iterative error correction combines the advantages of small and large
<italic>k</italic>
-mers to minimize the number of erroneous reads (
<xref ref-type="fig" rid="bbw003-F2">Figure 2</xref>
).</p>
<p>While iterative error correction requires longer runtimes, by far, more time is spent on genome assembly, subsequent genome annotation and finally the analysis of the new genome. Importantly, annotation and analysis critically depend on the quality of the genome assembly. As shown in
<xref ref-type="fig" rid="bbw003-F3">Figure 3</xref>
, improving the accuracy of long Illumina reads helps to obtain better assemblies. Hence, read accuracy should be the main consideration. Iterative error correction likely helps genome assembly in two ways. First, although assemblers like SGA try to discard erroneous reads, some of these reads will escape filtering and will be used for assembly. Indeed, SGA did not discard 34% (155 812 of 456 827) of the erroneous human reads after the single best
<italic>k</italic>
 = 75 correction, and 62% (97 006 of 155 812) of these non-discarded reads overlapped repeats. These erroneous reads add spurious branchings in assembly graphs and hamper assembly contiguity. As shown in
<xref ref-type="fig" rid="bbw003-F2">Figure 2B</xref>
, iterative error correction corrects more errors in repeats and thus alleviates the problem of adding branchings because of sequencing errors in parts of the assembly graph that are already hard to resolve. Second, iterative error correction increases the number of correct reads in our data sets by 2.7% for human, 3.4% for lizard and 0.5% for chicken. While these additional reads may have little effect for loci with high read coverage, they will help to build contigs spanning loci with low coverage.</p>
<p>We used SGA to test iterative error correction, because it is one of the best performing error correction methods for large genomes [
<xref rid="bbw003-B21" ref-type="bibr">21</xref>
] and also outperforms other methods. However, as shown in
<xref ref-type="fig" rid="bbw003-F6">Figure 6</xref>
, other tools can benefit from an iterative error correction strategy. Thus, future error correction tool development and benchmarking is likely to benefit from iterative correction in general.</p>
<sec>
<title>Practical guidance to the users</title>
<p>To facilitate application of the proposed iterative error correction by the community, we provide a wrapper script called SGA-Iteratively Correcting Errors (SGA-ICE) that implements the pipeline by repeatedly using SGA modules. Just given an input directory with the fastq files containing the sequencing reads, SGA-ICE produces an executable shell script that contains all commands for iterative error correction. By default, SGA-ICE runs three rounds of
<italic>k</italic>
-mer-based correction with a
<italic>k</italic>
between 40 and two-third the read length (which SGA-ICE will determine automatically), followed by a final round of overlap-based correction. Thus, SGA-ICE eliminates the need for the user to choose a single
<italic>k</italic>
-mer value. Alternatively, the user can specify which
<italic>k</italic>
s to use and whether to run a final overlap-based correction round.</p>
<p>For Illumina sequencing data of large genomes, such as vertebrate genomes, we recommend the SGA-ICE default strategy of three
<italic>k</italic>
-mer correction rounds, which reduces runtime while preserving most of the correction accuracy, as shown in
<xref ref-type="fig" rid="bbw003-F4">Figure 4</xref>
. Omitting overlap correction would reduce the runtime further; however, small insertion and deletion errors would remain in the reads. If runtime considerations are not important or if the genome is smaller, we recommend running more than three rounds to maximize read accuracy, for example, with
<italic>k</italic>
 = 40/75/100/125/150/175/200. As suggested by
<xref ref-type="fig" rid="bbw003-F5">Figure 5</xref>
, the sequencing coverage determines the largest
<italic>k</italic>
-mer that can be used. For a coverage of ∼30X, which should be sufficient to build contigs from long Illumina reads, we do not recommend using
<italic>k</italic>
-mers larger than two-third the read length. However, for high sequencing coverage of 60X or more, it will be advantageous to use these large
<italic>k</italic>
-mers, as they are able to correct errors in extremely similar genomic repeats that are difficult to bridge during contig assembly.</p>
<p>SGA-ICE is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/hillerlab/IterativeErrorCorrection/">https://github.com/hillerlab/Iterative ErrorCorrection/</ext-link>
.</p>
<p>
<boxed-text id="bbw003-BOX1" position="float" orientation="portrait">
<sec>
<title>Key Points</title>
<p>
<list list-type="bullet">
<list-item>
<p>Sequencing errors in reads that overlap highly similar genomic repeats are hard to correct by
<italic>k</italic>
-mer-based approaches.</p>
</list-item>
<list-item>
<p>Long 250 or 300 bp Illumina reads provide an opportunity to correct such errors by using longer
<italic>k</italic>
-mers.</p>
</list-item>
<list-item>
<p>Iterative error correction running multiple correction rounds with an increasing
<italic>k</italic>
-mer size corrects more errors, particularly in repeats.</p>
</list-item>
<list-item>
<p>The reduction in the number of erroneous reads improves contig assembly for long Illumina reads.</p>
</list-item>
<list-item>
<p>Iterative correction eliminates the need for the user to choose a single
<italic>k</italic>
-mer value.</p>
</list-item>
</list>
</p>
</sec>
</boxed-text>
</p>
</sec>
</sec>
<sec>
<title>Supplementary data</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available online at
<ext-link ext-link-type="uri" xlink:href="http://bib.oxfordjournals.org/">http://bib.oxfordjournals.org/</ext-link>
.</p>
</sec>
<sec>
<title>Funding</title>
<p>This work was supported by the Max Planck Society and fellowship 2012/01319-8 from Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) (to J.G.R.).</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="sup1">
<label>Supplementary Data</label>
<media xlink:href="bbw003_supp.zip">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>We thank Jared Simpson for his help with and discussions about SGA. We also thank the members of the Hiller lab and Holger Brandl for helpful comments on the manuscript and the Computer Service Facilities of the MPI-CBG and MPI-PKS for their support.</p>
</ack>
<notes id="Note2">
<sec sec-type="author-bio">
<title></title>
<p>
<bold>Katrin Sameith</bold>
has a PhD in Bioinformatics. She is a postdoctoral researcher with an interest in bioinformatics and functional genomics.</p>
</sec>
<sec sec-type="author-bio">
<title></title>
<p>
<bold>Juliana Roscito</bold>
has a PhD in developmental biology. She is a postdoctoral researcher interested in developmental and evolutionary biology and genomics.</p>
</sec>
<sec sec-type="author-bio">
<title></title>
<p>
<bold>Michael Hiller</bold>
has a PhD in Bioinformatics. He is the head of a research group that works on computational genomics approaches to study phenotype–genotype associations.</p>
</sec>
</notes>
<ref-list>
<title>References</title>
<ref id="bbw003-B1">
<label>1</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Laehnemann</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Borkhardt</surname>
<given-names>A</given-names>
</name>
<name name-style="western">
<surname>McHardy</surname>
<given-names>AC</given-names>
</name>
</person-group>
<article-title>Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction</article-title>
.
<source>Brief Bioinform</source>
<year>2015</year>
;
<volume>17</volume>
:
<fpage>154</fpage>
<lpage>79</lpage>
,
<comment>bbv029</comment>
.
<pub-id pub-id-type="pmid">26026159</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B2">
<label>2</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>BFC: correcting Illumina sequencing errors</article-title>
.
<source>Bioinformatics</source>
<year>2015</year>
;
<volume>31</volume>
:
<fpage>2885</fpage>
<lpage>7</lpage>
.
<pub-id pub-id-type="pmid">25953801</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B3">
<label>3</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Heo</surname>
<given-names>Y</given-names>
</name>
<name name-style="western">
<surname>Wu</surname>
<given-names>XL</given-names>
</name>
<name name-style="western">
<surname>Chen</surname>
<given-names>D</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>BLESS: bloom filter-based error correction solution for high-throughput sequencing reads</article-title>
.
<source>Bioinformatics</source>
<year>2014</year>
;
<volume>30</volume>
:
<fpage>1354</fpage>
<lpage>62</lpage>
,
<comment>btu030</comment>
.
<pub-id pub-id-type="pmid">24451628</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B4">
<label>4</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Correcting errors in short reads by multiple alignments</article-title>
.
<source>Bioinformatics</source>
<year>2011</year>
;
<volume>27</volume>
:
<fpage>1455</fpage>
<lpage>61</lpage>
.
<pub-id pub-id-type="pmid">21471014</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B5">
<label>5</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kao</surname>
<given-names>WC</given-names>
</name>
<name name-style="western">
<surname>Chan</surname>
<given-names>AH</given-names>
</name>
<name name-style="western">
<surname>Song</surname>
<given-names>YS</given-names>
</name>
</person-group>
<article-title>ECHO: a reference-free short-read error correction algorithm</article-title>
.
<source>Genome Res</source>
<year>2011</year>
;
<volume>21</volume>
:
<fpage>1181</fpage>
<lpage>92</lpage>
.
<pub-id pub-id-type="pmid">21482625</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B6">
<label>6</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name name-style="western">
<surname>Tang</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
.
<source>Proc Natl Acad Sci USA</source>
<year>2001</year>
;
<volume>98</volume>
:
<fpage>9748</fpage>
<lpage>53</lpage>
.
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B7">
<label>7</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chaisson</surname>
<given-names>MJ</given-names>
</name>
<name name-style="western">
<surname>Brinza</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>De novo fragment assembly with short mate-paired reads: does the read length matter?</article-title>
<source>Genome Res</source>
<year>2009</year>
;
<volume>19</volume>
:
<fpage>336</fpage>
<lpage>46</lpage>
.
<pub-id pub-id-type="pmid">19056694</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B8">
<label>8</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name name-style="western">
<surname>Weese</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Holtgrewe</surname>
<given-names>M</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>Fiona: a parallel and automatic strategy for read error correction</article-title>
.
<source>Bioinformatics</source>
<year>2014</year>
;
<volume>30</volume>
:
<fpage>i356</fpage>
<lpage>63</lpage>
.
<pub-id pub-id-type="pmid">25161220</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B9">
<label>9</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name name-style="western">
<surname>Scott</surname>
<given-names>E</given-names>
</name>
<name name-style="western">
<surname>Kakaradov</surname>
<given-names>B</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>Error correction of high-throughput sequencing datasets with non-uniform coverage</article-title>
.
<source>Bioinformatics</source>
<year>2011</year>
;
<volume>27</volume>
:
<fpage>i137</fpage>
<lpage>41</lpage>
.
<pub-id pub-id-type="pmid">21685062</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B10">
<label>10</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Fazayeli</surname>
<given-names>F</given-names>
</name>
<name name-style="western">
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>HiTEC: accurate error correction in high-throughput sequencing data</article-title>
.
<source>Bioinformatics</source>
<year>2011</year>
;
<volume>27</volume>
<fpage>295</fpage>
<lpage>302</lpage>
.
<pub-id pub-id-type="pmid">21115437</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B11">
<label>11</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Allam</surname>
<given-names>A</given-names>
</name>
<name name-style="western">
<surname>Kalnis</surname>
<given-names>P</given-names>
</name>
<name name-style="western">
<surname>Solovyev</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data</article-title>
.
<source>Bioinformatics</source>
<year>2015</year>
;
<volume>31</volume>
:
<fpage>3421</fpage>
<lpage>8</lpage>
.
<pub-id pub-id-type="pmid">26177965</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B12">
<label>12</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Song</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Florea</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Lighter: fast and memory- efficient sequencing error correction without counting</article-title>
.
<source>Genome Biol</source>
<year>2014</year>
;
<volume>15</volume>
:
<fpage>509</fpage>
.
<pub-id pub-id-type="pmid">25398208</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B13">
<label>13</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name name-style="western">
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data</article-title>
.
<source>Bioinformatics</source>
<year>2013</year>
;
<volume>29</volume>
:
<fpage>308</fpage>
<lpage>15</lpage>
.
<pub-id pub-id-type="pmid">23202746</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B14">
<label>14</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhao</surname>
<given-names>Z</given-names>
</name>
<name name-style="western">
<surname>Yin</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Li</surname>
<given-names>Y</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>An efficient hybrid approach to correcting errors in short reads</article-title>
.
<source>Model Decis Artif Intell</source>
<year>2011</year>
;
<volume>6820</volume>
:
<fpage>198</fpage>
<lpage>210</lpage>
.</mixed-citation>
</ref>
<ref id="bbw003-B15">
<label>15</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name name-style="western">
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name name-style="western">
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
.
<source>Genome Biol</source>
<year>2010</year>
;
<volume>11</volume>
:
<fpage>1</fpage>
<lpage>13</lpage>
.</mixed-citation>
</ref>
<ref id="bbw003-B16">
<label>16</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Molnar</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>RACER: rapid and accurate correction of errors in reads</article-title>
.
<source>Bioinformatics</source>
<year>2013</year>
;
<volume>29</volume>
:
<fpage>2490</fpage>
<lpage>3</lpage>
.
<pub-id pub-id-type="pmid">23853064</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B17">
<label>17</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name name-style="western">
<surname>Dorman</surname>
<given-names>KS</given-names>
</name>
<name name-style="western">
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Reptile: representative tiling for short read error correction</article-title>
.
<source>Bioinformatics</source>
<year>2010</year>
;
<volume>26</volume>
:
<fpage>2526</fpage>
<lpage>33</lpage>
.
<pub-id pub-id-type="pmid">20834037</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B18">
<label>18</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name name-style="western">
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Efficient de novo assembly of large genomes using compressed data structures</article-title>
.
<source>Genome Res</source>
<year>2012</year>
;
<volume>22</volume>
:
<fpage>549</fpage>
<lpage>56</lpage>
.
<pub-id pub-id-type="pmid">22156294</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B19">
<label>19</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>SHREC: a short-read error correction method</article-title>
.
<source>Bioinformatics</source>
<year>2009</year>
;
<volume>25</volume>
:
<fpage>2157</fpage>
<lpage>63</lpage>
.
<pub-id pub-id-type="pmid">19542152</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B20">
<label>20</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name name-style="western">
<surname>Chockalingam</surname>
<given-names>SP</given-names>
</name>
<name name-style="western">
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A survey of error-correction methods for next-generation sequencing</article-title>
.
<source>Brief Bioinform</source>
<year>2012</year>
;
<volume>14</volume>
:
<fpage>56</fpage>
<lpage>66</lpage>
,
<comment>bbs015</comment>
.
<pub-id pub-id-type="pmid">22492192</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B21">
<label>21</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Molnar</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Correcting Illumina data</article-title>
.
<source>Brief Bioinform</source>
<year>2014</year>
;
<volume>16</volume>
:
<fpage>588</fpage>
<lpage>99</lpage>
,
<comment>bbu029</comment>
.
<pub-id pub-id-type="pmid">25183248</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B22">
<label>22</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Wan</surname>
<given-names>Q-H</given-names>
</name>
<name name-style="western">
<surname>Pan</surname>
<given-names>S-K</given-names>
</name>
<name name-style="western">
<surname>Hu</surname>
<given-names>L</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>Genome analysis and signature discovery for diving and sensory properties of the endangered Chinese alligator</article-title>
.
<source>Cell Res</source>
<year>2013</year>
;
<volume>23</volume>
:
<fpage>1091</fpage>
<lpage>105</lpage>
.
<pub-id pub-id-type="pmid">23917531</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B23">
<label>23</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hu</surname>
<given-names>X</given-names>
</name>
<name name-style="western">
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>pIRS: profile-based Illumina pair-end reads simulator</article-title>
.
<source>Bioinformatics</source>
<year>2012</year>
;
<volume>28</volume>
:
<fpage>1533</fpage>
<lpage>5</lpage>
.
<pub-id pub-id-type="pmid">22508794</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B24">
<label>24</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name name-style="western">
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Myers</surname>
<given-names>JR</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>ART: a next-generation sequencing read simulator</article-title>
.
<source>Bioinformatics</source>
<year>2012</year>
;
<volume>28</volume>
:
<fpage>593</fpage>
<lpage>4</lpage>
.
<pub-id pub-id-type="pmid">22199392</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B25">
<label>25</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rosenbloom</surname>
<given-names>KR</given-names>
</name>
<name name-style="western">
<surname>Armstrong</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Barber</surname>
<given-names>GP</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>The UCSC genome browser database: 2015 update</article-title>
.
<source>Nucleic Acids Res</source>
<year>2015</year>
;
<volume>43</volume>
:
<fpage>D670</fpage>
<lpage>81</lpage>
.
<pub-id pub-id-type="pmid">25428374</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B26">
<label>26</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name name-style="western">
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Efficient construction of an assembly string graph using the FM-index</article-title>
.
<source>Bioinformatics</source>
<year>2010</year>
;
<volume>26</volume>
<fpage>i367</fpage>
<lpage>73</lpage>
.
<pub-id pub-id-type="pmid">20529929</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B27">
<label>27</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Fast gapped-read alignment with Bowtie 2</article-title>
.
<source>Nat Methods</source>
<year>2012</year>
;
<volume>9</volume>
:
<fpage>357</fpage>
<lpage>9</lpage>
.
<pub-id pub-id-type="pmid">22388286</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B28">
<label>28</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Simão</surname>
<given-names>FA</given-names>
</name>
<name name-style="western">
<surname>Waterhouse</surname>
<given-names>RM</given-names>
</name>
<name name-style="western">
<surname>Ioannidis</surname>
<given-names>P</given-names>
</name>
</person-group>
<etal>et al</etal>
<article-title>BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs</article-title>
.
<source>Bioinformatics</source>
<year>2015</year>
;
<volume>31</volume>
:
<fpage>3210</fpage>
<lpage>2</lpage>
,
<comment>btv351</comment>
.
<pub-id pub-id-type="pmid">26059717</pub-id>
</mixed-citation>
</ref>
<ref id="bbw003-B29">
<label>29</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Treangen</surname>
<given-names>TJ</given-names>
</name>
<name name-style="western">
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Repetitive DNA and next-generation sequencing: computational challenges and solutions</article-title>
.
<source>Nat Rev Genet</source>
<year>2012</year>
;
<volume>13</volume>
:
<fpage>36–46</fpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B46  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B46  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021