Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

SEQuel: improving the accuracy of genome assemblies

Identifieur interne : 000B07 ( Pmc/Corpus ); précédent : 000B06; suivant : 000B08

SEQuel: improving the accuracy of genome assemblies

Auteurs : Roy Ronen ; Christina Boucher ; Hamidreza Chitsaz ; Pavel Pevzner

Source :

RBID : PMC:3371851

Abstract

Motivation: Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.

Results: SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.

Availability: SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.

Contact:ppevzner@cs.ucsd.edu


Url:
DOI: 10.1093/bioinformatics/bts219
PubMed: 22689760
PubMed Central: 3371851

Links to Exploration step

PMC:3371851

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">SEQuel: improving the accuracy of genome assemblies</title>
<author>
<name sortKey="Ronen, Roy" sort="Ronen, Roy" uniqKey="Ronen R" first="Roy" last="Ronen">Roy Ronen</name>
<affiliation>
<nlm:aff id="AFF1">Bioinformatics Graduate Program,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boucher, Christina" sort="Boucher, Christina" uniqKey="Boucher C" first="Christina" last="Boucher">Christina Boucher</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chitsaz, Hamidreza" sort="Chitsaz, Hamidreza" uniqKey="Chitsaz H" first="Hamidreza" last="Chitsaz">Hamidreza Chitsaz</name>
<affiliation>
<nlm:aff id="AFF1">Department of Computer Science, Wayne State University, Detroit, MI 48202, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pevzner, Pavel" sort="Pevzner, Pavel" uniqKey="Pevzner P" first="Pavel" last="Pevzner">Pavel Pevzner</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22689760</idno>
<idno type="pmc">3371851</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3371851</idno>
<idno type="RBID">PMC:3371851</idno>
<idno type="doi">10.1093/bioinformatics/bts219</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000B07</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B07</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">SEQuel: improving the accuracy of genome assemblies</title>
<author>
<name sortKey="Ronen, Roy" sort="Ronen, Roy" uniqKey="Ronen R" first="Roy" last="Ronen">Roy Ronen</name>
<affiliation>
<nlm:aff id="AFF1">Bioinformatics Graduate Program,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boucher, Christina" sort="Boucher, Christina" uniqKey="Boucher C" first="Christina" last="Boucher">Christina Boucher</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chitsaz, Hamidreza" sort="Chitsaz, Hamidreza" uniqKey="Chitsaz H" first="Hamidreza" last="Chitsaz">Hamidreza Chitsaz</name>
<affiliation>
<nlm:aff id="AFF1">Department of Computer Science, Wayne State University, Detroit, MI 48202, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pevzner, Pavel" sort="Pevzner, Pavel" uniqKey="Pevzner P" first="Pavel" last="Pevzner">Pavel Pevzner</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop
<italic>SEQuel</italic>
, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the
<italic>positional de Bruijn graph</italic>
, a graph structure that models
<italic>k</italic>
-mers within reads while incorporating the approximate positions of reads into the model.</p>
<p>
<bold>Results:</bold>
SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell
<italic>Escherichia coli</italic>
data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium
<italic>SAR324</italic>
genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.</p>
<p>
<bold>Availability:</bold>
SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/SEQuel/">http://bix.ucsd.edu/SEQuel/</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>ppevzner@cs.ucsd.edu</email>
</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Alkan, S" uniqKey="Alkan S">S. Alkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A. Bankevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bentley, D R" uniqKey="Bentley D">D.R. Bentley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Butler, J" uniqKey="Butler J">J. Butler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H. Chitsaz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Compeau, P E C" uniqKey="Compeau P">P.E.C. Compeau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Depristo, M" uniqKey="Depristo M">M. DePristo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Donmez, N" uniqKey="Donmez N">N. Donmez</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M. Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewing, B" uniqKey="Ewing B">B. Ewing</name>
</author>
<author>
<name sortKey="Green, P" uniqKey="Green P">P. Green</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewing, B" uniqKey="Ewing B">B. Ewing</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mckenna, A" uniqKey="Mckenna A">A. McKenna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hannenhalli, S" uniqKey="Hannenhalli S">S. Hannenhalli</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hirschberg, D S" uniqKey="Hirschberg D">D.S. Hirschberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, S" uniqKey="Huang S">S. Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Idury, R M" uniqKey="Idury R">R.M. Idury</name>
</author>
<author>
<name sortKey="Waterman, M S" uniqKey="Waterman M">M.S. Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kececioglu, J" uniqKey="Kececioglu J">J. Kececioglu</name>
</author>
<author>
<name sortKey="Yu, J" uniqKey="Yu J">J. Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, D R" uniqKey="Kelley D">D.R. Kelley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, W J" uniqKey="Kent W">W.J. Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Klein, J D" uniqKey="Klein J">J.D. Klein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R. Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, G" uniqKey="Myers G">G. Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Genome, 10k Community Of Scientists" uniqKey="Genome 1">10K Community of Scientists. Genome</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P. Pevzner</name>
</author>
<author>
<name sortKey="Chaisson, M" uniqKey="Chaisson M">M. Chaisson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P. Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P A" uniqKey="Pevzner P">P.A. Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raghunathan, A" uniqKey="Raghunathan A">A. Raghunathan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robinson, G E" uniqKey="Robinson G">G.E. Robinson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodrigue, S" uniqKey="Rodrigue S">S. Rodrigue</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J T" uniqKey="Simpson J">J.T. Simpson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tammi, M T" uniqKey="Tammi M">M.T. Tammi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wheeler, D A" uniqKey="Wheeler D">D.A. Wheeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, D R" uniqKey="Zerbino D">D.R. Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E. Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhi, D" uniqKey="Zhi D">D. Zhi</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22689760</article-id>
<article-id pub-id-type="pmc">3371851</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/bts219</article-id>
<article-id pub-id-type="publisher-id">bts219</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb 2012 Proceedings Papers Committee July 15 to July 19, 2012, Long Beach, Ca, Usa</subject>
</subj-group>
<subj-group>
<subject>Original Papers</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>SEQuel: improving the accuracy of genome assemblies</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Ronen</surname>
<given-names>Roy</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="FN1">
<sup></sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Boucher</surname>
<given-names>Christina</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="FN1">
<sup></sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chitsaz</surname>
<given-names>Hamidreza</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pevzner</surname>
<given-names>Pavel</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="COR1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="AFF1">
<sup>1</sup>
Bioinformatics Graduate Program,
<sup>2</sup>
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093 and
<sup>3</sup>
Department of Computer Science, Wayne State University, Detroit, MI 48202, USA</aff>
<author-notes>
<corresp id="COR1">* To whom correspondence should be addressed.</corresp>
<fn id="FN1">
<p>
<sup></sup>
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>9</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>9</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>28</volume>
<issue>12</issue>
<fpage>i188</fpage>
<lpage>i196</lpage>
<permissions>
<copyright-statement>© The Author(s) 2012. Published by Oxford University Press.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop
<italic>SEQuel</italic>
, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the
<italic>positional de Bruijn graph</italic>
, a graph structure that models
<italic>k</italic>
-mers within reads while incorporating the approximate positions of reads into the model.</p>
<p>
<bold>Results:</bold>
SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell
<italic>Escherichia coli</italic>
data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium
<italic>SAR324</italic>
genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.</p>
<p>
<bold>Availability:</bold>
SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/SEQuel/">http://bix.ucsd.edu/SEQuel/</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>ppevzner@cs.ucsd.edu</email>
</p>
</abstract>
<counts>
<page-count count="9"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="SEC1">
<title>1 INTRODUCTION</title>
<p>The advent of next-generation sequencing (NGS) technologies, along with the development of new assembly algorithms has enabled the production of genome assemblies for a multitude of organisms at ever-decreasing costs (
<xref ref-type="bibr" rid="B14">Huang
<italic>et al</italic>
., 2009</xref>
;
<xref ref-type="bibr" rid="B21">Li
<italic>et al</italic>
., 2010</xref>
;
<xref ref-type="bibr" rid="B22">Li
<italic>et al</italic>
., 2010</xref>
). Robust assembly methods are imperative to the success of large
<italic>de novo</italic>
sequencing initiatives, such as the
<italic>Genome 10K</italic>
project that aims to sequence the genomes of 10 000 vertebrate species (
<xref ref-type="bibr" rid="B25">Genome 10K Community of Scientists, 2009</xref>
) and the
<italic>iK5</italic>
project where the objective is to sequence the genomes of 5 000 arthropods (
<xref ref-type="bibr" rid="B30">Robinson
<italic>et al</italic>
., 2011</xref>
). NGS technologies produce short sequence reads [approximately 100–150 base pairs (bp) for Illumina technology] at increasingly high throughput, permitting assembly methods suited to these technologies to exploit the redundancy in the data in order to produce high-quality contigs (
<xref ref-type="bibr" rid="B3">Bankevich
<italic>et al</italic>
., 2008</xref>
;
<xref ref-type="bibr" rid="B34">Wheeler
<italic>et al</italic>
., 2008</xref>
). Although these platforms have much higher throughput than Sanger sequencing platforms, assessment of short-read assemblies have shown them to be less accurate than the finished genomes assembled using the previous technologies (
<xref ref-type="bibr" rid="B1">Alkan,
<italic>et al</italic>
., 2011</xref>
). Earlier assembly algorithms developed for Sanger sequencing follow an ‘overlap - layout - consensus’ paradigm, where consensus refers to fixing errors in the contigs (
<xref ref-type="bibr" rid="B9">Ewing and Green, 1998</xref>
;
<xref ref-type="bibr" rid="B10">Ewing
<italic>et al</italic>
., 1998</xref>
). Since this paradigm faces difficulties in short-read assembly, most NGS assemblers employ a de Bruijn graph approach that effectively deals with large amounts of data. However, most NGS assemblers neglect the consensus step, i.e. there exists no post-processing of the contigs in Velvet (
<xref ref-type="bibr" rid="B35">Zerbino
<italic>et al</italic>
., 2008</xref>
) and many other popular assemblers. Relying on high and uniform coverage, NGS assembly algorithms push the burden of producing high-quality assemblies onto the construction of the de Bruijn graph. We argue that NGS assemblers can benefit from the use of a consensus step, particularly in the case of single-cell data that suffers from high error rates and non-uniform coverage (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
).</p>
<p>In the spirit of the consensus step, we propose an additional step to the NGS assembly process: refinement. We develop
<italic>SEQuel</italic>
, a tool that refines an initial assembly of short-read data by using approximate positions of reads in contigs. SEQuel takes as input an assembled contig, the paired-end reads that align to that contig and the approximate positions where they aligned, and returns a refined contig. We refer to the process implemented by SEQuel as
<italic>positional reassembly</italic>
.</p>
<p>In Eulerian assembly (
<xref ref-type="bibr" rid="B15">Idury and Waterman, 1995</xref>
;
<xref ref-type="bibr" rid="B28">Pevzner
<italic>et al</italic>
., 2001</xref>
), a de Bruijn graph is constructed with a vertex
<italic>v</italic>
for every (
<italic>k</italic>
− 1)-mer present in a set of reads, and an edge (
<italic>v</italic>
,
<italic>v</italic>
′) for every observed
<italic>k</italic>
-mer in the reads with (
<italic>k</italic>
− 1)-mer prefix
<italic>v</italic>
and (
<italic>k</italic>
− 1)-mer suffix
<italic>v</italic>
′. A contig corresponds to a non-branching path through this graph. See
<xref ref-type="bibr" rid="B6">Compeau
<italic>et al</italic>
. (2011</xref>
) for a more thorough explanation of de Bruijn graphs and their use in assembly. Euler-SR (
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
), Velvet (
<xref ref-type="bibr" rid="B35">Zerbino
<italic>et al</italic>
., 2008</xref>
), SOAPdenovo (
<xref ref-type="bibr" rid="B22">Li
<italic>et al</italic>
., 2010</xref>
), ABySS (
<xref ref-type="bibr" rid="B32">Simpson
<italic>et al</italic>
., 2009</xref>
) and ALLPATHS (
<xref ref-type="bibr" rid="B4">Butler
<italic>et al</italic>
., 2008</xref>
) all use this paradigm for assembly. Most existing NGS assemblers follow the same general outline: break the (possibly error corrected) reads into
<italic>k</italic>
-mers, construct the de Bruijn graph on the set of resulting
<italic>k</italic>
-mers, simplify the de Bruijn graph, resolve repeats by using mate–pair information and construct contigs. Although the implementation of these steps varies widely between different assemblers, existing NGS assemblers return contigs recovered from the de Bruijn graph with little refinement.</p>
<p>If every position in the genome was uniformly covered by error-free reads, and the genome had few repeats, this would result in a simple de Bruijn graph. However, sequencing errors and repeats lead to highly complex graphs and force assemblers to rely on graph simplification. It is during this simplification process that errors in the assembly are introduced. Substitution errors and indels in the reads create undirected cycles called
<italic>bulges</italic>
and short tandem repeats lead to directed cycles called
<italic>whirls</italic>
(
<xref ref-type="bibr" rid="B27">Pevzner
<italic>et al</italic>
., 2004</xref>
). There exist numerous methods for removing bulges and whirls but unfortunately, these methods potentially introduce errors in the contigs.
<xref ref-type="fig" rid="F1">Figure 1</xref>
illustrates a scenario where a bulge in the de Bruijn graph is caused due to a sequencing error.
<fig id="F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>An example of a bulge on eight vertices in a de Bruijn graph (
<italic>k</italic>
=4) resulting from a sequencing error. During the process of bulge removal, the correct path (top: CCT-CTA-TAG-AGG-GGA) may be discarded, thus creating a substitution error in the final contig. This may occur if, for example, coverage is taken as a consideration, since the bottom path (CCT-CTT-TTG-TGG-GGA), erroneous in this case, may have higher coverage due to
<italic>k</italic>
-mers originating from other parts of the genome</p>
</caption>
<graphic xlink:href="bts219f1"></graphic>
</fig>
</p>
<p>Error correction of the reads prior to assembly can greatly simplify the assembly process by implicitly eliminating bulges from the de Bruijn graph (
<xref ref-type="bibr" rid="B17">Kelley
<italic>et al</italic>
., 2010</xref>
;
<xref ref-type="bibr" rid="B24">Medvedev
<italic>et al</italic>
., 2011</xref>
;
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
;
<xref ref-type="bibr" rid="B28">Pevzner
<italic>et al</italic>
., 2001</xref>
). It is now established as a common pre-processing step before assembly, and used by several NGS assemblers, including Euler-SR (
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
), ABySS (
<xref ref-type="bibr" rid="B32">Simpson
<italic>et al</italic>
., 2009</xref>
) and ALLPATHS (
<xref ref-type="bibr" rid="B4">Butler
<italic>et al</italic>
., 2008</xref>
). Although error correction eliminates the majority of errors in reads, in ~1% of the cases it introduces, rather than corrects, errors (
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
). This leads to the following trade-off in fragment assembly: either error correction of reads is performed, which may lead to errors in the contigs, or error correction is not performed and the complex de Bruijn graph has to undergo aggressive simplification that may lead to errors in the contigs. It is clear, however, that in both cases subtle and complex errors will arise.</p>
<p>The accuracy of different assemblers varies widely. For example, our tests of Velvet produced contigs with 1–2 errors per 100 kb (for
<italic>k</italic>
-mer size 55), while SOAPdenovo produced contigs with 20–30 errors per 100 kb. Even the accuracy of Velvet deteriorates greatly when used in default mode with
<italic>k</italic>
-mer size 31. However, SOAPdenovo has some advantages over Velvet (e.g. the ability to handle larger genomes). Therefore, it is beneficial to design a refinement program that can be used in combination with any assembler. Decoupling the contig refinement problem from the assembly process removes the burden of re-implementing a positional reassembly process for each assembler.</p>
<p>We show that NGS assemblies suffer from indels and substitution errors that are somewhat masked by common metrics for assessing assembly quality. Many of these errors can be corrected using SEQuel. We give an analysis of the types of complex errors that occur in NGS assembly, and that can be fixed by SEQuel, offering some insight into why positional reassembly is a necessity for obtaining accurate assemblies.</p>
<p>We give a computational problem formulation for correcting errors in contigs, and present an algorithm for reassembly based on a graph structure referred to as the
<italic>positional de Bruijn graph</italic>
. We demonstrate the ability of SEQuel to improve the accuracy of assemblies generated from three assemblers: Euler-SR (
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
), Velvet (
<xref ref-type="bibr" rid="B35">Zerbino
<italic>et al</italic>
., 2008</xref>
) and Velvet-SC (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
). Euler-SR performs error correction on the sequence reads prior to assembly (
<xref ref-type="bibr" rid="B26">Pevzner
<italic>et al</italic>
., 2008</xref>
), whereas Velvet does not perform error correction (
<xref ref-type="bibr" rid="B35">Zerbino
<italic>et al</italic>
., 2008</xref>
). Velvet-SC is a specialized assembler tailored to handle the dramatic fluctuations in coverage that are characteristic to single-cell data (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
). Single-cell amplified DNA has been shown to suffer from amplification bias and low template quality (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
;
<xref ref-type="bibr" rid="B29">Raghunathan
<italic>et al</italic>
., 2005</xref>
;
<xref ref-type="bibr" rid="B31">Rodrigue
<italic>et al</italic>
., 2009</xref>
), resulting in sequence data with highly non-uniform coverage by error-prone reads (
<xref ref-type="bibr" rid="B29">Raghunathan
<italic>et al</italic>
., 2005</xref>
). Thus, assembly of such data is inherently more challenging and error-prone. Our experiments demonstrate that SEQuel is able to substantially reduce the number of errors in single-cell, and standard (multi-cell) assembly. Although we demonstrate the use of SEQuel with Euler-SR, Velvet and Velvet-SC, its use is not limited to these assemblers.</p>
</sec>
<sec id="SEC2">
<title>2 THE CONTIG REFINEMENT PROBLEM</title>
<p>We formalize the contig refinement problem and present an algorithm for positional reassembly that relies on the positional de Bruijn graph. A similar graph was previously proposed by
<xref ref-type="bibr" rid="B12">Hannenhalli
<italic>et al</italic>
. (1996</xref>
) for Sequencing By Hybridization. Contrary to the de Bruijn graph where edges correspond to
<italic>k</italic>
-mers, the edges of the positional de Bruijn graph correspond to
<italic>k</italic>
-mers and their inferred positions on the contigs.</p>
<p>Fragment assembly in the de Bruijn graph framework is often abstracted as a problem of finding a shortest string that explains the set of all
<italic>k</italic>
-mers from the reads, i.e. the
<italic>Shortest Common Superstring Problem</italic>
. We formulate the contig refinement problem in a similar manner; that is, as finding a string that explains all occuring
<italic>positional</italic>
<italic>k</italic>
-mers. While both abstractions are limited in that they do not adequately address the assembly of repeat regions, they prove to be conceptually useful. The input to the contig refinement problem is the set of
<italic>k</italic>
-mers used to assemble a contig, and for each
<italic>k</italic>
-mer a position (or positions) where it is presumably contained in the contig, i.e. a multiset of pairs (
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
,
<italic>p</italic>
), where
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
is a
<italic>k</italic>
-mer and
<italic>p</italic>
is the approximate position. We refer to these
<italic>k</italic>
-mer and position pairs (
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
,
<italic>p</italic>
) as
<italic>positional k-mers</italic>
. Given a parameter Δ, we call a positional
<italic>k</italic>
-mer (
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
,
<italic>p</italic>
)
<italic>valid</italic>
with respect to a string
<italic>S</italic>
if
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
appears in
<italic>S</italic>
at a position that is within Δ of
<italic>p</italic>
.</p>
<p>
<italic>The contig refinement problem</italic>
: given a multiset of positional \hbox{
<italic>k</italic>
-mers} and a parameter Δ, find a shortest string
<italic>S</italic>
that maximizes the total number of valid positional
<italic>k</italic>
-mers.</p>
<p>Due to indels in the contigs (relative to the genome), we expect the positions of
<italic>k</italic>
-mers in the assembled contig to differ from their positions in the correct contig. Therefore, since our goal is to assemble the correct contig, we assume the positions are an approximation to the position in the correct contig. Next, we demonstrate how positional information may be incorporated into the de Bruijn graph and used to refine contigs by positional reassembly.</p>
</sec>
<sec id="SEC3">
<title>3 ALGORITHM AND METHODS</title>
<p>Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph. The input to SEQuel is a multiset of positional
<italic>k</italic>
-mers, and a set of contigs; the output is a set of refined contigs. While the description of the method below is for a single contig, in practice it is applied to all contigs of an assembly.</p>
<sec id="SEC3.1">
<title>3.1 Recruitment of reads to contigs</title>
<p>We refer to a read-pair as
<italic>permissively aligned</italic>
to a contig if either one or both of the reads in the read-pair uniquely align to a single contig in the assembly. If only one read aligns, the approximate position of the unaligned read is deduced using the expected insert size. After assembly, we extract the set of reads that permissively align to the contig and their approximate positions, which is then used in the construction of the positional de Bruijn graph. This is performed using BWA (version 0.5.9) in paired-end mode with default parameters (
<xref ref-type="bibr" rid="B20">Li
<italic>et al</italic>
., 2009</xref>
), allowing detection of alignments with small indels as well as read-pairs where only one read is aligned. Although some assemblers (e.g. Velvet) output the assignment of reads to contigs, we obtain better refinement results using BWA. We note that using only the set of permissively aligned reads will lead to reduced coverage in certain regions, however, we see only a modest decline in
<italic>average</italic>
coverage and very few regions where coverage was significantly reduced (for bacterial genomes). Thus, the accuracy of SEQuel is improved at the expense of correcting fewer errors in these regions.</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Construction of the positional de Bruijn graph</title>
<p>From the set of reads that permissively aligned to the contig, we construct the set of positional
<italic>k</italic>
-mers. If a read
<italic>r</italic>
=[
<italic>r</italic>
<sub>1</sub>
<italic>r</italic>
<sub>
<italic>n</italic>
</sub>
] of length
<italic>n</italic>
is aligned to a contig at position
<italic>i</italic>
, we extract
<italic>n</italic>
<italic>k</italic>
+1 positional
<italic>k</italic>
-mers from
<italic>r</italic>
: ([
<italic>r</italic>
<sub>1</sub>
<italic>r</italic>
<sub>
<italic>k</italic>
</sub>
],
<italic>i</italic>
), ([
<italic>r</italic>
<sub>2</sub>
<italic>r</italic>
<sub>
<italic>k</italic>
+1</sub>
],
<italic>i</italic>
+1),…, ([
<italic>r</italic>
<sub>
<italic>n</italic>
<italic>k</italic>
+1</sub>
<italic>r</italic>
<sub>
<italic>n</italic>
</sub>
],
<italic>i</italic>
+
<italic>n</italic>
<italic>k</italic>
). We emphasize that different reads may give rise to the same
<italic>k</italic>
-mer with different inferred positions. Consequently, we cluster by position all positional
<italic>k</italic>
-mers that have the same
<italic>k</italic>
-mer sequence, and use the cluster centers when constructing the positional de Bruijn graph. This is a one-dimensional clustering problem, where single-linkage clustering performs well.</p>
<p>We refer to the
<italic>multiplicity</italic>
of a positional
<italic>k</italic>
-mer (
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
,
<italic>p</italic>
) as the number of occurrences where
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
clustered at position
<italic>p</italic>
. For example, if the
<italic>k</italic>
-mer ACTA aligned to positions (42, 43, 43, 44), and the corresponding positional
<italic>k</italic>
-mers all cluster to (ACTA,43), then the multiplicity of this positional
<italic>k</italic>
-mer equals 4. SEQuel removes clusters of low multiplicity since they are likely to represent erroneous
<italic>k</italic>
-mers. Lastly, the positional de Bruijn graph is constructed from the positional
<italic>k</italic>
-mers, as described below.</p>
<p>The positional de Bruijn graph
<italic>G</italic>
<sub>
<italic>k</italic>
</sub>
is defined for a multiset of positional
<italic>k</italic>
-mers and parameter Δ, and is constructed in a similar manner to the traditional de Bruijn graph using an A-Bruijn graph framework from (
<xref ref-type="bibr" rid="B27">Pevzner
<italic>et al</italic>
., 2004</xref>
). Given a
<italic>k</italic>
-mer
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
, let prefix(
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
) be the first
<italic>k</italic>
−1 nucleotides of
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
, and suffix(
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
) be the last
<italic>k</italic>
−1 nucleotides of
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
. Each positional
<italic>k</italic>
−mer (
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
,
<italic>p</italic>
) in the input multiset corresponds to a directed edge in the graph between two positional (
<italic>k</italic>
− 1)-mers, (prefix(
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
),
<italic>p</italic>
) and (suffix(
<italic>s</italic>
<sub>
<italic>k</italic>
</sub>
),
<italic>p</italic>
+1). After all edges are formed, the graph undergoes a gluing operation. A pair of positional (
<italic>k</italic>
−1)-mers, (
<italic>s</italic>
<sub>
<italic>k</italic>
−1</sub>
,
<italic>p</italic>
) and (
<italic>s</italic>
<sub>
<italic>k</italic>
−1</sub>
′,
<italic>p</italic>
′), are glued together into a single vertex if
<italic>s</italic>
<sub>
<italic>k</italic>
−1</sub>
=
<italic>s</italic>
<sub>
<italic>k</italic>
−1</sub>
′ and
<italic>p</italic>
∈[
<italic>p</italic>
′ − Δ,
<italic>p</italic>
′+Δ]. Hence, two positional (
<italic>k</italic>
− 1)-mers are glued together if their sequences are the same and their positions are within Δ from each other. Finally, each edge is weighted by the multiplicity of the corresponding positional
<italic>k</italic>
-mer. In
<xref ref-type="fig" rid="F2">Figure 2</xref>
, we show an example of a positional de Bruijn graph and a de Bruijn graph constructed from the same dataset.
<fig id="F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>An example illustrating the positional de Bruijn graph (
<italic>k</italic>
=4, Δ=1) and de Bruijn graph on a set of aligned reads, with their corresponding sets of
<italic>k</italic>
-mers and positional
<italic>k</italic>
-mers. There exists a single sequencing error in the reads (shown in red). In the de Bruijn graph, the (
<italic>k</italic>
− 1)-mer GCC appears as a single vertex, whereas, the positional de Bruijn graph separates the occurrence of GCC into two vertices. This additional information incorporated into the graph further constraints the gluing process and reduces complexity. Further, the positional
<italic>k</italic>
-mers (GCCT, 111) and (GCCT,975) have multiplicity 1 and 4, respectively, but the
<italic>k</italic>
-mer GCCT has multiplicity 5. This increases the weight of the incorrect path, and thus the likelihood of an error in the contig produced by the de Bruijn graph. Lastly, we note that in this example no vertex gluing operations occur but in more complex instances, vertex gluing will occur when equal
<italic>k</italic>
-mers align at adjacent positions</p>
</caption>
<graphic xlink:href="bts219f2"></graphic>
</fig>
</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Whirl removal in the positional de Bruijn graph</title>
<p>Due to the fact that the positional de Bruijn graph is built on a substantially smaller dataset (i.e. only reads that permissively align to a certain contig) and incorporates the approximate position of each
<italic>k</italic>
-mer in the contig, it is significantly less complex than the standard de Bruijn graph for the same contig and far less complex than the de Bruijn graph of an entire assembly. The relative simplicity of the graph is important since it decreases the ambiguities associated with bulge and whirl removal. While whirls are a substantial problem in the de Bruijn graph, the occurrence of whirls in the
<italic>positional</italic>
de Bruijn graph is extremely rare; the number of whirls encountered by SEQuel while refining the assemblies of
<italic>Escherichia coli</italic>
with Euler-SR, Velvet and Velvet-SC was less than five for each assembly. Nonetheless, the occurrence of whirls is possible. We follow the whirl processing algorithm from
<xref ref-type="bibr" rid="B27">Pevzner
<italic>et al</italic>
. (2004</xref>
), since the logic of whirl processing in the positional de Bruijn graph is similar to that in the de Bruijn graph. Bulges are implicitly removed in the next step of the SEQuel algorithm.</p>
</sec>
<sec id="SEC3.4">
<title>3.4 Refined contig construction</title>
<p>The final step of SEQuel is to refine the original contig using the positional de Bruijn graph. We note that since only permissively aligned reads were used to construct the positional de Bruijn graph (rather than the complete set of reads) the graph may be disconnected, leading to the construction of several contiguous sequences. These smaller contiguous sequences may not cover the original contig entirely. However, this does not pose a problem because we are using these sequences to refine the original contig. We find the heaviest path in each of the connected components and construct the contiguous sequences corresponding to these paths. As previously mentioned, the weight of each edge is equal to the multiplicity of the corresponding positional
<italic>k</italic>
-mer and therefore, the heaviest path ‘explains’ the largest number of positional
<italic>k</italic>
-mers. We refer to the sequences constructed from the positional de Bruijn graph as
<italic>partial contigs</italic>
and denote the set of partial contigs as {
<italic>c</italic>
<sub>1</sub>
,
<italic>c</italic>
<sub>2</sub>
,…,
<italic>c</italic>
<sub>
<italic>n</italic>
</sub>
}. The
<italic>average weight</italic>
of a partial contig is defined as the mean edge weight of the corresponding path. The partial contigs will be used to refine the original contig, denoted as
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
. Lastly, we denote the refined contig that is ultimately output as
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
.</p>
<p>In order to describe how
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
is refined using {
<italic>c</italic>
<sub>1</sub>
,
<italic>c</italic>
<sub>2</sub>
,…,
<italic>c</italic>
<sub>
<italic>n</italic>
</sub>
} we need some additional notation. We denote
<italic>u</italic>
<italic>v</italic>
as the concatenation of strings
<italic>u</italic>
and
<italic>v</italic>
. Given a local alignment of strings
<italic>u</italic>
=[
<italic>u</italic>
<sub>1</sub>
<italic>u</italic>
<sub>
<italic>n</italic>
</sub>
] and
<italic>v</italic>
=[
<italic>v</italic>
<sub>1</sub>
<italic>v</italic>
<sub>
<italic>m</italic>
</sub>
], where (
<italic>i</italic>
,
<italic>i</italic>
′) and (
<italic>j</italic>
,
<italic>j</italic>
′) are the respective start and end positions of the alignment on
<italic>u</italic>
and
<italic>v</italic>
, we denote
<italic>u</italic>
<italic>v</italic>
as
<italic>u</italic>
[1… (
<italic>i</italic>
−1)]○
<italic>v</italic>
[
<italic>j</italic>
<italic>j</italic>
′]○
<italic>u</italic>
[(
<italic>i</italic>
′+1)…
<italic>n</italic>
]. The refinement process starts by setting
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
equal to
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
. Next, for each partial contig
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
∈{
<italic>c</italic>
<sub>1</sub>
,
<italic>c</italic>
<sub>2</sub>
,…,
<italic>c</italic>
<sub>
<italic>n</italic>
</sub>
}, we let
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
be equal to
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
.</p>
<p>The order in that partial contigs are used to refine
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
is important because the alignments of several partial contigs to
<inline-graphic xlink:href="bts219i1.jpg"></inline-graphic>
<sub>
<italic>r</italic>
</sub>
may overlap. In positions where such an overlap occurs, any changes from previously used partial contigs will be overwritten by the last. When coverage is uniform we process the partial contigs in order of increasing length, however, when it is highly non-uniform we process them in order of increasing average weight. In both cases, ties are broken arbitrarily and alignments below a certain length are not considered. Thus, SEQuel has two user-defined modes corresponding to the described scenarios: standard and single-cell mode.</p>
</sec>
<sec id="SEC3.5">
<title>3.5 Software implementation</title>
<p>SEQuel is implemented in Java 6.0, and can optionally be run as a multi-threaded application. All tests were performed on a PC with 32 cores (64-bit, 2.27 GHz) and 512 GB of RAM running Linux. Although benchmarking was performed on this computer, SEQuel can be run on a standard desktop; see
<xref ref-type="sec" rid="SEC4.5">Section 4.5</xref>
.</p>
</sec>
</sec>
<sec id="SEC4">
<title>4 RESULTS</title>
<sec id="SEC4.1">
<title>4.1 Datasets</title>
<p>In order to evaluate the performance of SEQuel, we use three different datasets described in
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
. (2011</xref>
). All datasets consist of paired-end 100 bp reads from
<italic>E.coli</italic>
, generated by Illumina, Inc. on the Genome Analyzer (GA) IIx platform. The first dataset consists of approximately 27 million paired-end reads, and was obtained from the NCBI Short Read Archive (accession ERA000206, EMBL-EBI Sequence Read Archive). As a measure of quality assurance, we aligned the reads to the
<italic>E.coli</italic>
genome using BWA version 0.5.9 (
<xref ref-type="bibr" rid="B20">Li
<italic>et al</italic>
., 2009</xref>
) with default parameters. We call a read
<italic>mapped</italic>
if BWA outputs an alignment for it and
<italic>unmapped</italic>
otherwise. Analysis of the alignments revealed that 98% of the reads mapped to the reference genome, representing an average depth of approximately 600×; BLAST analysis against known contaminants revealed that the unmapped reads are attributed to minor contamination of the sample (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
).</p>
<p>The second dataset is a single-cell dataset consisting of approximately 29 million paired-end reads. Again, we aligned the reads to the
<italic>E.coli</italic>
reference genome and observed that 92 of the reads mapped to the reference genome, representing an average coverage of approximately 600×; unmapped reads have been attributed to contamination of the data (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
;
<xref ref-type="bibr" rid="B31">Rodrigue
<italic>et al</italic>
., 2009</xref>
). The coverage in this dataset is non-uniformly distributed across the genome and fluctuates greatly.</p>
<p>Lastly, we used SEQuel to improve the genome assembly of an uncultured Deltaproteobacterium
<italic>SAR324</italic>
. This assembly represents a first draft of the genome of this marine bacterium (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
). Sequence reads and assembly are both available for download at
<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/singlecell/">http://bix.ucsd.edu/singlecell/</ext-link>
and a detailed description of the assembly is given in (
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
., 2011</xref>
).</p>
</sec>
<sec id="SEC4.2">
<title>4.2 Performance on standard
<italic>E.coli</italic>
Data</title>
<p>We generated the assemblies with Euler-SR v2.0 and Velvet v1.1 in paired-end mode with suggested parameters (i.e.
<italic>k</italic>
=55 for Euler-SR and
<italic>k</italic>
=31 for Velvet). We generated an additional assembly with Velvet using
<italic>k</italic>
=55.
<xref ref-type="author-notes" rid="FN1">
<sup>1</sup>
</xref>
All assemblers accurately assembled the majority of the
<italic>E.coli</italic>
genome, however, there was a range in terms of the accuracy of the contigs. We limited our attention to contigs of length ≥ 250 bp and ran SEQuel with default parameters of
<italic>k</italic>
=50 and Δ=15. Contigs were aligned to the
<italic>E.coli</italic>
reference genome (NCBI Accession NC_000913.2) using BLAT (
<xref ref-type="bibr" rid="B18">Kent, 2002</xref>
) and both mismatches and indels were counted and compared across contigs before and after the use of SEQuel.</p>
<p>Both the Euler-SR and Velvet assemblies showed a decrease in the number of substitution errors and indels relative to the reference genome after the use of SEQuel. The Euler-SR assembly with SEQuel showed a significant decrease both in the number of positions mismatching the reference genome and in the number of indels. The number of indels of size ≤!50 bp was reduced from 145 to 79, and the number of substitution errors was reduced from 155 to 104 with the use of SEQuel.
<xref ref-type="fig" rid="F3">Figure 3</xref>
and
<xref ref-type="table" rid="T1">Table 1</xref>
illustrate these changes to the assembly.
<fig id="F3" position="float">
<label>Fig. 3.</label>
<caption>
<p>Illustration of the change in the number of short (≤50 bp) indels (
<bold>a</bold>
) and substitution errors (
<bold>b</bold>
) relative to the reference genome before and after the use of SEQuel. Standard reads were assembled using Euler-SR and Velvet. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively</p>
</caption>
<graphic xlink:href="bts219f3"></graphic>
</fig>
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>Refinement statistics of assemblies from the standard
<italic>E.coli</italic>
dataset</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Refinement statistics</th>
<th rowspan="1" colspan="1">Euler-SR</th>
<th rowspan="1" colspan="1">Velvet (
<italic>k</italic>
=31)</th>
<th rowspan="1" colspan="1">Velvet (
<italic>k</italic>
=55)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Matches added (lost)</td>
<td rowspan="1" colspan="1">754 (39)</td>
<td rowspan="1" colspan="1">17 010 (25)</td>
<td rowspan="1" colspan="1">45 (2)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Substitution errors removed (created)</td>
<td rowspan="1" colspan="1">69 (18)</td>
<td rowspan="1" colspan="1">8490 (5)</td>
<td rowspan="1" colspan="1">44 (2)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Added matches may originate from two sources: correcting substitution errors or correcting contig deletions. Lost matches may originate from the two sources: creating substitution errors or correcting contig insertions.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Similar results were obtained when applying SEQuel to the Velvet (
<italic>k</italic>
=31) assembly (
<xref ref-type="fig" rid="F3">Fig. 3</xref>
). The most significant change to this assembly was the sharp decrease in the number of substitution errors relative to the reference genome. Using SEQuel, the number of substitution errors went from 8937 to 452, correcting 94% of the total number of substitution errors in the assembly. The number of short indels in this assembly went from 979 to 342. The Velvet assembly with
<italic>k</italic>
=55 showed a substantially smaller number of indels and substitution errors (e.g. 50 substitution errors in the complete assembly), leaving less room for improvement. Nevertheless, using SEQuel, the number of short indels in the Velvet assembly with
<italic>k</italic>
=55 went from 3 to 2 and the number of substitution errors went from 50 to 8.
<xref ref-type="fn" rid="FN2">
<sup>2</sup>
</xref>
</p>
</sec>
<sec id="SEC4.3">
<title>4.3 Performance on single-cell
<italic>E.coli</italic>
data</title>
<p>Assembly of single-cell data is inherently more challenging due to the highly non-uniform coverage. Consequently, the contigs obtained from assembly of single-cell data using standard assembly tools are more error prone. In 2011,
<xref ref-type="bibr" rid="B5">Chitsaz
<italic>et al</italic>
. (2011</xref>
) developed Velvet-SC, a specialized assembler tailored to handle the dramatic fluctuations in coverage that are characteristic to single-cell data. In our single-cell experiments we used Velvet-SC and Euler-SR. We performed the assemblies in paired-end mode with
<italic>k</italic>
=55. SEQuel was used in single−cell mode with
<italic>k</italic>
=50 and Δ=15. Contigs were aligned to the
<italic>E.coli</italic>
reference genome (NCBI Accession NC_000913.2) using BLAT (
<xref ref-type="bibr" rid="B18">Kent, 2002</xref>
) and both mismatches and indels were counted and compared across contigs before and after the use of SEQuel. Again, we restrict interest to contigs with length ≥250 bp.</p>
<p>As expected, the quality of the Euler-SR assembly is substantially lower than the quality of the Velvet-SC assembly since the former is not equipped to deal with single-cell data. However, both the Euler-SR and Velvet-SC assemblies have a dramatic decrease in the number of substitution errors and the number of indels with the application of SEQuel. In the Euler-SR assembly, the number of indels of size ≤50 bp went from 908 to 379, and the number of substitution errors went from 221 to 109 with the use of SEQuel.
<xref ref-type="fig" rid="F4">Figure 4</xref>
illustrates these changes to the assembly.
<xref ref-type="table" rid="T2">Table 2</xref>
gives the total number of matches added and lost, and the total number of substitution errors removed and created by SEQuel. Similar results to Euler-SR were obtained with Velvet-SC. The most significant change in the Velvet-SC assembly was the decrease in the number of indels in the assembly: the number of indels went from 47 to 18, and the number of substitution errors went from 152 bp to 73 bp.
<fig id="F4" position="float">
<label>Fig. 4.</label>
<caption>
<p>Illustration of the change in the total number of short (≤50 bp) indels (
<bold>a</bold>
) and substitution errors (
<bold>b</bold>
) in assemblies before and after the use of SEQuel. Paired-end reads from a single-cell sample were assembled using Euler-SR and Velvet-SC. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively</p>
</caption>
<graphic xlink:href="bts219f4"></graphic>
</fig>
<table-wrap id="T2" position="float">
<label>Table 2.</label>
<caption>
<p>Refinement statistics of assemblies from single-cell
<italic>E.coli</italic>
reads</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Refinement statistics</th>
<th rowspan="1" colspan="1">Euler-SR</th>
<th rowspan="1" colspan="1">Velvet-SC</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Matches added (lost)</td>
<td rowspan="1" colspan="1">3100 (200)</td>
<td rowspan="1" colspan="1">93 (18)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Substitution errors removed (created)</td>
<td rowspan="1" colspan="1">126 (14)</td>
<td rowspan="1" colspan="1">80 (1)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Added matches may originate from two sources: correcting substitution errors or correcting contig deletions. Lost matches may originate from the two sources: creating substitution errors or correcting contig insertions.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="SEC4.4">
<title>4.4 Refinement of deltaproteobacterium SAR324 single-cell assembly</title>
<p>We ran SEQuel in single-cell mode with
<italic>k</italic>
=50 and Δ=15. SEQuel made changes to 335 of the 602 contigs. Approximately 800 bp were changed overall: 191 substitutions, 584 insertions and 42 deletions. We can only report changes made to the assembly since there exists no reference genome for the bacteria. However, based on the results obtained with SEQuel on a single-cell
<italic>E.coli</italic>
assembly, we extrapolate that the majority of these changes represent a (positive) refinement to the assembly.</p>
</sec>
<sec id="SEC4.5">
<title>4.5 Practical considerations: memory and time</title>
<p>We evaluated the memory and time requirements of SEQuel. Since SEQuel is a multi-threaded application, its wall-time depends on the computing resources available to the user. For evaluation purposes we used four threads, a setting that is suitable for most current desktops. The memory requirements depend on a number of factors, including contig length, coverage depth and the level of parallelization. For the
<italic>E.coli</italic>
and Deltaproteobacterium
<italic>SAR324</italic>
assemblies, SEQuel required a maximum of 6 GB and 1.5 h to complete (see
<xref ref-type="table" rid="T3">Table 3</xref>
).
<table-wrap id="T3" position="float">
<label>Table 3.</label>
<caption>
<p>Running time and memory usage of SEQuel on bacterial genomes</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Genome</th>
<th rowspan="1" colspan="1">Genome size</th>
<th rowspan="1" colspan="1">Time</th>
<th rowspan="1" colspan="1">Memory</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">
<italic>Escherichia coli</italic>
</td>
<td rowspan="1" colspan="1">4.6 Mb</td>
<td rowspan="1" colspan="1">55 min</td>
<td rowspan="1" colspan="1">6 GB</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<italic>SAR324</italic>
</td>
<td rowspan="1" colspan="1">4.9–6.4 Mb</td>
<td rowspan="1" colspan="1">87 min</td>
<td rowspan="1" colspan="1">6 GB</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>SAR324</italic>
has not been finished, thus the exact size of the genome remains unknown.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>From a practical perspective, the time and memory requirements of SEQuel do not significantly increase upon those of most NGS assemblers, making it an easy and practical post-processing step. Further, due to the fact that it is performed contig-wise rather than genome-wise, our method should easily scale to larger genomes (i.e. mammalian genomes).</p>
</sec>
</sec>
<sec id="SEC5">
<title>5 DISCUSSION</title>
<sec id="SEC5.1">
<title>5.1 Comparison between SEQuel and variation discovery tools</title>
<p>We compared SEQuel with a simpler method of error correction provided by extensions of variation discovery tools. We aligned the reads to the contigs using BWA, and then applied GATK base quality score recalibration, variation discovery (i.e. SNP and indel detection) and variant quality score recalibration (
<xref ref-type="bibr" rid="B7">DePristo
<italic>et al</italic>
., 2011</xref>
;
<xref ref-type="bibr" rid="B11">McKenna
<italic>et al</italic>
., 2010</xref>
). We then incorporated these variants into the original contigs. We performed this process on both the single-cell and standard multi-cell
<italic>E.coli</italic>
assemblies considered in this article. Although this method is conceptually simpler and does not require the development of another tool, the performance was poor in comparison to that of SEQuel. This method corrected approximately one-third of the errors corrected by SEQuel,
<xref ref-type="fn" rid="FN3">
<sup>3</sup>
</xref>
and created 40–50% more errors than SEQuel.</p>
</sec>
<sec id="SEC5.2">
<title>5.2 Assessment of errors corrected by SEQuel</title>
<p>We ran several experiments to better characterize the errors that SEQuel corrects. The assemblies produced by running Velvet (
<italic>k</italic>
=31 and
<italic>k</italic>
=55) on simulated,
<italic>perfect</italic>
reads generated from the reference
<italic>E.coli</italic>
genome (200 × coverage) had only a negligible number of errors (less than 10 for
<italic>k</italic>
=31 and less than 5 for
<italic>k</italic>
=55). Thus, almost all errors in the assembly of the
<italic>E.coli</italic>
genome originate from sequencing errors. These sequencing errors give rise to erroneous
<italic>k</italic>
-mers that correspond to spurious edges in the de Bruijn graph. Such spurious edges may cause whirls and bulges, which cannot always be reconciled correctly by the assembler.</p>
<p>In this section, we show two examples of assembly errors that are caused by erroneous
<italic>k</italic>
-mers, and are corrected by SEQuel. Before going into these examples in detail, we give a general description of how erroneous
<italic>k</italic>
-mers can lead to complexity in the de Bruijn graph. Consider a read
<italic>r</italic>
sampled from a position
<italic>i</italic>
of the genome. Assuming
<italic>i</italic>
does not lie within a repeated region,
<italic>k</italic>
-mers from
<italic>r</italic>
should contribute only to contigs from the region proximal to
<italic>i</italic>
. In practice,
<italic>k</italic>
-mers from erroneous reads originating in one position may contribute to contigs from completely unrelated regions of the genome. For example, an erroneous ‘read’ AATACCC sampled from the genomic region AATGCCC (single substitution, from G to A) will contribute a
<italic>k</italic>
-mer (
<italic>k</italic>
=4, ATAC) to the contig …GG
<bold>ATAC</bold>
TT…., rather than the correct contig …A
<bold>ATGC</bold>
CC. This reallocation of
<italic>k</italic>
-mers may occur either within different regions of the same contig (second example below), or across different contigs (first example below). Both the following examples are from the Velvet (
<italic>k</italic>
=31) assembly of standard
<italic>E.coli</italic>
reads.</p>
<p>First, we focus our attention to a single contig (contig 170157) that illustrates how erroneous
<italic>k</italic>
-mers from reads that permissively align to a single contig can be responsible for complexities that the assembler cannot rectify. This contig has two insertions in alignment to the
<italic>E.coli</italic>
genome, both of which were corrected by SEQuel. We constructed the de Bruijn graph from only the set of reads that permissively aligned to this contig. As shown in
<xref ref-type="fig" rid="F5">Figure 5</xref>
, this graph contains bulges and whirls in the exact locations corresponding to the gaps in alignment. In contrast, the positional de Bruijn graph on the same set of reads is significantly simpler (contains a single long path); the positional information constraints graph construction and eliminates many of the spurious edges. SEQuel corrected both insertions in contig 170157.
<fig id="F5" position="float">
<label>Fig. 5.</label>
<caption>
<p>The first illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 1975 bp contig from the assembly with Velvet and
<italic>k</italic>
=31 (contig number 170157), showing two insertions in the alignment, having respective lengths 1 bp and 15 bp. The de Bruijn graph constructed from the set of permissively aligned reads to this contig contains bulges and whirls at regions corresponding to the insertions in the contigs</p>
</caption>
<graphic xlink:href="bts219f5"></graphic>
</fig>
</p>
<p>Our second example illustrates how erroneous
<italic>k</italic>
-mers are recruited from distant regions of the de Bruijn graph and form spurious edges that eventually lead to assembly errors. We constructed the de Bruijn graph from the set of reads used to assemble contig 10362 and two other contigs from the same assembly. The de Bruijn graph built from
<italic>only</italic>
the reads that permissively aligned to contig 10362 is simple enough that it led to an error-free contig, but the de Bruijn graph built from the larger set of reads has whirls and bulges at the exact locations where deletions in the alignment occur (
<xref ref-type="fig" rid="F6">Fig. 6</xref>
). The deletions likely occurred due to
<italic>k</italic>
-mers originating in erroneous reads associated with other contigs that were recruited to an incorrect region (that of contig 10362) in the de Bruijn graph. Since SEQuel builds the positional de Bruijn graph on only the set of reads that permissively align to a single contig (and not the complete set of reads), the complexity caused by erroneous
<italic>k</italic>
-mers being recruited from other regions in the graph is eliminated.
<fig id="F6" position="float">
<label>Fig. 6.</label>
<caption>
<p>The second illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 725 bp contig from the assembly with Velvet and
<italic>k</italic>
=31 (contig number 10362) shows two deletions in the contig, having respective lengths 20 bp and 7 bp. The regions in the de Bruijn graph corresponding to the deletions in alignment are complex and contain bulges and whirls that likely lead to assembly errors</p>
</caption>
<graphic xlink:href="bts219f6"></graphic>
</fig>
</p>
<p>The graphs shown in
<xref ref-type="fig" rid="F5">Figures 5</xref>
and
<xref ref-type="fig" rid="F6">6</xref>
are only estimates of the relevant regions in the de Bruijn graph constructed by the assembler; the actual graph includes
<italic>k</italic>
-mers from the entire dataset (not just
<italic>k</italic>
-mers that aligned to a single or several contigs). Both examples reveal the complexity inherent to the de Bruijn graph even on a small set of reads and provide insight into possible causes of assembly errors. The positional de Bruijn graph constructed from the same reads is substantially less complex. The recruitment of
<italic>k</italic>
-mers from one contig to an unrelated contig (as shown in
<xref ref-type="fig" rid="F6">Fig. 6</xref>
) presents a severe problem to NGS assemblers, since the position of
<italic>k</italic>
-mers cannot be ascertained prior to assembly. A post-processing step is needed to define relative position and reconcile the complexity of the original de Bruijn graph, thus the second example serves to illustrate the necessity of a post-processing step to assembly.</p>
</sec>
</sec>
<sec id="SEC6">
<title>6 CONCLUSIONS</title>
<p>To the best of our knowledge, a post-processing step for obtaining more accurate NGS assemblies has not been previously considered. While
<xref ref-type="bibr" rid="B8">Donmez and Brudno (2011</xref>
) state ‘The consensus sequence for each contig is generated using a greedy multiple sequence alignment’, no further detail is given, and the effects of this process are not evaluated. Furthermore, it cannot be used in combination with another assembler. Additionally, there is some similarity between SEQuel and the approach implemented by the LOCAS re-assembly tool (
<xref ref-type="bibr" rid="B19">Klein
<italic>et al</italic>
., 2011</xref>
), where positional information is used to partition reads into groups that will be assembled together. The novelty of our approach lies in the fact that we explicitly incorporate the positions of
<italic>k</italic>
-mers into the graph theoretical model.</p>
<p>While SEQuel significantly reduces the amount of assembly errors, it does not address the notoriously difficult problem of
<italic>repeat separation</italic>
in fragment assembly. Although significant efforts have been made to resolve this problem in Sanger sequencing (Kececioglu and Yu, 2001; (
<xref ref-type="bibr" rid="B23">Myers, 2001</xref>
;
<xref ref-type="bibr" rid="B33">Tammi
<italic>et al</italic>
., 2002</xref>
;
<xref ref-type="bibr" rid="B36">Zhi
<italic>et al</italic>
., 2007</xref>
)), it remains poorly addressed in the context of NGS. The positional de Bruijn graph offers an opportunity to revisit this problem in future research.</p>
<p>Our results demonstrate that a substantial number of small indels and substitution errors can be corrected in both single-cell and standard (multi-cell) assemblies. In particular, our results on single-cell assemblies further illustrate that high and uniform coverage is not a requirement for SEQuel. Lastly, we show that the corrections made by SEQuel can likely be accomplished only in post-processing of assembly, since the positional information cannot be inferred prior to assembly.</p>
</sec>
</body>
<back>
<fn-group>
<fn id="FN2">
<p>
<sup>1</sup>
Velvet generates assemblies that are substantially more accurate using
<italic>k</italic>
=55, however this is a more advanced parameterization since it requires recompilation of the source code with a slight modification. Thus, typical use of Velvet by non-expert users would be with
<italic>k</italic>
=31.</p>
</fn>
<fn id="FN3">
<p>
<sup>2</sup>
To illustrate that SEQuel can be used with any assembler, we further tested it with SOAPdenovo (
<xref ref-type="bibr" rid="B22">Li
<italic>et al</italic>
., 2010</xref>
) and SPAdes (
<xref ref-type="bibr" rid="B2">Bankevich
<italic>et al</italic>
., 2012</xref>
), a new NGS assembler (
<ext-link ext-link-type="uri" xlink:href="http://bioinf.spbau.ru/spades">http://bioinf.spbau.ru/spades</ext-link>
). On the multi-cell
<italic>E.coli</italic>
data, the SOAPdenovo (SPAdes) assembly had 1230 (199) substitution errors, and 27 (14) small indels totaling 590 (94) bp. SEQuel reduced the number of substitution errors in SOAPdenovo (SPAdes) by 95% (26%), and the total indel size by 80% (91%). The number of substitution errors in the SOAPdenovo assembly was dramatically reduced, illustrating that the application of SEQuel can improve this assembly so that it is nearly as accurate as that of Velvet, which is currently the most accurate assembler.</p>
</fn>
<fn id="FN4">
<p>
<sup>3</sup>
The largest number of corrections made was in the Velvet (
<italic>k</italic>
=31) assembly of standard multi-cell data, and the number of corrections was 65% less than that of SEQuel.</p>
</fn>
</fn-group>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>The authors would like to thank Glenn Tesler from the University of California, San Diego, for many insightful discussions. The authors wish to thank the reviewers for their insightful comments.</p>
<p>
<italic>Funding</italic>
:
<funding-source>CB and PP</funding-source>
were funded by
<funding-source>National Institutes of Health</funding-source>
(grant
<award-id>3P41RR024851-02S1</award-id>
) for the duration of this research. RR was funded by
<funding-source>National Science Foundation</funding-source>
(grant
<award-id>CCF-1115206</award-id>
). Finally, CB was partially funded by
<funding-source>Natural Sciences and Engineering Research Council of Canada</funding-source>
,
<funding-source>Postdoctoral Fellowship Program</funding-source>
.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alkan</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Limitations of next-generation genome sequence assembly</article-title>
<source>Nature Meth.</source>
<year>2011</year>
<volume>8</volume>
<fpage>61</fpage>
<lpage>65</lpage>
</element-citation>
</ref>
<ref id="B2">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bankevich</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SPAdes: a New Genome Assembly Algorithm and its Applications to Single-Cell Sequencing</article-title>
<source>J. Comp. Bio.</source>
<year>2012</year>
<volume>19</volume>
<fpage>455</fpage>
<lpage>477</lpage>
</element-citation>
</ref>
<ref id="B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bentley</surname>
<given-names>D.R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Accurate whole genome sequencing using reversible terminator chemistry</article-title>
<source>Nature</source>
<year>2008</year>
<volume>456</volume>
<fpage>53</fpage>
<lpage>59</lpage>
<pub-id pub-id-type="pmid">18987734</pub-id>
</element-citation>
</ref>
<ref id="B4">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Butler</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ALLPATHS: De novo assembly of whole-genome shotgun microreads</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>810</fpage>
<lpage>820</lpage>
<pub-id pub-id-type="pmid">18340039</pub-id>
</element-citation>
</ref>
<ref id="B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chitsaz</surname>
<given-names>H.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Efficient de novo assembly of single-cell bacterial genomes from short-read datasets</article-title>
<source>Nature Biotech.</source>
<year>2011</year>
<volume>29</volume>
<fpage>915</fpage>
<lpage>921</lpage>
</element-citation>
</ref>
<ref id="B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Compeau</surname>
<given-names>P.E.C.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>How to apply de Bruijn graphs to genome assembly</article-title>
<source>Nature Biotech.</source>
<year>2011</year>
<volume>29</volume>
<fpage>987</fpage>
<lpage>991</lpage>
</element-citation>
</ref>
<ref id="B7">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DePristo</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A framework for variation discovery and genotyping using next-generation DNA sequencing data</article-title>
<source>Nature Gene.</source>
<year>2011</year>
<volume>43</volume>
<fpage>491</fpage>
<lpage>498</lpage>
</element-citation>
</ref>
<ref id="B8">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Donmez</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M.</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Bafna</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Cenk</surname>
<given-names>S.</given-names>
</name>
</person-group>
<article-title>Hapsembler: as assembler for highly polymorphic genomes</article-title>
<source>RECOMB 2011</source>
<year>2011</year>
<volume>6577</volume>
<publisher-loc>Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>38</fpage>
<lpage>51</lpage>
<comment>LNCS</comment>
</element-citation>
</ref>
<ref id="B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewing</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>Base-calling of automated sequencer traces using Phred.II. ErrorProbabilities</article-title>
<source>Genome Res.</source>
<year>1998</year>
<volume>8</volume>
<fpage>186</fpage>
<lpage>194</lpage>
<pub-id pub-id-type="pmid">9521922</pub-id>
</element-citation>
</ref>
<ref id="B10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewing</surname>
<given-names>B.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Base-calling of automated sequencer traces using Phred.I.Accuracy assessment</article-title>
<source>Genome Res.</source>
<year>1998</year>
<volume>8</volume>
<fpage>175</fpage>
<lpage>185</lpage>
<pub-id pub-id-type="pmid">9521921</pub-id>
</element-citation>
</ref>
<ref id="B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McKenna</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data</article-title>
<source>Genome Res.</source>
<year>2010</year>
<volume>20</volume>
<fpage>1297</fpage>
<lpage>303</lpage>
<pub-id pub-id-type="pmid">20644199</pub-id>
</element-citation>
</ref>
<ref id="B12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hannenhalli</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Positional sequencing by hybridization</article-title>
<source>CABIOS</source>
<year>1996</year>
<volume>12</volume>
<fpage>19</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="pmid">8670615</pub-id>
</element-citation>
</ref>
<ref id="B13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hirschberg</surname>
<given-names>D.S.</given-names>
</name>
</person-group>
<article-title>A linear space algorithm for computing maximal common subsequences</article-title>
<source>Comm. A.C.M.</source>
<year>1975</year>
<volume>18</volume>
<fpage>341</fpage>
<lpage>343</lpage>
</element-citation>
</ref>
<ref id="B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The genome of the cucumber</article-title>
<source>Cucumis sativus L. Nature Gen.</source>
<year>2009</year>
<volume>41</volume>
<fpage>1275</fpage>
<lpage>1281</lpage>
</element-citation>
</ref>
<ref id="B15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Idury</surname>
<given-names>R.M.</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M.S.</given-names>
</name>
</person-group>
<article-title>A new algorithm for dna sequence assembly</article-title>
<source>J. Comput. Biol.</source>
<year>1995</year>
<volume>2</volume>
<fpage>291</fpage>
<lpage>306</lpage>
<pub-id pub-id-type="pmid">7497130</pub-id>
</element-citation>
</ref>
<ref id="B16">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kececioglu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Separating Repeats in DNA Sequence Assembly</article-title>
<source>RECOMB 2001</source>
<year>2001</year>
<publisher-loc>New York, NY</publisher-loc>
<publisher-name>ACM</publisher-name>
<fpage>176</fpage>
<lpage>183</lpage>
</element-citation>
</ref>
<ref id="B17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>D.R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol.</source>
<year>2010</year>
<volume>11</volume>
<fpage>R116</fpage>
<pub-id pub-id-type="pmid">21114842</pub-id>
</element-citation>
</ref>
<ref id="B18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>W.J.</given-names>
</name>
</person-group>
<article-title>BLAT – the BLAST-like alignment tool</article-title>
<source>Genome Res.</source>
<year>2002</year>
<volume>12</volume>
<fpage>656</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="pmid">11932250</pub-id>
</element-citation>
</ref>
<ref id="B19">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klein</surname>
<given-names>J.D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>LOCAS–a low coverage assembly tool for resequencing projects</article-title>
<source>PLoS One</source>
<year>2011</year>
<volume>6</volume>
<fpage>e23455</fpage>
<pub-id pub-id-type="pmid">21858125</pub-id>
</element-citation>
</ref>
<ref id="B20">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R.</given-names>
</name>
</person-group>
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1754</fpage>
<lpage>1760</lpage>
<pub-id pub-id-type="pmid">19451168</pub-id>
</element-citation>
</ref>
<ref id="B21">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The sequence and de novo assembly of the giant panda genome</article-title>
<source>Nature</source>
<year>2010</year>
<volume>463</volume>
<fpage>311</fpage>
<lpage>317</lpage>
<pub-id pub-id-type="pmid">20010809</pub-id>
</element-citation>
</ref>
<ref id="B22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res.</source>
<year>2010</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="pmid">20019144</pub-id>
</element-citation>
</ref>
<ref id="B23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Myers</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Optimally separating sequences</article-title>
<source>Genome Inform.</source>
<year>2001</year>
<volume>12</volume>
<fpage>165</fpage>
<lpage>174</lpage>
<pub-id pub-id-type="pmid">11791235</pub-id>
</element-citation>
</ref>
<ref id="B24">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Error correction of high-throughput sequencing datasets with non-uniform coverage</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>27</volume>
<fpage>i137</fpage>
<lpage>i141</lpage>
<pub-id pub-id-type="pmid">21685062</pub-id>
</element-citation>
</ref>
<ref id="B25">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Genome</surname>
<given-names>10K Community of Scientists.</given-names>
</name>
</person-group>
<article-title>Genome 10k: a proposal to obtain whole-genome sequence for 10000 vertebrate species</article-title>
<source>J. Hered.</source>
<year>2009</year>
<volume>100</volume>
<fpage>659</fpage>
<lpage>674</lpage>
<pub-id pub-id-type="pmid">19892720</pub-id>
</element-citation>
</ref>
<ref id="B26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pevzner</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chaisson</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Short read fragment assembly of bacterial genomes</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>324</fpage>
<lpage>330</lpage>
<pub-id pub-id-type="pmid">18083777</pub-id>
</element-citation>
</ref>
<ref id="B27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pevzner</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo repeat classification and fragment assembly</article-title>
<source>Genome Res.</source>
<year>2004</year>
<volume>14</volume>
<fpage>1786</fpage>
<lpage>1796</lpage>
<pub-id pub-id-type="pmid">15342561</pub-id>
</element-citation>
</ref>
<ref id="B28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pevzner</surname>
<given-names>P.A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>An eulerian path approach to DNA fragment assembly</article-title>
<source>Proc. Natl. Acad. Sci.</source>
<year>2001</year>
<volume>98</volume>
<fpage>9748</fpage>
<lpage>9753</lpage>
<pub-id pub-id-type="pmid">11504945</pub-id>
</element-citation>
</ref>
<ref id="B29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raghunathan</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Genomic DNA amplification from a single bacterium</article-title>
<source>Appl. Environ. Microbiol.</source>
<year>2005</year>
<volume>71</volume>
<fpage>3342</fpage>
<lpage>3347</lpage>
<pub-id pub-id-type="pmid">15933038</pub-id>
</element-citation>
</ref>
<ref id="B30">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robinson</surname>
<given-names>G.E.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Creating a buzz about insect genomes</article-title>
<source>Science</source>
<year>2011</year>
<volume>331</volume>
<fpage>1386</fpage>
<pub-id pub-id-type="pmid">21415334</pub-id>
</element-citation>
</ref>
<ref id="B31">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodrigue</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Whole genome amplification and de novo assembly of single bacterial cells</article-title>
<source>PLoS One</source>
<year>2009</year>
<volume>4</volume>
<fpage>e6864</fpage>
<pub-id pub-id-type="pmid">19724646</pub-id>
</element-citation>
</ref>
<ref id="B32">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simpson</surname>
<given-names>J.T.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ABySS: a parallel assembler for short read sequence data</article-title>
<source>Genome Res.</source>
<year>2009</year>
<volume>19</volume>
<fpage>1117</fpage>
<lpage>1123</lpage>
<pub-id pub-id-type="pmid">19251739</pub-id>
</element-citation>
</ref>
<ref id="B33">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tammi</surname>
<given-names>M.T.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Separation of nearly identical repeats in shotgun assemblies using defined nucloetide positions, DNPs</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>379</fpage>
<lpage>388</lpage>
<pub-id pub-id-type="pmid">11934736</pub-id>
</element-citation>
</ref>
<ref id="B34">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wheeler</surname>
<given-names>D.A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The complete genome of an individual by massively parallel DNA sequencing</article-title>
<source>Nature</source>
<year>2008</year>
<volume>452</volume>
<fpage>872</fpage>
<lpage>876</lpage>
<pub-id pub-id-type="pmid">18421352</pub-id>
</element-citation>
</ref>
<ref id="B35">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>D.R.</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E.</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de bruijn graphs</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="B36">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhi</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Correcting base-assignment errors in repeat regions of shotgun assembly</article-title>
<source>IEEE/ACM Trans. Comput. Biol. Bioinform.</source>
<year>2007</year>
<volume>4</volume>
<fpage>54</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="pmid">17277413</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B07 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B07 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3371851
   |texte=   SEQuel: improving the accuracy of genome assemblies
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22689760" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021