Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000371 ( Pmc/Corpus ); précédent : 0003709; suivant : 0003720 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads</title>
<author>
<name sortKey="Song, Li" sort="Song, Li" uniqKey="Song L" first="Li" last="Song">Li Song</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Florea, Liliana" sort="Florea, Liliana" uniqKey="Florea L" first="Liliana" last="Florea">Liliana Florea</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, 21205 USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26500767</idno>
<idno type="pmc">4615873</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4615873</idno>
<idno type="RBID">PMC:4615873</idno>
<idno type="doi">10.1186/s13742-015-0089-y</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000371</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000371</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads</title>
<author>
<name sortKey="Song, Li" sort="Song, Li" uniqKey="Song L" first="Li" last="Song">Li Song</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Florea, Liliana" sort="Florea, Liliana" uniqKey="Florea L" first="Liliana" last="Florea">Liliana Florea</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, 21205 USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">GigaScience</title>
<idno type="eISSN">2047-217X</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing.</p>
</sec>
<sec>
<title>Findings</title>
<p>We developed a
<italic>k</italic>
-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted
<italic>k</italic>
-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted
<italic>k</italic>
-mers, Rcorrector computes a local threshold at every position in a read.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from
<ext-link ext-link-type="uri" xlink:href="https://github.com/mourisl/Rcorrector/">https://github.com/mourisl/Rcorrector/</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13742-015-0089-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y Heo</name>
</author>
<author>
<name sortKey="Wu, Xl" uniqKey="Wu X">XL Wu</name>
</author>
<author>
<name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
<author>
<name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author>
<name sortKey="Hwu, Wm" uniqKey="Hwu W">WM Hwu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Florea, L" uniqKey="Florea L">L Florea</name>
</author>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Chockalingam, Sp" uniqKey="Chockalingam S">SP Chockalingam</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, D" uniqKey="Kelley D">D Kelley</name>
</author>
<author>
<name sortKey="Schatz, M" uniqKey="Schatz M">M Schatz</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Scott, E" uniqKey="Scott E">E Scott</name>
</author>
<author>
<name sortKey="Kakaradov, B" uniqKey="Kakaradov B">B Kakaradov</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Fazayeli, F" uniqKey="Fazayeli F">F Fazayeli</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le, Hs" uniqKey="Le H">HS Le</name>
</author>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Mccauley, Bm" uniqKey="Mccauley B">BM McCauley</name>
</author>
<author>
<name sortKey="Hinman, Vf" uniqKey="Hinman V">VF Hinman</name>
</author>
<author>
<name sortKey="Bar Joseph, Z" uniqKey="Bar Joseph Z">Z Bar-Joseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Griebel, T" uniqKey="Griebel T">T Griebel</name>
</author>
<author>
<name sortKey="Zacher, B" uniqKey="Zacher B">B Zacher</name>
</author>
<author>
<name sortKey="Ribeca, P" uniqKey="Ribeca P">P Ribeca</name>
</author>
<author>
<name sortKey="Raineri, E" uniqKey="Raineri E">E Raineri</name>
</author>
<author>
<name sortKey="Lacroix, V" uniqKey="Lacroix V">V Lacroix</name>
</author>
<author>
<name sortKey="Guig, R" uniqKey="Guig R">R Guigó</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doring, A" uniqKey="Doring A">A Doring</name>
</author>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Rausch, T" uniqKey="Rausch T">T Rausch</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pimentel, H" uniqKey="Pimentel H">H Pimentel</name>
</author>
<author>
<name sortKey="Kelley, R" uniqKey="Kelley R">R Kelley</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Walenz, B" uniqKey="Walenz B">B Walenz</name>
</author>
<author>
<name sortKey="Florea, L" uniqKey="Florea L">L Florea</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haas, Bj" uniqKey="Haas B">BJ Haas</name>
</author>
<author>
<name sortKey="Papanicolaou, A" uniqKey="Papanicolaou A">A Papanicolaou</name>
</author>
<author>
<name sortKey="Yassour, M" uniqKey="Yassour M">M Yassour</name>
</author>
<author>
<name sortKey="Grabherr, M" uniqKey="Grabherr M">M Grabherr</name>
</author>
<author>
<name sortKey="Blood, Pd" uniqKey="Blood P">PD Blood</name>
</author>
<author>
<name sortKey="Bowden, J" uniqKey="Bowden J">J Bowden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
<author>
<name sortKey="Nurk, S" uniqKey="Nurk S">S Nurk</name>
</author>
<author>
<name sortKey="Antipov, D" uniqKey="Antipov D">D Antipov</name>
</author>
<author>
<name sortKey="Gurevich, Aa" uniqKey="Gurevich A">AA Gurevich</name>
</author>
<author>
<name sortKey="Dvorkin, M" uniqKey="Dvorkin M">M Dvorkin</name>
</author>
<author>
<name sortKey="Kulikov, As" uniqKey="Kulikov A">AS Kulikov</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gurevich, A" uniqKey="Gurevich A">A Gurevich</name>
</author>
<author>
<name sortKey="Saveliev, V" uniqKey="Saveliev V">V Saveliev</name>
</author>
<author>
<name sortKey="Vyahhi, N" uniqKey="Vyahhi N">N Vyahhi</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Gigascience</journal-id>
<journal-id journal-id-type="iso-abbrev">Gigascience</journal-id>
<journal-title-group>
<journal-title>GigaScience</journal-title>
</journal-title-group>
<issn pub-type="epub">2047-217X</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26500767</article-id>
<article-id pub-id-type="pmc">4615873</article-id>
<article-id pub-id-type="publisher-id">89</article-id>
<article-id pub-id-type="doi">10.1186/s13742-015-0089-y</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Technical Note</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Song</surname>
<given-names>Li</given-names>
</name>
<address>
<email>lsong10@jhu.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Florea</surname>
<given-names>Liliana</given-names>
</name>
<address>
<email>florea@jhu.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
Department of Computer Science, Johns Hopkins University, Baltimore, 21218 USA</aff>
<aff id="Aff2">
<label>2</label>
McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, 21205 USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>19</day>
<month>10</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>19</day>
<month>10</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>4</volume>
<elocation-id>48</elocation-id>
<history>
<date date-type="received">
<day>1</day>
<month>6</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>9</day>
<month>10</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© Song and Florea. 2015</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing.</p>
</sec>
<sec>
<title>Findings</title>
<p>We developed a
<italic>k</italic>
-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted
<italic>k</italic>
-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted
<italic>k</italic>
-mers, Rcorrector computes a local threshold at every position in a read.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from
<ext-link ext-link-type="uri" xlink:href="https://github.com/mourisl/Rcorrector/">https://github.com/mourisl/Rcorrector/</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13742-015-0089-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Next-generation sequencing</kwd>
<kwd>RNA-seq</kwd>
<kwd>Error correction</kwd>
<kwd>
<italic>k</italic>
-mers</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="introduction">
<title>Introduction</title>
<p>Next-generation sequencing of cellular RNA (RNA-seq) has become the foundation of virtually every transcriptomic analysis. The large number of reads generated from a single sample allow researchers to study the genes being expressed and estimate their expression levels, and to discover alternative splicing and other sequence variations. However, biases and errors introduced at various stages during the experiment, in particular sequencing errors, can have a significant impact on bioinformatics analyses.</p>
<p>Systematic error correction of whole-genome sequencing (WGS) reads was proven to increase the quality of alignment and assembly [
<xref ref-type="bibr" rid="CR1">1</xref>
<xref ref-type="bibr" rid="CR3">3</xref>
], two critical steps in analyzing next-generation sequencing data. There are currently several error correction methods for WGS reads, classified into three categories [
<xref ref-type="bibr" rid="CR4">4</xref>
].
<italic>K</italic>
-spectrum based methods, which are the most popular of the three, classify a
<italic>k</italic>
-mer as trusted or untrusted depending on whether the number of occurrences in the input reads exceeds a given threshold. Then, for each read, low-frequency (untrusted)
<italic>k</italic>
-mers are converted into high-frequency (trusted) ones. Candidate
<italic>k</italic>
-mers are stored in a data structure such as a Hamming graph, which connects
<italic>k</italic>
-mers within a fixed distance, or a Bloom filter. Methods in this category include Quake [
<xref ref-type="bibr" rid="CR5">5</xref>
], Hammer [
<xref ref-type="bibr" rid="CR6">6</xref>
], Musket [
<xref ref-type="bibr" rid="CR7">7</xref>
], Bless [
<xref ref-type="bibr" rid="CR1">1</xref>
], BFC [
<xref ref-type="bibr" rid="CR2">2</xref>
], and Lighter [
<xref ref-type="bibr" rid="CR3">3</xref>
]. Suffix tree and suffix array based methods build a data structure from the input reads, and replace a substring in a read if its number of occurrences falls below that expected given a probabilistic model. These methods, which include Shrec [
<xref ref-type="bibr" rid="CR8">8</xref>
], Hybrid-Shrec [
<xref ref-type="bibr" rid="CR9">9</xref>
] and HiTEC [
<xref ref-type="bibr" rid="CR10">10</xref>
], can handle multiple
<italic>k</italic>
-mer sizes. Lastly, multiple sequence alignment (MSA) based methods such as Coral [
<xref ref-type="bibr" rid="CR11">11</xref>
] and SEECER [
<xref ref-type="bibr" rid="CR12">12</xref>
] cluster reads that share
<italic>k</italic>
-mers to create a local vicinity and a multiple alignment, and use the consensus sequence as a guide to correct the reads.</p>
<p>RNA-seq sequence data differ from WGS data in several critical ways. First, while read coverage in WGS data is largely uniform across the genome, genes and transcripts in an RNA-seq experiment have different expression levels. Consequently, even low-frequency
<italic>k</italic>
-mers may be correct, belonging to a homolog or a splice isoform. Second, alternative splicing events can create multiple correct
<italic>k</italic>
-mers at the event boundaries, a phenomenon that occurs only at repeat regions for WGS reads. In both of these cases, the reads would be erroneously converted by a WGS correction method. Hence, error correctors for WGS reads are generally not well suited for RNA-seq sequences [
<xref ref-type="bibr" rid="CR13">13</xref>
].</p>
<p>There is so far only one other tool designed specifically for RNA-seq error correction, called SEECER [
<xref ref-type="bibr" rid="CR12">12</xref>
], based on the MSA approach. Given a read, SEECER attempts to determine its context (overlapping reads from the same transcript), characterized by a hidden Markov model, and to use this to identify and correct errors. One significant drawback, however, is the large amount of memory needed to index the reads. Herein we propose a novel
<italic>k</italic>
-spectrum based method, Rcorrector (RNA-seq error CORRECTOR), for RNA-seq data. Rcorrector uses a flexible
<italic>k</italic>
-mer count threshold, computing a different threshold for a
<italic>k</italic>
-mer within each read, to account for different transcript and gene expression levels. It also allows for multiple
<italic>k</italic>
-mer choices at any position in the read. Rcorrector only stores
<italic>k</italic>
-mers that appear more than once in the read set, which makes it scalable with large datasets. Accurate and efficient, Rcorrector is uniquely suited to datasets from species with large and complex genomes and transcriptomes, such as human, without requiring significant hardware resources. Rcorrector can also be applied to other types of data with non-uniform coverage such as single-cell sequencing, as we will show later. In the following sections we present the algorithm, first, followed by an evaluation of this and other methods on both simulated and real data. In particular, we illustrate and compare the impact of several error correctors for two popular bioinformatics applications, namely, alignment and assembly of reads.</p>
</sec>
<sec id="Sec2">
<title>Algorithm</title>
<sec id="Sec3">
<title>De Bruijn graph</title>
<p>In a first preprocessing stage, Rcorrector builds a De Bruijn graph of all
<italic>k</italic>
-mers that appear more than once in the input reads, together with their counts. To do so, Rcorrector uses Jellyfish2 [
<xref ref-type="bibr" rid="CR14">14</xref>
] to build a Bloom counter that detects
<italic>k</italic>
-mers occurring multiple times, and then stores these in a hash table. Intuitively, the graph encodes all transcripts (full or partial) that can be assembled from the input reads. At run time, for each read the algorithm finds the closest path in the graph, corresponding to its transcript of origin, which it then uses to correct the read.</p>
</sec>
<sec id="Sec4">
<title>Read error correction: the path search algorithm</title>
<p>As with any
<italic>k</italic>
-spectrum method, Rcorrector distinguishes among solid and non-solid
<italic>k</italic>
-mers as the basis for its correction algorithm. A solid
<italic>k</italic>
-mer is one that passes a given count threshold and therefore can be trusted to be correct. Rcorrector uses a flexible threshold for solid
<italic>k</italic>
-mers, which is calculated for each
<italic>k</italic>
-mer within each read sequence. At run time, Rcorrector scans the read sequence and, at each position, decides whether the next
<italic>k</italic>
-mer and each of its alternatives are solid and therefore represent valid continuations of the path. The path with the smallest number of differences from the read sequence, representing the likely transcript of origin, is then used to correct
<italic>k</italic>
-mers in the original read.</p>
<p>More formally, let
<italic>u</italic>
be a
<italic>k</italic>
-mer in read
<italic>r</italic>
and
<italic>S</italic>
(
<italic>u</italic>
,
<italic>c</italic>
) denote the successor
<italic>k</italic>
-mer for
<italic>u</italic>
when appending nucleotide
<italic>c</italic>
, with
<italic>c</italic>
∈{A,C,G,T}. For example, in Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
,
<italic>S</italic>
(AAGT,C)=AGTC,
<italic>k</italic>
=4. Let
<italic>M</italic>
(
<italic>u</italic>
) denote the multiplicity of
<italic>k</italic>
-mer
<italic>u</italic>
. To find a start node in the graph from which to search for a valid path, Rcorrector scans the read to identify a stretch of two or more consecutive solid
<italic>k</italic>
-mers, and marks these bases as solid. Starting from the longest stretch of solid bases, it proceeds in both directions, one base at a time as described below. By symmetry, we only illustrate the search in the 5
<sup></sup>
→3
<sup></sup>
direction.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Path extension in Rcorrector. Four possible path continuations at the AGTC
<italic>k</italic>
-mer (
<italic>k</italic>
=4) in the De Bruijn graph for the
<italic>r</italic>
= AAGTCATAA read sequence. Numbers in the vertices represent
<italic>k</italic>
-mer counts. The first (
<italic>top</italic>
) path corresponds to the original read’s representation in the De Bruijn graph. The extension is pruned after the first step, AGTC →GTCA, as the count
<italic>M</italic>
(
<italic>GTCA</italic>
)=4 falls below the local cutoff (determined based on the maximum
<italic>k</italic>
-mer count (494) of the four possible successors of AGTC). The second path (
<italic>yellow</italic>
) has higher
<italic>k</italic>
-mer counts but it introduces four corrections, changing the read into AAGTCCGTC. The third path (
<italic>blue</italic>
) introduces only two corrections, to change the sequence into AAGTCGTTA, and is therefore chosen to correct the read. The fourth (
<italic>bottom</italic>
) path is pruned as the
<italic>k</italic>
-mer count for GTCT does not pass the threshold. Paths 2 and 3 are likely to indicate paralogs and/or splice variants of this gene</p>
</caption>
<graphic xlink:href="13742_2015_89_Fig1_HTML" id="d30e500"></graphic>
</fig>
</p>
<p>Suppose
<italic>u</italic>
=
<italic>r</italic>
<sub>
<italic>i</italic>
</sub>
<italic>r</italic>
<sub>
<italic>i</italic>
+1</sub>
<italic>r</italic>
<sub>
<italic>i</italic>
+
<italic>k</italic>
−1</sub>
is the
<italic>k</italic>
-mer starting at position
<italic>i</italic>
in read
<italic>r</italic>
. Rcorrector considers all possible successors
<italic>S</italic>
(
<italic>u</italic>
,
<italic>c</italic>
),
<italic>c</italic>
∈{A,C,G,T}, and their multiplicities
<italic>M</italic>
(
<italic>S</italic>
(
<italic>u</italic>
,
<italic>c</italic>
)) and determines which ones are solid based on a locally defined threshold (see below). Rcorrector tests all the possible nucleotides for position
<italic>i</italic>
+
<italic>k</italic>
and retains those that lead to solid
<italic>k</italic>
-mers, and then follows the paths in the De Bruijn graph from these
<italic>k</italic>
-mers. Multiple
<italic>k</italic>
-mer choices are considered in order to allow for splice variants. If the nucleotide in the current path is different from
<italic>r</italic>
<sub>
<italic>i</italic>
+
<italic>k</italic>
</sub>
, then it is marked as a correction. When the number of corrections in the path exceeds an
<italic>a priori</italic>
defined threshold, Rcorrector terminates the current search path and starts a new one. In the end, Rcorrector selects the path with the minimum number of changes and uses the path’s sequence to correct the read. To improve speed, Rcorrector does not attempt to correct solid positions, and gradually decreases the allowable number of corrections if the number of searched paths becomes large.</p>
</sec>
<sec id="Sec5">
<title>A flexible local threshold for solid
<italic>k</italic>
-mers</title>
<p>Let
<italic>u</italic>
be the
<italic>k</italic>
-mer starting at position
<italic>i</italic>
in the read, as before. Unlike with WGS reads, even if the multiplicity
<italic>M</italic>
(
<italic>S</italic>
(
<italic>u</italic>
,
<italic>r</italic>
<sub>
<italic>i</italic>
+
<italic>k</italic>
</sub>
)) of its successor
<italic>k</italic>
-mer is very low, the base
<italic>r</italic>
<sub>
<italic>i</italic>
+
<italic>k</italic>
</sub>
may still be correct, for instance sampled from a low-expression transcript. Therefore, an RNA-seq read error corrector cannot simply use a global
<italic>k</italic>
-mer count threshold. Rcorrector uses a locally defined threshold as follows. Let
<italic>t</italic>
= max
<italic>c</italic>
<italic>M</italic>
(
<italic>S</italic>
(
<italic>u</italic>
,
<italic>c</italic>
)), calculated over all possible successors of
<italic>k</italic>
-mer
<italic>u</italic>
encoded in the De Bruijn graph. Rcorrector defines the local threshold at run time,
<italic>f</italic>
(
<italic>t</italic>
,
<italic>r</italic>
), as the smaller of two values, a
<italic>k</italic>
-mer-level threshold and a read-level one:
<italic>f</italic>
(
<italic>t</italic>
,
<italic>r</italic>
)= min(
<italic>g</italic>
(
<italic>t</italic>
),
<italic>h</italic>
(
<italic>r</italic>
)).</p>
<p>The
<italic>k</italic>
-mer-level threshold is defined as
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$g(t)=\alpha t + 6\sqrt {\alpha t}$\end{document}</tex-math>
<mml:math id="M2">
<mml:mi>g</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>αt</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>6</mml:mn>
<mml:msqrt>
<mml:mrow>
<mml:mi>αt</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:math>
<inline-graphic xlink:href="13742_2015_Article_89_TeX2GIF_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<italic>α</italic>
is a global variation coefficient. Specifically,
<italic>α</italic>
is determined for each dataset from a sample of 1 million high-count
<italic>k</italic>
-mers (multiplicities over 1,000), as follows. Given the four (or fewer) possible continuations of a
<italic>k</italic>
-mer, Rcorrector calculates a value equal to the ratio between the second highest and the highest multiplicities. Then,
<italic>α</italic>
is chosen as the smallest such value larger than 95 % of those in the sample. This criterion ensures that only
<italic>k</italic>
-mers that can be unambiguously distinguished from their alternates will be chosen; lowering this parameter value will reduce the stringency. Note that the
<italic>k</italic>
-mer-level threshold is the same for a
<italic>k</italic>
-mer in all read contexts, but differs by
<italic>k</italic>
-mer.</p>
<p>To calculate the read-level threshold, Rcorrector orders all
<italic>k</italic>
-mers in the read by decreasing multiplicities. Let
<italic>x</italic>
be the multiplicity before the first sharp drop (> 2-fold) in this curve. Rcorrector then uses
<italic>h</italic>
(
<italic>r</italic>
)=
<italic>g</italic>
(
<italic>x</italic>
) as the read-level threshold. Refinements to this step to accommodate additional lower-count paths are described below.</p>
</sec>
<sec id="Sec6">
<title>Refinements</title>
<sec id="Sec7">
<title>Clustered corrections</title>
<p>Once a set of corrections has been determined for a read, Rcorrector scans the read and selectively refines those at nearby positions. The rationale for this step is that the likelihood of two or more clustered errors is very low under the assumed model of random sequencing errors, and the read may instead originate from a paralog. More specifically, let
<italic>u</italic>
<sub>
<italic>i</italic>
</sub>
and
<italic>u</italic>
<sub>
<italic>j</italic>
</sub>
be the
<italic>k</italic>
-mers ending at two positions
<italic>i</italic>
and
<italic>j</italic>
, with
<italic>j</italic>
<italic>i</italic>
<
<italic>k</italic>
, and
<italic>M</italic>
(
<italic>u</italic>
<sub>
<italic>i</italic>
</sub>
) and
<italic>M</italic>
(
<italic>u</italic>
<sub>
<italic>j</italic>
</sub>
) their multiplicities. To infer the source for the
<italic>k</italic>
-mer, Rcorrector uses the local read context and tests for the difference in the multiplicities of
<italic>k</italic>
-mers before correction. If the difference is significant, then it is a strong indication for a cluster of sequencing errors. Otherwise (i.e., if 0.5<
<italic>M</italic>
(
<italic>u</italic>
<sub>
<italic>i</italic>
</sub>
)/
<italic>M</italic>
(
<italic>u</italic>
<sub>
<italic>j</italic>
</sub>
)<2), then the
<italic>k</italic>
-mers are likely to have originated from the same path in the graph, corresponding to a low-expression paralog, and the read is deemed to be correct. Rcorrector will revert corrections at positions
<italic>i</italic>
and
<italic>j</italic>
and then iteratively revisit all corrections within distance
<italic>k</italic>
from those previously reverted.</p>
</sec>
<sec id="Sec8">
<title>Unfixable reads</title>
<p>Rcorrector builds multiple possible paths for a read and in the end chooses the path with the minimum number of base changes. If the number of changes over the entire read or within any window of size
<italic>k</italic>
exceeds an
<italic>a priori</italic>
determined threshold, the read is deemed ‘unfixable’. There are two likely explanations for unfixable reads:
<italic>i</italic>
) the read is correct, and originates from a low-expression transcript for which there is a higher-expression homolog present in the sample; and
<italic>ii</italic>
) the read contains too many errors to be rescued.</p>
<p>In the first case, Rcorrector never entered the true path in the graph during the extension, and hence the read was incorrectly converted to the high-expression homolog. To alleviate this problem, Rcorrector uses an iterative procedure to lower the read-level threshold
<italic>h</italic>
(
<italic>r</italic>
) and allow lower count
<italic>k</italic>
-mers in the path.</p>
<p>Specifically, Rcorrector looks for the next sharp drop in the
<italic>k</italic>
-mer multiplicity plot to define a new and reduced
<italic>h</italic>
(
<italic>r</italic>
), until there is no such drop or the number of corrections is within the set limits.</p>
</sec>
<sec id="Sec9">
<title>PolyA tail reads</title>
<p>The presence of polyA tail sequences in the sample will lead to
<italic>k</italic>
-mers with mostly A or T bases. Because their multiplicities are derived from a mixture distribution from a large number of transcripts, these
<italic>k</italic>
-mers are ignored during the correction process. Rcorrector will consequently not attempt to correct such
<italic>k</italic>
-mers.</p>
</sec>
<sec id="Sec10">
<title>Paired-end reads</title>
<p>With paired-end reads, Rcorrector leverages the
<italic>k</italic>
-mer count information across the two reads to improve the correction accuracy. In particular, it chooses the smaller of the two read-level thresholds as the common threshold for the two reads. In doing so, it models the scenario where the fragment comes from a low-expression isoform of the gene, with one of the reads specific to this isoform and the other shared among multiple, higher-expression isoforms. In this case, the lower of the two read-level thresholds better represents the originating transcript.</p>
</sec>
</sec>
</sec>
<sec id="Sec11">
<title>Findings</title>
<p>We evaluate Rcorrector for its ability to correct Illumina sequencing reads, both simulated and real. We include in the evaluation four other error correctors: SEECER (v0.1.3), which is the only other tool specifically designed for RNA-seq reads, as well as at least one representative method for each of the three classes of WGS error correction methods. These include Musket (v1.1) and BFC (r181) for
<italic>k</italic>
-spectrum, Hybrid-Shrec (Hshrec) for suffix tree and suffix array, and Coral (v1.4) for MSA-based methods. Since many tools are sensitive to the
<italic>k</italic>
-mer size
<italic>k</italic>
, we test different
<italic>k</italic>
-mer sizes for each tool where applicable and report the result that produces the best performance. We assess the impact of all programs on two representative bioinformatics applications, read alignment and read assembly. Lastly, we show that Rcorrector can be successfully applied to other types of data exhibiting non-uniform read coverage, such as single-cell sequencing reads.</p>
<sec id="Sec12">
<title>Evaluation on simulated data</title>
<p>In a first test, we evaluated all programs on a simulated dataset containing 100 million 100 bp long paired-end reads. Reads were generated with FluxSimulator [
<xref ref-type="bibr" rid="CR15">15</xref>
] starting from the human GENCODE v.17 gene annotations. Errors were subsequently introduced with Mason [
<xref ref-type="bibr" rid="CR16">16</xref>
]; error rates were extracted from alignments of same-length Illumina Human Body Map reads (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
, Section S1). As in [
<xref ref-type="bibr" rid="CR4">4</xref>
], we evaluate the accuracy of error corrections by inspecting how each base was corrected. Let true positives (TP) be the number of error bases that are converted into the correct nucleotide; false positives (FP) the number of error-free bases that are falsely corrected; and false negatives (FN) the number of error bases that are not converted or where the converted base is still an error. We use the standard measures of
<italic>R</italic>
<italic>e</italic>
<italic>c</italic>
<italic>a</italic>
<italic>l</italic>
<italic>l</italic>
=
<italic>T</italic>
<italic>P</italic>
/(
<italic>T</italic>
<italic>P</italic>
+
<italic>F</italic>
<italic>N</italic>
),
<italic>P</italic>
<italic>r</italic>
<italic>e</italic>
<italic>c</italic>
<italic>i</italic>
<italic>s</italic>
<italic>i</italic>
<italic>o</italic>
<italic>n</italic>
=
<italic>T</italic>
<italic>P</italic>
/(
<italic>T</italic>
<italic>P</italic>
+
<italic>F</italic>
<italic>P</italic>
), and
<italic>F</italic>
_
<italic>s</italic>
<italic>c</italic>
<italic>o</italic>
<italic>r</italic>
<italic>e</italic>
=2∗
<italic>R</italic>
<italic>e</italic>
<italic>c</italic>
<italic>a</italic>
<italic>l</italic>
<italic>l</italic>
<italic>P</italic>
<italic>r</italic>
<italic>e</italic>
<italic>c</italic>
<italic>i</italic>
<italic>s</italic>
<italic>i</italic>
<italic>o</italic>
<italic>n</italic>
/(
<italic>R</italic>
<italic>e</italic>
<italic>c</italic>
<italic>a</italic>
<italic>l</italic>
<italic>l</italic>
+
<italic>P</italic>
<italic>r</italic>
<italic>e</italic>
<italic>c</italic>
<italic>i</italic>
<italic>s</italic>
<italic>i</italic>
<italic>o</italic>
<italic>n</italic>
) to evaluate all methods. For each tool we test different
<italic>k</italic>
-mer sizes and report the result with the best
<italic>F</italic>
_
<italic>s</italic>
<italic>c</italic>
<italic>o</italic>
<italic>r</italic>
<italic>e</italic>
.</p>
<p>Accuracy values and performance measurements for the six error correctors are shown in Table
<xref rid="Tab1" ref-type="table">1</xref>
. All programs were run on a 256 GB RAM machine with a 48-core 2.1 GHz AMD Opteron(
<sc>TM</sc>
) processor, with 8 threads. Here and throughout the manuscript, all measures are expressed in percentages. The overall sensitivity is below 90 % for all methods due to the large number of polyA reads generated by FluxSimulator, which are left unchanged. Rcorrector has the best overall performance by all measures, with 88 % sensitivity and greater than 99 % precision, followed closely by SEECER. Rcorrector is also virtually tied with BFC for the fastest method, and is among the most memory efficient. In particular, at 5 GB RAM for analyzing 100 million reads, it required 12 times less memory than SEECER and can easily fit in the memory of most desktop computers (Table
<xref rid="Tab1" ref-type="table">1</xref>
).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Accuracy of the six error correction methods on the 100 million simulated reads</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Program</th>
<th align="left">
<italic>k</italic>
</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">F-score</th>
<th align="left">Run time</th>
<th align="left">Memory</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left">(min)</th>
<th align="left">(GB)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SEECER</td>
<td align="left">31</td>
<td align="left">87.13</td>
<td align="left">96.93</td>
<td align="left">91.77</td>
<td align="left">177</td>
<td align="left">61</td>
</tr>
<tr>
<td align="left">HShrec</td>
<td align="left">-</td>
<td align="left">69.53</td>
<td align="left">31.74</td>
<td align="left">43.58</td>
<td align="left">13641</td>
<td align="left">30</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">31</td>
<td align="left">58.35</td>
<td align="left">85.14</td>
<td align="left">69.25</td>
<td align="left">1391</td>
<td align="left">81</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">27</td>
<td align="left">78.24</td>
<td align="left">96.90</td>
<td align="left">86.58</td>
<td align="left">152</td>
<td align="left">
<italic>4</italic>
</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">27</td>
<td align="left">80.45</td>
<td align="left">97.91</td>
<td align="left">88.32</td>
<td align="left">
<italic>111</italic>
</td>
<td align="left">6</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">27</td>
<td align="left">
<italic>88.94</italic>
</td>
<td align="left">
<italic>99.84</italic>
</td>
<td align="left">
<italic>94.07</italic>
</td>
<td align="left">118</td>
<td align="left">5</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers in each category are highlighted in italic. All programs were run multithreaded, with eight threads</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>The difficulty of error correction is expected to vary with the expression level of transcripts. Correcting reads from low-expression transcripts is particularly challenging because the error-containing
<italic>k</italic>
-mers cannot be easily distinguished on the basis of frequency. To assess the performance of the various tools with transcript expression levels, we divide the simulated transcripts into low-, medium-, and high-expression groups based on their relative abundance
<italic>A</italic>
assigned by FluxSimulator (low,
<italic>A</italic>
<5
<italic>e</italic>
<sup>−7</sup>
; medium, 5
<italic>e</italic>
<sup>−7</sup>
<
<italic>A</italic>
<0.0001; and high,
<italic>A</italic>
>0.0001). The results of each tool on the three subclasses are shown in Table
<xref rid="Tab2" ref-type="table">2</xref>
. Most tools perform well on the high-expression dataset, with the exception of Coral (low sensitivity) and Hshrec (low precision). However, the performance for all methods, especially sensitivity, drops for reads from low-expression transcripts. Rcorrector has the best or comparable sensitivity and precision for each of the three classes of transcripts. Both Rcorrector and SEECER are significantly more precise (>86 % in all categories) and more sensitive than methods designed for DNA reads, especially for reads from low-expression transcripts.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Accuracy of six error correction methods on 100 million simulated reads, by expression level of transcripts.
<italic>k</italic>
-mer sizes used are those in Table
<xref rid="Tab1" ref-type="table">1</xref>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Program</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">F-score</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Low expression</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">32.78</td>
<td align="left">
<italic>90.54</italic>
</td>
<td align="left">48.14</td>
</tr>
<tr>
<td align="left">HShrec</td>
<td align="left">24.77</td>
<td align="left">0.81</td>
<td align="left">1.56</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">31.88</td>
<td align="left">64.60</td>
<td align="left">42.69</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">13.88</td>
<td align="left">33.94</td>
<td align="left">19.71</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">25.18</td>
<td align="left">58.37</td>
<td align="left">35.19</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">
<italic>39.40</italic>
</td>
<td align="left">86.62</td>
<td align="left">
<italic>54.16</italic>
</td>
</tr>
<tr>
<td align="left">Medium expression</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">86.58</td>
<td align="left">97.05</td>
<td align="left">91.51</td>
</tr>
<tr>
<td align="left">HShrec</td>
<td align="left">70.57</td>
<td align="left">19.57</td>
<td align="left">30.64</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">89.07</td>
<td align="left">85.12</td>
<td align="left">87.05</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">72.02</td>
<td align="left">92.16</td>
<td align="left">80.86</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">
<italic>89.12</italic>
</td>
<td align="left">96.88</td>
<td align="left">92.84</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">87.73</td>
<td align="left">
<italic>99.66</italic>
</td>
<td align="left">
<italic>93.31</italic>
</td>
</tr>
<tr>
<td align="left">High expression</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">87.39</td>
<td align="left">96.90</td>
<td align="left">91.90</td>
</tr>
<tr>
<td align="left">HShrec</td>
<td align="left">69.22</td>
<td align="left">41.67</td>
<td align="left">52.02</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">47.59</td>
<td align="left">85.17</td>
<td align="left">61.06</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">80.50</td>
<td align="left">98.53</td>
<td align="left">88.61</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">77.47</td>
<td align="left">98.35</td>
<td align="left">86.67</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">
<italic>89.42</italic>
</td>
<td align="left">
<italic>99.91</italic>
</td>
<td align="left">
<italic>94.37</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec13">
<title>Real datasets</title>
<p>For a more realistic assessment, we applied the tools to three real datasets that vary in their sequencing depth, read length, amount of sequence variation, and application area (Table
<xref rid="Tab3" ref-type="table">3</xref>
and Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S2). These include a plant RNA-seq dataset (peach embryos and cotyledons; SRA accession SRR531865), a lung cancer cell line (SRA accession SRR1062943), and a lymphoblastoid cell line sequenced as part of the GEUVADIS population variation project (SRA accession ERR188021). We use these three sets to evaluate the performance of programs on real data, as well as to illustrate the effects of error correction on the alignment and assembly of RNA-seq reads. Summary statistics for all datasets are shown in Table
<xref rid="Tab3" ref-type="table">3</xref>
.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Summary of datasets included in the evaluation</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Name</th>
<th align="left">Reads</th>
<th align="left">Read length</th>
<th align="left">Aligned</th>
<th align="left">Perfectly</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">(bp)</th>
<th align="left"></th>
<th align="left">aligned</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Simulated</td>
<td align="left">99,338,716</td>
<td align="left">100</td>
<td align="left">81,994,413</td>
<td align="left">21,070,024</td>
</tr>
<tr>
<td align="left">Peach</td>
<td align="left">38,883,238</td>
<td align="left">75</td>
<td align="left">24,775,386</td>
<td align="left">5,617,514</td>
</tr>
<tr>
<td align="left">Lung</td>
<td align="left">113,313,254</td>
<td align="left">50</td>
<td align="left">110,771,941</td>
<td align="left">85,160,322</td>
</tr>
<tr>
<td align="left">Geuvadis</td>
<td align="left">65,015,656</td>
<td align="left">75</td>
<td align="left">59,130,806</td>
<td align="left">26,468,128</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Unlike for simulated data, the ground truth for each base is unknown, making it impossible to judge performance directly and in an unbiased way. Instead, we use alignment rates to estimate the accuracy of error correction. We tested different
<italic>k</italic>
-mer sizes for each tool, and chose the one maximizing the total number of matching bases. Statistics for alignments generated with Tophat2 (v2.0.13) [
<xref ref-type="bibr" rid="CR17">17</xref>
] are summarized in Table
<xref rid="Tab4" ref-type="table">4</xref>
. Lacking a true measure of sensitivity, the number and percentage of aligned reads as well as the per base match rate, as introduced in [
<xref ref-type="bibr" rid="CR3">3</xref>
], are used to estimate sensitivity at read and base-level, respectively. The per base match rate is computed as the ratio of the total number of all the matching bases to the total number of aligned reads. Likewise, we introduce an alternate measure of specificity, defined as
<italic>T</italic>
<italic>N</italic>
/(
<italic>T</italic>
<italic>N</italic>
+
<italic>F</italic>
<italic>P</italic>
), based on a high-confidence subset of the original reads (Table
<xref rid="Tab4" ref-type="table">4</xref>
). We extracted those reads that have perfect alignments on the genome, i.e., that had exact sequence matches and the alignment of reads in a pair was concordant. These reads are expected to be predominantly error-free, therefore the proportion of reads that are not corrected represents a measure of specificity. As a caveat, these measures will falsely include those reads that are incorrectly converted to a paralog and aligned at the wrong location in the genome.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Tophat2 alignments of simulated and real reads</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left">Simulated reads</th>
<th align="left"></th>
<th align="left"></th>
</tr>
<tr>
<th align="left"></th>
<th align="left">
<italic>k</italic>
</th>
<th align="left">Aligned</th>
<th align="left">Observed</th>
<th align="left">Base</th>
<th align="left">Specificity</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left">rate</th>
<th align="left">match rate</th>
<th align="left"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="left">-</td>
<td align="left">81,994,413</td>
<td align="left">82.540</td>
<td align="left">99.391</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">31</td>
<td align="left">85,374,347</td>
<td align="left">
<italic>85.943</italic>
</td>
<td align="left">
<italic>99.988</italic>
</td>
<td align="left">99.619</td>
</tr>
<tr>
<td align="left">Hshrec</td>
<td align="left">-</td>
<td align="left">77,488,558</td>
<td align="left">78.004</td>
<td align="left">99.888</td>
<td align="left">97.886</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">31</td>
<td align="left">84,662,510</td>
<td align="left">85.226</td>
<td align="left">99.745</td>
<td align="left">99.494</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">27</td>
<td align="left">84,892,466</td>
<td align="left">85.458</td>
<td align="left">99.906</td>
<td align="left">99.739</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">27</td>
<td align="left">84,844,168</td>
<td align="left">85.409</td>
<td align="left">99.918</td>
<td align="left">99.889</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">27</td>
<td align="left">85,033,277</td>
<td align="left">85.599</td>
<td align="left">99.986</td>
<td align="left">
<italic>99.970</italic>
</td>
</tr>
<tr>
<td align="left">Peach</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">-</td>
<td align="left">24,775,386</td>
<td align="left">63.717</td>
<td align="left">99.198</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">27</td>
<td align="left">29,056,747</td>
<td align="left">74.728</td>
<td align="left">
<italic>99.879</italic>
</td>
<td align="left">99.199</td>
</tr>
<tr>
<td align="left">Hshrec</td>
<td align="left">-</td>
<td align="left">24,496,308</td>
<td align="left">63.000</td>
<td align="left">99.265</td>
<td align="left">96.027</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">23</td>
<td align="left">28,974,141</td>
<td align="left">74.516</td>
<td align="left">99.316</td>
<td align="left">99.027</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">27</td>
<td align="left">28,345,203</td>
<td align="left">72.898</td>
<td align="left">99.256</td>
<td align="left">99.677</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">31</td>
<td align="left">26,553,943</td>
<td align="left">68.291</td>
<td align="left">99.278</td>
<td align="left">
<italic>99.777</italic>
</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">23</td>
<td align="left">30,563,388</td>
<td align="left">
<italic>78.603</italic>
</td>
<td align="left">99.833</td>
<td align="left">99.628</td>
</tr>
<tr>
<td align="left">Lung</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">-</td>
<td align="left">110,771,941</td>
<td align="left">97.757</td>
<td align="left">99.717</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">23</td>
<td align="left">111,261,651</td>
<td align="left">98.189</td>
<td align="left">
<italic>99.855</italic>
</td>
<td align="left">98.239</td>
</tr>
<tr>
<td align="left">Hshrec</td>
<td align="left">-</td>
<td align="left">102,121,932</td>
<td align="left">90.124</td>
<td align="left">99.781</td>
<td align="left">89.786</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">23</td>
<td align="left">111,107,133</td>
<td align="left">98.053</td>
<td align="left">99.809</td>
<td align="left">98.330</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">27</td>
<td align="left">110,907,828</td>
<td align="left">97.877</td>
<td align="left">99.781</td>
<td align="left">98.698</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">23</td>
<td align="left">111,427,773</td>
<td align="left">
<italic>98.336</italic>
</td>
<td align="left">99.824</td>
<td align="left">99.359</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">23</td>
<td align="left">111,198,587</td>
<td align="left">98.134</td>
<td align="left">99.830</td>
<td align="left">
<italic>99.599</italic>
</td>
</tr>
<tr>
<td align="left">Geuvadis</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">-</td>
<td align="left">59,130,806</td>
<td align="left">90.949</td>
<td align="left">99.477</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">23</td>
<td align="left">61,514,024</td>
<td align="left">94.614</td>
<td align="left">
<italic>99.837</italic>
</td>
<td align="left">98.530</td>
</tr>
<tr>
<td align="left">Hshrec</td>
<td align="left">23</td>
<td align="left">51,669,686</td>
<td align="left">79.473</td>
<td align="left">99.709</td>
<td align="left">87.924</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">23</td>
<td align="left">61,399,007</td>
<td align="left">94.437</td>
<td align="left">99.717</td>
<td align="left">98.049</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">23</td>
<td align="left">60,450,316</td>
<td align="left">92.978</td>
<td align="left">99.652</td>
<td align="left">97.900</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">23</td>
<td align="left">61,870,897</td>
<td align="left">
<italic>95.163</italic>
</td>
<td align="left">99.775</td>
<td align="left">98.790</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">23</td>
<td align="left">61,641,866</td>
<td align="left">94.811</td>
<td align="left">99.814</td>
<td align="left">
<italic>99.227</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Error correction improves alignment rates by 1–11 %, depending on the dataset (Table
<xref rid="Tab4" ref-type="table">4</xref>
). Note that alignment rates themselves differ with the amount of sequence variation and quality of the data. Rcorrector, SEECER, and BFC take turns in being the most sensitive across the four datasets. However, only Rcorrector and SEECER are consistently ranked among the top results in each category. Rcorrector has the highest or comparable specificity, greater than 99.2 %, in all cases.</p>
<p>We further assess the impact of error correction on improving
<italic>de novo</italic>
assembly of RNA-seq reads. We used the transcript assembler Oases [
<xref ref-type="bibr" rid="CR18">18</xref>
] to assemble the reads
<italic>a priori</italic>
corrected with each of the methods. To evaluate the quality of the assembled transcripts, we aligned them to the reference genome with the spliced alignment program ESTmapper/sim4db [
<xref ref-type="bibr" rid="CR19">19</xref>
], retaining only the best match for each transcript. We use conventional methods and measures to evaluate the performance in reconstructing full-length transcripts [
<xref ref-type="bibr" rid="CR20">20</xref>
]. Specifically, we define a match between a reference annotation transcript and the spliced alignment of an assembled transcript if and only if they have identical intron chains, whereas their endpoints may differ. We used the GENCODE v.17 annotations and the peach gene annotations (v1.1) obtained from the Genome Database for Rosaceae as the gold reference for the real datasets, respectively, and the subset of GENCODE transcripts sampled by FluxSimulator for the simulated data. The results, shown in Table
<xref rid="Tab5" ref-type="table">5</xref>
, again indicate that SEECER, Rcorrector, and BFC have the most impact on improving the accuracy and quality of the assembled transcripts, and show comparable performance. Results were similar, showing Rcorrector and SEECER predominantly producing the top results, when using an alternative assembler, Trinity [
<xref ref-type="bibr" rid="CR21">21</xref>
] (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S3). Of note, these measures only capture full transcripts, whereas many of the transcripts in the sample will not have enough reads to be assembled fully.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Oases assembly of simulated and real reads</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Program</th>
<th align="left" colspan="2">Simulated</th>
<th align="left" colspan="2">Peach</th>
<th align="left" colspan="2">Lung</th>
<th align="left" colspan="2">Geuvadis</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">Precision</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="left">30.575</td>
<td align="left">48.862</td>
<td align="left">28.879</td>
<td align="left">
<italic>16.410</italic>
</td>
<td align="left">4.957</td>
<td align="left">10.475</td>
<td align="left">5.997</td>
<td align="left">16.749</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">36.698</td>
<td align="left">
<italic>52.181</italic>
</td>
<td align="left">
<italic>29.752</italic>
</td>
<td align="left">16.116</td>
<td align="left">4.944</td>
<td align="left">10.174</td>
<td align="left">6.162</td>
<td align="left">16.639</td>
</tr>
<tr>
<td align="left">Hshrec</td>
<td align="left">23.334</td>
<td align="left">47.417</td>
<td align="left">26.132</td>
<td align="left">13.850</td>
<td align="left">3.608</td>
<td align="left">
<italic>11.459</italic>
</td>
<td align="left">4.266</td>
<td align="left">19.101</td>
</tr>
<tr>
<td align="left">Coral</td>
<td align="left">35.039</td>
<td align="left">51.942</td>
<td align="left">29.784</td>
<td align="left">15.881</td>
<td align="left">4.934</td>
<td align="left">10.174</td>
<td align="left">6.170</td>
<td align="left">16.372</td>
</tr>
<tr>
<td align="left">Musket</td>
<td align="left">33.845</td>
<td align="left">47.769</td>
<td align="left">28.760</td>
<td align="left">15.991</td>
<td align="left">4.920</td>
<td align="left">10.577</td>
<td align="left">5.846</td>
<td align="left">16.901</td>
</tr>
<tr>
<td align="left">BFC</td>
<td align="left">34.789</td>
<td align="left">50.579</td>
<td align="left">29.633</td>
<td align="left">16.211</td>
<td align="left">
<italic>5.018</italic>
</td>
<td align="left">10.498</td>
<td align="left">6.166</td>
<td align="left">16.509</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">
<italic>36.763</italic>
</td>
<td align="left">52.144</td>
<td align="left">29.355</td>
<td align="left">15.951</td>
<td align="left">5.012</td>
<td align="left">10.478</td>
<td align="left">
<italic>6.222</italic>
</td>
<td align="left">16.375</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Figure
<xref rid="Fig2" ref-type="fig">2</xref>
illustrates the spliced alignments of a 13 exon transcript at the MTMR11 (myotubularin related protein) gene locus (chr1:149,900,543-149,908,791) assembled with Oases from the simulated reads before and after correction. All methods missed the first intron, which was supported by six error-containing reads, but produced partial reconstructions of the transcript, consisting of multiple contigs (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Section S4). While all error correctors improved upon the original reads, Rcorrector produced the most complete and compact assembly, with only three contigs, including one containing the full reconstruction of exons 1–12.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Transcripts assembled from the original and error-corrected reads at the MTMR11 gene locus. Rcorrector (
<italic>bottom panel</italic>
) improves upon the original reads and leads to the most complete reconstruction of the transcript</p>
</caption>
<graphic xlink:href="13742_2015_89_Fig2_HTML" id="d30e2848"></graphic>
</fig>
</p>
</sec>
<sec id="Sec14">
<title>Single-cell sequencing</title>
<p>While Rcorrector was designed to correct RNA-seq reads, the method is also applicable to a wider range of problems where read coverage is non-uniform.</p>
<p>Single-cell sequencing has recently emerged as a powerful technique to survey the content and variation within an individual cell. However, PCR amplification of the input DNA introduces biases in read coverage across the genome. We compared Rcorrector with SEECER and the error correction module built into the assembly package SPAdes (3.1.0) [
<xref ref-type="bibr" rid="CR22">22</xref>
]. The latter is based on the error corrector BayesHammer [
<xref ref-type="bibr" rid="CR23">23</xref>
], which accounts for variable depth coverage. We applied all three methods to correct 29,124,078
<italic>E. coli</italic>
K-12 MG1655 Illumina reads [
<xref ref-type="bibr" rid="CR22">22</xref>
], then aligned the corrected reads to the
<italic>E. coli</italic>
K-12 genome with Bowtie2 [
<xref ref-type="bibr" rid="CR24">24</xref>
] and assembled them with SPAdes. We evaluated the alignment outcome as described earlier and separately used the package QUAST [
<xref ref-type="bibr" rid="CR25">25</xref>
] to assess the quality of the resulting genome assemblies.</p>
<p>As seen in Table
<xref rid="Tab6" ref-type="table">6</xref>
, Rcorrector results in the largest number of aligned reads, and is also the most specific among the methods. Surprisingly, the built-in SPAdes error corrector shows very low specificity (41.5 %), primarily arising from BayesHammer’s trimming of end sequences for some reads. In contrast, SEECER has very high specificity but relatively low sensitivity, as the number of mapped reads was actually reduced after correction. Rcorrector shows both the highest sensitivity and the highest precision, and is therefore the best choice for this dataset.
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>Bowtie2 alignment of single-cell sequencing reads</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">
<italic>k</italic>
</th>
<th align="left">Aligned</th>
<th align="left">Rate</th>
<th align="left">Base match rate</th>
<th align="left">Specificity</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="left">-</td>
<td align="left">27,002,682</td>
<td align="left">92.716</td>
<td align="left">98.863</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">SPAdes</td>
<td align="left">-</td>
<td align="left">27,104,190</td>
<td align="left">93.065</td>
<td align="left">99.675</td>
<td align="left">41.482</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">27</td>
<td align="left">26,937,652</td>
<td align="left">92.493</td>
<td align="left">99.507</td>
<td align="left">99.553</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">19</td>
<td align="left">27,227,855</td>
<td align="left">
<italic>93.489</italic>
</td>
<td align="left">
<italic>99.711</italic>
</td>
<td align="left">
<italic>99.998</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>For assembly, both Rcorrector and SEECER lead to longer contigs and better genome coverage compared to the built-in corrector in SPAdes, while Rcorrector additionally produces the smallest number of misassemblies (Table
<xref rid="Tab7" ref-type="table">7</xref>
). To conclude, Rcorrector can be effectively applied to correct single-cell DNA sequencing reads.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>SPAdes assembly of single-cell sequencing reads. NG50 is the minimum contig length such that the total number of bases in contigs this size or longer represents more than half of the length of the reference genome</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">NG50</th>
<th align="left">Misassembly</th>
<th align="left">Edits/100 kbps</th>
<th align="left">Genome</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left">coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="left">105,623</td>
<td align="left">
<italic>1</italic>
</td>
<td align="left">
<italic>6.57</italic>
</td>
<td align="left">95.054</td>
</tr>
<tr>
<td align="left">SPAdes</td>
<td align="left">109,876</td>
<td align="left">2</td>
<td align="left">7.52</td>
<td align="left">94.903</td>
</tr>
<tr>
<td align="left">SEECER</td>
<td align="left">
<italic>110,103</italic>
</td>
<td align="left">2</td>
<td align="left">7.26</td>
<td align="left">95.059</td>
</tr>
<tr>
<td align="left">Rcorrector</td>
<td align="left">
<italic>110,103</italic>
</td>
<td align="left">
<italic>1</italic>
</td>
<td align="left">10.02</td>
<td align="left">
<italic>95.094</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best performers are highlighted in italic</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec15" sec-type="conclusion">
<title>Conclusions</title>
<p>Rcorrector is the first
<italic>k</italic>
-spectrum based method designed specifically for correcting RNA-seq reads, and addresses several limitations in existing methods. It implements a flexible
<italic>k</italic>
-mer count threshold, to account for different gene and transcript expression levels, and simultaneously explores multiple correction paths for a read, to accommodate isoforms of a gene. In comparisons with similar tools, Rcorrector showed the highest or near-highest accuracy on all datasets, which varied in their amount of sequencing errors as well as polymorphisms. Also, with a small 5 GB memory footprint for a 100 million read dataset, it required an order of magnitude less memory than SEECER, the only other tool designed specifically for RNA-seq reads. Lastly, Rcorrector was the fastest of all methods tested, taking less than two hours to correct the simulated dataset. Therefore, Rcorrector is an excellent choice for large-scale and affordable transcriptomic studies in both model and non-model organisms.</p>
</sec>
<sec id="Sec16">
<title>Availability and requirements</title>
<p>
<bold>Project name:</bold>
Rcorrector
<bold>Project home page:</bold>
<ext-link ext-link-type="uri" xlink:href="http://github.com/mourisl/Rcorrector">http://github.com/mourisl/Rcorrector</ext-link>
<bold>Operating system(s):</bold>
Unix, Linux
<bold>Programming language:</bold>
C, C++, Perl
<bold>License:</bold>
GNU General Public License version 3.0 (GPLv3)
<bold>Any restrictions to use by non-academics:</bold>
none</p>
</sec>
<sec id="Sec17">
<title>Availability of supporting data</title>
<p>All data sets supporting the analyses are available from the GigaScience GigaDB repository [
<xref ref-type="bibr" rid="CR26">26</xref>
].</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec18">
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="13742_2015_89_MOESM1_ESM.docx">
<label>Additional file 1</label>
<caption>
<p>
<bold>Supplementary material.</bold>
Section
<italic>S1 -</italic>
Command line and error rate parameters for Mason. Section
<italic>S2 -</italic>
Variation coefficient (
<italic>α</italic>
) for the four datasets. Section
<italic>S3 -</italic>
Trinity assembly of simulated and real reads. Section
<italic>S4 -</italic>
Sim4db spliced alignments of Oases transcripts assembled from original and error-corrected reads. (DOCX 130 kb)</p>
</caption>
</media>
</supplementary-material>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>LS and LF conceived the project and evaluation. LS developed the method. Both authors prepared the manuscript. Both authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>This work was supported in part by NSF awards ABI-1159078 and ABI-1356078 to LF.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heo</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>XL</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hwu</surname>
<given-names>WM</given-names>
</name>
</person-group>
<article-title>BLESS: Bloom-filter-based Error Correction Solution for High-throughput Sequencing Reads</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>10</issue>
<fpage>1354</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu030</pub-id>
<pub-id pub-id-type="pmid">24451628</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>BFC: correcting Illumina sequencing errors</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>17</issue>
<fpage>2885</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv290</pub-id>
<pub-id pub-id-type="pmid">25953801</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Florea</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Lighter: fast and memory-efficient sequencing error correction without counting</article-title>
<source>Genome Biol</source>
<year>2014</year>
<volume>15</volume>
<issue>11</issue>
<fpage>509</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-014-0509-9</pub-id>
<pub-id pub-id-type="pmid">25398208</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Chockalingam</surname>
<given-names>SP</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A survey of error-correction methods for next-generation sequencing</article-title>
<source>Brief Bioinformatics</source>
<year>2013</year>
<volume>14</volume>
<issue>1</issue>
<fpage>56</fpage>
<lpage>66</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbs015</pub-id>
<pub-id pub-id-type="pmid">22492192</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelley</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>R116</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-11-r116</pub-id>
<pub-id pub-id-type="pmid">21114842</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Scott</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Kakaradov</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Error correction of high-throughput sequencing datasets with non-uniform coverage</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>13</issue>
<fpage>i137</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr208</pub-id>
<pub-id pub-id-type="pmid">21685062</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>3</issue>
<fpage>308</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts690</pub-id>
<pub-id pub-id-type="pmid">23202746</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>SHREC: a short-read error correction method</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>17</issue>
<fpage>2157</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp379</pub-id>
<pub-id pub-id-type="pmid">19542152</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Correction of sequencing errors in a mixed set of reads</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>10</issue>
<fpage>1284</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq151</pub-id>
<pub-id pub-id-type="pmid">20378555</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Fazayeli</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>HiTEC: accurate error correction in high-throughput sequencing data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>3</issue>
<fpage>295</fpage>
<lpage>302</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq653</pub-id>
<pub-id pub-id-type="pmid">21115437</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Correcting Errors in Short Reads by Multiple Alignments</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1455</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr170</pub-id>
<pub-id pub-id-type="pmid">21471014</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Le</surname>
<given-names>HS</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>McCauley</surname>
<given-names>BM</given-names>
</name>
<name>
<surname>Hinman</surname>
<given-names>VF</given-names>
</name>
<name>
<surname>Bar-Joseph</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Probabilistic error correction for RNA sequencing</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<issue>10</issue>
<fpage>e109</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt215</pub-id>
<pub-id pub-id-type="pmid">23558750</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<mixed-citation publication-type="other">MacManes MD. Optimizing error correction of RNAseq reads. bioRxiv. 2015. doi:10.1101/020123.</mixed-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marçais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Griebel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zacher</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ribeca</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Raineri</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lacroix</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Guigó</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Modelling and simulating generic RNA-Seq experiments with the flux simulator</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>40</volume>
<issue>20</issue>
<fpage>10073</fpage>
<lpage>083</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gks666</pub-id>
<pub-id pub-id-type="pmid">22962361</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doring</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weese</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Rausch</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>SeqAn: An efficient, generic C++ library for sequence analysis</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>11</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-11</pub-id>
<pub-id pub-id-type="pmid">18184432</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pimentel</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions</article-title>
<source>Genome Biol</source>
<year>2013</year>
<volume>14</volume>
<issue>4</issue>
<fpage>R36</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-4-r36</pub-id>
<pub-id pub-id-type="pmid">23618408</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<mixed-citation publication-type="other">Schulz MH, Zerbino DR, Vingron M, Birney E. Bioinformatics. 2012; 28(8):1086–92.</mixed-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Walenz</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Florea</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Sim4db and Leaff: utilities for fast batch spliced alignment and sequence indexing</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>13</issue>
<fpage>1869</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr285</pub-id>
<pub-id pub-id-type="pmid">21551146</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<mixed-citation publication-type="other">Li W, Feng J, Jiang T. IsoLasso: A LASSO regression approach to RNA-seq based transcriptome assembly. J Comput Biol; 18(11):1693–707.</mixed-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haas</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Papanicolaou</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Yassour</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grabherr</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Blood</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Bowden</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis</article-title>
<source>Nat Protocols</source>
<year>2013</year>
<volume>8</volume>
<fpage>1494</fpage>
<lpage>512</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2013.084</pub-id>
<pub-id pub-id-type="pmid">23845962</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bankevich</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nurk</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Antipov</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Gurevich</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Dvorkin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kulikov</surname>
<given-names>AS</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing</article-title>
<source>J Comput Biol</source>
<year>2012</year>
<volume>19</volume>
<issue>4</issue>
<fpage>455</fpage>
<lpage>77</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0021</pub-id>
<pub-id pub-id-type="pmid">22506599</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<mixed-citation publication-type="other">Nikolenko SI, Korobeynikov A, Alekseyev MA. BMC Genomics. 2013; 14(S-1):S7.</mixed-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Fast gapped-read alignment with Bowtie 2</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>4</issue>
<fpage>357</fpage>
<lpage>59</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
<pub-id pub-id-type="pmid">22388286</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gurevich</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Saveliev</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Vyahhi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>QUAST: quality assessment tool for genome assemblies</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>8</issue>
<fpage>1072</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt086</pub-id>
<pub-id pub-id-type="pmid">23422339</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<mixed-citation publication-type="other">Song L, Florea L. Software and exemplar data for Rcorrector. GigaScience Database. 2015. doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5524/100171">10.5524/100171</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000371  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000371  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021