MersV1, Pmc, Corpus, bibRecord, 0002670

***** Acces problem to record *****\

Identifieur interne : 0002670 ( Pmc/Corpus ); précédent : 0002669; suivant : 0002671 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">An improved filtering algorithm for big read datasets and its application to single-cell assembly</title>
<author><name sortKey="Wedemeyer, Axel" sort="Wedemeyer, Axel" uniqKey="Wedemeyer A" first="Axel" last="Wedemeyer">Axel Wedemeyer</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kliemann, Lasse" sort="Kliemann, Lasse" uniqKey="Kliemann L" first="Lasse" last="Kliemann">Lasse Kliemann</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Srivastav, Anand" sort="Srivastav, Anand" uniqKey="Srivastav A" first="Anand" last="Srivastav">Anand Srivastav</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Schielke, Christian" sort="Schielke, Christian" uniqKey="Schielke C" first="Christian" last="Schielke">Christian Schielke</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Reusch, Thorsten B" sort="Reusch, Thorsten B" uniqKey="Reusch T" first="Thorsten B." last="Reusch">Thorsten B. Reusch</name>
<affiliation><nlm:aff id="Aff2"><institution-wrap><institution-id institution-id-type="ISNI">0000 0000 9056 9663</institution-id>
<institution-id institution-id-type="GRID">grid.15649.3f</institution-id>
<institution>Marine Ecology,</institution>
<institution>GEOMAR Helmholtz Centre for Ocean Research Kiel,</institution>
</institution-wrap>
Düsternbrooker Weg 20, Kiel, 24105 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Rosenstiel, Philip" sort="Rosenstiel, Philip" uniqKey="Rosenstiel P" first="Philip" last="Rosenstiel">Philip Rosenstiel</name>
<affiliation><nlm:aff id="Aff3"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Institute of Clinical Molecular Biology,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Schittenhelmstr. 12, Kiel, 24105 Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">28673253</idno>
<idno type="pmc">5496428</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5496428</idno>
<idno type="RBID">PMC:5496428</idno>
<idno type="doi">10.1186/s12859-017-1724-7</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000267</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000267</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">An improved filtering algorithm for big read datasets and its application to single-cell assembly</title>
<author><name sortKey="Wedemeyer, Axel" sort="Wedemeyer, Axel" uniqKey="Wedemeyer A" first="Axel" last="Wedemeyer">Axel Wedemeyer</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kliemann, Lasse" sort="Kliemann, Lasse" uniqKey="Kliemann L" first="Lasse" last="Kliemann">Lasse Kliemann</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Srivastav, Anand" sort="Srivastav, Anand" uniqKey="Srivastav A" first="Anand" last="Srivastav">Anand Srivastav</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Schielke, Christian" sort="Schielke, Christian" uniqKey="Schielke C" first="Christian" last="Schielke">Christian Schielke</name>
<affiliation><nlm:aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Reusch, Thorsten B" sort="Reusch, Thorsten B" uniqKey="Reusch T" first="Thorsten B." last="Reusch">Thorsten B. Reusch</name>
<affiliation><nlm:aff id="Aff2"><institution-wrap><institution-id institution-id-type="ISNI">0000 0000 9056 9663</institution-id>
<institution-id institution-id-type="GRID">grid.15649.3f</institution-id>
<institution>Marine Ecology,</institution>
<institution>GEOMAR Helmholtz Centre for Ocean Research Kiel,</institution>
</institution-wrap>
Düsternbrooker Weg 20, Kiel, 24105 Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Rosenstiel, Philip" sort="Rosenstiel, Philip" uniqKey="Rosenstiel P" first="Philip" last="Rosenstiel">Philip Rosenstiel</name>
<affiliation><nlm:aff id="Aff3"><institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Institute of Clinical Molecular Biology,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Schittenhelmstr. 12, Kiel, 24105 Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their <italic>k</italic>
-mers.</p>
</sec>
<sec><title>Methods</title>
<p>We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep.</p>
</sec>
<sec><title>Results</title>
<p>We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm.</p>
</sec>
<sec><title>Conclusions</title>
<p>We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects.</p>
<p>Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at <ext-link ext-link-type="uri" xlink:href="https://git.informatik.uni-kiel.de/axw/Bignorm">https://git.informatik.uni-kiel.de/axw/Bignorm</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-017-1724-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Del Fabbro, C" uniqKey="Del Fabbro C">C Del Fabbro</name>
</author>
<author><name sortKey="Scalabrin, S" uniqKey="Scalabrin S">S Scalabrin</name>
</author>
<author><name sortKey="Morgante, M" uniqKey="Morgante M">M Morgante</name>
</author>
<author><name sortKey="Giorgi, Fm" uniqKey="Giorgi F">FM Giorgi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Martin, M" uniqKey="Martin M">M Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Prezza, N" uniqKey="Prezza N">N Prezza</name>
</author>
<author><name sortKey="Del Fabbro, C" uniqKey="Del Fabbro C">C Del Fabbro</name>
</author>
<author><name sortKey="Vezzi, F" uniqKey="Vezzi F">F Vezzi</name>
</author>
<author><name sortKey="De Paoli, E" uniqKey="De Paoli E">E De Paoli</name>
</author>
<author><name sortKey="Policriti, A" uniqKey="Policriti A">A Policriti</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cox, Mp" uniqKey="Cox M">MP Cox</name>
</author>
<author><name sortKey="Peterson, Da" uniqKey="Peterson D">DA Peterson</name>
</author>
<author><name sortKey="Biggs, Pj" uniqKey="Biggs P">PJ Biggs</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Smeds, L" uniqKey="Smeds L">L Smeds</name>
</author>
<author><name sortKey="Kunstner, A" uniqKey="Kunstner A">A Künstner</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Schmieder, R" uniqKey="Schmieder R">R Schmieder</name>
</author>
<author><name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bolger, Am" uniqKey="Bolger A">AM Bolger</name>
</author>
<author><name sortKey="Lohse, M" uniqKey="Lohse M">M Lohse</name>
</author>
<author><name sortKey="Usadel, B" uniqKey="Usadel B">B Usadel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Alic, As" uniqKey="Alic A">AS Alic</name>
</author>
<author><name sortKey="Ruzafa, D" uniqKey="Ruzafa D">D Ruzafa</name>
</author>
<author><name sortKey="Dopazo, J" uniqKey="Dopazo J">J Dopazo</name>
</author>
<author><name sortKey="Blanquer, I" uniqKey="Blanquer I">I Blanquer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author><name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author><name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author><name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author><name sortKey="Howe, Ac" uniqKey="Howe A">AC Howe</name>
</author>
<author><name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cormode, G" uniqKey="Cormode G">G Cormode</name>
</author>
<author><name sortKey="Muthukrishnan, S" uniqKey="Muthukrishnan S">S Muthukrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dietzfelbinger, M" uniqKey="Dietzfelbinger M">M Dietzfelbinger</name>
</author>
<author><name sortKey="Hagerup, T" uniqKey="Hagerup T">T Hagerup</name>
</author>
<author><name sortKey="Katajainen, J" uniqKey="Katajainen J">J Katajainen</name>
</author>
<author><name sortKey="Penttonen, M" uniqKey="Penttonen M">M Penttonen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
<author><name sortKey="Nurk, S" uniqKey="Nurk S">S Nurk</name>
</author>
<author><name sortKey="Antipov, D" uniqKey="Antipov D">D Antipov</name>
</author>
<author><name sortKey="Gurevich, Aa" uniqKey="Gurevich A">AA Gurevich</name>
</author>
<author><name sortKey="Dvorkin, M" uniqKey="Dvorkin M">M Dvorkin</name>
</author>
<author><name sortKey="Kulikov, As" uniqKey="Kulikov A">AS Kulikov</name>
</author>
<author><name sortKey="Lesin, Vm" uniqKey="Lesin V">VM Lesin</name>
</author>
<author><name sortKey="Nikolenko, Si" uniqKey="Nikolenko S">SI Nikolenko</name>
</author>
<author><name sortKey="Pham, S" uniqKey="Pham S">S Pham</name>
</author>
<author><name sortKey="Prjibelski, Ad" uniqKey="Prjibelski A">AD Prjibelski</name>
</author>
<author><name sortKey="Pyshkin, Av" uniqKey="Pyshkin A">AV Pyshkin</name>
</author>
<author><name sortKey="Sirotkin, Av" uniqKey="Sirotkin A">AV Sirotkin</name>
</author>
<author><name sortKey="Vyahhi, N" uniqKey="Vyahhi N">N Vyahhi</name>
</author>
<author><name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
<author><name sortKey="Alekseyev, Ma" uniqKey="Alekseyev M">MA Alekseyev</name>
</author>
<author><name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author><name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author><name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author><name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H Chitsaz</name>
</author>
<author><name sortKey="Yee Greenbaum Joyclyn, L" uniqKey="Yee Greenbaum Joyclyn L">L Yee-Greenbaum Joyclyn</name>
</author>
<author><name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
<author><name sortKey="Dupont, Cl" uniqKey="Dupont C">CL Dupont</name>
</author>
<author><name sortKey="Badger, Jh" uniqKey="Badger J">JH Badger</name>
</author>
<author><name sortKey="Novotny, M" uniqKey="Novotny M">M Novotny</name>
</author>
<author><name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
<author><name sortKey="Fraser, Lj" uniqKey="Fraser L">LJ Fraser</name>
</author>
<author><name sortKey="Gormley, Na" uniqKey="Gormley N">NA Gormley</name>
</author>
<author><name sortKey="Schulz Trieglaff, O" uniqKey="Schulz Trieglaff O">O Schulz-Trieglaff</name>
</author>
<author><name sortKey="Smith, Gp" uniqKey="Smith G">GP Smith</name>
</author>
<author><name sortKey="Evers, Dj" uniqKey="Evers D">DJ Evers</name>
</author>
<author><name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author><name sortKey="Lasken, Rs" uniqKey="Lasken R">RS Lasken</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Quinlan, Ar" uniqKey="Quinlan A">AR Quinlan</name>
</author>
<author><name sortKey="Hall, Im" uniqKey="Hall I">IM Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Gurevich, A" uniqKey="Gurevich A">A Gurevich</name>
</author>
<author><name sortKey="Saveliev, V" uniqKey="Saveliev V">V Saveliev</name>
</author>
<author><name sortKey="Vyahhi, N" uniqKey="Vyahhi N">N Vyahhi</name>
</author>
<author><name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author><name sortKey="Bradnam, K" uniqKey="Bradnam K">K Bradnam</name>
</author>
<author><name sortKey="John, Js" uniqKey="John J">JS John</name>
</author>
<author><name sortKey="Darling, A" uniqKey="Darling A">A Darling</name>
</author>
<author><name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
<author><name sortKey="Fass, J" uniqKey="Fass J">J Fass</name>
</author>
<author><name sortKey="Yu, Hok" uniqKey="Yu H">HOK Yu</name>
</author>
<author><name sortKey="Buffalo, V" uniqKey="Buffalo V">V Buffalo</name>
</author>
<author><name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author><name sortKey="Diekhans, M" uniqKey="Diekhans M">M Diekhans</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Kamke, J" uniqKey="Kamke J">J Kamke</name>
</author>
<author><name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author><name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
<author><name sortKey="Schwientek, P" uniqKey="Schwientek P">P Schwientek</name>
</author>
<author><name sortKey="Rinke, C" uniqKey="Rinke C">C Rinke</name>
</author>
<author><name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author><name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
<author><name sortKey="Hentschel, U" uniqKey="Hentschel U">U Hentschel</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">28673253</article-id>
<article-id pub-id-type="pmc">5496428</article-id>
<article-id pub-id-type="publisher-id">1724</article-id>
<article-id pub-id-type="doi">10.1186/s12859-017-1724-7</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>An improved filtering algorithm for big read datasets and its application to single-cell assembly</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-3934-3858</contrib-id>
<name><surname>Wedemeyer</surname>
<given-names>Axel</given-names>
</name>
<address><email>axw@informatik.uni-kiel.de</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Kliemann</surname>
<given-names>Lasse</given-names>
</name>
<address><email>lki@informatik.uni-kiel.de</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Srivastav</surname>
<given-names>Anand</given-names>
</name>
<address><email>asr@informatik.uni-kiel.de</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Schielke</surname>
<given-names>Christian</given-names>
</name>
<address><email>csch@informatik.uni-kiel.de</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Reusch</surname>
<given-names>Thorsten B.</given-names>
</name>
<address><email>treusch@geomar.de</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Rosenstiel</surname>
<given-names>Philip</given-names>
</name>
<address><email>admin1@ikmb.uni-kiel.de</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1"><label>1</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Christian-Albrechts-Platz 4, Kiel, 24118 Germany</aff>
<aff id="Aff2"><label>2</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0000 9056 9663</institution-id>
<institution-id institution-id-type="GRID">grid.15649.3f</institution-id>
<institution>Marine Ecology,</institution>
<institution>GEOMAR Helmholtz Centre for Ocean Research Kiel,</institution>
</institution-wrap>
Düsternbrooker Weg 20, Kiel, 24105 Germany</aff>
<aff id="Aff3"><label>3</label>
<institution-wrap><institution-id institution-id-type="ISNI">0000 0001 2153 9986</institution-id>
<institution-id institution-id-type="GRID">grid.9764.c</institution-id>
<institution>Institute of Clinical Molecular Biology,</institution>
<institution>Kiel University,</institution>
</institution-wrap>
Schittenhelmstr. 12, Kiel, 24105 Germany</aff>
</contrib-group>
<pub-date pub-type="epub"><day>3</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>3</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection"><year>2017</year>
</pub-date>
<volume>18</volume>
<elocation-id>324</elocation-id>
<history><date date-type="received"><day>19</day>
<month>10</month>
<year>2016</year>
</date>
<date date-type="accepted"><day>12</day>
<month>6</month>
<year>2017</year>
</date>
</history>
<permissions><copyright-statement>© The Author(s) 2017</copyright-statement>
<license license-type="OpenAccess"><license-p><bold>Open Access</bold>
 This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1"><sec><title>Background</title>
<p>For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their <italic>k</italic>
-mers.</p>
</sec>
<sec><title>Methods</title>
<p>We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep.</p>
</sec>
<sec><title>Results</title>
<p>We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm.</p>
</sec>
<sec><title>Conclusions</title>
<p>We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects.</p>
<p>Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at <ext-link ext-link-type="uri" xlink:href="https://git.informatik.uni-kiel.de/axw/Bignorm">https://git.informatik.uni-kiel.de/axw/Bignorm</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-017-1724-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en"><title>Keywords</title>
<kwd>Read filtering</kwd>
<kwd>Read normalization</kwd>
<kwd>Bignorm</kwd>
<kwd>Diginorm</kwd>
<kwd>Singe cell sequencing</kwd>
<kwd>Coverage</kwd>
</kwd-group>
<funding-group><award-group><funding-source><institution>German Research Foundation (DFG)</institution>
</funding-source>
<award-id>SR7/15-1</award-id>
</award-group>
</funding-group>
<custom-meta-group><custom-meta><meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec id="Sec1"><title>Background</title>
<p>Next generation sequencing systems (such as the Illumina platform) tend to produce an enormous amount of data — especially when used for single-cell or metagenomic protocols — of which only a small fraction is essential for the assembly of the genome. It is thus advisable to filter that data prior to assembly.</p>
<p>A coverage of about 20 for each position of the genome has been empirically determined as optimal for a successful assembly of the genome [<xref ref-type="bibr" rid="CR1">1</xref>
]. On the other hand, in many setups, the coverage for a large number of loci is much higher than 20, often rising up to tens or hundreds of thousands, especially for single-cell or metagenomic protocols (see Table <xref rid="Tab1" ref-type="table">1</xref>
, “max” column for the maximal coverage of the datasets that we use in our experiments). In order to speed up the assembly process — or in extreme cases to make it possible in the first place, given certain restrictions on available RAM and/or time — a sub-dataset of the sequencing dataset is to be determined such that an assembly based on this sub-dataset works as good as possible. For a formal description of the problem, see Additional file <xref rid="MOESM1" ref-type="media">1</xref>: Section S1.
<table-wrap id="Tab1"><label>Table 1</label>
<caption><p>Coverage statistics for Bignorm with <italic>Q</italic>
<sub>0</sub>
=20, Diginorm, and the raw datasets</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Dataset</th>
<th align="left">Algorithm</th>
<th align="left"><inline-formula id="IEq1"><alternatives><tex-math id="M1">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}10$\end{document}</tex-math>
<mml:math id="M2"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>10</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Mean</th>
<th align="left"><inline-formula id="IEq2"><alternatives><tex-math id="M3">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}90$\end{document}</tex-math>
<mml:math id="M4"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>90</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Max</th>
</tr>
</thead>
<tbody><tr><td align="left">Aceto</td>
<td align="left">Bignorm</td>
<td align="left">6</td>
<td align="left">132</td>
<td align="left">216</td>
<td align="left">6801</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">7</td>
<td align="left">171</td>
<td align="left">295</td>
<td align="left">12,020</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">15</td>
<td align="left">9562</td>
<td align="left">17,227</td>
<td align="left">551,000</td>
</tr>
<tr><td align="left">Alphaproteo</td>
<td align="left">Bignorm</td>
<td align="left">10</td>
<td align="left">43</td>
<td align="left">92</td>
<td align="left">884</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">7</td>
<td align="left">173</td>
<td align="left">481</td>
<td align="left">6681</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">25</td>
<td align="left">5302</td>
<td align="left">14,070</td>
<td align="left">303,200</td>
</tr>
<tr><td align="left">Arco</td>
<td align="left">Bignorm</td>
<td align="left">1</td>
<td align="left">98</td>
<td align="left">54</td>
<td align="left">2103</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">1</td>
<td align="left">362</td>
<td align="left">200</td>
<td align="left">6114</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">3</td>
<td align="left">10,850</td>
<td align="left">4091</td>
<td align="left">220,600</td>
</tr>
<tr><td align="left">Arma</td>
<td align="left">Bignorm</td>
<td align="left">8</td>
<td align="left">23</td>
<td align="left">32</td>
<td align="left">358</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">8</td>
<td align="left">79</td>
<td align="left">141</td>
<td align="left">5000</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">17</td>
<td align="left">629</td>
<td align="left">1118</td>
<td align="left">31,260</td>
</tr>
<tr><td align="left">ASZN2</td>
<td align="left">Bignorm</td>
<td align="left">40</td>
<td align="left">70</td>
<td align="left">83</td>
<td align="left">2012</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">23</td>
<td align="left">143</td>
<td align="left">354</td>
<td align="left">3437</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">50</td>
<td align="left">1738</td>
<td align="left">4784</td>
<td align="left">43,840</td>
</tr>
<tr><td align="left">Bacteroides</td>
<td align="left">Bignorm</td>
<td align="left">3</td>
<td align="left">74</td>
<td align="left">90</td>
<td align="left">6768</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">3</td>
<td align="left">123</td>
<td align="left">205</td>
<td align="left">7933</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">7</td>
<td align="left">6051</td>
<td align="left">8127</td>
<td align="left">570,900</td>
</tr>
<tr><td align="left">Caldi</td>
<td align="left">Bignorm</td>
<td align="left">25</td>
<td align="left">63</td>
<td align="left">110</td>
<td align="left">786</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">15</td>
<td align="left">67</td>
<td align="left">135</td>
<td align="left">3584</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">27</td>
<td align="left">1556</td>
<td align="left">3643</td>
<td align="left">33,530</td>
</tr>
<tr><td align="left">Caulo</td>
<td align="left">Bignorm</td>
<td align="left">7</td>
<td align="left">228</td>
<td align="left">216</td>
<td align="left">10,400</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">8</td>
<td align="left">362</td>
<td align="left">491</td>
<td align="left">35,520</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">8</td>
<td align="left">10,220</td>
<td align="left">9737</td>
<td align="left">464,300</td>
</tr>
<tr><td align="left">Chloroflexi</td>
<td align="left">Bignorm</td>
<td align="left">8</td>
<td align="left">72</td>
<td align="left">101</td>
<td align="left">2822</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">9</td>
<td align="left">412</td>
<td align="left">878</td>
<td align="left">20,850</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">9</td>
<td align="left">5612</td>
<td align="left">7741</td>
<td align="left">316,900</td>
</tr>
<tr><td align="left">Crenarch</td>
<td align="left">Bignorm</td>
<td align="left">8</td>
<td align="left">104</td>
<td align="left">159</td>
<td align="left">3770</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">10</td>
<td align="left">560</td>
<td align="left">1285</td>
<td align="left">29,720</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">10</td>
<td align="left">8086</td>
<td align="left">14,987</td>
<td align="left">316,700</td>
</tr>
<tr><td align="left">Cyanobact</td>
<td align="left">Bignorm</td>
<td align="left">9</td>
<td align="left">144</td>
<td align="left">153</td>
<td align="left">5234</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">10</td>
<td align="left">756</td>
<td align="left">1450</td>
<td align="left">26,980</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">10</td>
<td align="left">9478</td>
<td align="left">11,076</td>
<td align="left">356,600</td>
</tr>
<tr><td align="left">E.coli</td>
<td align="left">Bignorm</td>
<td align="left">37</td>
<td align="left">45</td>
<td align="left">56</td>
<td align="left">234</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">50</td>
<td align="left">382</td>
<td align="left">922</td>
<td align="left">7864</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">112</td>
<td align="left">2522</td>
<td align="left">6378</td>
<td align="left">56,520</td>
</tr>
<tr><td align="left">SAR324</td>
<td align="left">Bignorm</td>
<td align="left">24</td>
<td align="left">49</td>
<td align="left">71</td>
<td align="left">1410</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">18</td>
<td align="left">53</td>
<td align="left">107</td>
<td align="left">2473</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">26</td>
<td align="left">1086</td>
<td align="left">2761</td>
<td align="left">106,000</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<sec id="Sec2"><title>Previous work</title>
<p>We briefly survey two prior approaches for read pre-processing, namely <italic>trimming</italic>
 and <italic>error correction</italic>
. Read trimming programs (see [<xref ref-type="bibr" rid="CR2">2</xref>
] for a recent review) try to cut away the low quality parts of a read (or drop reads whose overall quality is low). These algorithms can be classified into two groups: <italic>running sum</italic>
 (Cutadapt, ERNE, SolexaQA with -bwa option [<xref ref-type="bibr" rid="CR3">3</xref>
–<xref ref-type="bibr" rid="CR5">5</xref>
]) and <italic>window based</italic>
 (ConDeTri, FASTX, PRINSEQ, Sickle, SolexaQA, and Trimmomatic [<xref ref-type="bibr" rid="CR5">5</xref>
–<xref ref-type="bibr" rid="CR10">10</xref>
]). The running sum algorithms take a quality threshold <italic>Q</italic>
 as input, which is subtracted from the phred score of each base of the read. The algorithms vary with respect to the functions applied to these differences to determine the quality of a read, the direction in which the read is processed, the function’s quality threshold upon which the cutoff point is determined, and the minimum length of a read after the cutoff to be accepted.</p>
<p>The window based algorithms, on the other hand, first cut away the reads’s 3’ or 5’ ends (depending on the algorithm) whose quality is below a specified minimum quality parameter and then determine a contiguous sequence of high quality using techniques similar to those used in the running sum algorithms.</p>
<p>All of these trimming algorithms generally work on a per-read basis, reading the input once and processing only a single read at a time. The drawback of this approach is that low quality sequences within a read are being dropped even when these sequences are not covered by any other reads whose quality is high. On the other hand, sequences whose quality and abundance are high are added over and over although their coverage is already high enough, which yields higher memory usage than necessary.</p>
<p>Most of the error correction programs (see [<xref ref-type="bibr" rid="CR11">11</xref>
] for a recent review) read the input twice: a first pass gathers statistics about the data (often <italic>k</italic>
-mer counts) which in a second pass are used to identify and correct errors. Some programs trim reads which cannot be corrected. Again, coverage is not a concern: reads which seem to be correct or which can be corrected are always accepted. According to [<xref ref-type="bibr" rid="CR11">11</xref>
], currently the best known and most used error correction program is Quake [<xref ref-type="bibr" rid="CR12">12</xref>]. Its algorithm is based on two assumptions: 
<list list-type="bullet"><list-item><p>“For sufficiently large <italic>k</italic>
, almost all single-base errors alter <italic>k</italic>
-mers overlapping the error to versions that do not exist in the genome. Therefore, <italic>k</italic>
-mers with low coverage, particularly those occurring just once or twice, usually represent sequencing errors.”</p>
</list-item>
<list-item><p>Errors follow a Gamma distribution, whereas true <italic>k</italic>
-mers are distributed as per a combination of the Normal and the Zeta distribution.</p>
</list-item>
</list>
</p>
<p>In the first pass of the program, a score based on the phred quality scores of the individual nucleotides is computed for each <italic>k</italic>
-mer. After this, Quake computes a <italic>coverage cutoff</italic>
 value, that is, the local minimum of the <italic>k</italic>
-mer spectrum between the Gamma and the Normal maxima. All <italic>k</italic>
-mers having a score higher than the coverage cutoff are considered to be correct (<italic>trusted</italic>
 or <italic>solid</italic>
 in error correction terminology), the others are assumed to be erroneous. In a second pass, Quake reads the input again and tries to replace erroneous <italic>k</italic>
-mers by trusted ones using a maximum likelihood approach. Reads which cannot be corrected are optionally trimmed or dumped.</p>
<p>But the main goal of error correctors is not the reduction of the data volume (in particular, they do not pay attention to excessive coverage), hence they cannot replace the following approaches.</p>
<p>Brown et al. invented an algorithm named <italic>Diginorm</italic>
 [<xref ref-type="bibr" rid="CR1">1</xref>
, <xref ref-type="bibr" rid="CR13">13</xref>
] for read filtering that rejects or accepts reads based on the abundance of their <italic>k</italic>
-mers. The name <italic>Diginorm</italic>
 is a short form for <italic>digital normalization</italic>
: the goal is to normalize the coverage over all loci, using a computer algorithm after sequencing. The idea is to remove those reads from the input which mainly consist of <italic>k</italic>
-mers that have already been observed many times in other reads. Diginorm processes reads one by one, splits them into <italic>k</italic>
-mers, and counts these <italic>k</italic>
-mers.</p>
<p>In order to save RAM, Diginorm does not keep track of those numbers exactly, but instead keeps appropriate estimates using the count-min sketch (CMS [<xref ref-type="bibr" rid="CR14">14</xref>
], see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Section S1.2 for a formal description). A read is accepted if the median of its <italic>k</italic>
-mer counts is below a fixed threshold, usually 20. It was demonstrated that successful assemblies are still possible after Diginorm removed the majority of the data.</p>
</sec>
<sec id="Sec3"><title>Our algorithm — Bignorm</title>
<p>Diginorm is a pioneering work. However, the following points, which are important from the biological or computational point of view, are not covered in Diginorm. We consider them as the algorithmic innovation in our work: 
<list list-type="simple"><list-item><label>(i)</label>
<p>We incorporate the important phred quality score into the decision whether to accept or to reject a read, using a quality threshold. This allows a tuning of the filtering process towards high-quality assemblies by using different thresholds.</p>
</list-item>
<list-item><label>(ii)</label>
<p>When deciding whether to accept or to reject a read, we do a detailed analysis of the numbers in the count vectors. Diginorm merely considers their medians.</p>
</list-item>
<list-item><label>(iii)</label>
<p>We offer a better handling of the N case, that is, when the sequencing machine could not decide for a particular nucleotide. Diginorm simply converts all N to A, which can lead to false <italic>k</italic>
-mer counts.</p>
</list-item>
<list-item><label>(iv)</label>
<p>We provide a substantially faster implementation. For example, we include fast hashing functions (see [<xref ref-type="bibr" rid="CR15">15</xref>
, <xref ref-type="bibr" rid="CR16">16</xref>
]) for counting <italic>k</italic>
-mers through the count-min sketch data structure (CMS), and we use the C programming language and OpenMP.</p>
</list-item>
</list>
</p>
<p>A technical description of our algorithm, called <italic>Bignorm</italic>
, is given in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Section S1.3, which might be important for computer scientists and mathematicians working in this area.</p>
</sec>
</sec>
<sec id="Sec4"><title>Methods</title>
<sec id="Sec5"><title>Experimental setup</title>
<p>For the experimental evaluation, we collected the following datasets. We use two single cell datasets of the UC San Diego, one of the group of Ute Hentschel (now GEOMAR Kiel) and 10 datasets from the JGI Genome Portal. The datasets from JGI were selected as follows. On the JGI Genome Portal [<xref ref-type="bibr" rid="CR17">17</xref>], we used “single cell” as search term. We narrowed the results down to datasets with all of the following characteristics: 
<list list-type="bullet"><list-item><p>status “complete”;</p>
</list-item>
<list-item><p>containing read data <italic>and</italic>
 an assembly in the download section;</p>
</list-item>
<list-item><p>aligning the reads to the assembly using Bowtie 2 [<xref ref-type="bibr" rid="CR18">18</xref>
] yields an “overall alignment rate” of more than 70<italic>%</italic>
.</p>
</list-item>
</list>
</p>
<p>From those datasets, we arbitrarily selected one per species, until we had a collection of 10 datasets. We refer to each combination of species and selected dataset as a <italic>case</italic>
 in the following. In total, we have 13 cases; the details are given in Table <xref rid="Tab2" ref-type="table">2</xref>.
<table-wrap id="Tab2"><label>Table 2</label>
<caption><p>Selected species and datasets (Cases)</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Short name</th>
<th align="left">Species/Description</th>
<th align="left">Source</th>
<th align="left">URL</th>
</tr>
</thead>
<tbody><tr><td align="left">ASZN2</td>
<td align="left">Candidatus Poribacteria sp. WGA-4E_FD</td>
<td align="left">Hentschel Group [<xref ref-type="bibr" rid="CR27">27</xref>
]</td>
<td align="left">[<xref ref-type="bibr" rid="CR28">28</xref>
]</td>
</tr>
<tr><td align="left">Aceto</td>
<td align="left">Acetothermia bacterium JGI MDM2 LHC4sed-1-H19</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR29">29</xref>
]</td>
</tr>
<tr><td align="left">Alphaproteo</td>
<td align="left">Alphaproteobacteria bacterium SCGC AC-312_D23v2</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR30">30</xref>
]</td>
</tr>
<tr><td align="left">Arco</td>
<td align="left">Arcobacter sp. SCGC AAA036-D18</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR31">31</xref>
]</td>
</tr>
<tr><td align="left">Arma</td>
<td align="left">Armatimonadetes bacterium JGI 0000077-K19</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR32">32</xref>
]</td>
</tr>
<tr><td align="left">Bacteroides</td>
<td align="left">Bacteroidetes bacVI JGI MCM14ME016</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR33">33</xref>
]</td>
</tr>
<tr><td align="left">Caldi</td>
<td align="left">Calescamantes bacterium JGI MDM2 SSWTFF-3-M19</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR34">34</xref>
]</td>
</tr>
<tr><td align="left">Caulo</td>
<td align="left">Caulobacter bacterium JGI SC39-H11</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR35">35</xref>
]</td>
</tr>
<tr><td align="left">Chloroflexi</td>
<td align="left">Chloroflexi bacterium SCGC AAA257-O03</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR36">36</xref>
]</td>
</tr>
<tr><td align="left">Crenarch</td>
<td align="left">Crenarchaeota archaeon SCGC AAA261-F05</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR37">37</xref>
]</td>
</tr>
<tr><td align="left">Cyanobact</td>
<td align="left">Cyanobacteria bacterium SCGC JGI 014-E08</td>
<td align="left">JGI Genome Portal</td>
<td align="left">[<xref ref-type="bibr" rid="CR38">38</xref>
]</td>
</tr>
<tr><td align="left">E.coli</td>
<td align="left">E.coli K-12, strain MG1655, single cell MDA, Cell one</td>
<td align="left">UC San Diego</td>
<td align="left">[<xref ref-type="bibr" rid="CR39">39</xref>
]</td>
</tr>
<tr><td align="left">SAR324</td>
<td align="left">SAR324 (Deltaproteobacteria)</td>
<td align="left">UC San Diego</td>
<td align="left">[<xref ref-type="bibr" rid="CR39">39</xref>
]</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>For each case, we analyze the results obtained with Diginorm and with Bignorm using quality parameters <italic>Q</italic>
<sub>0</sub>
∈{5,8,10,12,15,18,20,…,45}. Analysis is done on the one hand in terms of data reduction, quality, and coverage. On the other hand, we study actual assemblies that are computed with SPAdes [<xref ref-type="bibr" rid="CR19">19</xref>
] based on the raw and filtered datasets. For comparison, we also did assemblies using IDBA_UD [<xref ref-type="bibr" rid="CR20">20</xref>
] and Velvet-SC [<xref ref-type="bibr" rid="CR21">21</xref>
] (for <italic>Q</italic>
<sub>0</sub>
=20 only). All the details are given in the next section.</p>
<p>The dimensions of the count-min sketch are fixed to <italic>m</italic>
=1,024 and <italic>t</italic>
=10, thus 10 GB of RAM were used.</p>
</sec>
</sec>
<sec id="Sec6" sec-type="results"><title>Results</title>
<p>For our analysis, we mainly considered percentiles and quartiles of measured parameters. The <italic>i</italic>
th quartile is denoted by <inline-formula id="IEq3"><alternatives><tex-math id="M5">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}i$\end{document}</tex-math>
<mml:math id="M6"><mml:mi mathvariant="script">Qi</mml:mi>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where we use <inline-formula id="IEq4"><alternatives><tex-math id="M7">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}0$\end{document}</tex-math>
<mml:math id="M8"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>0</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
 for the minimum, <inline-formula id="IEq5"><alternatives><tex-math id="M9">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}2$\end{document}</tex-math>
<mml:math id="M10"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>2</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
 for the median, and <inline-formula id="IEq6"><alternatives><tex-math id="M11">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}4$\end{document}</tex-math>
<mml:math id="M12"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>4</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
 for the maximum. The <italic>i</italic>
th percentile is denoted by <inline-formula id="IEq7"><alternatives><tex-math id="M13">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}i$\end{document}</tex-math>
<mml:math id="M14"><mml:mi mathvariant="script">Pi</mml:mi>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
; we often use the 10th percentile <inline-formula id="IEq8"><alternatives><tex-math id="M15">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}10$\end{document}</tex-math>
<mml:math id="M16"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>10</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<sec id="Sec7"><title>Number of accepted reads</title>
<p>Statistics for the number of accepted reads are given as a box plot in Fig. <xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
. This plot is constructed as follows. Each of the blue boxes corresponds to Bignorm with a particular <italic>Q</italic>
<sub>0</sub>
, while Diginorm is represented as the wide orange box in the background (recall that Diginorm does not consider quality values). Note that the “whiskers” of Diginorm’s box are shown as light-orange areas. For each box, for each case the raw dataset is filtered using the algorithm and algorithmic parameters corresponding to that box, and the percentage of the accepted reads is taken into consideration. For example, if the top of a box (which corresponds to the 3rd quartile, also denoted <inline-formula id="IEq9"><alternatives><tex-math id="M17">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}3$\end{document}</tex-math>
<mml:math id="M18"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>3</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
) gives the value <italic>x</italic>
<italic>%</italic>
, then we know that for 75<italic>%</italic>
 of the cases, <italic>x</italic>
<italic>%</italic> or less of the reads were accepted using the algorithm and algorithmic parameters corresponding to this box.
<fig id="Fig1"><label>Fig. 1</label>
<caption><p>Box plots showing reduction and quality statistics. <bold>a</bold>
 Percentage of accepted reads (i.e. reads kept) over all datasets. <bold>b</bold>
 Mean quality values of the accepted reads over all datasets</p>
</caption>
<graphic xlink:href="12859_2017_1724_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>There are two prominent outliers: one for Diginorm with value ≈29<italic>%</italic>
 (shown as the red line at the top) and one for Bignorm for <italic>Q</italic>
<sub>0</sub>
=5 with value ≈26<italic>%</italic>
. In both cases, the Arma dataset is responsible, which is the dataset with the worst mean phred score and the strongest decline of the phred score over the read length (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Section S4 for more information and per base sequence quality plots). This suggest that the high rate of read kept is caused by a high error rate of the dataset. For 15≤<italic>Q</italic>
<sub>0</sub>
, even Bignorm’s outliers fall below Diginorm’s median, and for 18≤<italic>Q</italic>
<sub>0</sub>
 Bignorm keeps less than 5<italic>%</italic>
 of the reads for at least 75<italic>%</italic>
 of the datasets. In the range 20≤<italic>Q</italic>
<sub>0</sub>
≤25, Bignorm delivers similar results for the different values of <italic>Q</italic>
<sub>0</sub>
, and the gain in reduction for larger <italic>Q</italic>
<sub>0</sub>
 is small up to <italic>Q</italic>
<sub>0</sub>
=32. For even larger <italic>Q</italic>
<sub>0</sub>
, there is another jump in reduction, but we will see that coverage and the quality of the assembly suffer too much in that range. We conjecture that in the range 18≤<italic>Q</italic>
<sub>0</sub>
≤32, we remove most of the actual errors, whereas for larger <italic>Q</italic>
<sub>0</sub>
, we also remove useful information.</p>
</sec>
<sec id="Sec8"><title>Quality values</title>
<p>Statistics for phred quality scores in the filtered datasets are given in Fig. <xref rid="Fig1" ref-type="fig">1</xref>
. The data was obtained using fastx_quality_stats from the FASTX Toolkit [<xref ref-type="bibr" rid="CR7">7</xref>
] on the filtered fastq files and calculating the mean phred quality scores over all read positions for each dataset. Looking at the statistics for these overall means, for 15≤<italic>Q</italic>
<sub>0</sub>
, Bignorm’s median is better than Diginorm’s maximum. For 20≤<italic>Q</italic>
<sub>0</sub>
, this effect becomes even stronger. For all values for <italic>Q</italic>
<sub>0</sub>
, Bignorm’s minimum is clearly above Diginorm’s median. Note that an increase of 10 units means reducing error probability by factor 10.</p>
<p>In Table <xref rid="Tab3" ref-type="table">3</xref>
, we give quartiles of mean quality values for the raw datasets and Bignorm’s datasets produced with <italic>Q</italic>
<sub>0</sub>=20. Bignorm improves slightly on the raw dataset in all five quartiles.
<table-wrap id="Tab3"><label>Table 3</label>
<caption><p>Comparing quality values for the raw dataset and Bignorm with <italic>Q</italic>
<sub>0</sub>
=20</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Quartile</th>
<th align="left">Bignorm</th>
<th align="left">Raw</th>
</tr>
</thead>
<tbody><tr><td align="left"><inline-formula id="IEq10"><alternatives><tex-math id="M19">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}4$\end{document}</tex-math>
<mml:math id="M20"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>4</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
 (max)</td>
<td align="left">37.82</td>
<td align="left">37.37</td>
</tr>
<tr><td align="left"><inline-formula id="IEq11"><alternatives><tex-math id="M21">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}3$\end{document}</tex-math>
<mml:math id="M22"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>3</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">37.33</td>
<td align="left">36.52</td>
</tr>
<tr><td align="left"><inline-formula id="IEq12"><alternatives><tex-math id="M23">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}2$\end{document}</tex-math>
<mml:math id="M24"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>2</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
 (median)</td>
<td align="left">33.77</td>
<td align="left">32.52</td>
</tr>
<tr><td align="left"><inline-formula id="IEq13"><alternatives><tex-math id="M25">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}1$\end{document}</tex-math>
<mml:math id="M26"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>1</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">31.91</td>
<td align="left">30.50</td>
</tr>
<tr><td align="left"><inline-formula id="IEq14"><alternatives><tex-math id="M27">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}0$\end{document}</tex-math>
<mml:math id="M28"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>0</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq14.gif"></inline-graphic>
</alternatives>
</inline-formula>
 (min)</td>
<td align="left">26.14</td>
<td align="left">24.34</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Of course, all this could be explained by Bignorm simply cutting away any low-quality reads. However, the data in the next section suggests that Bignorm may in fact be more careful than this.</p>
</sec>
<sec id="Sec9"><title>Coverage</title>
<p>In Fig. <xref rid="Fig2" ref-type="fig">2</xref>
, we see statistics for the coverage. The data was obtained by remapping the filtered reads onto the assembly from the JGI using Bowtie 2 and then using coverageBed from the bedtools [<xref ref-type="bibr" rid="CR22">22</xref>
] and R [<xref ref-type="bibr" rid="CR23">23</xref>
] for the statistics. In Fig. <xref rid="Fig2" ref-type="fig">2</xref>
<xref rid="Fig2" ref-type="fig">a</xref>
, the mean is considered. For 15≤<italic>Q</italic>
<sub>0</sub>
, Bignorm reduces the coverage heavily. For 20≤<italic>Q</italic>
<sub>0</sub>
, Bignorm’s <inline-formula id="IEq15"><alternatives><tex-math id="M29">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}3$\end{document}</tex-math>
<mml:math id="M30"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>3</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq15.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is below Diginorm’s <inline-formula id="IEq16"><alternatives><tex-math id="M31">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}1$\end{document}</tex-math>
<mml:math id="M32"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>1</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq16.gif"></inline-graphic>
</alternatives>
</inline-formula>
. This may raise the concern that Bignorm could create areas with insufficient coverage. However, in Fig. <xref rid="Fig2" ref-type="fig">2</xref>
<xref rid="Fig2" ref-type="fig">b</xref>
, we look at the 10th percentile (<inline-formula id="IEq17"><alternatives><tex-math id="M33">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}10$\end{document}</tex-math>
<mml:math id="M34"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>10</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
) of the coverage instead of the mean. We consider this statistics as an indicator for the impact of the filtering on areas with low coverage. For <italic>Q</italic>
<sub>0</sub>
≤25, Bignorm’s <inline-formula id="IEq18"><alternatives><tex-math id="M35">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}3$\end{document}</tex-math>
<mml:math id="M36"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>3</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is at or above Diginorm’s maximum, and Bignorm’s minimum coincides with Diginorm’s (except for <italic>Q</italic>
<sub>0</sub>
=10, where we are slightly below). In terms of the median, both algorithms are very similar for <italic>Q</italic>
<sub>0</sub>≤25. We consider all this as a strong indication that we cut away in the right places.
<fig id="Fig2"><label>Fig. 2</label>
<caption><p>Box plots showing coverage statistics. <bold>a</bold>
 Mean coverage over all datasets. <bold>b</bold>
 10th percentile of the coverage over all datasets</p>
</caption>
<graphic xlink:href="12859_2017_1724_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>For 28≤<italic>Q</italic>
<sub>0</sub>
, there is a clear drop in coverage, so we do not recommend such <italic>Q</italic>
<sub>0</sub>
 values.</p>
<p>In Table <xref rid="Tab1" ref-type="table">1</xref>
, we give coverage statistics for each dataset. The reduction compared to the raw dataset in terms of mean, <inline-formula id="IEq19"><alternatives><tex-math id="M37">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}90$\end{document}</tex-math>
<mml:math id="M38"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>90</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
, and maximum is substantial. But also the improvement of Bignorm over Diginorm in mean, <inline-formula id="IEq20"><alternatives><tex-math id="M39">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {P}90$\end{document}</tex-math>
<mml:math id="M40"><mml:mi mathvariant="script">P</mml:mi>
<mml:mn>90</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
, and maximum is considerable for most datasets.</p>
</sec>
<sec id="Sec10"><title>Assessment through assemblies</title>
<p>The quality and significance of read filtering is subject to complete assemblies, which is the final “road test” for these algorithms. For each case, we do an assembly with SPAdes using the raw dataset and those filtered with Diginorm and Bignorm for a selection of <italic>Q</italic>
<sub>0</sub>
 values. The assemblies are then analyzed using quast [<xref ref-type="bibr" rid="CR24">24</xref>
] and the assembly from the JGI as reference. Statistics for four cases are shown in Fig. <xref rid="Fig3" ref-type="fig">3</xref>. We give the quality measures N50, genomic fraction, and largest contig, and in addition the overall running time (pre-processing plus assembler Wall time). Each measure is given in percentage relative to the raw dataset.
<fig id="Fig3"><label>Fig. 3</label>
<caption><p>Assembly statistics for four selected datasets; measurements of assemblies performed on the datasets with prior filtering using Diginorm and Bignorm, relative to the results of assemblies performed on the unfiltered datasets</p>
</caption>
<graphic xlink:href="12859_2017_1724_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>Generally, our biggest improvements are for N50 and running time. For 15≤<italic>Q</italic>
<sub>0</sub>
, Bignorm is always faster than Diginorm, for three of the four cases by a large margin. In terms of N50, for 15≤<italic>Q</italic>
<sub>0</sub>
, we observe improvements for three cases. For E.coli, Diginorm’s N50 is 100<italic>%</italic>
, that we also attain for <italic>Q</italic>
<sub>0</sub>
=20. In terms of genomic fraction and largest contig, we cannot always attain the same quality as Diginorm; the biggest deviation at <italic>Q</italic>
<sub>0</sub>
=20 is 10 percentage points for the ASZN2 case. The N50 is generally accepted as one of the most important measures, as long as the assembly represents the genome well (as measured by the genomic fraction here) [<xref ref-type="bibr" rid="CR25">25</xref>
].</p>
<p>In Tables <xref rid="Tab4" ref-type="table">4</xref>
 and <xref rid="Tab5" ref-type="table">5</xref>
, we give statistics for <italic>Q</italic>
<sub>0</sub>
=20 and each dataset. In terms of genomic fraction, Bignorm is generally not as good as Diginorm. However, excluding the Aceto and Arco cases, Bignorm’s genomic fraction is still always at least 95<italic>%</italic>
. For Aceto and Arco, Bignorm misses 3.21<italic>%</italic>
 and 3.48<italic>%</italic>, respectively, of the genome in comparison to Diginorm. In 8 cases, Bignorm’s N50 is better or at least as good as Diginorm’s. The 4 cases where we achieved a smaller N50 are Arco, Caldi, Caulo, Crenarch, and Cyanobact.
<table-wrap id="Tab4"><label>Table 4</label>
<caption><p>Filter and assembly statistics for Bignorm with <italic>Q</italic>
<sub>0</sub>
=20, Diginorm, and the raw datasets (Part I)</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Dataset</th>
<th align="left">Algorithm</th>
<th align="left">Reads kept</th>
<th align="left">Mean phred</th>
<th align="left">Contigs</th>
<th align="left">Filter time</th>
<th align="left">SPAdes time</th>
</tr>
<tr><th align="left"></th>
<th align="left"></th>
<th align="left">in %</th>
<th align="left">score</th>
<th align="left">≥10 000</th>
<th align="left">in sec</th>
<th align="left">in sec</th>
</tr>
</thead>
<tbody><tr><td align="left">Aceto</td>
<td align="left">Bignorm</td>
<td align="left">3.16</td>
<td align="left">37.33</td>
<td align="left">1</td>
<td align="left">906</td>
<td align="left">1708</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">3.95</td>
<td align="left">27.28</td>
<td align="left">1</td>
<td align="left">3290</td>
<td align="left">4363</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">36.52</td>
<td align="left">3</td>
<td align="left"></td>
<td align="left">47,813</td>
</tr>
<tr><td align="left">Alphaproteo</td>
<td align="left">Bignorm</td>
<td align="left">3.13</td>
<td align="left">34.65</td>
<td align="left">18</td>
<td align="left">623</td>
<td align="left">420</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">7.81</td>
<td align="left">28.73</td>
<td align="left">17</td>
<td align="left">1629</td>
<td align="left">11,844</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">33.64</td>
<td align="left">17</td>
<td align="left"></td>
<td align="left">29,057</td>
</tr>
<tr><td align="left">Arco</td>
<td align="left">Bignorm</td>
<td align="left">2.20</td>
<td align="left">33.77</td>
<td align="left">4</td>
<td align="left">429</td>
<td align="left">207</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">8.76</td>
<td align="left">21.39</td>
<td align="left">6</td>
<td align="left">1410</td>
<td align="left">1385</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">32.27</td>
<td align="left">6</td>
<td align="left"></td>
<td align="left">15,776</td>
</tr>
<tr><td align="left">Arma</td>
<td align="left">Bignorm</td>
<td align="left">7.90</td>
<td align="left">28.21</td>
<td align="left">44</td>
<td align="left">240</td>
<td align="left">135</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">29.30</td>
<td align="left">21.19</td>
<td align="left">50</td>
<td align="left">588</td>
<td align="left">1743</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">26.96</td>
<td align="left">44</td>
<td align="left"></td>
<td align="left">5371</td>
</tr>
<tr><td align="left">ASZN2</td>
<td align="left">Bignorm</td>
<td align="left">5.66</td>
<td align="left">37.66</td>
<td align="left">118</td>
<td align="left">1224</td>
<td align="left">1537</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">12.62</td>
<td align="left">32.73</td>
<td align="left">130</td>
<td align="left">5125</td>
<td align="left">21,626</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">36.85</td>
<td align="left">112</td>
<td align="left"></td>
<td align="left">47,859</td>
</tr>
<tr><td align="left">Bacteroides</td>
<td align="left">Bignorm</td>
<td align="left">2.85</td>
<td align="left">37.47</td>
<td align="left">6</td>
<td align="left">653</td>
<td align="left">3217</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">4.94</td>
<td align="left">27.64</td>
<td align="left">5</td>
<td align="left">2124</td>
<td align="left">3668</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">37.25</td>
<td align="left">9</td>
<td align="left"></td>
<td align="left">32,409</td>
</tr>
<tr><td align="left">Caldi</td>
<td align="left">Bignorm</td>
<td align="left">3.97</td>
<td align="left">37.82</td>
<td align="left">41</td>
<td align="left">842</td>
<td align="left">455</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">5.61</td>
<td align="left">30.67</td>
<td align="left">36</td>
<td align="left">1838</td>
<td align="left">793</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">37.37</td>
<td align="left">38</td>
<td align="left"></td>
<td align="left">7563</td>
</tr>
<tr><td align="left">Caulo</td>
<td align="left">Bignorm</td>
<td align="left">2.40</td>
<td align="left">36.95</td>
<td align="left">10</td>
<td align="left">679</td>
<td align="left">712</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">4.70</td>
<td align="left">25.16</td>
<td align="left">9</td>
<td align="left">2584</td>
<td align="left">765</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">36.01</td>
<td align="left">13</td>
<td align="left"></td>
<td align="left">18,497</td>
</tr>
<tr><td align="left">Chloroflexi</td>
<td align="left">Bignorm</td>
<td align="left">1.40</td>
<td align="left">31.91</td>
<td align="left">32</td>
<td align="left">694</td>
<td align="left">134</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">9.70</td>
<td align="left">18.91</td>
<td align="left">33</td>
<td align="left">2304</td>
<td align="left">1852</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">30.50</td>
<td align="left">34</td>
<td align="left"></td>
<td align="left">15,108</td>
</tr>
<tr><td align="left">Crenarch</td>
<td align="left">Bignorm</td>
<td align="left">1.46</td>
<td align="left">33.18</td>
<td align="left">19</td>
<td align="left">1107</td>
<td align="left">790</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">9.72</td>
<td align="left">19.80</td>
<td align="left">18</td>
<td align="left">2931</td>
<td align="left">3754</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">31.49</td>
<td align="left">26</td>
<td align="left"></td>
<td align="left">20,590</td>
</tr>
<tr><td align="left">Cyanobact</td>
<td align="left">Bignorm</td>
<td align="left">1.65</td>
<td align="left">30.45</td>
<td align="left">12</td>
<td align="left">679</td>
<td align="left">450</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">11.30</td>
<td align="left">17.58</td>
<td align="left">13</td>
<td align="left">1487</td>
<td align="left">1343</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">28.49</td>
<td align="left">13</td>
<td align="left"></td>
<td align="left">9417</td>
</tr>
<tr><td align="left">E. coli</td>
<td align="left">Bignorm</td>
<td align="left">1.91</td>
<td align="left">26.14</td>
<td align="left">67</td>
<td align="left">2279</td>
<td align="left">598</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">17.03</td>
<td align="left">19.34</td>
<td align="left">63</td>
<td align="left">9105</td>
<td align="left">3995</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">24.34</td>
<td align="left">64</td>
<td align="left"></td>
<td align="left">16,706</td>
</tr>
<tr><td align="left">SAR324</td>
<td align="left">Bignorm</td>
<td align="left">4.34</td>
<td align="left">33.05</td>
<td align="left">55</td>
<td align="left">1222</td>
<td align="left">708</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">4.69</td>
<td align="left">23.58</td>
<td align="left">52</td>
<td align="left">3706</td>
<td align="left">3085</td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left"></td>
<td align="left">32.52</td>
<td align="left">51</td>
<td align="left"></td>
<td align="left">26,237</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab5"><label>Table 5</label>
<caption><p>Filter and assembly statistics for Bignorm with <italic>Q</italic>
<sub>0</sub>
=20, Diginorm, and the raw datasets (Part II)</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Dataset</th>
<th align="left">Algorithm</th>
<th align="left" colspan="3">N50</th>
<th align="left" colspan="3">Longest contig length</th>
<th align="left" colspan="3">Genomic fraction</th>
<th align="left" colspan="3">Misassembled contig length</th>
</tr>
<tr><th align="left"></th>
<th align="left"></th>
<th align="left">abs</th>
<th align="left">% of raw</th>
<th align="left">% of Diginorm</th>
<th align="left">abs</th>
<th align="left">% of raw</th>
<th align="left">% of Diginorm</th>
<th align="left">abs</th>
<th align="left">% of raw</th>
<th align="left">% of Diginorm</th>
<th align="left">abs</th>
<th align="left">% of raw</th>
<th align="left">% of Diginorm</th>
</tr>
</thead>
<tbody><tr><td align="left">Aceto</td>
<td align="left">Bignorm</td>
<td align="left">2324</td>
<td align="left">79</td>
<td align="left">105</td>
<td align="left">11,525</td>
<td align="left">98</td>
<td align="left">100</td>
<td align="left">91</td>
<td align="left">97</td>
<td align="left">97</td>
<td align="left">52,487</td>
<td align="left">148</td>
<td align="left">178</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">2216</td>
<td align="left">76</td>
<td align="left"></td>
<td align="left">11,525</td>
<td align="left">98</td>
<td align="left"></td>
<td align="left">94</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">29,539</td>
<td align="left">84</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">2935</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">11,772</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">94</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">35,351</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Alphaproteo</td>
<td align="left">Bignorm</td>
<td align="left">11,750</td>
<td align="left">94</td>
<td align="left">115</td>
<td align="left">43,977</td>
<td align="left">91</td>
<td align="left">95</td>
<td align="left">98</td>
<td align="left">101</td>
<td align="left">105</td>
<td align="left">52,001</td>
<td align="left">120</td>
<td align="left">89</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">10,213</td>
<td align="left">82</td>
<td align="left"></td>
<td align="left">46,295</td>
<td align="left">95</td>
<td align="left"></td>
<td align="left">93</td>
<td align="left">95</td>
<td align="left"></td>
<td align="left">58,184</td>
<td align="left">134</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">12,446</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">48,586</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">43,388</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Arco</td>
<td align="left">Bignorm</td>
<td align="left">3320</td>
<td align="left">81</td>
<td align="left">97</td>
<td align="left">12,808</td>
<td align="left">57</td>
<td align="left">57</td>
<td align="left">85</td>
<td align="left">100</td>
<td align="left">97</td>
<td align="left">76,797</td>
<td align="left">99</td>
<td align="left">91</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">3434</td>
<td align="left">84</td>
<td align="left"></td>
<td align="left">22,463</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">88</td>
<td align="left">103</td>
<td align="left"></td>
<td align="left">84,613</td>
<td align="left">109</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">4092</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">22,439</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">85</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">77,888</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Arma</td>
<td align="left">Bignorm</td>
<td align="left">18,432</td>
<td align="left">102</td>
<td align="left">107</td>
<td align="left">108,140</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">98</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">774,291</td>
<td align="left">91</td>
<td align="left">103</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">17,288</td>
<td align="left">96</td>
<td align="left"></td>
<td align="left">108,498</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">98</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">748,560</td>
<td align="left">88</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">18,039</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">108,498</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">849,085</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">ASZN2</td>
<td align="left">Bignorm</td>
<td align="left">19,788</td>
<td align="left">91</td>
<td align="left">88</td>
<td align="left">72,685</td>
<td align="left">71</td>
<td align="left">88</td>
<td align="left">97</td>
<td align="left">99</td>
<td align="left">99</td>
<td align="left">2,753,167</td>
<td align="left">94</td>
<td align="left">105</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">16,591</td>
<td align="left">76</td>
<td align="left"></td>
<td align="left">82687</td>
<td align="left">81</td>
<td align="left"></td>
<td align="left">97</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">2,617,095</td>
<td align="left">89</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">21,784</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">102,287</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">97</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">2,941,524</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Bacteroides</td>
<td align="left">Bignorm</td>
<td align="left">3356</td>
<td align="left">68</td>
<td align="left">100</td>
<td align="left">25,300</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">95</td>
<td align="left">98</td>
<td align="left">99</td>
<td align="left">70,206</td>
<td align="left">105</td>
<td align="left">112</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">3356</td>
<td align="left">68</td>
<td align="left"></td>
<td align="left">25,300</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">96</td>
<td align="left">99</td>
<td align="left"></td>
<td align="left">62,882</td>
<td align="left">94</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">4930</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">25,299</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">66,626</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Caldi</td>
<td align="left">Bignorm</td>
<td align="left">50,973</td>
<td align="left">82</td>
<td align="left">83</td>
<td align="left">143,346</td>
<td align="left">89</td>
<td align="left">91</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">573,836</td>
<td align="left">94</td>
<td align="left">68</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">61,108</td>
<td align="left">98</td>
<td align="left"></td>
<td align="left">157,479</td>
<td align="left">98</td>
<td align="left"></td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">839,126</td>
<td align="left">138</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">62,429</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">160,851</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">100</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">609,604</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Caulo</td>
<td align="left">Bignorm</td>
<td align="left">4515</td>
<td align="left">69</td>
<td align="left">95</td>
<td align="left">20,255</td>
<td align="left">100</td>
<td align="left">107</td>
<td align="left">96</td>
<td align="left">98</td>
<td align="left">98</td>
<td align="left">60,362</td>
<td align="left">86</td>
<td align="left">113</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">4729</td>
<td align="left">72</td>
<td align="left"></td>
<td align="left">18,907</td>
<td align="left">93</td>
<td align="left"></td>
<td align="left">98</td>
<td align="left">101</td>
<td align="left"></td>
<td align="left">53,456</td>
<td align="left">76</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">6562</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">20,255</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">97</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">70,161</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Chloroflexi</td>
<td align="left">Bignorm</td>
<td align="left">13,418</td>
<td align="left">102</td>
<td align="left">109</td>
<td align="left">79,605</td>
<td align="left">102</td>
<td align="left">102</td>
<td align="left">99</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">666,519</td>
<td align="left">95</td>
<td align="left">93</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">12,305</td>
<td align="left">93</td>
<td align="left"></td>
<td align="left">78,276</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">716,473</td>
<td align="left">102</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">13,218</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">78,276</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">99</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">703,171</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Crenarch</td>
<td align="left">Bignorm</td>
<td align="left">6538</td>
<td align="left">77</td>
<td align="left">91</td>
<td align="left">31,401</td>
<td align="left">81</td>
<td align="left">66</td>
<td align="left">97</td>
<td align="left">99</td>
<td align="left">99</td>
<td align="left">484,354</td>
<td align="left">89</td>
<td align="left">95</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">7148</td>
<td align="left">84</td>
<td align="left"></td>
<td align="left">47,803</td>
<td align="left">124</td>
<td align="left"></td>
<td align="left">98</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">510,256</td>
<td align="left">94</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">8501</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">38,582</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">544,763</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">Cyanobact</td>
<td align="left">Bignorm</td>
<td align="left">5833</td>
<td align="left">95</td>
<td align="left">99</td>
<td align="left">33,462</td>
<td align="left">98</td>
<td align="left">100</td>
<td align="left">99</td>
<td align="left">101</td>
<td align="left">100</td>
<td align="left">236,391</td>
<td align="left">113</td>
<td align="left">110</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">5907</td>
<td align="left">96</td>
<td align="left"></td>
<td align="left">33,516</td>
<td align="left">98</td>
<td align="left"></td>
<td align="left">99</td>
<td align="left">101</td>
<td align="left"></td>
<td align="left">214,574</td>
<td align="left">103</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">6130</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">34,300</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">209,269</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">E. coli</td>
<td align="left">Bignorm</td>
<td align="left">112,393</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">268,306</td>
<td align="left">94</td>
<td align="left">94</td>
<td align="left">96</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">28,966</td>
<td align="left">65</td>
<td align="left">65</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">112,393</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">285,311</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">96</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">44,465</td>
<td align="left">100</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">112,393</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">285,528</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">96</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">44,366</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left">SAR324</td>
<td align="left">Bignorm</td>
<td align="left">135,669</td>
<td align="left">100</td>
<td align="left">114</td>
<td align="left">302,443</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">99</td>
<td align="left">100</td>
<td align="left">100</td>
<td align="left">4,259,479</td>
<td align="left">98</td>
<td align="left">100</td>
</tr>
<tr><td align="left"></td>
<td align="left">Diginorm</td>
<td align="left">119,529</td>
<td align="left">88</td>
<td align="left"></td>
<td align="left">302,443</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">99</td>
<td align="left">100</td>
<td align="left"></td>
<td align="left">4,264,234</td>
<td align="left">98</td>
<td align="left"></td>
</tr>
<tr><td align="left"></td>
<td align="left">Raw</td>
<td align="left">136,176</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">302,442</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">99</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">4,342,602</td>
<td align="left"></td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>In Table <xref rid="Tab6" ref-type="table">6</xref>
, we show the total length of the assemblies for <italic>Q</italic>
<sub>0</sub>
=20 absolute and relative to the length of the reference. In most cases, all assemblies are clearly longer than the reference, with Diginorm by trend causing slightly larger and Bignorm causing slightly shorter assemblies compared to the unfiltered dataset (see Additional file <xref rid="MOESM1" ref-type="media">1</xref>: Figure S6 for a box plot).
<table-wrap id="Tab6"><label>Table 6</label>
<caption><p>Reference length and total length of assemblies for Bignorm with <italic>Q</italic>
<sub>0</sub>
=20, Diginorm, and the raw datasets</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Dataset</th>
<th align="left">Reference</th>
<th align="left" colspan="2">Raw</th>
<th align="left" colspan="2">Diginorm</th>
<th align="left" colspan="2">Bignorm</th>
</tr>
<tr><th align="left"></th>
<th align="left">Ref length</th>
<th align="left">Total length</th>
<th align="left">% of ref</th>
<th align="left">Total length</th>
<th align="left">% of ref</th>
<th align="left">Total length</th>
<th align="left">% of ref</th>
</tr>
</thead>
<tbody><tr><td align="left">Aceto</td>
<td align="left">426,710</td>
<td align="left">750,316</td>
<td align="left">175.80</td>
<td align="left">769,090</td>
<td align="left">180.20</td>
<td align="left">731,850</td>
<td align="left">171.50</td>
</tr>
<tr><td align="left">Alphaproteo</td>
<td align="left">463,456</td>
<td align="left">405,020</td>
<td align="left">87.40</td>
<td align="left">377,293</td>
<td align="left">81.40</td>
<td align="left">394,979</td>
<td align="left">85.20</td>
</tr>
<tr><td align="left">Arco</td>
<td align="left">231,937</td>
<td align="left">408,571</td>
<td align="left">176.20</td>
<td align="left">419,403</td>
<td align="left">180.80</td>
<td align="left">380,191</td>
<td align="left">163.90</td>
</tr>
<tr><td align="left">Arma</td>
<td align="left">1,364,272</td>
<td align="left">2,123,588</td>
<td align="left">155.70</td>
<td align="left">2,131,958</td>
<td align="left">156.30</td>
<td align="left">2,077,037</td>
<td align="left">152.20</td>
</tr>
<tr><td align="left">ASZN2</td>
<td align="left">3,669,182</td>
<td align="left">4,938,079</td>
<td align="left">134.60</td>
<td align="left">4,930,677</td>
<td align="left">134.40</td>
<td align="left">4,836,216</td>
<td align="left">131.80</td>
</tr>
<tr><td align="left">Bacteroides</td>
<td align="left">560,676</td>
<td align="left">826,566</td>
<td align="left">147.40</td>
<td align="left">818,799</td>
<td align="left">146.00</td>
<td align="left">792,384</td>
<td align="left">141.30</td>
</tr>
<tr><td align="left">Caldi</td>
<td align="left">1,961,164</td>
<td align="left">2,044,270</td>
<td align="left">104.20</td>
<td align="left">2,041,841</td>
<td align="left">104.10</td>
<td align="left">2,037,901</td>
<td align="left">103.90</td>
</tr>
<tr><td align="left">Caulo</td>
<td align="left">423,390</td>
<td align="left">601,709</td>
<td align="left">142.10</td>
<td align="left">616,942</td>
<td align="left">145.70</td>
<td align="left">590,319</td>
<td align="left">139.40</td>
</tr>
<tr><td align="left">Chloroflexi</td>
<td align="left">863,677</td>
<td align="left">1,317,768</td>
<td align="left">152.60</td>
<td align="left">1,326,848</td>
<td align="left">153.60</td>
<td align="left">1,186,531</td>
<td align="left">137.40</td>
</tr>
<tr><td align="left">Crenarch</td>
<td align="left">716,004</td>
<td align="left">1,009,122</td>
<td align="left">140.90</td>
<td align="left">1,016,485</td>
<td align="left">142.00</td>
<td align="left">946,606</td>
<td align="left">132.20</td>
</tr>
<tr><td align="left">Cyanobact</td>
<td align="left">343,353</td>
<td align="left">635,368</td>
<td align="left">185.00</td>
<td align="left">636,876</td>
<td align="left">185.50</td>
<td align="left">591,367</td>
<td align="left">172.20</td>
</tr>
<tr><td align="left">E. coli</td>
<td align="left">4,639,675</td>
<td align="left">4,896,992</td>
<td align="left">105.50</td>
<td align="left">4,898,422</td>
<td align="left">105.60</td>
<td align="left">4,948,739</td>
<td align="left">106.70</td>
</tr>
<tr><td align="left">SAR324</td>
<td align="left">4,255,983</td>
<td align="left">4,676,938</td>
<td align="left">109.90</td>
<td align="left">4,674,540</td>
<td align="left">109.80</td>
<td align="left">4,669,774</td>
<td align="left">109.70</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Bignorm’s mean phred score is always slightly larger than that of the raw dataset, whereas Diginorm’s is always smaller. For some cases, the difference is substantial; the quartiles for the ratio of Diginorm’s mean phred score to that of the raw dataset are given in Table <xref rid="Tab7" ref-type="table">7</xref> in the first row.
<table-wrap id="Tab7"><label>Table 7</label>
<caption><p>Quartiles for comparison of mean phred score, filter and assembler Wall time in %</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left">Min</th>
<th align="left"><inline-formula id="IEq21"><alternatives><tex-math id="M41">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}1$\end{document}</tex-math>
<mml:math id="M42"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>1</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Median</th>
<th align="left">Mean</th>
<th align="left"><inline-formula id="IEq22"><alternatives><tex-math id="M43">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$\mathcal {Q}3$\end{document}</tex-math>
<mml:math id="M44"><mml:mi mathvariant="script">Q</mml:mi>
<mml:mn>3</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2017_1724_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Max</th>
</tr>
</thead>
<tbody><tr><td align="left"><underline>Diginorm mean phred score</underline>
</td>
<td align="left">62</td>
<td align="left">66</td>
<td align="left">74</td>
<td align="left">74</td>
<td align="left">79</td>
<td align="left">89</td>
</tr>
<tr><td align="left">raw mean phred score</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left"><underline>Bignorm filter time</underline>
</td>
<td align="left">24</td>
<td align="left">28</td>
<td align="left">31</td>
<td align="left">33</td>
<td align="left">38</td>
<td align="left">46</td>
</tr>
<tr><td align="left">Diginorm filter time</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr><td align="left"><underline>Bignorm SPAdes time</underline>
</td>
<td align="left">4</td>
<td align="left">08</td>
<td align="left">18</td>
<td align="left">26</td>
<td align="left">35</td>
<td align="left">88</td>
</tr>
<tr><td align="left">Diginorm SPAdes time</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Clearly, our biggest gain is in running time, for the filtering as well for the assembly. Quartiles of the corresponding improvements are given in rows two and three of Table <xref rid="Tab7" ref-type="table">7</xref>
.</p>
</sec>
<sec id="Sec11"><title>IDBA_UD and Velvet-SC</title>
<p>For a detailed presentation of the results gained with IDBA_UD and Velvet-SC, please see “Comparison of different assemblers” section in the Additional file <xref rid="MOESM1" ref-type="media">1</xref>. We briefly summarize the results: 
<list list-type="bullet"><list-item><p>IDBA_UD does not considerably benefit from read filtering, while Velvet-SC clearly does.</p>
</list-item>
<list-item><p>Velvet-SC is clearly inferior to both SPAdes and IDBA_UD, though in some regards the combination of read filtering and Velvet-SC is as good as IDBA_UD.</p>
</list-item>
<list-item><p>SPAdes nearly always produced better results than IDBA_UD, but in median (on unfiltered datasets) IDBA_UD is about 7 times faster than SPAdes.</p>
</list-item>
<list-item><p>SPAdes running on a dataset filtered using Diginorm is approximately as fast as IDBA_UD on the unfiltered dataset while SPAdes on a dataset filtered using Bignorm is roughly 4 times faster.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="Sec12" sec-type="discussion"><title>Discussion</title>
<p>The quality parameter <italic>Q</italic>
<sub>0</sub>
 that Bignorm introduces as an innovation to Diginorm has shown to have a strong impact on the number of reads kept, coverage, and quality of the assembly. A reasonable upper bound of <italic>Q</italic>
<sub>0</sub>
≤25 was obtained by considering the 10th percentile of the coverage (Fig. <xref rid="Fig2" ref-type="fig">2</xref>
<xref rid="Fig2" ref-type="fig">b</xref>
). With this constraint in mind, in order to keep a small number of reads, Fig. <xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
 suggests 18≤<italic>Q</italic>
<sub>0</sub>
≤25. Given that N50 for E.coli starts to decline at <italic>Q</italic>
<sub>0</sub>
=20 (Fig. <xref rid="Fig3" ref-type="fig">3</xref>
), we decided for <italic>Q</italic>
<sub>0</sub>
=20 as the recommended value. As presented in detail in Table <xref rid="Tab4" ref-type="table">4</xref>
, <italic>Q</italic>
<sub>0</sub>
=20 gives good assemblies for all 13 cases. The gain in speed is considerable: in terms of the median, we only require 31<italic>%</italic>
 and 18<italic>%</italic>
 of Diginorm’s time for filtering and assembly, respectively. This speedup generally comes at the price of a smaller genomic fraction and shorter largest contig, although those differences are relatively slight.</p>
<p>We believe that the increase of the N50 and largest contig for high values of <italic>Q</italic>
<sub>0</sub>
, which we observe for some datasets just before the breakdown of the assembly (compare for example the results for the Alphaproteo dataset in Fig. <xref rid="Fig3" ref-type="fig">3</xref>
), is due to the reduced number of branches in the assembly graph: SPAdes, as every assembler, ends a contig when it reaches an unresolvable branch in its assembly graph. As the number of reads in the input decreases more and more with increasing <italic>Q</italic>
<sub>0</sub>
, the number of these branches also decreases and the resulting contigs get longer.</p>
</sec>
<sec id="Sec13" sec-type="conclusion"><title>Conclusions</title>
<p>For 13 bacteria single cell datasets, we have shown that good and fast assemblies are possible based on only 5<italic>%</italic>
 of the reads in most of the cases (and on less than 10<italic>%</italic>
 of the reads in all of the cases). The filtering process, using our new algorithm Bignorm, also works fast and much faster than Diginorm. Like Diginorm, we use a count-min sketch for counting <italic>k</italic>
-mers, so the memory requirements are relatively small and known in advance. Our algorithm Bignorm yields filtered datasets and subsequent assemblies of competative quality in much shorter time. In particular, the combination of Bignorm and SPAdes gives superior results to IDBA_UD while being faster. Furthermore, the mean phred score of our filtered dataset is much higher than that of Diginorm.</p>
</sec>
</body>
<back><app-group><app id="App1"><sec id="Sec14"><title>Additional file</title>
<p><media position="anchor" xlink:href="12859_2017_1724_MOESM1_ESM.pdf" id="MOESM1"><label>Additional file 1</label>
<caption><p>See file ’supplement.pdf’ for formal definitions and details on results from different assemblers. (PDF 259 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<fn-group><fn><p><bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (doi:10.1186/s12859-017-1724-7) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack><title>Acknowledgements</title>
<p>Not applicable.</p>
<sec id="d29e5245"><title>Funding</title>
<p>This work was funded by DFG Priority Programme 1736 <italic>Algorithms for Big Data</italic>
, Grant SR7/15-1.</p>
</sec>
<sec id="d29e5253"><title>Availability of data and materials</title>
<p>The datasets analyzed in the current study can be found in the references in Table <xref rid="Tab2" ref-type="table">2</xref>
. The source code for Bignorm is available at [<xref ref-type="bibr" rid="CR26">26</xref>
].</p>
</sec>
<sec id="d29e5264"><title>Author’s contributions</title>
<p>All authors planned and designed the study. AW implemented the software and performed the experiments. AW, LK, and CS wrote the manuscript. All authors read and approved the final manuscript.</p>
</sec>
<sec id="d29e5269"><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="d29e5274"><title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e5279"><title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e5284"><title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</ack>
<ref-list id="Bib1"><title>References</title>
<ref id="CR1"><label>1</label>
<mixed-citation publication-type="other">Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom TH. A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. ArXiv e-prints. 2012:1–18. http://arxiv.org/abs/1203.4802.</mixed-citation>
</ref>
<ref id="CR2"><label>2</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Del Fabbro</surname>
<given-names>C</given-names>
</name>
<name><surname>Scalabrin</surname>
<given-names>S</given-names>
</name>
<name><surname>Morgante</surname>
<given-names>M</given-names>
</name>
<name><surname>Giorgi</surname>
<given-names>FM</given-names>
</name>
</person-group>
<article-title>An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis</article-title>
<source>PLoS ONE</source>
<year>2013</year>
<volume>8</volume>
<issue>12</issue>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0085024</pub-id>
</element-citation>
</ref>
<ref id="CR3"><label>3</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Martin</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Cutadapt removes adapter sequences from high-throughput sequencing reads</article-title>
<source>EMBnet J</source>
<year>2011</year>
<volume>17</volume>
<issue>1</issue>
<fpage>10</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.14806/ej.17.1.200</pub-id>
</element-citation>
</ref>
<ref id="CR4"><label>4</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Prezza</surname>
<given-names>N</given-names>
</name>
<name><surname>Del Fabbro</surname>
<given-names>C</given-names>
</name>
<name><surname>Vezzi</surname>
<given-names>F</given-names>
</name>
<name><surname>De Paoli</surname>
<given-names>E</given-names>
</name>
<name><surname>Policriti</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>ERNE-BS5: Aligning BS-treated Sequences by Multiple Hits on a 5-letters Alphabet</article-title>
<source>Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. BCB ’12</source>
<year>2012</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>ACM</publisher-name>
</element-citation>
</ref>
<ref id="CR5"><label>5</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cox</surname>
<given-names>MP</given-names>
</name>
<name><surname>Peterson</surname>
<given-names>DA</given-names>
</name>
<name><surname>Biggs</surname>
<given-names>PJ</given-names>
</name>
</person-group>
<article-title>SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data</article-title>
<source>BMC Bioinforma</source>
<year>2010</year>
<volume>11</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-485</pub-id>
</element-citation>
</ref>
<ref id="CR6"><label>6</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Smeds</surname>
<given-names>L</given-names>
</name>
<name><surname>Künstner</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>ConDeTri - A Content Dependent Read Trimmer for Illumina Data</article-title>
<source>PLoS ONE</source>
<year>2011</year>
<volume>6</volume>
<issue>10</issue>
<fpage>1</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0026314</pub-id>
</element-citation>
</ref>
<ref id="CR7"><label>7</label>
<mixed-citation publication-type="other">FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_toolkit/. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://hannonlab.cshl.edu/fastx_toolkit/">http://hannonlab.cshl.edu/fastx_toolkit/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR8"><label>8</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Schmieder</surname>
<given-names>R</given-names>
</name>
<name><surname>Edwards</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Quality control and preprocessing of metagenomic datasets</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>863</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr026</pub-id>
<pub-id pub-id-type="pmid">21278185</pub-id>
</element-citation>
</ref>
<ref id="CR9"><label>9</label>
<mixed-citation publication-type="other">Joshi N, Fass J. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33). 2011. <ext-link ext-link-type="uri" xlink:href="https://github.com/najoshi/sickle">https://github.com/najoshi/sickle</ext-link>
.</mixed-citation>
</ref>
<ref id="CR10"><label>10</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bolger</surname>
<given-names>AM</given-names>
</name>
<name><surname>Lohse</surname>
<given-names>M</given-names>
</name>
<name><surname>Usadel</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Trimmomatic: A flexible trimmer for Illumina Sequence Data</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>15</issue>
<fpage>2114</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu170</pub-id>
<pub-id pub-id-type="pmid">24695404</pub-id>
</element-citation>
</ref>
<ref id="CR11"><label>11</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Alic</surname>
<given-names>AS</given-names>
</name>
<name><surname>Ruzafa</surname>
<given-names>D</given-names>
</name>
<name><surname>Dopazo</surname>
<given-names>J</given-names>
</name>
<name><surname>Blanquer</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Objective review of de novo stand-alone error correction methods for NGS data</article-title>
<source>Wiley Interdiscip Rev Comput Mol Sci</source>
<year>2016</year>
<volume>6</volume>
<issue>2</issue>
<fpage>111</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="doi">10.1002/wcms.1239</pub-id>
</element-citation>
</ref>
<ref id="CR12"><label>12</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name><surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-11-r116</pub-id>
</element-citation>
</ref>
<ref id="CR13"><label>13</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<name><surname>Pell</surname>
<given-names>J</given-names>
</name>
<name><surname>Canino-Koning</surname>
<given-names>R</given-names>
</name>
<name><surname>Howe</surname>
<given-names>AC</given-names>
</name>
<name><surname>Brown</surname>
<given-names>CT</given-names>
</name>
</person-group>
<article-title>These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure</article-title>
<source>PLoS ONE</source>
<year>2014</year>
<volume>9</volume>
<issue>7</issue>
<fpage>1</fpage>
<lpage>13</lpage>
</element-citation>
</ref>
<ref id="CR14"><label>14</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cormode</surname>
<given-names>G</given-names>
</name>
<name><surname>Muthukrishnan</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>An improved data stream summary: the count-min sketch and its applications</article-title>
<source>J Algoritm</source>
<year>2005</year>
<volume>55</volume>
<issue>1</issue>
<fpage>58</fpage>
<lpage>75</lpage>
<pub-id pub-id-type="doi">10.1016/j.jalgor.2003.12.001</pub-id>
</element-citation>
</ref>
<ref id="CR15"><label>15</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dietzfelbinger</surname>
<given-names>M</given-names>
</name>
<name><surname>Hagerup</surname>
<given-names>T</given-names>
</name>
<name><surname>Katajainen</surname>
<given-names>J</given-names>
</name>
<name><surname>Penttonen</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A Reliable Randomized Algorithm for the Closest-Pair Problem</article-title>
<source>J Algoritm</source>
<year>1997</year>
<volume>25</volume>
<issue>1</issue>
<fpage>19</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1006/jagm.1997.0873</pub-id>
</element-citation>
</ref>
<ref id="CR16"><label>16</label>
<mixed-citation publication-type="other">Wölfel P. Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen. PhD thesis, Universität Dortmund, Fachbereich Informatik. 2003.</mixed-citation>
</ref>
<ref id="CR17"><label>17</label>
<mixed-citation publication-type="other">JGI Genome Portal - Home. http://genome.jgi.doe.gov. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov">http://genome.jgi.doe.gov</ext-link>
.</mixed-citation>
</ref>
<ref id="CR18"><label>18</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Fast gapped-read alignment with Bowtie 2</article-title>
<source>Nat Meth</source>
<year>2012</year>
<volume>9</volume>
<issue>4</issue>
<fpage>357</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
</element-citation>
</ref>
<ref id="CR19"><label>19</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bankevich</surname>
<given-names>A</given-names>
</name>
<name><surname>Nurk</surname>
<given-names>S</given-names>
</name>
<name><surname>Antipov</surname>
<given-names>D</given-names>
</name>
<name><surname>Gurevich</surname>
<given-names>AA</given-names>
</name>
<name><surname>Dvorkin</surname>
<given-names>M</given-names>
</name>
<name><surname>Kulikov</surname>
<given-names>AS</given-names>
</name>
<name><surname>Lesin</surname>
<given-names>VM</given-names>
</name>
<name><surname>Nikolenko</surname>
<given-names>SI</given-names>
</name>
<name><surname>Pham</surname>
<given-names>S</given-names>
</name>
<name><surname>Prjibelski</surname>
<given-names>AD</given-names>
</name>
<name><surname>Pyshkin</surname>
<given-names>AV</given-names>
</name>
<name><surname>Sirotkin</surname>
<given-names>AV</given-names>
</name>
<name><surname>Vyahhi</surname>
<given-names>N</given-names>
</name>
<name><surname>Tesler</surname>
<given-names>G</given-names>
</name>
<name><surname>Alekseyev</surname>
<given-names>MA</given-names>
</name>
<name><surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing</article-title>
<source>J Comput Biol</source>
<year>2012</year>
<volume>19</volume>
<issue>5</issue>
<fpage>455</fpage>
<lpage>77</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0021</pub-id>
<pub-id pub-id-type="pmid">22506599</pub-id>
</element-citation>
</ref>
<ref id="CR20"><label>20</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name><surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name><surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name><surname>Chin</surname>
<given-names>FYL</given-names>
</name>
</person-group>
<article-title>Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>11</issue>
<fpage>1420</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts174</pub-id>
<pub-id pub-id-type="pmid">22495754</pub-id>
</element-citation>
</ref>
<ref id="CR21"><label>21</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chitsaz</surname>
<given-names>H</given-names>
</name>
<name><surname>Yee-Greenbaum Joyclyn</surname>
<given-names>L</given-names>
</name>
<name><surname>Tesler</surname>
<given-names>G</given-names>
</name>
<collab>Lombardo M-J</collab>
<name><surname>Dupont</surname>
<given-names>CL</given-names>
</name>
<name><surname>Badger</surname>
<given-names>JH</given-names>
</name>
<name><surname>Novotny</surname>
<given-names>M</given-names>
</name>
<name><surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<name><surname>Fraser</surname>
<given-names>LJ</given-names>
</name>
<name><surname>Gormley</surname>
<given-names>NA</given-names>
</name>
<name><surname>Schulz-Trieglaff</surname>
<given-names>O</given-names>
</name>
<name><surname>Smith</surname>
<given-names>GP</given-names>
</name>
<name><surname>Evers</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name><surname>Lasken</surname>
<given-names>RS</given-names>
</name>
</person-group>
<article-title>Efficient de novo assembly of single-cell bacterial genomes from short-read data sets</article-title>
<source>Nat Biotech</source>
<year>2011</year>
<volume>29</volume>
<issue>10</issue>
<fpage>915</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1966</pub-id>
</element-citation>
</ref>
<ref id="CR22"><label>22</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Quinlan</surname>
<given-names>AR</given-names>
</name>
<name><surname>Hall</surname>
<given-names>IM</given-names>
</name>
</person-group>
<article-title>BEDTools: a flexible suite of utilities for comparing genomic features</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>6</issue>
<fpage>841</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq033</pub-id>
<pub-id pub-id-type="pmid">20110278</pub-id>
</element-citation>
</ref>
<ref id="CR23"><label>23</label>
<element-citation publication-type="book"><person-group person-group-type="author"><collab>R Core Team</collab>
</person-group>
<source>R: A Language and Environment for Statistical Computing</source>
<year>2016</year>
<publisher-loc>Vienna</publisher-loc>
<publisher-name>R Foundation for Statistical Computing</publisher-name>
</element-citation>
</ref>
<ref id="CR24"><label>24</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gurevich</surname>
<given-names>A</given-names>
</name>
<name><surname>Saveliev</surname>
<given-names>V</given-names>
</name>
<name><surname>Vyahhi</surname>
<given-names>N</given-names>
</name>
<name><surname>Tesler</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>QUAST: quality assessment tool for genome assemblies</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>8</issue>
<fpage>1072</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt086</pub-id>
<pub-id pub-id-type="pmid">23422339</pub-id>
</element-citation>
</ref>
<ref id="CR25"><label>25</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Earl</surname>
<given-names>D</given-names>
</name>
<name><surname>Bradnam</surname>
<given-names>K</given-names>
</name>
<name><surname>John</surname>
<given-names>JS</given-names>
</name>
<name><surname>Darling</surname>
<given-names>A</given-names>
</name>
<name><surname>Lin</surname>
<given-names>D</given-names>
</name>
<name><surname>Fass</surname>
<given-names>J</given-names>
</name>
<name><surname>Yu</surname>
<given-names>HOK</given-names>
</name>
<name><surname>Buffalo</surname>
<given-names>V</given-names>
</name>
<name><surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name><surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assemblathon 1: A competitive assessment of de novo short read assembly methods</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<issue>12</issue>
<fpage>2224</fpage>
<lpage>241</lpage>
<pub-id pub-id-type="doi">10.1101/gr.126599.111</pub-id>
<pub-id pub-id-type="pmid">21926179</pub-id>
</element-citation>
</ref>
<ref id="CR26"><label>26</label>
<mixed-citation publication-type="other">Wedemeyer A. Bignorm. https://git.informatik.uni-kiel.de/axw/Bignorm. Accessed 10 Oct 2016<ext-link ext-link-type="uri" xlink:href="https://git.informatik.uni-kiel.de/axw/Bignorm">https://git.informatik.uni-kiel.de/axw/Bignorm</ext-link>
.</mixed-citation>
</ref>
<ref id="CR27"><label>27</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kamke</surname>
<given-names>J</given-names>
</name>
<name><surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
<name><surname>Ivanova</surname>
<given-names>N</given-names>
</name>
<name><surname>Schwientek</surname>
<given-names>P</given-names>
</name>
<name><surname>Rinke</surname>
<given-names>C</given-names>
</name>
<name><surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name><surname>Woyke</surname>
<given-names>T</given-names>
</name>
<name><surname>Hentschel</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Single-cell genomics reveals complex carbohydrate degradation patterns in poribacterial symbionts of marine sponges</article-title>
<source>ISME J</source>
<year>2013</year>
<volume>7</volume>
<issue>12</issue>
<fpage>2287</fpage>
<lpage>300</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2013.111</pub-id>
<pub-id pub-id-type="pmid">23842652</pub-id>
</element-citation>
</ref>
<ref id="CR28"><label>28</label>
<mixed-citation publication-type="other">Candidatus Poribacteria Sp. WGA-4E. http://genome.jgi.doe.gov/CanPorspWGA4E_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/CanPorspWGA4E_FD">http://genome.jgi.doe.gov/CanPorspWGA4E_FD</ext-link>
.</mixed-citation>
</ref>
<ref id="CR29"><label>29</label>
<mixed-citation publication-type="other">Acetothermia Bacterium JGI MDM2 LHC4sed-1-H19. http://genome.jgi.doe.gov/AcebacLHC4se1H19_FD/AcebacLHC4se1H19_FD.info.html. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/AcebacLHC4se1H19_FD/AcebacLHC4se1H19_FD.info.html">http://genome.jgi.doe.gov/AcebacLHC4se1H19_FD/AcebacLHC4se1H19_FD.info.html</ext-link>
.</mixed-citation>
</ref>
<ref id="CR30"><label>30</label>
<mixed-citation publication-type="other">Alphaproteobacteria Bacterium SCGC AC-312_D23v2. http://genome.jgi.doe.gov/AlpbacA312_D23v2_FD/AlpbacA312_D23v2_FD.info.html. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/AlpbacA312_D23v2_FD/AlpbacA312_D23v2_FD.info.html">http://genome.jgi.doe.gov/AlpbacA312_D23v2_FD/AlpbacA312_D23v2_FD.info.html</ext-link>
.</mixed-citation>
</ref>
<ref id="CR31"><label>31</label>
<mixed-citation publication-type="other">Arcobacter Sp. SCGC AAA036-D18. http://genome.jgi.doe.gov/ArcspSAAA036D18_FD/ArcspSAAA036D18_FD.info.html. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/ArcspSAAA036D18_FD/ArcspSAAA036D18_FD.info.html">http://genome.jgi.doe.gov/ArcspSAAA036D18_FD/ArcspSAAA036D18_FD.info.html</ext-link>
.</mixed-citation>
</ref>
<ref id="CR32"><label>32</label>
<mixed-citation publication-type="other">Armatimonadetes Bacterium JGI 0000077-K19. http://genome.jgi.doe.gov/Armbac0000077K19_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/Armbac0000077K19_FD">http://genome.jgi.doe.gov/Armbac0000077K19_FD</ext-link>
..</mixed-citation>
</ref>
<ref id="CR33"><label>33</label>
<mixed-citation publication-type="other">Bacteroidetes bacVI JGI MCM14ME016. http://genome.jgi.doe.gov/BacbacMCM14ME016_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/BacbacMCM14ME016_FD">http://genome.jgi.doe.gov/BacbacMCM14ME016_FD</ext-link>
.</mixed-citation>
</ref>
<ref id="CR34"><label>34</label>
<mixed-citation publication-type="other">Calescamantes Bacterium JGI MDM2 SSWTFF-3-M19. http://genome.jgi.doe.gov/CalbacSSWTFF3M19_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/CalbacSSWTFF3M19_FD">http://genome.jgi.doe.gov/CalbacSSWTFF3M19_FD</ext-link>
.</mixed-citation>
</ref>
<ref id="CR35"><label>35</label>
<mixed-citation publication-type="other">Caulobacter Bacterium JGI SC39-H11. http://genome.jgi.doe.gov/CaubacJGISC39H11_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/CaubacJGISC39H11_FD">http://genome.jgi.doe.gov/CaubacJGISC39H11_FD</ext-link>
..</mixed-citation>
</ref>
<ref id="CR36"><label>36</label>
<mixed-citation publication-type="other">Chloroflexi Bacterium SCGC AAA257-O03. http://genome.jgi.doe.gov/ChlbacSAAA257O03_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/ChlbacSAAA257O03_FD">http://genome.jgi.doe.gov/ChlbacSAAA257O03_FD</ext-link>
.</mixed-citation>
</ref>
<ref id="CR37"><label>37</label>
<mixed-citation publication-type="other">Crenarchaeota Archaeon SCGC AAA261-F05. http://genome.jgi.doe.gov/CrearcSAAA261F05_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/CrearcSAAA261F05_FD">http://genome.jgi.doe.gov/CrearcSAAA261F05_FD</ext-link>
..</mixed-citation>
</ref>
<ref id="CR38"><label>38</label>
<mixed-citation publication-type="other">Cyanobacteria Bacterium SCGC JGI 014-E08. http://genome.jgi.doe.gov/CyabacSJGI014E08_FD. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://genome.jgi.doe.gov/CyabacSJGI014E08_FD">http://genome.jgi.doe.gov/CyabacSJGI014E08_FD</ext-link>
.</mixed-citation>
</ref>
<ref id="CR39"><label>39</label>
<mixed-citation publication-type="other">Single Cell Data Sets. http://bix.ucsd.edu/projects/singlecell/nbt_data.html. Accessed 18 July 2016<ext-link ext-link-type="uri" xlink:href="http://bix.ucsd.edu/projects/singlecell/nbt_data.html">http://bix.ucsd.edu/projects/singlecell/nbt_data.html</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0002670 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0002670 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri