Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Faucet: streaming de novo assembly graph construction

Identifieur interne : 000B20 ( Pmc/Corpus ); précédent : 000B19; suivant : 000B21

Faucet: streaming de novo assembly graph construction

Auteurs : Roye Rozov ; Gil Goldshlager ; Eran Halperin ; Ron Shamir

Source :

RBID : PMC:5870852

Abstract

AbstractMotivation

We present Faucet, a two-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.

Results

Faucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata—coverage counts collected at junction k-mers and connections bridging between junction pairs—contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Fauceted resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency—namely, Minia and LightAssembler. However, on metagenomes tested, Faucet,o outputs had 14–110% higher mean NGA50 lengths compared with Minia, and 2- to 11-fold higher mean NGA50 lengths compared with LightAssembler, the only other streaming assembler available.

Availability and implementation

Faucet is available at https://github.com/Shamir-Lab/Faucet

<xref ref-type="supplementary-material" rid="sup1">Supplementary information</xref>

Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btx471
PubMed: 29036597
PubMed Central: 5870852

Links to Exploration step

PMC:5870852

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Faucet: streaming
<italic>de novo</italic>
assembly graph construction</title>
<author>
<name sortKey="Rozov, Roye" sort="Rozov, Roye" uniqKey="Rozov R" first="Roye" last="Rozov">Roye Rozov</name>
<affiliation>
<nlm:aff id="btx471-aff1">Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Goldshlager, Gil" sort="Goldshlager, Gil" uniqKey="Goldshlager G" first="Gil" last="Goldshlager">Gil Goldshlager</name>
<affiliation>
<nlm:aff id="btx471-aff2">Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Halperin, Eran" sort="Halperin, Eran" uniqKey="Halperin E" first="Eran" last="Halperin">Eran Halperin</name>
<affiliation>
<nlm:aff id="btx471-aff3">Departments of Computer Science, Anesthesiology and Perioperative Medicine, University of California Los Angeles, CA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="btx471-aff1">Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29036597</idno>
<idno type="pmc">5870852</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870852</idno>
<idno type="RBID">PMC:5870852</idno>
<idno type="doi">10.1093/bioinformatics/btx471</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000B20</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B20</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Faucet: streaming
<italic>de novo</italic>
assembly graph construction</title>
<author>
<name sortKey="Rozov, Roye" sort="Rozov, Roye" uniqKey="Rozov R" first="Roye" last="Rozov">Roye Rozov</name>
<affiliation>
<nlm:aff id="btx471-aff1">Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Goldshlager, Gil" sort="Goldshlager, Gil" uniqKey="Goldshlager G" first="Gil" last="Goldshlager">Gil Goldshlager</name>
<affiliation>
<nlm:aff id="btx471-aff2">Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Halperin, Eran" sort="Halperin, Eran" uniqKey="Halperin E" first="Eran" last="Halperin">Eran Halperin</name>
<affiliation>
<nlm:aff id="btx471-aff3">Departments of Computer Science, Anesthesiology and Perioperative Medicine, University of California Los Angeles, CA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="btx471-aff1">Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="s1">
<title>Motivation</title>
<p>We present Faucet, a two-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.</p>
</sec>
<sec id="s2">
<title>Results</title>
<p>Faucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata—coverage counts collected at junction k-mers and connections bridging between junction pairs—contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Fauceted resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency—namely, Minia and LightAssembler. However, on metagenomes tested, Faucet,o outputs had 14–110% higher mean NGA50 lengths compared with Minia, and 2- to 11-fold higher mean NGA50 lengths compared with LightAssembler, the only other streaming assembler available.</p>
</sec>
<sec id="s3">
<title>Availability and implementation</title>
<p>Faucet is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/Shamir-Lab/Faucet">https://github.com/Shamir-Lab/Faucet</ext-link>
</p>
</sec>
<sec id="s5">
<title>
<xref ref-type="supplementary-material" rid="sup1">Supplementary information</xref>
</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available at
<italic>Bioinformatics</italic>
online.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A. Bankevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, B H" uniqKey="Bloom B">B.H. Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G. Rizk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="El Metwally, S" uniqKey="El Metwally S">S. El-Metwally</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gurevich, A" uniqKey="Gurevich A">A. Gurevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z. Iqbal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, D" uniqKey="Li D">D. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P. Melsted</name>
</author>
<author>
<name sortKey="Halldorsson, B V" uniqKey="Halldorsson B">B.V. Halldorsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Minkin, I" uniqKey="Minkin I">I. Minkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohamadi, H" uniqKey="Mohamadi H">H. Mohamadi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nihalani, R" uniqKey="Nihalani R">R. Nihalani</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S. Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Novak, A M" uniqKey="Novak A">A.M. Novak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nurk, S" uniqKey="Nurk S">S. Nurk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J. Pell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pertea, M" uniqKey="Pertea M">M. Pertea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P A" uniqKey="Pevzner P">P.A. Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Prjibelski, A D" uniqKey="Prjibelski A">A.D. Prjibelski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, A" uniqKey="Roberts A">A. Roberts</name>
</author>
<author>
<name sortKey="Pachter, L" uniqKey="Pachter L">L. Pachter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rozov, R" uniqKey="Rozov R">R. Rozov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, W" uniqKey="Shi W">W. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J T" uniqKey="Simpson J">J.T. Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R. Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L. Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C. Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q. Zhang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29036597</article-id>
<article-id pub-id-type="pmc">5870852</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btx471</article-id>
<article-id pub-id-type="publisher-id">btx471</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Special Issue: Recomb-Seq/Recomb-Ccb/Recomb-Cg</subject>
<subj-group subj-group-type="category-toc-heading">
<subject>Genome Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Faucet: streaming
<italic>de novo</italic>
assembly graph construction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Rozov</surname>
<given-names>Roye</given-names>
</name>
<xref ref-type="aff" rid="btx471-aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Goldshlager</surname>
<given-names>Gil</given-names>
</name>
<xref ref-type="aff" rid="btx471-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Halperin</surname>
<given-names>Eran</given-names>
</name>
<xref ref-type="aff" rid="btx471-aff3">3</xref>
<xref ref-type="corresp" rid="btx471-cor1"></xref>
<pmc-comment>eranhalperin@gmail.com</pmc-comment>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<xref ref-type="aff" rid="btx471-aff1">1</xref>
<xref ref-type="corresp" rid="btx471-cor1"></xref>
<pmc-comment>rshamir@tau.ac.il</pmc-comment>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Sahinalp</surname>
<given-names>Cenk</given-names>
</name>
<role>Associate Editor</role>
</contrib>
</contrib-group>
<aff id="btx471-aff1">
<label>1</label>
Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel</aff>
<aff id="btx471-aff2">
<label>2</label>
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA</aff>
<aff id="btx471-aff3">
<label>3</label>
Departments of Computer Science, Anesthesiology and Perioperative Medicine, University of California Los Angeles, CA, USA</aff>
<author-notes>
<corresp id="btx471-cor1">To whom correspondence should be addressed. Email:
<email>rshamir@tau.ac.il</email>
or
<email>eranhalperin@gmail.com</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>1</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2017-07-24">
<day>24</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>24</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>34</volume>
<issue>1</issue>
<fpage>147</fpage>
<lpage>154</lpage>
<history>
<date date-type="received">
<day>12</day>
<month>4</month>
<year>2017</year>
</date>
<date date-type="rev-recd">
<day>10</day>
<month>7</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>7</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2017. Published by Oxford University Press.</copyright-statement>
<copyright-year>2017</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by-nc/4.0/" license-type="cc-by-nc">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="btx471.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="s1">
<title>Motivation</title>
<p>We present Faucet, a two-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.</p>
</sec>
<sec id="s2">
<title>Results</title>
<p>Faucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata—coverage counts collected at junction k-mers and connections bridging between junction pairs—contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Fauceted resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency—namely, Minia and LightAssembler. However, on metagenomes tested, Faucet,o outputs had 14–110% higher mean NGA50 lengths compared with Minia, and 2- to 11-fold higher mean NGA50 lengths compared with LightAssembler, the only other streaming assembler available.</p>
</sec>
<sec id="s3">
<title>Availability and implementation</title>
<p>Faucet is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/Shamir-Lab/Faucet">https://github.com/Shamir-Lab/Faucet</ext-link>
</p>
</sec>
<sec id="s5">
<title>
<xref ref-type="supplementary-material" rid="sup1">Supplementary information</xref>
</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available at
<italic>Bioinformatics</italic>
online.</p>
</sec>
</abstract>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Israel Science Foundation</named-content>
<named-content content-type="funder-identifier">10.13039/501100003977</named-content>
</funding-source>
</award-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Israel Science Foundation</named-content>
<named-content content-type="funder-identifier">10.13039/501100003977</named-content>
</funding-source>
<award-id>1425/13</award-id>
</award-group>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<p>Assembly graphs encode relationships among sequences from a common source: they capture sequences as well as the overlaps observed among them. When assembly graphs are indexed, their sequence contents can be queried without iterating over every sequence in the input. This functionality makes graph and index construction a prerequisite for many applications. Among these are different types of assembly—e.g.
<italic>de novo</italic>
assembly of whole genomes, transcripts, plasmids etc. (
<xref rid="btx471-B18" ref-type="bibr">Pertea
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btx471-B22" ref-type="bibr">Rozov
<italic>et al.</italic>
, 2017</xref>
)—and downstream applications—e.g. mapping reads to the graphs, variant calling, pangenome analysis etc. (
<xref rid="btx471-B8" ref-type="bibr">Iqbal
<italic>et al.</italic>
, 2012</xref>
;
<xref rid="btx471-B15" ref-type="bibr">Novak
<italic>et al.</italic>
, 2017</xref>
).</p>
<p>In recent years, much effort has been expended to reduce the amount of memory used for constructing assembly graphs and indexing them. Major advances often relied on index structures that saved memory by enabling subsets of possible queries: e.g. one could query what extensions a given substring
<italic>s</italic>
has, but not how many times
<italic>s</italic>
was seen in the input data. A great deal of success ensued in reducing the amount of memory needed to efficiently construct the central data structures used by most
<italic>de novo</italic>
assembly algorithms, namely, the de Bruijn and string graphs (
<xref rid="btx471-B3" ref-type="bibr">Chikhi and Rizk, 2012</xref>
;
<xref rid="btx471-B17" ref-type="bibr">Pell
<italic>et al.</italic>
, 2012</xref>
;
<xref rid="btx471-B24" ref-type="bibr">Simpson and Durbin, 2010</xref>
;
<xref rid="btx471-B26" ref-type="bibr">Ye
<italic>et al.</italic>
, 2012</xref>
). Furthermore, efficient conversion of de Bruijn graphs to their
<italic>compacted</italic>
form (essentially string graphs with fixed overlap size) has been demonstrated
<xref rid="btx471-B4" ref-type="bibr">Chikhi
<italic>et al.</italic>
, 2014</xref>
,
<xref rid="btx471-B5" ref-type="bibr">2016</xref>
;
<xref rid="btx471-B12" ref-type="bibr">Minkin
<italic>et al.</italic>
, 2016</xref>
).</p>
<p>In parallel to these efforts, streaming approaches were demonstrated as alternative resource-efficient means of performing analyses that had typically relied on static indices. Although appealing in terms of speed and low memory use, these approaches were initially demonstrated primarily for counting-centered applications such as estimating k-mer frequencies, error-correction of reads, and quantification of transcripts (
<xref rid="btx471-B11" ref-type="bibr">Melsted and Halldorsson, 2014</xref>
;
<xref rid="btx471-B13" ref-type="bibr">Mohamadi
<italic>et al.</italic>
, 2017</xref>
;
<xref rid="btx471-B21" ref-type="bibr">Roberts and Pachter, 2012</xref>
;
<xref rid="btx471-B25" ref-type="bibr">Song
<italic>et al.</italic>
, 2014</xref>
;
<xref rid="btx471-B27" ref-type="bibr">Zhang
<italic>et al.</italic>
, 2014</xref>
).</p>
<p>Recently, a first step towards bridging the gap between streaming approaches and those based on static index construction was taken, hinting at the potential benefits of combining the two.
<xref rid="btx471-B6" ref-type="bibr">El-Metwally
<italic>et al.</italic>
(2016)</xref>
demonstrated a streaming approach to assembly by making two passes on a set of reads. The first pass subsamples k-mers in the de Bruijn graph and inserts them into a Bloom filter, and the second uses this Bloom filter to identify ‘solid’ (likely correct) k-mers, which are then inserted into a second Bloom filter. This streaming approach resulted in very high resource efficiency in terms of memory and disk use. However, LightAssembler finds solid k-mers while disregarding paired-end and coverage information, and thus is limited in its ability to resolve repeats and to differentiate between different possible extensions in order to improve contiguity.</p>
<p>In this work, we extend this approach with the aim of providing a more complete alternative to downloading and storing reads for the sake of
<italic>de novo</italic>
assembly. We show this is achievable via online graph and index construction. We describe the Faucet algorithm, composed of an online phase and an offline phase. During the online phase, two passes are made on the reads without storing them locally to first load their k-mers into a Bloom filter, and then identify and record structural characteristics of the graph and associated metadata essential for achieving high contiguity in assembly. The offline phase uses all of this information together to iteratively clean and refine the graph structure.</p>
<p>We show that Faucet requires less disk space than the input data, in contrast with extant assemblers that require storing reads and often produce intermediate files that are larger than the input. We also show that the ratio of disk space Faucet uses to the input data improves with higher coverage levels by streaming successively larger subsets of a high coverage human genome sample. Furthermore, we introduce a new cleaning step called
<italic>disentanglement</italic>
enabled by storage of paired junction extensions in two Bloom filters—one meant for pairings inside a read, and one meant for junctions on separate paired end mates. We show the benefit of disentanglement via extensive experiments. Finally, we compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency—namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14–110% higher mean NGA50 lengths compared with Minia, and 2- to 11-fold higher mean NGA50 lengths compared with LightAssembler, the only other streaming assembler available.</p>
</sec>
<sec>
<title>2 Preliminaries</title>
<p>For a string
<italic>s</italic>
, we denote by
<italic>s</italic>
[
<italic>i</italic>
] the character at position
<italic>i</italic>
,
<italic>s</italic>
[
<italic>i</italic>
:
<italic>j</italic>
] the substring of
<italic>s</italic>
from position
<italic>i</italic>
to
<italic>j</italic>
(inclusive of both ends), and |
<italic>s</italic>
| the length of
<italic>s</italic>
. Let
<italic>pref</italic>
(
<italic>s</italic>
,
<italic>j</italic>
) be the prefix comprised of the first
<italic>j</italic>
characters of
<italic>s</italic>
and
<italic>suff</italic>
(
<italic>s</italic>
,
<italic>j</italic>
) be the suffix comprised of the last
<italic>j</italic>
characters of
<italic>s</italic>
. We denote concatenation of strings
<italic>s</italic>
and
<italic>t</italic>
by
<inline-formula id="IE1">
<mml:math id="IM1">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>°</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, and the reverse complement of a string
<italic>s</italic>
by
<inline-formula id="IE2">
<mml:math id="IM2">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>A
<italic>k-mer</italic>
is a string of length
<italic>k</italic>
drawn from the DNA alphabet
<inline-formula id="IE3">
<mml:math id="IM3">
<mml:mrow>
<mml:mo>Σ</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. The de Bruijn graph
<italic>G</italic>
(
<italic>S</italic>
,
<italic>k</italic>
) = (
<italic>V</italic>
,
<italic>E</italic>
) of a set of sequences
<italic>S</italic>
has nodes defined by consecutive k-mers in the sequences,
<inline-formula id="IE4">
<mml:math id="IM4">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
;
<italic>E</italic>
is the set of arcs defined by (
<italic>k</italic>
−1)−
<italic>mer</italic>
overlaps between nodes in
<italic>V</italic>
. Namely, identifying vertices with their k-mers,
<inline-formula id="IE5">
<mml:math id="IM5">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
<mml:mo></mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. Each node
<italic>v</italic>
is identified with its reverse complement
<inline-formula id="IE6">
<mml:math id="IM6">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, making the graph
<italic>G</italic>
bidirected, in that edges may represent overlaps between either orientation of each node (
<xref rid="btx471-B10" ref-type="bibr">Medvedev
<italic>et al.</italic>
, 2007</xref>
). When necessary, our explicit representation of nodes will use
<italic>canonical</italic>
node naming, i.e. the name of node
<inline-formula id="IE7">
<mml:math id="IM7">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
will be the lexicographically lesser of
<italic>v</italic>
and
<inline-formula id="IE8">
<mml:math id="IM8">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.
<italic>Junction nodes</italic>
are defined as k-mers having in-degree or out-degree > 1.
<italic>Terminal nodes</italic>
are k-mers having out-degree 1 and in-degree 0 or in-degree 1 and out-degree 0. Terminals and junctions are collectively referred to as
<italic>special nodes</italic>
. The
<italic>compacted de Bruijn graph</italic>
is obtained from a de Bruijn graph by merging all adjacent
<italic>non-branching nodes</italic>
(i.e. those having in-degree and out-degree of exactly 1). The string associated with merged adjacent nodes is the first k-mer, concatenated with the single character extensions of all following non-branching k-mers. Such merged non-branching paths are called
<italic>unitigs</italic>
.</p>
<p>Since a junction
<italic>v</italic>
having in-degree > 1 and out-degree 1 is identified with
<inline-formula id="IE9">
<mml:math id="IM9">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
having out-degree > 1 and in-degree 1, we speak of junction directions relative to the reading direction of the junction’s k-mer. Therefore, a
<italic>forward junction</italic>
has out-degree > 1, and a
<italic>back junction</italic>
has in-degree > 1. We refer to outbound k-mers beginning paths in the direction having out-degree > 1 as
<italic>heads</italic>
, and the sole outbound k-mer in the opposite direction as the junction’s
<italic>tail</italic>
. It is possible that a junction may have no tail.</p>
<p>A Bloom filter
<italic>B</italic>
is a space-efficient probabilistic hash table enabling insertion and approximate membership query operations (
<xref rid="btx471-B2" ref-type="bibr">Bloom, 1970</xref>
). The filter consists of a bit array of size
<italic>m</italic>
, and an element
<italic>x</italic>
is inserted to
<italic>B</italic>
by applying
<italic>h</italic>
hash functions,
<italic>f</italic>
<sub>0</sub>
, … ,
<italic>f
<sub>h</sub>
</italic>
<sub></sub>
<sub>1</sub>
such that
<inline-formula id="IE10">
<mml:math id="IM10">
<mml:mrow>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, and setting values of the filter to 1 at the positions returned. For a Bloom filter
<italic>B</italic>
and string
<italic>s</italic>
, by
<inline-formula id="IE11">
<mml:math id="IM11">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
or the term ’s in B’ we refer to
<italic>B</italic>
[
<italic>s</italic>
] = 1, i.e. when the
<italic>h</italic>
hash functions used to load
<italic>B</italic>
are applied to
<italic>s</italic>
, only 1 values are returned. Similarly,
<inline-formula id="IE12">
<mml:math id="IM12">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
or ‘s not in B’ means that at least one of the
<italic>h</italic>
hash functions of
<italic>B</italic>
returned 0 when applied to
<italic>s</italic>
. For any
<italic>s</italic>
that has been inserted to
<italic>B</italic>
,
<italic>B</italic>
[
<italic>s</italic>
] = 1 by definition (i.e. there are no false negatives). However, false positives are possible, with a probability that can tuned by adjusting
<italic>m</italic>
or
<italic>h</italic>
appropriately.</p>
</sec>
<sec>
<title>3 Materials and methods</title>
<p>We developed an algorithm called Faucet for streaming
<italic>de novo</italic>
assembly graph construction. A bird’s eye view of its entire work-flow is provided in
<xref ref-type="fig" rid="btx471-F1">Figure 1</xref>
. Below we detail individual steps. </p>
<fig id="btx471-F1" orientation="portrait" position="float">
<label>Fig. 1.</label>
<caption>
<p>Faucet work-flow.
<bold>(A)</bold>
The online stage involves a first round of processing all reads in order to load Bloom filters
<italic>B</italic>
<sub>1</sub>
and
<italic>B</italic>
<sub>2</sub>
, and a second round in order to build the junction map M and load additional Bloom filters
<italic>B</italic>
<sub>3</sub>
and
<italic>B</italic>
<sub>4</sub>
.
<italic>M</italic>
stores the set of all junctions and extension counts for each junction, while
<italic>B</italic>
<sub>3</sub>
and
<italic>B</italic>
<sub>4</sub>
capture connections between junction pairs. The two online rounds capture information from and perform processing on each read, and the processing performed always depends on the current state of data structures being loaded.
<bold>(B)</bold>
The offline stage uses
<italic>B</italic>
<sub>2</sub>
and
<italic>M</italic>
, constructed during the online stage, in order to build the compacted de Bruijn graph by extending between special nodes using Bloom filter queries. ContigNodes (not shown) take the place of junctions and are stored in
<inline-formula id="IE13">
<mml:math id="IM13">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, allowing access (via stored pointers) to Contigs out of each junction, and coverage information. An additional vector of coverage values at fake or past junctions is also maintained for each Contig. Then,
<italic>B</italic>
<sub>3</sub>
,
<italic>B</italic>
<sub>4</sub>
, and this coverage information are used together to perform simplifications on and cleaning of the graph</p>
</caption>
<graphic xlink:href="btx471f1"></graphic>
</fig>
<sec>
<title>3.1 Online Bloom filter loading</title>
<p>Faucet begins by loading two Bloom filters,
<italic>B</italic>
<sub>1</sub>
and
<italic>B</italic>
<sub>2</sub>
, as it iterates through the reads, using the following procedure: all k-mers are inserted to
<italic>B</italic>
<sub>1</sub>
, and only k-mers already in
<italic>B</italic>
<sub>1</sub>
(i.e. those for which all hash queries return 1 from
<italic>B</italic>
<sub>1</sub>
) are inserted to
<italic>B</italic>
<sub>2</sub>
. Namely, for each k-mer
<italic>s</italic>
, if
<italic>B</italic>
<sub>1</sub>
[
<italic>s</italic>
] = 1 then we insert
<italic>s</italic>
into
<italic>B</italic>
<sub>2</sub>
, otherwise we insert into
<italic>B</italic>
<sub>1</sub>
. After iterating through all reads,
<italic>B</italic>
<sub>1</sub>
is discarded and only
<italic>B</italic>
<sub>2</sub>
is used for later stages. This procedure imposes a coverage threshold on the vast majority of k-mers so that primarily ‘solid k-mers’ (
<xref rid="btx471-B19" ref-type="bibr">Pevzner
<italic>et al.</italic>
, 2001</xref>
) observed at least twice are kept. This process is depicted in Round 1 of
<xref ref-type="fig" rid="btx471-F1">Figure 1A</xref>
. We note that a small proportion of singleton or false positive k-mers may evade this filtration. No count information is associated with k-mers at this round.</p>
</sec>
<sec>
<title>3.2 Online graph construction</title>
<p>
<italic>B</italic>
<sub>2</sub>
, loaded at the first round, enables Faucet to query possible forward extensions of each k-mer. Faucet iterates through all reads a second time to collect information necessary for avoiding false positive extensions, building the compacted de Bruijn graph, and later, cleaning the graph. The second round consists of finding junctions and terminal k-mers, recording their true extension counts, and recording k-mer pairs (Round 2 of
<xref ref-type="fig" rid="btx471-F1">Fig. 1A</xref>
).</p>
<p>Faucet’s Online stage has one main routine—Algorithm 1—that calls upon two subroutines—Algorithms 2 and 3. First, junction k-mers and their start positions are derived from a call to Algorithm 2. To find junctions, Algorithm 2 makes all possible alternate extension queries (Lines 3–5) to
<italic>B</italic>
<sub>2</sub>
for each k-mer in the read sequence
<italic>r</italic>
. A junction k-mer
<italic>j</italic>
may have multiple extensions in
<italic>B</italic>
<sub>2</sub>
—either because there are multiple extensions of
<italic>j</italic>
in
<italic>G</italic>
that are all real (i.e. present on some read), or because there is at least one real extension in
<italic>G</italic>
and some others in
<italic>B</italic>
<sub>2</sub>
that are false positives. Accordingly, each k-mer possessing at least one extension that differs from the next base on the read is identified as a junction. Whenever one is found, its sequence along with its start position are recorded (Line 4), and the list of such tuples is returned. We note that each k-mer in the read is also queried for junctions in the reverse complement direction, but this is not shown in Algorithm 2.</p>
<p>
<boxed-text id="btx471-BOX1" position="float" orientation="portrait">
<label>Algorithm 1</label>
<caption>
<p>
<italic>scanReads</italic>
(
<italic>R</italic>
,
<italic>B</italic>
<sub>2</sub>
)</p>
</caption>
<p>
<bold>Input:</bold>
read set
<italic>R</italic>
, Bloom filter
<italic>B</italic>
<sub>2</sub>
loaded from round 1, an empty Bloom filter
<italic>B</italic>
<sub>3</sub>
</p>
<p>
<bold>Output:</bold>
1. a junction Map
<italic>M</italic>
comprised of (
<italic>key</italic>
,
<italic>value</italic>
) pairs. Each
<italic>key</italic>
is a junction k-mer, and each
<inline-formula id="IE14">
<mml:math id="IM14">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mo></mml:mo>
<mml:mn>4</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
is a vector
<inline-formula id="IE15">
<mml:math id="IM15">
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>A</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>C</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>G</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of counts representing the number of times each possible extension of
<italic>key</italic>
was observed in
<italic>R</italic>
; 2.
<italic>B</italic>
<sub>3</sub>
is loaded with linked k-mer pairs (i.e. specific 2k-mers—see text—are hashed in).</p>
<p>1:
<inline-formula id="IE16">
<mml:math id="IM16">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>2:
<bold>for</bold>
<inline-formula id="IE17">
<mml:math id="IM17">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>do</bold>
</p>
<p>3:  
<inline-formula id="IE18">
<mml:math id="IM18">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>J</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
▹ call to Algorithm 2</p>
<p>4:  
<bold>for</bold>
<inline-formula id="IE19">
<mml:math id="IM19">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>do</bold>
</p>
<p>5:   
<bold>if</bold>
<inline-formula id="IE20">
<mml:math id="IM20">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>then</bold>
</p>
<p>6:    
<inline-formula id="IE21">
<mml:math id="IM21">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>7:   
<italic>increment counter in</italic>
<inline-formula id="IE22">
<mml:math id="IM22">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo> </mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo> </mml:mo>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
▹ call to Algorithm 3</p>
<p>8:
<bold>return</bold>
<italic>M</italic>
,
<italic>B</italic>
<sub>3</sub>
</p>
</boxed-text>
</p>
<p>Algorithm 1 then uses this set of junctions to perform accounting (Lines 4–8). All junctions are inserted into a hash map
<italic>M</italic>
that maps junction k-mers to vectors maintaining counts for each extension. For each junction of
<italic>r</italic>
, a count of 0 is initialized for each possible extension. These counters are only incremented based on extensions observed on reads—i.e. extensions due to Bloom filter outputs alone are not counted. As every real extension out of each junction must be observed on some read, and we scan the entire set of reads, an extension will have non-zero count only if it is real. This mechanism allows Faucet to maintain coverage counts for all real extensions out of junctions. In later stages, only extensions having non-zero counts will be visited, but counts are stored for real extensions of false junctions as well. These latter counts are used to sample coverage distributions on unitig sequences at more points than just their ends. Proportions of real junctions versus the totals stored after accounting are described in the section ‘Solid junction counts’ in the
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
.
<boxed-text id="btx471-BOX2" position="float" orientation="portrait">
<label>Algorithm 2</label>
<caption>
<p>
<italic>findJunctions</italic>
(
<italic>r</italic>
,
<italic>B</italic>
<sub>2</sub>
)</p>
</caption>
<p>
<bold>Input:</bold>
read
<italic>r</italic>
and Bloom filter
<italic>B</italic>
<sub>2</sub>
</p>
<p>
<bold>Output:</bold>
juncTuples, a list of tuples (
<italic>seq</italic>
,
<italic>p</italic>
), where
<italic>p</italic>
is the start position of junction k-mer
<italic>seq</italic>
in
<italic>r</italic>
, in order of appearance on
<italic>r</italic>
</p>
<p>1:
<inline-formula id="IE23">
<mml:math id="IM23">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>2:
<bold>for</bold>
<inline-formula id="IE24">
<mml:math id="IM24">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>do</bold>
<inline-formula id="IE25">
<mml:math id="IM25">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>3:  
<bold>for</bold>
<inline-formula id="IE26">
<mml:math id="IM26">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:mo>Σ</mml:mo>
<mml:mi mathvariant="normal"></mml:mi>
<mml:mo>{</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
}
<bold>do</bold>
</p>
<p>4:   
<bold>if</bold>
<inline-formula id="IE27">
<mml:math id="IM27">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>°</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>then</bold>
  
<inline-formula id="IE28">
<mml:math id="IM28">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>5:
<bold>return</bold>
<italic>juncTuples</italic>
</p>
</boxed-text>
</p>
<p>Following the accounting performed on observed junctions, Faucet records adjacencies between pairs of junctions using additional Bloom filters—
<italic>B</italic>
<sub>3</sub>
and
<italic>B</italic>
<sub>4</sub>
. These adjacencies are needed for disentanglement—a cleaning step applied in Faucet’s offline stage. Disentanglement, depicted in
<xref ref-type="fig" rid="btx471-F2">Figure 2</xref>
, is a means of repeat resolution. Its purpose is to split paths that have been merged due to the presence of a shared segment—the repeat—in both paths. In order to ‘disentangle’, or resolve the tangled region into its underlying latent paths, we seek to store sequences that flank opposite ends of the repeat. Pairs of heads observed on reads provide a means of ‘reading out’ such latent paths by indicating which heads co-occur on sequenced DNA fragments. The application of disentanglement is presented in the section ‘Offline graph simplification and cleaning’, while we now focus on the mechanism of pair collection and its rationale. To capture short and long range information separately, Bloom filter
<italic>B</italic>
<sub>3</sub>
holds head pairs on the same read, while
<italic>B</italic>
<sub>4</sub>
holds heads chosen such that each head is on a different mate of a paired-end read. Algorithm 3 is the process by which pairs are inserted into
<italic>B</italic>
<sub>3</sub>
, and insertion into
<italic>B</italic>
<sub>4</sub>
is described in the
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
. </p>
<fig id="btx471-F2" orientation="portrait" position="float">
<label>Fig. 2.</label>
<caption>
<p>Disentanglement.
<bold>(A)</bold>
A
<italic>tangle</italic>
characterized by two opposite facing junctions
<italic>j</italic>
<sub>1</sub>
and
<italic>j</italic>
<sub>2</sub>
, each with out-degree 2.
<bold>(B)</bold>
Junction pairs linking extensions on
<italic>s
<sub>a</sub>
</italic>
with
<italic>s
<sub>c</sub>
</italic>
and
<italic>s
<sub>b</sub>
</italic>
with
<italic>s
<sub>d</sub>
</italic>
. Since no pairs link extensions on
<italic>s
<sub>a</sub>
</italic>
with
<italic>s
<sub>d</sub>
</italic>
or
<italic>s
<sub>b</sub>
</italic>
with
<italic>s
<sub>c</sub>
</italic>
, only one orientation is supported.
<bold>(C)</bold>
the result of disentanglement: paths [
<italic>s
<sub>a</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>c</sub>
</italic>
] and [
<italic>s
<sub>b</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>d</sub>
</italic>
] are each merged into individual sequences, and junctions
<italic>j</italic>
<sub>1</sub>
and
<italic>j</italic>
<sub>2</sub>
are removed from
<italic>M</italic>
</p>
</caption>
<graphic xlink:href="btx471f2"></graphic>
</fig>
<p>In Algorithm 3, we aim to pair heads that are maximally informative. Informative pairs are those that allow us to ‘read out’ pairs of unitigs that belong to the same latent path. We specifically choose to insert heads because during the offline stage when disentanglement takes place, adjacencies between each unitig starting at an edge to a head and the unitig starting at the edge from the junction to its tail of are known and accessible via pointers to their sequences. Therefore, extension pairs capturing information of direct adjacencies provide no new information. The closest indirect adjacency that may be informative when captured from a read is that between two junctions that either face in the same direction, or when the first faces back and the second faces forward, as shown in
<xref ref-type="fig" rid="btx471-F3">Figure 3A</xref>
. Thus, when there are only two junctions on a read, their pair of heads is inserted as long as the two junctions are not facing each other. When there are at least three junctions on a read, every other junction out of every consecutive triplet is paired, as shown for a single triplet in
<xref ref-type="fig" rid="btx471-F3">Figure 3B</xref>
. This figure demonstrates that selecting every other head is preferable to selecting consecutive heads out of a triplet. This type of insertion is executed in Lines 1–6 of Algorithm 3 and ensures all unitigs flanking some triplet are potentially inferable. For reads having more than three junctions, applying the triplet rule for every consecutive window of size 3 similarly allows for all unitigs on the read to be included in some hashed pair. </p>
<fig id="btx471-F3" orientation="portrait" position="float">
<label>Fig. 3.</label>
<caption>
<p>Rationale for
<italic>B</italic>
<sub>3</sub>
insertions. Narrow blue arrows indicate unitigs observed on a read, green circles are junctions and thick arrows are junction heads. Among red arrows, solids are those inserted to
<italic>B</italic>
<sub>3</sub>
. For simplicity, we provide a direction to each arrow. The opposite direction is equally valid, hence in this view heads can also enter a junction and not only exit from it. In each case, a pair of red heads is inserted from a read. They will be inserted if they provide additional information to infer a path on the graph. Black lines indicate a subset of possible paths; out of these the solid path is that observed on a read.
<bold>(A)</bold>
Two junctions observed on a read. I, II: The two heads together imply the solid paths and rule out alternatives, so the pair is inserted to B3. III: The two heads lie on the ends of the same unitig and thus add no information.
<bold>(B)</bold>
Three junctions observed on a read, comparing insertions of consecutive heads against non-consecutive heads. Four possible arrangements are shown; there are four more that are symmetrical reflections and are not shown to save space. In each case, we compare the unitigs covered (i.e. either having a head on them or being a sole extension at a junctionve heads against non-consecconsecutive (top) and non-consecutive (bottom) junctions are chosen. Note that in Cases I–III the right-most unitig is not covered under consecutive heads (Color version of this figure is available at
<italic>Bioinformatics</italic>
online.)</p>
</caption>
<graphic xlink:href="btx471f3"></graphic>
</fig>
<p>
<boxed-text id="btx471-BOX3" position="float" orientation="portrait">
<label>Algorithm 3</label>
<caption>
<p>
<italic>recordPairs</italic>
(
<italic>r</italic>
,
<italic>juncs</italic>
,
<italic>B</italic>
<sub>3</sub>
)</p>
</caption>
<p>
<bold>Input:</bold>
read
<italic>r</italic>
,
<italic>juncs—</italic>
a list of pairs (
<italic>j</italic>
,
<italic>p</italic>
), where
<italic>p</italic>
is the start position of junction
<italic>j</italic>
in
<italic>r</italic>
, and Bloom filter
<italic>B</italic>
<sub>3</sub>
. We also make use of a subroutine
<italic>getOutExt</italic>
(
<italic>j
<sub>i</sub>
</italic>
,
<italic>p
<sub>i</sub>
</italic>
,
<italic>r</italic>
) that for a junction
<italic>j
<sub>i</sub>
</italic>
returns
<inline-formula id="IE29">
<mml:math id="IM29">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>°</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
if
<italic>j
<sub>i</sub>
</italic>
is a back junction, and
<inline-formula id="IE30">
<mml:math id="IM30">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>°</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
otherwise.</p>
<p>
<bold>Output:</bold>
Bloom filter
<italic>B</italic>
<sub>3</sub>
, loaded with select
<italic>linked k-mer pairs</italic>
</p>
<p>1:
<bold>if</bold>
<italic>len</italic>
(
<italic>juncs</italic>
) > 2
<bold>then</bold>
</p>
<p>2:  
<bold>for</bold>
<inline-formula id="IE31">
<mml:math id="IM31">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<bold>do</bold>
</p>
<p>3:   
<inline-formula id="IE32">
<mml:math id="IM32">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>4:   
<inline-formula id="IE33">
<mml:math id="IM33">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>5:   
<inline-formula id="IE34">
<mml:math id="IM34">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>°</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
▹ insert the concatenation into
<italic>B</italic>
<sub>3</sub>
</p>
<p>6:
<bold>else if</bold>
(
<inline-formula id="IE35">
<mml:math id="IM35">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>¬</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mtext>is</mml:mtext>
<mml:mo> </mml:mo>
<mml:mi mathvariant="normal">a</mml:mi>
<mml:mo> </mml:mo>
<mml:mtext>forward</mml:mtext>
<mml:mo> </mml:mo>
<mml:mtext>junction</mml:mtext>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mtext>is</mml:mtext>
<mml:mo> </mml:mo>
<mml:mi mathvariant="normal">a</mml:mi>
<mml:mo> </mml:mo>
<mml:mtext>back</mml:mtext>
<mml:mo> </mml:mo>
<mml:mtext>junction</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
)
<bold>then</bold>
</p>
<p>7:  
<inline-formula id="IE36">
<mml:math id="IM36">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>8:  
<inline-formula id="IE37">
<mml:math id="IM37">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>g</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>j</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>9:  
<inline-formula id="IE38">
<mml:math id="IM38">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>°</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
<p>10:
<bold>return</bold>
<italic>B</italic>
<sub>3</sub>
</p>
</boxed-text>
</p>
</sec>
<sec>
<title>3.3 Offline graph simplification and cleaning</title>
<p>Given
<italic>B</italic>
<sub>2</sub>
,
<italic>B</italic>
<sub>3</sub>
,
<italic>B</italic>
<sub>4</sub>
and
<italic>M</italic>
resulting from the online stage, the compacted de Bruijn graph is generated by traversing each forward extension out of every special k-mer, as well as traversing backwards in the reverse complement direction when the node has not been reached before by a traversal starting from another node. This is done by querying
<italic>B</italic>
<sub>2</sub>
for extensions and continuing until the next special node is reached. During each such traversal from special node
<italic>u</italic>
to special node
<italic>v</italic>
, a unitig sequence
<italic>s
<sub>uv</sub>
</italic>
is constructed.
<italic>s
<sub>uv</sub>
</italic>
is initialized to the sequence of
<italic>u</italic>
, and a base is added at each extension until
<italic>v</italic>
is reached.</p>
<p>New data structures are constructed in the course of traversals in order to aid later queries and updates. A
<italic>ContigNode</italic>
structure is used to represent a junction that points to
<italic>Contigs</italic>
. ContigNodes are structures possessing a pointer to a Contig at each forward extension, as well as one backwards pointer. This backwards pointer connects the junction to the sequence beginning with the reverse complement of the junction’s k-mer. Contigs initially store unitig sequences, but these may later be concatenated or duplicated. They also point to one ContigNode at each end. To efficiently query Contigs and ContigNodes, a new hashmap
<inline-formula id="IE39">
<mml:math id="IM39">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is constructed having junction k-mers as keys, and ContigNodes that represent those junctions as values. Isolated contigs formed by unitigs that extend between terminal nodes are stored in a separate set data structure.</p>
<p>Once the raw graph is obtained, cleaning steps commence, incorporating tip removal, chimera removal, collapsing of bulges, and disentanglement. Coverage information and paired-junction links are crucial to these steps. Briefly, tip removal involves deletion of Contigs shorter than the input read length that lead to a terminal node. Chimera and bulge removal steps involve heuristics designed to remove low coverage Contigs when a more credible alternative (higher coverage, or involved in more sub-paths) is identified. These first three steps proceed as described in (
<xref rid="btx471-B1" ref-type="bibr">Bankevich
<italic>et al.</italic>
, 2012</xref>
), thus we omit their full description here.</p>
<p>Disentanglement relies on paired junction links inserted into
<italic>B</italic>
<sub>3</sub>
and
<italic>B</italic>
<sub>4</sub>
. We iterate through the set of ContigNodes to look for ‘tangles’—pairs of opposite-facing junctions joined by a repeat sequence—as shown in
<xref ref-type="fig" rid="btx471-F2">Figure 2</xref>
. Tangles are characterized by tuples (
<italic>j</italic>
<sub>1</sub>
,
<italic>j</italic>
<sub>2</sub>
,
<italic>s</italic>
) where
<italic>j</italic>
<sub>1</sub>
is a back junction,
<italic>j</italic>
<sub>2</sub>
is a forward junction (or vice-versa), and there is a common Contig
<italic>s</italic>
pointed to by the back pointers of both
<italic>j</italic>
<sub>1</sub>
and
<italic>j</italic>
<sub>2</sub>
. Junctions
<italic>j</italic>
<sub>1</sub>
and
<italic>j</italic>
<sub>2</sub>
each have at least two outward extensions. We restrict cleaning to tangles having exactly two extensions at each end. Let
<italic>s
<sub>a</sub>
</italic>
and
<italic>s
<sub>b</sub>
</italic>
be the Contigs starting at heads of
<italic>j</italic>
<sub>1</sub>
, and
<italic>s
<sub>c</sub>
</italic>
and
<italic>s
<sub>d</sub>
</italic>
be the Contigs starting at heads of
<italic>j</italic>
<sub>2</sub>
. By disentangling, we seek to pair extensions at each side of
<italic>s</italic>
to form two paths. The possible outputs are paths [
<italic>s
<sub>a</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>c</sub>
</italic>
] together with [
<italic>s
<sub>b</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>d</sub>
</italic>
] or [
<italic>s
<sub>a</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>d</sub>
</italic>
] together with [
<italic>s
<sub>b</sub>
</italic>
,
<italic>s</italic>
,
<italic>s
<sub>c</sub>
</italic>
].</p>
<p>Thus, each such pair straddling the tangle—e.g. having one head on
<italic>s
<sub>a</sub>
</italic>
and the other on
<italic>s
<sub>c</sub>
</italic>
—lends some support to the hypothesis that the correct split is that which pairs the two. To decide between the two possible split orientations, we count the number of pairs supporting each by querying
<italic>B</italic>
<sub>3</sub>
or
<italic>B</italic>
<sub>4</sub>
for all possible junction pairings that are separated by a characteristic length associated with the pairs inserted to each. For example,
<italic>B</italic>
<sub>3</sub>
stores heads out of non-consecutive junction pairs on the same read. Therefore, for each junction on
<italic>s
<sub>a</sub>
</italic>
we count each pairing accepted by
<italic>B</italic>
<sub>3</sub>
with a junction on
<italic>s
<sub>c</sub>
</italic>
that is at most one read length away. Specifically for
<italic>B</italic>
<sub>3</sub>
, we also know that inserted pairs are always one or two junctions away from the starting junction, based on the scheme presented in
<xref ref-type="fig" rid="btx471-F3">Figure 3</xref>
. To decide when a tangle should be split, we apply XOR logic to arrive at a decision: if the count of pairs supporting both paths in one orientation is > 0, and the count of both paths in the other orientation is 0, we disentangle according to the first, as shown in
<xref ref-type="fig" rid="btx471-F2">Figure 2</xref>
. Similar yet more involved reasoning is used for junction links in
<italic>B</italic>
<sub>4</sub>
, using the insert size between read pairs (see
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
). Once we arrive at a decision, we add a new sequence to the set of Contigs that is the concatenation of the sequences involved in the original paths. We note one of the consequences of this simplification step is that the graph no longer represents a de Bruijn graph, in that each k-mer is no longer guaranteed to appear at most once in the graph. Furthermore, the XOR case presented is the most frequently applied form of disentanglement out of a few alternatives. We discuss these alternatives in the
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
.</p>
</sec>
<sec>
<title>3.4 Optimizations and technical details</title>
<p>Here we discuss some details omitted from the above descriptions for the sake of completeness. Based on the description of Algorithms 1 and 2, it is possible that false positive extensions out of terminal nodes will ensue. This is possible because the mechanism described for removing false positive junctions can differentiate between one or multiple extensions existing in
<italic>G</italic>
for a given node, but cannot differentiate between one or none. This may lead to assembly errors at sink nodes.</p>
<p>To overcome such effects, we store distances between junctions seen on the same read with the distance recorded being assigned to the extension of each junction observed on the read. When an outermost junction on a read has not been previously linked to another junction, we record its distance from the nearest read end—this solves the problem mentioned previously as long as paths to sinks are shorter than read length. To obtain accurate measurements of distances on longer non-branching paths, we also introduce artificial ‘dummy’ junctions whenever a pre-defined length threshold is surpassed. In effect, this means that reads with no real junctions are assigned dummy junctions.</p>
<p>Once distances and dummy junctions are introduced, an additional benefit is gained: the speed of the read-scan can be improved by skipping between junctions that have been seen before. Once distances are known, if we see a particular extension out of a junction, and then a sequence of length
<italic></italic>
without any junctions, then, wherever else we see that junction and extension, it must be followed by the exact same
<italic></italic>
next bases. Otherwise, there would be a junction earlier. So we store
<italic></italic>
when we see it, and skip subsequent occurrences.</p>
<p>Finally, we note that Faucet can benefit from precise Bloom filter sizing. When a good estimate of dataset parameters is known, the algorithm can do the two-pass process above. Otherwise, to determine the numbers of distinct k-mers and the number of singletons in the dataset in a streaming manner, we have used the tool ntCard (
<xref rid="btx471-B13" ref-type="bibr">Mohamadi
<italic>et al.</italic>
, 2017</xref>
). This requires an additional pass over the reads (for a total of three passes). The added pass does not increase RAM or disk use. In fact, in tests on locally stored data, we found it only adds negligible time.</p>
</sec>
</sec>
<sec>
<title>4 Results</title>
<sec>
<title>4.1 Assembling while downloading</title>
<p>As a demonstration of streaming assembly, we ran Faucet on publicly available human data, SRR034939, used for benchmarking in (
<xref rid="btx471-B3" ref-type="bibr">Chikhi and Rizk, 2012</xref>
). To assess resource use at different data volumes, we ran Faucet on 10, 20 and 37 paired-end files out of 37 total. Streaming was enabled using standard Linux command line tools: wget was used for commencing a download from a supplied URL, and streamed reading from the compressed data was enabled by the bzip2 utility. Downloads were initiated separately for each run. The streaming results are shown in
<xref rid="btx471-T1" ref-type="table">Table 1</xref>
.
<table-wrap id="btx471-T1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Resource use and data compression observed as data volume increases</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1">No. of files</th>
<th rowspan="1" colspan="1">Time (h)</th>
<th rowspan="1" colspan="1">RAM (GB)</th>
<th rowspan="1" colspan="1">Disk (GB)</th>
<th rowspan="1" colspan="1">Data size (GB)</th>
<th rowspan="1" colspan="1">Comp. ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">26.3</td>
<td rowspan="1" colspan="1">48.3</td>
<td rowspan="1" colspan="1">19.0</td>
<td rowspan="1" colspan="1">29.6</td>
<td rowspan="1" colspan="1">0.64</td>
</tr>
<tr>
<td rowspan="1" colspan="1">20</td>
<td rowspan="1" colspan="1">47.7</td>
<td rowspan="1" colspan="1">84.3</td>
<td rowspan="1" colspan="1">34.3</td>
<td rowspan="1" colspan="1">59.2</td>
<td rowspan="1" colspan="1">0.58</td>
</tr>
<tr>
<td rowspan="1" colspan="1">37</td>
<td rowspan="1" colspan="1">98.2</td>
<td rowspan="1" colspan="1">144.7</td>
<td rowspan="1" colspan="1">50.0</td>
<td rowspan="1" colspan="1">108.4</td>
<td rowspan="1" colspan="1">0.46</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>We emphasize that Faucet required less space than the size of the input data in order to assemble it, while most assemblers generate files during the course of their processing that are larger than the input data. Also, the ratio of input data to disk used by Faucet decreased as data volume increased, reflecting the tendency of sequences to be seen repeatedly with high coverage. We also note that Faucet’s outputs effectively create a lossy compression of the read data, in that the choice of k value inherently creates some ambiguity for read substrings larger than k. This compression format is also queryable, in that given a k-mer in the graph, its extensions can be found: indeed, this is the basis of Faucet’s graph construction and cleaning.</p>
</sec>
<sec>
<title>4.2 Disentanglement assessment</title>
<p>To gauge the benefits of disentanglement on assembly quality, we compared Faucet’s outputs with and without each of short- and long-range pairing information, provided by Bloom filters
<italic>B</italic>
<sub>3</sub>
and
<italic>B</italic>
<sub>4</sub>
, on SYN 64—a synthetic metagenome produced to provide a dataset for which the ground truth is known comprised of 64 species (dataset sizes and additional characteristics are provided in the
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
). The results of this assessment are presented in
<xref rid="btx471-T2" ref-type="table">Table 2</xref>
. We measured assembly contiguity by the NGA50 measure. NG50 is defined as ‘the contig length such that using equal or longer length contigs produces x% of the length of the reference genome, rather than x% of the assembly length’ in (
<xref rid="btx471-B7" ref-type="bibr">Gurevich
<italic>et al.</italic>
, 2013</xref>
). NGA50 is an adjustment of the NG50 measure designed to penalize contigs composed of misassembled parts by breaking contigs into aligned blocks after alignment to the reference. We found that disentanglement more than doubled contiguity measured by mean NGA50 values, with greater gains as more kinds of disentanglement were enabled. This was also reflected by corresponding gains in the genome fractions, and in the number of species for which at least 50% of the genome was aligned to, allowing NGA50 scores to be reported. More applications of disentanglement also increased the number of misassemblies reported and the duplication ratio; however, two-thirds of the maximum misassembly count is already seen without any disentanglement applied.
<table-wrap id="btx471-T2" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<p>The effect of increasing levels of disentanglement on contiguity and accuracy</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1">Measure</th>
<th rowspan="1" colspan="1">No disent.</th>
<th rowspan="1" colspan="1">
<italic>B</italic>
<sub>3</sub>
only</th>
<th rowspan="1" colspan="1">
<italic>B</italic>
<sub>4</sub>
only</th>
<th rowspan="1" colspan="1">both
<italic>B</italic>
<sub>3</sub>
,
<italic>B</italic>
<sub>4</sub>
</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Genome fraction (%)</td>
<td rowspan="1" colspan="1">76.4</td>
<td rowspan="1" colspan="1">79.9</td>
<td rowspan="1" colspan="1">80.3</td>
<td rowspan="1" colspan="1">82.3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Dup. ratio</td>
<td rowspan="1" colspan="1">1.00</td>
<td rowspan="1" colspan="1">1.01</td>
<td rowspan="1" colspan="1">1.02</td>
<td rowspan="1" colspan="1">1.02</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mean NGA50</td>
<td align="right" rowspan="1" colspan="1">13048</td>
<td align="right" rowspan="1" colspan="1">21703</td>
<td align="right" rowspan="1" colspan="1">26356</td>
<td align="right" rowspan="1" colspan="1">29066</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Misassemblies</td>
<td align="right" rowspan="1" colspan="1">388</td>
<td align="right" rowspan="1" colspan="1">480</td>
<td align="right" rowspan="1" colspan="1">521</td>
<td align="right" rowspan="1" colspan="1">572</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Species reported</td>
<td align="right" rowspan="1" colspan="1">54</td>
<td align="right" rowspan="1" colspan="1">56</td>
<td align="right" rowspan="1" colspan="1">56</td>
<td align="right" rowspan="1" colspan="1">56</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec>
<title>4.3 Tools comparison</title>
<p>We sought to assess Faucet’s effectiveness in assembling metagenomes, and its resource efficiency. For the former, we compared Faucet to MetaSPAdes (
<xref rid="btx471-B16" ref-type="bibr">Nurk
<italic>et al.</italic>
, 2017</xref>
) and Megahit (
<xref rid="btx471-B9" ref-type="bibr">Li
<italic>et al.</italic>
, 2014</xref>
), state of the art metagenome assemblers in terms of contiguity and accuracy that require substantial resources. To address resource efficiency, we also compared Faucet to two leading resource efficient assemblers, Minia 3 (Beta) (
<xref rid="btx471-B3" ref-type="bibr">Chikhi and Rizk, 2012</xref>
) and LightAssembler (
<xref rid="btx471-B6" ref-type="bibr">El-Metwally
<italic>et al.</italic>
, 2016</xref>
). We note these last two were not designed as metagenome assemblers, but they perform operations similar to what Faucet does—both in the course of their graph construction steps, and in their cleaning steps. They differ from Faucet in that neither is capable of disentanglement, as they do not utilize paired-end information, but counter this advantage with more sophisticated traversal schemes. All tools were run on two metagenome datasets—SYN64 and HMP—a female tongue dorsum sample sequenced as part of the Human Microbiome Project. Both datasets were used for testing in (
<xref rid="btx471-B16" ref-type="bibr">Nurk
<italic>et al.</italic>
, 2017</xref>
). To achieve a fair comparison, runs were performed with a single thread on the same machine, as Faucet does not currently support multi-threaded execution. Full details of the comparison, including versions, parameters, and data accessions, are presented in the
<xref ref-type="supplementary-material" rid="sup1">Supplementary Appendix</xref>
.</p>
<p>
<xref rid="btx471-T3" ref-type="table">Table 3</xref>
presents the full results for the tools comparison. There was a strong advantage to Megahit and MetaSPAdes over the three lightweight assemblers (Minia, LightAssembler, and Faucet) in terms of contiguity achieved (shown by NGA50 statistics), but this came at a large cost in terms of memory, disk space, and time, particularly in the case of MetaSPAdes. Among the lightweight assemblers, Minia used by far the most disk space, and differences in other resource measures were less pronounced. Among these three, Faucet had a large advantage in NGA50 statistics relative to the other two. This is highlighted by the trend of
<xref rid="btx471-T3" ref-type="table">Table 3</xref>
, and shown by its 14–110% advantage in the mean of NGA50 relative to Minia, and 2- to 11-fold advantage relative to LightAssembler.
<table-wrap id="btx471-T3" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<p>Tool comparison on two metagenomes</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
<col valign="top" align="right" span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th colspan="5" align="center" rowspan="1">SYN64
<hr></hr>
</th>
<th colspan="5" align="center" rowspan="1">HMP
<hr></hr>
</th>
</tr>
<tr>
<th rowspan="1" colspan="1">Measure</th>
<th rowspan="1" colspan="1">Metaspades</th>
<th rowspan="1" colspan="1">Megahit</th>
<th rowspan="1" colspan="1">LightAssembler</th>
<th rowspan="1" colspan="1">Minia</th>
<th rowspan="1" colspan="1">Faucet</th>
<th rowspan="1" colspan="1">Metaspades</th>
<th rowspan="1" colspan="1">Megahit</th>
<th rowspan="1" colspan="1">LightAssembler</th>
<th rowspan="1" colspan="1">Minia</th>
<th rowspan="1" colspan="1">Faucet</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Genome fraction (%)</td>
<td rowspan="1" colspan="1">89.1</td>
<td rowspan="1" colspan="1">90.1</td>
<td rowspan="1" colspan="1">75.6</td>
<td rowspan="1" colspan="1">76.5</td>
<td rowspan="1" colspan="1">82.3</td>
<td rowspan="1" colspan="1">46.9</td>
<td rowspan="1" colspan="1">48.6</td>
<td rowspan="1" colspan="1">23.4</td>
<td rowspan="1" colspan="1">27.8</td>
<td rowspan="1" colspan="1">27.9</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Duplication ratio</td>
<td rowspan="1" colspan="1">1.02</td>
<td rowspan="1" colspan="1">1.02</td>
<td rowspan="1" colspan="1">1.01</td>
<td rowspan="1" colspan="1">1.00</td>
<td rowspan="1" colspan="1">1.02</td>
<td rowspan="1" colspan="1">1.05</td>
<td rowspan="1" colspan="1">1.12</td>
<td rowspan="1" colspan="1">1.02</td>
<td rowspan="1" colspan="1">1.01</td>
<td rowspan="1" colspan="1">1.05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mean NGA50 (kb)</td>
<td rowspan="1" colspan="1">167</td>
<td rowspan="1" colspan="1">99.0</td>
<td rowspan="1" colspan="1">2.60</td>
<td rowspan="1" colspan="1">14.6</td>
<td rowspan="1" colspan="1">30.7</td>
<td rowspan="1" colspan="1">28.3</td>
<td rowspan="1" colspan="1">36.8</td>
<td rowspan="1" colspan="1">3.18</td>
<td rowspan="1" colspan="1">6.25</td>
<td rowspan="1" colspan="1">7.12</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Median NGA50 (kb)</td>
<td rowspan="1" colspan="1">71.1</td>
<td rowspan="1" colspan="1">57.6</td>
<td rowspan="1" colspan="1">2.30</td>
<td rowspan="1" colspan="1">10.5</td>
<td rowspan="1" colspan="1">23.7</td>
<td rowspan="1" colspan="1">28.3</td>
<td rowspan="1" colspan="1">36.8</td>
<td rowspan="1" colspan="1">3.18</td>
<td rowspan="1" colspan="1">6.25</td>
<td rowspan="1" colspan="1">7.12</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Misassemblies</td>
<td rowspan="1" colspan="1">785</td>
<td rowspan="1" colspan="1">949</td>
<td rowspan="1" colspan="1">314</td>
<td rowspan="1" colspan="1">395</td>
<td rowspan="1" colspan="1">572</td>
<td rowspan="1" colspan="1">504</td>
<td rowspan="1" colspan="1">602</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">184</td>
<td rowspan="1" colspan="1">202</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Species reported</td>
<td rowspan="1" colspan="1">59</td>
<td rowspan="1" colspan="1">61</td>
<td rowspan="1" colspan="1">55</td>
<td rowspan="1" colspan="1">52</td>
<td rowspan="1" colspan="1">56</td>
<td rowspan="1" colspan="1">12</td>
<td rowspan="1" colspan="1">12</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Time (h)</td>
<td rowspan="1" colspan="1">41.2</td>
<td rowspan="1" colspan="1">10.9</td>
<td rowspan="1" colspan="1">1.63</td>
<td rowspan="1" colspan="1">0.97</td>
<td rowspan="1" colspan="1">2.61</td>
<td rowspan="1" colspan="1">30.5</td>
<td rowspan="1" colspan="1">13.0</td>
<td rowspan="1" colspan="1">3.35</td>
<td rowspan="1" colspan="1">0.99</td>
<td rowspan="1" colspan="1">2.30</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Memory (GB)</td>
<td rowspan="1" colspan="1">26</td>
<td rowspan="1" colspan="1">9.1</td>
<td rowspan="1" colspan="1">2.7</td>
<td rowspan="1" colspan="1">4.8</td>
<td rowspan="1" colspan="1">6.0</td>
<td rowspan="1" colspan="1">14</td>
<td rowspan="1" colspan="1">8.3</td>
<td rowspan="1" colspan="1">3.4</td>
<td rowspan="1" colspan="1">3.7</td>
<td rowspan="1" colspan="1">7.3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Disk (GB)</td>
<td rowspan="1" colspan="1">43.1</td>
<td rowspan="1" colspan="1">14.3</td>
<td rowspan="1" colspan="1">1.84</td>
<td rowspan="1" colspan="1">28.2</td>
<td rowspan="1" colspan="1">1.59</td>
<td rowspan="1" colspan="1">53.2</td>
<td rowspan="1" colspan="1">11.5</td>
<td rowspan="1" colspan="1">1.30</td>
<td rowspan="1" colspan="1">23.5</td>
<td rowspan="1" colspan="1">1.61</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="tblfn1">
<p>Top values in each cell are for SYN 64 data, and bottom values are for HMP. Duplication ratio is the ratio between the total aligned length to the combined length of all references aligned to. The mean and median NGA50 values are calculated on based on species sufficiently covered by all assemblers to yield an NGA50 value (i.e. 50% of the genome is covered). Species reported are those for which an NGA50 value is reported. In the HMP data, only two species were reported for all, making the mean and median NGA50 values equal. Disk and memory use are those reported by the Linux time utility, and Disk use is the total amount written to disk during the course of a run.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec>
<title>5 Discussion</title>
<p>Streaming
<italic>de novo</italic>
assembly presents an opportunity to significantly ease some of the burdens introduced by the recent deluge of second generation sequencing data. We posit the main applications of streaming assembly will be
<italic>de novo</italic>
assembly of very large individual datasets (e.g. metagenomes from highly diverse environments) and re-assembly of pangenomes derived from many samples. In both cases, very large volumes of data must be digested in order to address the relevant biological questions behind these assays. Therefore, streaming graph assembly presents an attractive alternative to data compression: instead of attempting to reduce the size of data, the aim is to keep locally only relevant information in a manner that is queryable and that allows for future re-analysis.</p>
<p>Here, we have demonstrated a mechanism for performing streaming graph assembly and described some of its characteristics. First, we showed that assembly can be achieved without ever storing raw reads locally. By assembling the graph, an intermediate by-product of many assemblers, we show this technique is generally applicable. By refining the graph and showing better assembly contiguity than competing resource efficient tools on metagenome assembly, we showed this method can also be applied in the setting when sensitive recovery of rare sequences is crucial.</p>
<p>In future work, we aim to expand the capabilities of Faucet in a number of ways. Multi-threaded processing will reduce run times and make the tool more applicable to large data volumes. We believe further refinements of cleaning and contig generation can be achieved by adopting a statistical approach to making assembly decisions. In addition, beyond graph cleaning, we aim to apply Faucet’s data structures to path generation, as done with paired end reads in (
<xref rid="btx471-B14" ref-type="bibr">Nihalani and Aluru, 2016</xref>
;
<xref rid="btx471-B20" ref-type="bibr">Prjibelski
<italic>et al.</italic>
, 2014</xref>
;
<xref rid="btx471-B23" ref-type="bibr">Shi
<italic>et al.</italic>
, 2017</xref>
). Both have the potential to greatly improve contiguity and accuracy.</p>
<p>Beyond this, this work raises several remaining challenges pertaining to what one may expect of streaming assembly. For instance, it is immediately appealing to ask if streaming assembly can be achieved with a just a single pass on the reads, and if so, what inherent limitations exist. In
<xref rid="btx471-B25" ref-type="bibr">Song
<italic>et al.</italic>
(2014</xref>
), a simple solution is proposed wherein the first 1 M reads are processed to provide a succinct summary for the rest, but such an approach is more suited to high coverage or low entropy data, and thus unlikely to perform well on diverse metagenomes or when rare events are of particular interest. Another issue raised by the performance comparison herein is that of capturing the added value that iterative (multi-k value) graph generation provides. We have given a partial solution by capturing subsets of junction pairs within each read, and between mates of paired-end reads. Although it is possible to iteratively refine the graph with more passes on the reads, each time for the collection of k-mers at different lengths, this becomes unwieldy with large data volumes. Identifying the contexts for which such information would be useful in the graph and indexing the reads to allow for querying of such contexts may provide more efficient means of extracting such information.</p>
</sec>
<sec>
<title>Funding</title>
<p>This work was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program to R.S. R.S. was supported in part by the Raymond and Beverley Chair in Bioinformatics at Tel Aviv University. E.H. was supported in part by the USA–Israel Binational Science Foundation [Grant 2012304] and E.H. and R.R. were supported in part by the Israel Science Foundation [Grant 1425/13]. R.R. was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University, an IBM PhD fellowship, and by the Center for Absorption in Science, the Israel Ministry of Immigrant Absorption. E.H. is a Faculty Fellow of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. G.G. was supported by the MISTI MIT-Israel program at MIT and Tel Aviv University.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="sup1">
<label>Supplementary Appendix</label>
<media xlink:href="faucet_appendix_btx471.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="btx471-B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bankevich</surname>
<given-names>A.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
.
<source>J. Comput. Biol</source>
.,
<volume>19</volume>
,
<fpage>455</fpage>
<lpage>477</lpage>
.
<pub-id pub-id-type="pmid">22506599</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bloom</surname>
<given-names>B.H.</given-names>
</name>
</person-group>
(
<year>1970</year>
)
<article-title>Space/time trade-offs in hash coding with allowable errors</article-title>
.
<source>Commun. ACM</source>
,
<volume>13</volume>
,
<fpage>422</fpage>
<lpage>426</lpage>
.</mixed-citation>
</ref>
<ref id="btx471-B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
,
<name name-style="western">
<surname>Rizk</surname>
<given-names>G.</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Space-efficient and exact de Bruijn graph representation based on a Bloom filter</article-title>
.
<source>Algorithms Bioinformatics</source>
,
<volume>8</volume>
,
<fpage>236</fpage>
<lpage>248</lpage>
.</mixed-citation>
</ref>
<ref id="btx471-B4">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
).
<chapter-title>On the representation of de bruijn graphs</chapter-title>
In:
<source>Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
, vol.
<volume>8394</volume>
<publisher-name>LNBI</publisher-name>
, pp
<fpage>35</fpage>
<lpage>55</lpage>
.</mixed-citation>
</ref>
<ref id="btx471-B5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016</year>
)
<article-title>Compacting de Bruijn graphs from sequencing data quickly and in low memory</article-title>
.
<source>Bioinformatics</source>
,
<volume>32</volume>
,
<fpage>i201</fpage>
<lpage>i208</lpage>
.
<pub-id pub-id-type="pmid">27307618</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B6">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>El-Metwally</surname>
<given-names>S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016</year>
)
<article-title>LightAssembler: Fast and memory-efficient assembly algorithm for high-throughput sequencing reads</article-title>
.
<source>Bioinformatics</source>
,
<volume>32</volume>
,
<fpage>3215</fpage>
<lpage>3223</lpage>
.
<pub-id pub-id-type="pmid">27412092</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Gurevich</surname>
<given-names>A.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>QUAST: quality assessment tool for genome assemblies</article-title>
.
<source>Bioinformatics</source>
,
<volume>29</volume>
,
<fpage>1072</fpage>
<lpage>1075</lpage>
.
<pub-id pub-id-type="pmid">23422339</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Iqbal</surname>
<given-names>Z.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>De novo assembly and genotyping of variants using colored de Bruijn graphs</article-title>
.
<source>Nat. Genet</source>
.,
<volume>44</volume>
,
<fpage>226</fpage>
<lpage>232</lpage>
.
<pub-id pub-id-type="pmid">22231483</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>D.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph</article-title>
.
<source>Bioinformatics</source>
,
<volume>31</volume>
,
<fpage>1674</fpage>
<lpage>1676</lpage>
.</mixed-citation>
</ref>
<ref id="btx471-B10">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Medvedev</surname>
<given-names>P.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2007</year>
).
<chapter-title>Computability of models for sequence assembly</chapter-title>
In:
<source>Algorithms in Bioinformatics</source>
.
<publisher-name>Springer, Berlin Heidelberg</publisher-name>
, pp.
<fpage>289</fpage>
<lpage>301</lpage>
.</mixed-citation>
</ref>
<ref id="btx471-B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Melsted</surname>
<given-names>P.</given-names>
</name>
,
<name name-style="western">
<surname>Halldorsson</surname>
<given-names>B.V.</given-names>
</name>
</person-group>
(
<year>2014</year>
)
<article-title>KmerStream: streaming algorithms for k-mer abundance estimation</article-title>
.
<source>Bioinformatics</source>
,
<volume>30</volume>
,
<fpage>3541</fpage>
<lpage>3547</lpage>
.
<pub-id pub-id-type="pmid">25355787</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B12">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Minkin</surname>
<given-names>I.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes</article-title>
.
<source>Bioinformatics</source>
,
<volume>33</volume>
,
<fpage>4024</fpage>
<lpage>4032</lpage>
.
<pub-id pub-id-type="pmid">27659452</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Mohamadi</surname>
<given-names>H.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>ntCard: a streaming algorithm for cardinality estimation in genomics data</article-title>
.
<source>Bioinformatics</source>
,
<volume>33</volume>
,
<fpage>1324</fpage>
<lpage>1330</lpage>
.
<pub-id pub-id-type="pmid">28453674</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B14">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Nihalani</surname>
<given-names>R.</given-names>
</name>
,
<name name-style="western">
<surname>Aluru</surname>
<given-names>S.</given-names>
</name>
</person-group>
(
<year>2016</year>
). Effective Utilization of Paired Reads to Improve Length and Accuracy of Contigs in Genome Assembly. In:
<italic>Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics</italic>
, Seattle, WA, USA — October 02 - 05, 2016, pp. 355–363.</mixed-citation>
</ref>
<ref id="btx471-B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Novak</surname>
<given-names>A.M.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>Genome graphs</article-title>
.
<source>bioRxiv</source>
. doi: 10.1101/101378.</mixed-citation>
</ref>
<ref id="btx471-B16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Nurk</surname>
<given-names>S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>metaSPAdes: a new versatile de novo metagenomics assembler</article-title>
.
<source>Genome Res.</source>
,
<volume>27</volume>
,
<fpage>824</fpage>
<lpage>834</lpage>
.
<pub-id pub-id-type="pmid">28298430</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Pell</surname>
<given-names>J.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Scaling metagenome sequence assembly with probabilistic de Bruijn graphs</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
,
<volume>109</volume>
,
<fpage>13272</fpage>
<lpage>13277</lpage>
.
<pub-id pub-id-type="pmid">22847406</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B18">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Pertea</surname>
<given-names>M.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>StringTie enables improved reconstruction of a transcriptome from RNA-seq reads</article-title>
.
<source>Nat. Biotechnol</source>
.,
<volume>33</volume>
,
<fpage>290</fpage>
<lpage>295</lpage>
.
<pub-id pub-id-type="pmid">25690850</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B19">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Pevzner</surname>
<given-names>P.A.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2001</year>
)
<article-title>An Eulerian path approach to DNA fragment assembly</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
,
<volume>98</volume>
,
<fpage>9748</fpage>
<lpage>9753</lpage>
.
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B20">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Prjibelski</surname>
<given-names>A.D.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>ExSPAnder: a universal repeat resolver for DNA fragment assembly</article-title>
.
<source>Bioinformatics</source>
,
<volume>30</volume>
,</mixed-citation>
</ref>
<ref id="btx471-B21">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Roberts</surname>
<given-names>A.</given-names>
</name>
,
<name name-style="western">
<surname>Pachter</surname>
<given-names>L.</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Streaming fragment assignment for real-time analysis of sequencing experiments</article-title>
.
<source>Nat. Methods</source>
,
<volume>10</volume>
,
<fpage>71</fpage>
<lpage>73</lpage>
.
<pub-id pub-id-type="pmid">23160280</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B22">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rozov</surname>
<given-names>R.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>Recycler: an algorithm for detecting plasmids from
<italic>de novo</italic>
assembly graphs</article-title>
.
<source>Bioinformatics</source>
,
<volume>33</volume>
,
<fpage>475</fpage>
<lpage>482</lpage>
.
<pub-id pub-id-type="pmid">28003256</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B23">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Shi</surname>
<given-names>W.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2017</year>
)
<article-title>The combination of direct and paired link graphs can boost repetitive genome assembly</article-title>
.
<source>Nucleic Acids Res</source>
.,
<volume>45</volume>
,
<fpage>e43</fpage>
.
<pub-id pub-id-type="pmid">27924003</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B24">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Simpson</surname>
<given-names>J.T.</given-names>
</name>
,
<name name-style="western">
<surname>Durbin</surname>
<given-names>R.</given-names>
</name>
</person-group>
(
<year>2010</year>
)
<article-title>Efficient construction of an assembly string graph using the FM-index</article-title>
.
<source>Bioinformatics</source>
,
<volume>26</volume>
,</mixed-citation>
</ref>
<ref id="btx471-B25">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Song</surname>
<given-names>L.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>Lighter: fast and memory-efficient sequencing error correction without counting</article-title>
.
<source>Genome Biol</source>
.,
<volume>15</volume>
,
<fpage>509.</fpage>
<pub-id pub-id-type="pmid">25398208</pub-id>
</mixed-citation>
</ref>
<ref id="btx471-B26">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ye</surname>
<given-names>C.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Exploiting sparseness in de novo genome assembly</article-title>
.
<source>BMC Bioinformatics</source>
,
<volume>13(Suppl. 6)</volume>
,
<fpage>S1.</fpage>
</mixed-citation>
</ref>
<ref id="btx471-B27">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhang</surname>
<given-names>Q.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>These are not the K-mers you are looking for: Efficient online K-mer counting using a probabilistic data structure</article-title>
.
<source>PLoS One</source>
,
<volume>9</volume>
,
<fpage>e101271.</fpage>
<pub-id pub-id-type="pmid">25062443</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B20 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B20 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5870852
   |texte=   Faucet: streaming de novo assembly graph construction
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:29036597" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021