Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0002450 ( Pmc/Corpus ); précédent : 0002449; suivant : 0002451 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</title>
<author>
<name sortKey="Holley, Guillaume" sort="Holley, Guillaume" uniqKey="Holley G" first="Guillaume" last="Holley">Guillaume Holley</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wittler, Roland" sort="Wittler, Roland" uniqKey="Wittler R" first="Roland" last="Wittler">Roland Wittler</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stoye, Jens" sort="Stoye, Jens" uniqKey="Stoye J" first="Jens" last="Stoye">Jens Stoye</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27087830</idno>
<idno type="pmc">4832552</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4832552</idno>
<idno type="RBID">PMC:4832552</idno>
<idno type="doi">10.1186/s13015-016-0066-8</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000245</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000245</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</title>
<author>
<name sortKey="Holley, Guillaume" sort="Holley, Guillaume" uniqKey="Holley G" first="Guillaume" last="Holley">Guillaume Holley</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wittler, Roland" sort="Wittler, Roland" uniqKey="Wittler R" first="Roland" last="Wittler">Roland Wittler</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stoye, Jens" sort="Stoye, Jens" uniqKey="Stoye J" first="Jens" last="Stoye">Jens Stoye</name>
<affiliation>
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Algorithms for Molecular Biology : AMB</title>
<idno type="eISSN">1748-7188</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored
<italic>k</italic>
-mers, strings of length
<italic>k</italic>
, and stores them in vertices.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored
<italic>k</italic>
-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying
<italic>k</italic>
-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores
<italic>k</italic>
-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.github.com/GuillaumeHolley/BloomFilterTrie">https://www.github.com/GuillaumeHolley/BloomFilterTrie</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
<author>
<name sortKey="Caccamo, M" uniqKey="Caccamo M">M Caccamo</name>
</author>
<author>
<name sortKey="Turner, I" uniqKey="Turner I">I Turner</name>
</author>
<author>
<name sortKey="Flicek, P" uniqKey="Flicek P">P Flicek</name>
</author>
<author>
<name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cox, Aj" uniqKey="Cox A">AJ Cox</name>
</author>
<author>
<name sortKey="Bauer, Mj" uniqKey="Bauer M">MJ Bauer</name>
</author>
<author>
<name sortKey="Jakobi, T" uniqKey="Jakobi T">T Jakobi</name>
</author>
<author>
<name sortKey="Rosone, G" uniqKey="Rosone G">G Rosone</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fredking, E" uniqKey="Fredking E">E Fredking</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heinz, S" uniqKey="Heinz S">S Heinz</name>
</author>
<author>
<name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
<author>
<name sortKey="Williams, He" uniqKey="Williams H">HE Williams</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
<author>
<name sortKey="Banks, E" uniqKey="Banks E">E Banks</name>
</author>
<author>
<name sortKey="Poplin, R" uniqKey="Poplin R">R Poplin</name>
</author>
<author>
<name sortKey="Garimella, Kv" uniqKey="Garimella K">KV Garimella</name>
</author>
<author>
<name sortKey="Maguire, Jr" uniqKey="Maguire J">JR Maguire</name>
</author>
<author>
<name sortKey="Hartl, C" uniqKey="Hartl C">C Hartl</name>
</author>
<author>
<name sortKey="Philippakis, Aa" uniqKey="Philippakis A">AA Philippakis</name>
</author>
<author>
<name sortKey="Del Angel, G" uniqKey="Del Angel G">G del Angel</name>
</author>
<author>
<name sortKey="Rivas, Ma" uniqKey="Rivas M">MA Rivas</name>
</author>
<author>
<name sortKey="Hanna, M" uniqKey="Hanna M">M Hanna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ernst, C" uniqKey="Ernst C">C Ernst</name>
</author>
<author>
<name sortKey="Rahmann, S" uniqKey="Rahmann S">S Rahmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nguyen, N" uniqKey="Nguyen N">N Nguyen</name>
</author>
<author>
<name sortKey="Hickey, G" uniqKey="Hickey G">G Hickey</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Raney, B" uniqKey="Raney B">B Raney</name>
</author>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="Armstrong, J" uniqKey="Armstrong J">J Armstrong</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
<author>
<name sortKey="Paten, B" uniqKey="Paten B">B Paten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paten, B" uniqKey="Paten B">B Paten</name>
</author>
<author>
<name sortKey="Diekhans, M" uniqKey="Diekhans M">M Diekhans</name>
</author>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="John, Js" uniqKey="John J">JS John</name>
</author>
<author>
<name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author>
<name sortKey="Suh, B" uniqKey="Suh B">B Suh</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, L" uniqKey="Huang L">L Huang</name>
</author>
<author>
<name sortKey="Popic, V" uniqKey="Popic V">V Popic</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Starlinger, J" uniqKey="Starlinger J">J Starlinger</name>
</author>
<author>
<name sortKey="Bux, M" uniqKey="Bux M">M Bux</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcus, S" uniqKey="Marcus S">S Marcus</name>
</author>
<author>
<name sortKey="Lee, H" uniqKey="Lee H">H Lee</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baier, U" uniqKey="Baier U">U Baier</name>
</author>
<author>
<name sortKey="Beller, T" uniqKey="Beller T">T Beller</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Algorithms Mol Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Algorithms Mol Biol</journal-id>
<journal-title-group>
<journal-title>Algorithms for Molecular Biology : AMB</journal-title>
</journal-title-group>
<issn pub-type="epub">1748-7188</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27087830</article-id>
<article-id pub-id-type="pmc">4832552</article-id>
<article-id pub-id-type="publisher-id">66</article-id>
<article-id pub-id-type="doi">10.1186/s13015-016-0066-8</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Holley</surname>
<given-names>Guillaume</given-names>
</name>
<address>
<email>gholley@cebitec.uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wittler</surname>
<given-names>Roland</given-names>
</name>
<address>
<email>roland.wittler@uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stoye</surname>
<given-names>Jens</given-names>
</name>
<address>
<email>jens.stoye@uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</aff>
<aff id="Aff2">
<label></label>
Center for Biotechnology, Bielefeld University, Bielefeld, Germany</aff>
<aff id="Aff3">
<label></label>
International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>14</day>
<month>4</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>14</day>
<month>4</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>11</volume>
<elocation-id>3</elocation-id>
<history>
<date date-type="received">
<day>8</day>
<month>12</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>3</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>© Holley et al. 2016</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored
<italic>k</italic>
-mers, strings of length
<italic>k</italic>
, and stores them in vertices.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored
<italic>k</italic>
-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying
<italic>k</italic>
-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores
<italic>k</italic>
-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.github.com/GuillaumeHolley/BloomFilterTrie">https://www.github.com/GuillaumeHolley/BloomFilterTrie</ext-link>
.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Pan-genome</kwd>
<kwd>Similar genomes</kwd>
<kwd>Population genomics</kwd>
<kwd>Colored de bruijn graph</kwd>
<kwd>Bloom filter</kwd>
<kwd>Compression</kwd>
<kwd>Trie</kwd>
<kwd>Index</kwd>
<kwd>Succinct data structure</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution>International DFG Research Training Group GRK 1906/1</institution>
</funding-source>
<award-id>ID0EUJAC90</award-id>
<principal-award-recipient>
<name>
<surname>Holley</surname>
<given-names>Guillaume</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2016</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>A
<italic>string</italic>
<italic>x</italic>
is a sequence of characters drawn from a finite, non-empty set, called the
<italic>alphabet</italic>
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}$$\end{document}</tex-math>
<mml:math id="M2">
<mml:mi mathvariant="script">A</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Its
<italic>length</italic>
is denoted by |
<italic>x</italic>
|. The character at position 
<italic>i</italic>
is denoted by
<italic>x</italic>
[
<italic>i</italic>
], the substring starting at position 
<italic>i</italic>
and ending at position 
<italic>j</italic>
by
<italic>x</italic>
[
<italic>i</italic>
..
<italic>j</italic>
]. Strings are concatenated by juxtaposition. If
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x = ps$$\end{document}</tex-math>
<mml:math id="M4">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
for (potentially empty) strings
<italic>p</italic>
and
<italic>s</italic>
, then
<italic>p</italic>
is a
<italic>prefix</italic>
and
<italic>s</italic>
is a
<italic>suffix</italic>
of
<italic>x</italic>
.</p>
<p>A
<italic>genome</italic>
is the collection of all inheritable material of a cell. Ideally it can be represented as a single string over the DNA alphabet
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A} = \{{a},{c},{g},{t}\}$$\end{document}</tex-math>
<mml:math id="M6">
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
(or as a few strings in case of species with multiple chromosomes). In practice, however, genomes in databases are often less perfect, either left unchanged in form of the raw data as produced by sequencing machines (millions of short sequences called
<italic>reads</italic>
), or after some incomplete assembly procedure in form of contiguous chromosome regions (hundreds of
<italic>contigs</italic>
of various lengths). We are interested in the problem of storing and compressing a set of multiple highly similar genomes, e.g. the pan-genome of a bacterial species, comprising hundreds, or even thousands of strains that share large sequence parts, but differ by individual mutations from one another. An abstract structure that has been proposed for this task is the
<italic>colored de Bruijn graph</italic>
(C-DBG) [
<xref ref-type="bibr" rid="CR1">1</xref>
]. It is a directed graph
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G = (V_G,E_G)$$\end{document}</tex-math>
<mml:math id="M8">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>G</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>G</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
in which each vertex
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in V_G$$\end{document}</tex-math>
<mml:math id="M10">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>G</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
represents a
<italic>k</italic>
-mer, a string of length 
<italic>k</italic>
over
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}$$\end{document}</tex-math>
<mml:math id="M12">
<mml:mi mathvariant="script">A</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
, associated with a set of colors representing the genomes in which the
<italic>k</italic>
-mer occurs. A directed edge
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$e \in E_G$$\end{document}</tex-math>
<mml:math id="M14">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>G</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
from vertex
<italic>v</italic>
to vertex
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v'$$\end{document}</tex-math>
<mml:math id="M16">
<mml:msup>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
, respectively from
<italic>k</italic>
-mer
<italic>x</italic>
to
<italic>k</italic>
-mer
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x'$$\end{document}</tex-math>
<mml:math id="M18">
<mml:msup>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
, exists if
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x[2..k] = x'[1..k-1]$$\end{document}</tex-math>
<mml:math id="M20">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Each
<italic>k</italic>
-mer
<italic>x</italic>
has
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {A}|$$\end{document}</tex-math>
<mml:math id="M22">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
 possible successors
<italic>x</italic>
[2..
<italic>k</italic>
]
<italic>c</italic>
and
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {A}|$$\end{document}</tex-math>
<mml:math id="M24">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
 possible predecessors
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c x[1..k-1]$$\end{document}</tex-math>
<mml:math id="M26">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
with
<inline-formula id="IEq14">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c\in \mathcal {A}$$\end{document}</tex-math>
<mml:math id="M28">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq14.gif"></inline-graphic>
</alternatives>
</inline-formula>
. An implementation of such a graph does not have to store edges since they are implicitly given by vertices overlapping on
<inline-formula id="IEq15">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-1$$\end{document}</tex-math>
<mml:math id="M30">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq15.gif"></inline-graphic>
</alternatives>
</inline-formula>
characters.</p>
<p>In this paper, we propose a new data structure for indexing and compressing a pan-genome as a C-DBG, the Bloom Filter Trie (BFT). It is alignment-free, reference-free and incremental, i.e., it does not need to be entirely rebuilt when a new genome is inserted. BFTs provide insertion and look-up operations for strings of fixed length associated with an annotation. This paper is an extended version of the preliminary work presented in [
<xref ref-type="bibr" rid="CR2">2</xref>
].</p>
<p>In the next section, existing data structures and software for pan-genome representation are reviewed. The BFT and the operations it supports are then described, followed by the traversal method of a C-DBG stored as a BFT. Finally, experimental results showing the performance of the data structure are provided.</p>
</sec>
<sec id="Sec2">
<title>Existing approaches</title>
<p>The BFT, as well as existing tools for pan-genome storage, uses a variety of basic data structures reviewed in the following. Existing tools for pan-genome storage will then be discussed.</p>
<sec id="Sec3">
<title>Data structures</title>
<p>One common way to index and compress a set of strings is the
<italic>Burrows-Wheeler Transform</italic>
(BWT) [
<xref ref-type="bibr" rid="CR3">3</xref>
] that rearranges the input data to enable better compression by aggregating characters with similar context. For multiple sets of strings, a disk-based approach [
<xref ref-type="bibr" rid="CR4">4</xref>
] or different terminator characters must be used. The
<italic>FM-Index</italic>
 [
<xref ref-type="bibr" rid="CR5">5</xref>
] allows to count and locate the occurrences of a substring in the BWT.</p>
<p>Introduced by Bloom [
<xref ref-type="bibr" rid="CR6">6</xref>
], a
<italic>Bloom filter</italic>
(BF) records the presence of elements in a set. Based on the hash table principle, look-up and insertion times are constant. The BF is composed of a bit array
<italic>B</italic>
[1..
<italic>m</italic>
], initialized with 0s, in which the presence of
<italic>n</italic>
elements is recorded. A set of
<italic>f</italic>
independent hash functions
<inline-formula id="IEq16">
<alternatives>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_1, ..., h_f$$\end{document}</tex-math>
<mml:math id="M32">
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq16.gif"></inline-graphic>
</alternatives>
</inline-formula>
is used such that each hash function maps an element to an integer from one to
<italic>m</italic>
. Inserting an element
<italic>e</italic>
into
<italic>B</italic>
and testing for its presence are then
<disp-formula id="Equ1">
<alternatives>
<tex-math id="M33">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textsf {Insert}(e,B): B[h_i(e)] \leftarrow 1 \quad \text { for all } i = 1,...,f \end{aligned}$$\end{document}</tex-math>
<mml:math id="M34" display="block">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi mathvariant="sans-serif">Insert</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo stretchy="false"></mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="1em"></mml:mspace>
<mml:mspace width="0.333333em"></mml:mspace>
<mml:mtext>for all</mml:mtext>
<mml:mspace width="0.333333em"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2016_66_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
and
<disp-formula id="Equ2">
<alternatives>
<tex-math id="M35">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \textsf {MayContain}(e,B) : \bigwedge \limits _{i=1}^{f}B[h_i(e)], \end{aligned}$$\end{document}</tex-math>
<mml:math id="M36" display="block">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi mathvariant="sans-serif">MayContain</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>f</mml:mi>
</mml:munderover>
<mml:mi>B</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2016_66_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
respectively, where
<inline-formula id="IEq17">
<alternatives>
<tex-math id="M37">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bigwedge$$\end{document}</tex-math>
<mml:math id="M38">
<mml:mo></mml:mo>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the logical conjunction operator. The BF does not generate false negatives but may generate false positives, as
<inline-formula id="IEq18">
<alternatives>
<tex-math id="M39">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {MayContain}$$\end{document}</tex-math>
<mml:math id="M40">
<mml:mi mathvariant="sans-serif">MayContain</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
can report the presence of elements which are not present but a result of independent insertions.</p>
<p>The
<italic>Sequence Bloom Tree</italic>
(SBT) [
<xref ref-type="bibr" rid="CR7">7</xref>
] is a binary tree with BFs as vertices. An internal vertex is the union of its two children BFs, i.e., a BF where a slot is set to 1 if the slot at the same position in at least one of the two children is 1.</p>
<p>A
<italic>trie</italic>
 [
<xref ref-type="bibr" rid="CR8">8</xref>
] is a rooted edge-labeled tree
<inline-formula id="IEq19">
<alternatives>
<tex-math id="M41">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T = (V_T,E_T)$$\end{document}</tex-math>
<mml:math id="M42">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
storing a set of strings. Each edge
<inline-formula id="IEq20">
<alternatives>
<tex-math id="M43">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$e \in E_T$$\end{document}</tex-math>
<mml:math id="M44">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
is labeled with a character and no two edges starting at the same vertex can have the same character. A path from the root to a leaf represents the string obtained by concatenating all the characters on this path. The depth of a vertex
<italic>v</italic>
in
<italic>T</italic>
is denoted by
<italic>depth</italic>
(
<italic>v</italic>
<italic>T</italic>
) and is the number of edges between the root of
<italic>T</italic>
and
<italic>v</italic>
. The height of
<italic>T</italic>
, denoted by
<italic>height</italic>
(
<italic>T</italic>
), is the number of edges on the longest path from the root of
<italic>T</italic>
to a leaf. The
<italic>burst trie</italic>
 [
<xref ref-type="bibr" rid="CR9">9</xref>
] is an efficient implementation of a trie which reduces its number of branches by compressing sub-tries into leaves. Its internal vertices are labeled with multiple prefixes of length 1, linked to children. The leaves are labeled with multiple suffixes of arbitrary length. A leaf has a limited capacity of suffixes and is
<italic>burst</italic>
when this capacity is exceeded. A burst splits suffixes of a leaf into prefixes of length 1, linked to new leaves representing the remaining suffixes.</p>
</sec>
<sec id="Sec4">
<title>Software for pan-genome storage</title>
<p>Existing tools for pan-genome storage are mostly alignment-based or reference-based and take a set of assembled genomes as input. Alignments naturally exhibit shared and unique regions of the pan-genome but are computationally expensive to obtain. In addition, misalignments can lead to an inaccurate estimation of the pan-genome regions [
<xref ref-type="bibr" rid="CR10">10</xref>
]. PanCake [
<xref ref-type="bibr" rid="CR11">11</xref>
] is an extension of string graphs, known from genome assembly [
<xref ref-type="bibr" rid="CR12">12</xref>
], which achieves compression based on pairwise alignments. Experiments showed compression ratios of 3:1 to 5:1. Nguyen et al. [
<xref ref-type="bibr" rid="CR13">13</xref>
] formulated the pan-genome construction problem as an optimization problem of arranging alignment blocks for a set of genomes partitioned by homology. The complexity of the problem has been shown to be NP-hard, and a heuristic using Cactus graphs [
<xref ref-type="bibr" rid="CR14">14</xref>
] was provided. However, a multiple sequence alignment is required for creating the blocks, another NP-hard problem.</p>
<p>Among the reference-based tools, Huang et al. [
<xref ref-type="bibr" rid="CR15">15</xref>
] proposed to build a pan-genome by annotating a reference genome with all the variants detected between a set of genomes and the reference. The BWT of the augmented reference is then computed and can be used by an aligner based on the FM-Index. While being more accurate with the augmented reference genome than BWA [
<xref ref-type="bibr" rid="CR16">16</xref>
] with the reference alone, the aligner is between 10 to 100 times slower, uses significantly more memory and can introduce false positive alignments. RCSI [
<xref ref-type="bibr" rid="CR17">17</xref>
] (Referentially Compressed Search Index) uses referential compression with a compressed suffix tree to store a pan-genome and to search for exact or inexact matches. The inexact matching allows a limited number of edit distance operations. 1092 human genomes totaling 3.09 TB of data were compressed into an index of 115 GB, offering a compression ratio of about 28:1. Yet, the index is built for a maximum length query and a maximum number of edit operations. MRCSI [
<xref ref-type="bibr" rid="CR18">18</xref>
] improves on RCSI by proposing a compressed search index based on multiple references.</p>
<p>Closer to our approach is SplitMEM [
<xref ref-type="bibr" rid="CR19">19</xref>
], which uses a C-DBG to build a pan-genome from assembled genomes and extract the shared regions. The C-DBG is directly constructed in a compressed way, where a non-branching path is stored in a single vertex, using an augmented suffix tree. Baier et al. [
<xref ref-type="bibr" rid="CR20">20</xref>
] improved SplitMEM in theory and practice with two algorithms that use the BWT and a compressed suffix tree. Unfortunately, both tools use more memory than the original size of the input sequences.</p>
<p>The tool Khmer [
<xref ref-type="bibr" rid="CR21">21</xref>
] provides a lightweight representation of de Bruijn graphs [
<xref ref-type="bibr" rid="CR22">22</xref>
] based on Bloom filters and a graph labeling method based on graph partitioning. Unfortunately, the graph labeling method does not offer yet enough flexibility to reproduce the experiments presented in this paper.</p>
<p>The SBT data structure has been implemented in an alignment-free, reference-free and incremental software [
<xref ref-type="bibr" rid="CR7">7</xref>
] to label raw sequences with their colors. The proposed tool is designed to index and compress data from sequencing experiments for effective query of full-length genes or transcripts by separation into
<italic>k</italic>
-mers. A leaf of an SBT is used to represent a sequencing experiment by extracting all its
<italic>k</italic>
-mers and storing them in the BF of the leaf. The SBT software does not represent exactly the set of
<italic>k</italic>
-mers of the sequencing experiments they contain, though, due to the inexact nature of BFs.</p>
</sec>
</sec>
<sec id="Sec5">
<title>The Bloom Filter Trie</title>
<p>The Bloom Filter Trie (BFT) that we propose in this paper is an implementation of a C-DBG. It is based on a burst trie and is used to store
<italic>k</italic>
-mers associated with a set of colors. For the moment we may assume that colors are represented by a bit array
<italic>color</italic>
initialized with 0s. Each color has an index
<inline-formula id="IEq21">
<alternatives>
<tex-math id="M45">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{color}$$\end{document}</tex-math>
<mml:math id="M46">
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
such that
<inline-formula id="IEq22">
<alternatives>
<tex-math id="M47">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${color}_x[i_{color}] = 1$$\end{document}</tex-math>
<mml:math id="M48">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mi>x</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
records that
<italic>k</italic>
-mer
<italic>x</italic>
has color
<inline-formula id="IEq23">
<alternatives>
<tex-math id="M49">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{color}$$\end{document}</tex-math>
<mml:math id="M50">
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq23.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Sets of colors will later be compressed. All arrays in a BFT are dynamic: An insertion at position
<italic>pos</italic>
in an array
<italic>A</italic>
reallocates it and shifts every slot having an index
<inline-formula id="IEq24">
<alternatives>
<tex-math id="M51">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge {pos}$$\end{document}</tex-math>
<mml:math id="M52">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq24.gif"></inline-graphic>
</alternatives>
</inline-formula>
by one position in
<inline-formula id="IEq25">
<alternatives>
<tex-math id="M53">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(|A|)$$\end{document}</tex-math>
<mml:math id="M54">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq25.gif"></inline-graphic>
</alternatives>
</inline-formula>
time.</p>
<p>In the following, let
<inline-formula id="IEq26">
<alternatives>
<tex-math id="M55">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t = (V_t,E_t)$$\end{document}</tex-math>
<mml:math id="M56">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq26.gif"></inline-graphic>
</alternatives>
</inline-formula>
be a BFT created for a certain value of
<italic>k</italic>
where we assume that
<italic>k</italic>
is a multiple of an integer
<italic>l</italic>
such that
<italic>k</italic>
-mers can be split into
<inline-formula id="IEq27">
<alternatives>
<tex-math id="M57">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{k}{l}$$\end{document}</tex-math>
<mml:math id="M58">
<mml:mfrac>
<mml:mi>k</mml:mi>
<mml:mi>l</mml:mi>
</mml:mfrac>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq27.gif"></inline-graphic>
</alternatives>
</inline-formula>
equal-length substrings. The maximum height of
<italic>t</italic>
is
<inline-formula id="IEq28">
<alternatives>
<tex-math id="M59">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${height}_{max}(t) = \frac{k}{l}-1$$\end{document}</tex-math>
<mml:math id="M60">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>k</mml:mi>
<mml:mi>l</mml:mi>
</mml:mfrac>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq28.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The alphabet we consider is the DNA alphabet
<inline-formula id="IEq29">
<alternatives>
<tex-math id="M61">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A} = \{\textit{a},\textit{c},\textit{g},\textit{t}\}$$\end{document}</tex-math>
<mml:math id="M62">
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq29.gif"></inline-graphic>
</alternatives>
</inline-formula>
, and because
<inline-formula id="IEq30">
<alternatives>
<tex-math id="M63">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {A}| = 4$$\end{document}</tex-math>
<mml:math id="M64">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq30.gif"></inline-graphic>
</alternatives>
</inline-formula>
, each character can be stored using two bits. A vertex in a BFT is a list of containers, zero or more of which are
<italic>compressed</italic>
, plus zero or one
<italic>uncompressed</italic>
container. In the following, we will explain how the containers are represented and how an uncompressed container is burst when its capacity is exceeded.</p>
<sec id="Sec6">
<title>Uncompressed container</title>
<p>An uncompressed container of a vertex
<italic>v</italic>
in a BFT is a limited capacity set of tuples
<inline-formula id="IEq31">
<alternatives>
<tex-math id="M65">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${<}s,{color}_{ps}{>}$$\end{document}</tex-math>
<mml:math id="M66">
<mml:mrow>
<mml:mo><</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>></mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq31.gif"></inline-graphic>
</alternatives>
</inline-formula>
where
<italic>s</italic>
is a suffix and
<italic>p</italic>
is the prefix represented by the path from the root to
<italic>v</italic>
such that
<inline-formula id="IEq32">
<alternatives>
<tex-math id="M67">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|p|+|s| = k$$\end{document}</tex-math>
<mml:math id="M68">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq32.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Tuples are lexicographically ordered in the set according to their suffixes. Uncompressed containers are burst when the number of suffixes stored exceeds their capacity
<inline-formula id="IEq33">
<alternatives>
<tex-math id="M69">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c > 0$$\end{document}</tex-math>
<mml:math id="M70">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>></mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq33.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Then, each suffix
<italic>s</italic>
of the uncompressed container is split into a prefix
<inline-formula id="IEq34">
<alternatives>
<tex-math id="M71">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}$$\end{document}</tex-math>
<mml:math id="M72">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq34.gif"></inline-graphic>
</alternatives>
</inline-formula>
of length
<italic>l</italic>
and a suffix
<inline-formula id="IEq35">
<alternatives>
<tex-math id="M73">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{suf}$$\end{document}</tex-math>
<mml:math id="M74">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq35.gif"></inline-graphic>
</alternatives>
</inline-formula>
of length
<inline-formula id="IEq36">
<alternatives>
<tex-math id="M75">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|s|-l$$\end{document}</tex-math>
<mml:math id="M76">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq36.gif"></inline-graphic>
</alternatives>
</inline-formula>
such that
<inline-formula id="IEq37">
<alternatives>
<tex-math id="M77">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s = s_{pref}s_{suf}$$\end{document}</tex-math>
<mml:math id="M78">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq37.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Prefixes are stored in a new compressed container. Suffixes, attached with their colors, are stored in new uncompressed containers, themselves stored in the children of the compressed container. An example of a BFT and a bursting is given in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Insertion of six suffixes (that are here complete
<italic>k</italic>
-mers) with different colors (
<italic>boxes</italic>
with
<italic>diagonal lines</italic>
) into a BFT with
<inline-formula id="IEq38">
<alternatives>
<tex-math id="M79">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=12$$\end{document}</tex-math>
<mml:math id="M80">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq38.gif"></inline-graphic>
</alternatives>
</inline-formula>
,
<inline-formula id="IEq39">
<alternatives>
<tex-math id="M81">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l=4$$\end{document}</tex-math>
<mml:math id="M82">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq39.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq40">
<alternatives>
<tex-math id="M83">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=5$$\end{document}</tex-math>
<mml:math id="M84">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq40.gif"></inline-graphic>
</alternatives>
</inline-formula>
. In
<bold>a</bold>
, the first five suffixes are inserted at the root into an uncompressed container. When a sixth suffix
<italic>gcgccaggaatc</italic>
is inserted, the uncompressed container exceeds its capacity and is burst, resulting in the BFT structure shown in 
<bold>b</bold>
</p>
</caption>
<graphic xlink:href="13015_2016_66_Fig1_HTML" id="MO3"></graphic>
</fig>
</p>
</sec>
<sec id="Sec7">
<title>Compressed container</title>
<p>A bursting replaces an uncompressed container by a compressed one, used to:
<list list-type="bullet">
<list-item>
<p>store
<inline-formula id="IEq41">
<alternatives>
<tex-math id="M85">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q \le c$$\end{document}</tex-math>
<mml:math id="M86">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo></mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq41.gif"></inline-graphic>
</alternatives>
</inline-formula>
suffix prefixes in compressed form (in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b,
<inline-formula id="IEq42">
<alternatives>
<tex-math id="M87">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=4$$\end{document}</tex-math>
<mml:math id="M88">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq42.gif"></inline-graphic>
</alternatives>
</inline-formula>
),</p>
</list-item>
<list-item>
<p>store links to children containing the suffixes, and</p>
</list-item>
<list-item>
<p>reconstruct suffix prefixes and find the corresponding children.</p>
</list-item>
</list>
To store a suffix prefix
<inline-formula id="IEq43">
<alternatives>
<tex-math id="M89">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}$$\end{document}</tex-math>
<mml:math id="M90">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq43.gif"></inline-graphic>
</alternatives>
</inline-formula>
efficiently, it is split into a prefix
<italic>a</italic>
and a suffix
<italic>b</italic>
with respective binary representations
<inline-formula id="IEq44">
<alternatives>
<tex-math id="M91">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}</tex-math>
<mml:math id="M92">
<mml:mi mathvariant="italic">α</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq44.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq45">
<alternatives>
<tex-math id="M93">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document}</tex-math>
<mml:math id="M94">
<mml:mi mathvariant="italic">β</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq45.gif"></inline-graphic>
</alternatives>
</inline-formula>
of length
<inline-formula id="IEq46">
<alternatives>
<tex-math id="M95">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document}</tex-math>
<mml:math id="M96">
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq46.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq47">
<alternatives>
<tex-math id="M97">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}</tex-math>
<mml:math id="M98">
<mml:mi mathvariant="italic">μ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq47.gif"></inline-graphic>
</alternatives>
</inline-formula>
bits. A compressed container is composed of four structures
<italic>quer</italic>
,
<italic>pref</italic>
,
<italic>suf</italic>
and
<italic>clust</italic>
, where:
<list list-type="bullet">
<list-item>
<p>
<italic>quer</italic>
is a BF represented as a bit array of length
<italic>m</italic>
and
<italic>f</italic>
hash functions, used to record and filter for the presence of
<italic>q</italic>
suffix prefixes;</p>
</list-item>
<list-item>
<p>
<italic>pref</italic>
is a bit array of
<inline-formula id="IEq48">
<alternatives>
<tex-math id="M99">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^{\lambda }$$\end{document}</tex-math>
<mml:math id="M100">
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq48.gif"></inline-graphic>
</alternatives>
</inline-formula>
bits initialized with 0s and used to record prefix presence exactly. Here the binary representation
<inline-formula id="IEq49">
<alternatives>
<tex-math id="M101">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}</tex-math>
<mml:math id="M102">
<mml:mi mathvariant="italic">α</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq49.gif"></inline-graphic>
</alternatives>
</inline-formula>
of a prefix
<italic>a</italic>
is interpreted as an integer such that
<inline-formula id="IEq50">
<alternatives>
<tex-math id="M103">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[\alpha ]$$\end{document}</tex-math>
<mml:math id="M104">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq50.gif"></inline-graphic>
</alternatives>
</inline-formula>
set to 1 records the presence of
<italic>a</italic>
;</p>
</list-item>
<list-item>
<p>
<italic>suf</italic>
is an array of
<italic>q</italic>
suffixes
<italic>b</italic>
sorted in ascending lexicographic order of the original suffix prefixes they belong to;</p>
</list-item>
<list-item>
<p>
<italic>clust</italic>
is an array of
<italic>q</italic>
bits, one per suffix of array
<italic>suf</italic>
, that represents cluster starting points. A cluster is a list of consecutive suffixes in array
<italic>suf</italic>
that share the same prefix. It has an index
<inline-formula id="IEq51">
<alternatives>
<tex-math id="M105">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{cluster}$$\end{document}</tex-math>
<mml:math id="M106">
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq51.gif"></inline-graphic>
</alternatives>
</inline-formula>
with
<inline-formula id="IEq52">
<alternatives>
<tex-math id="M107">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 \le i_{cluster} \le 2^{\lambda }$$\end{document}</tex-math>
<mml:math id="M108">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq52.gif"></inline-graphic>
</alternatives>
</inline-formula>
and a start position
<inline-formula id="IEq53">
<alternatives>
<tex-math id="M109">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos}_{cluster}$$\end{document}</tex-math>
<mml:math id="M110">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq53.gif"></inline-graphic>
</alternatives>
</inline-formula>
in the array
<italic>suf</italic>
with
<inline-formula id="IEq54">
<alternatives>
<tex-math id="M111">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{cluster} \le {pos}_{cluster} \le q$$\end{document}</tex-math>
<mml:math id="M112">
<mml:mrow>
<mml:msub>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq54.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Position
<italic>pos</italic>
in array
<italic>clust</italic>
is set to 1 to indicate that the suffix in
<italic>suf</italic>
[
<italic>pos</italic>
] starts a cluster because it is the lexicographically smallest suffix of its cluster. A cluster contains
<inline-formula id="IEq55">
<alternatives>
<tex-math id="M113">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \ge 1$$\end{document}</tex-math>
<mml:math id="M114">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq55.gif"></inline-graphic>
</alternatives>
</inline-formula>
suffixes and, therefore, position
<italic>i</italic>
in array
<italic>clust</italic>
is set to 0 for
<inline-formula id="IEq56">
<alternatives>
<tex-math id="M115">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos} < i < {pos}+n$$\end{document}</tex-math>
<mml:math id="M116">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo><</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq56.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The end of a cluster is indicated by the beginning of the next cluster or if
<inline-formula id="IEq57">
<alternatives>
<tex-math id="M117">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos} \ge q$$\end{document}</tex-math>
<mml:math id="M118">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq57.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
</list-item>
</list>
For example, the internal representation of the compressed container shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b with
<inline-formula id="IEq58">
<alternatives>
<tex-math id="M119">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|a|=2$$\end{document}</tex-math>
<mml:math id="M120">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq58.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq59">
<alternatives>
<tex-math id="M121">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|b|=2$$\end{document}</tex-math>
<mml:math id="M122">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq59.gif"></inline-graphic>
</alternatives>
</inline-formula>
would be:
<graphic position="anchor" xlink:href="13015_2016_66_Figa_HTML" id="MO80"></graphic>
</p>
<p>The size of
<italic>q</italic>
substrings in a compressed container is
<inline-formula id="IEq60">
<alternatives>
<tex-math id="M123">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m + 2^{\lambda } + q \cdot (\mu + 1)$$\end{document}</tex-math>
<mml:math id="M124">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>·</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq60.gif"></inline-graphic>
</alternatives>
</inline-formula>
bits. A bursting minimizes this size by choosing a prefix length |
<italic>a</italic>
| and a BF size
<italic>m</italic>
such that the set of substrings stored in a compressed container does not occupy more memory than their original representation in an uncompressed container, i.e., 
<inline-formula id="IEq61">
<alternatives>
<tex-math id="M125">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m + 2^{\lambda } \le q \cdot (\lambda - 1)$$\end{document}</tex-math>
<mml:math id="M126">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>·</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq61.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Each suffix prefix inserted after a bursting costs only
<inline-formula id="IEq62">
<alternatives>
<tex-math id="M127">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu + 1$$\end{document}</tex-math>
<mml:math id="M128">
<mml:mrow>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq62.gif"></inline-graphic>
</alternatives>
</inline-formula>
 bits. When the average size per suffix prefix stored is close to
<inline-formula id="IEq63">
<alternatives>
<tex-math id="M129">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu + 1$$\end{document}</tex-math>
<mml:math id="M130">
<mml:mrow>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq63.gif"></inline-graphic>
</alternatives>
</inline-formula>
bits, arrays
<italic>pref</italic>
,
<italic>suf</italic>
and
<italic>clust</italic>
can be recomputed by increasing |
<italic>a</italic>
| and decreasing |
<italic>b</italic>
|, such that
<inline-formula id="IEq64">
<alternatives>
<tex-math id="M131">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^{\lambda '} + q \cdot \mu ' < 2^{\lambda } + q \cdot \mu$$\end{document}</tex-math>
<mml:math id="M132">
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:msup>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>·</mml:mo>
<mml:msup>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo><</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>·</mml:mo>
<mml:mi mathvariant="italic">μ</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq64.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<inline-formula id="IEq65">
<alternatives>
<tex-math id="M133">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda '$$\end{document}</tex-math>
<mml:math id="M134">
<mml:msup>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq65.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq66">
<alternatives>
<tex-math id="M135">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu '$$\end{document}</tex-math>
<mml:math id="M136">
<mml:msup>
<mml:mi mathvariant="italic">μ</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq66.gif"></inline-graphic>
</alternatives>
</inline-formula>
are the values of
<inline-formula id="IEq67">
<alternatives>
<tex-math id="M137">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document}</tex-math>
<mml:math id="M138">
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq67.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq68">
<alternatives>
<tex-math id="M139">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}</tex-math>
<mml:math id="M140">
<mml:mi mathvariant="italic">μ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq68.gif"></inline-graphic>
</alternatives>
</inline-formula>
, respectively, after resizing.</p>
</sec>
</sec>
<sec id="Sec8">
<title>Operations supported by the Bloom Filter Trie</title>
<p>The BFT supports all operations necessary for storing, traversing and searching a pan-genome, as well as to extract the relevant information of the contained genomes and subsets thereof. Here we describe the most basic ones of them, Look-up and Insertion, as well as how the sets of colors are compressed. The traversal of the graph is discussed in the next section.</p>
<p>The algorithms use three auxiliary functions.
<inline-formula id="IEq69">
<alternatives>
<tex-math id="M141">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {HammingWeight}(\alpha ,{pref})$$\end{document}</tex-math>
<mml:math id="M142">
<mml:mrow>
<mml:mi mathvariant="sans-serif">HammingWeight</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq69.gif"></inline-graphic>
</alternatives>
</inline-formula>
counts the number of 1s in
<inline-formula id="IEq70">
<alternatives>
<tex-math id="M143">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[1..\alpha ]$$\end{document}</tex-math>
<mml:math id="M144">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq70.gif"></inline-graphic>
</alternatives>
</inline-formula>
and corresponds to how many prefixes represented in array
<italic>pref</italic>
are lexicographically smaller than or equal to an inserted prefix
<italic>a</italic>
with binary representation
<inline-formula id="IEq71">
<alternatives>
<tex-math id="M145">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}</tex-math>
<mml:math id="M146">
<mml:mi mathvariant="italic">α</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq71.gif"></inline-graphic>
</alternatives>
</inline-formula>
of length
<inline-formula id="IEq72">
<alternatives>
<tex-math id="M147">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document}</tex-math>
<mml:math id="M148">
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq72.gif"></inline-graphic>
</alternatives>
</inline-formula>
bits. This requires
<inline-formula id="IEq73">
<alternatives>
<tex-math id="M149">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(2^{\lambda })$$\end{document}</tex-math>
<mml:math id="M150">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq73.gif"></inline-graphic>
</alternatives>
</inline-formula>
time. The second function,
<inline-formula id="IEq74">
<alternatives>
<tex-math id="M151">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Rank}(i,{clust})$$\end{document}</tex-math>
<mml:math id="M152">
<mml:mrow>
<mml:mi mathvariant="sans-serif">Rank</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq74.gif"></inline-graphic>
</alternatives>
</inline-formula>
, iterates over array
<italic>clust</italic>
from its first position until the
<italic>i</italic>
th entry 1 is found and returns the position of this entry. It corresponds to the start position of cluster 
<italic>i</italic>
in array
<italic>clust</italic>
. If the entry is not found, the function returns
<inline-formula id="IEq75">
<alternatives>
<tex-math id="M153">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{clust}|+1$$\end{document}</tex-math>
<mml:math id="M154">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq75.gif"></inline-graphic>
</alternatives>
</inline-formula>
as a position. While
<inline-formula id="IEq76">
<alternatives>
<tex-math id="M155">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Rank}$$\end{document}</tex-math>
<mml:math id="M156">
<mml:mi mathvariant="sans-serif">Rank</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq76.gif"></inline-graphic>
</alternatives>
</inline-formula>
could be implemented in
<inline-formula id="IEq77">
<alternatives>
<tex-math id="M157">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(1)$$\end{document}</tex-math>
<mml:math id="M158">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq77.gif"></inline-graphic>
</alternatives>
</inline-formula>
time [
<xref ref-type="bibr" rid="CR5">5</xref>
], we use a more naive but space efficient
<inline-formula id="IEq78">
<alternatives>
<tex-math id="M159">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(q)$$\end{document}</tex-math>
<mml:math id="M160">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq78.gif"></inline-graphic>
</alternatives>
</inline-formula>
time implementation, where
<italic>q</italic>
is the number of suffix prefixes in a compressed container. Finally,
<inline-formula id="IEq79">
<alternatives>
<tex-math id="M161">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {BinarySearch}({uc},{s})$$\end{document}</tex-math>
<mml:math id="M162">
<mml:mrow>
<mml:mi mathvariant="sans-serif">BinarySearch</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq79.gif"></inline-graphic>
</alternatives>
</inline-formula>
searches for the suffix 
<italic>s</italic>
in the uncompressed container
<italic>uc</italic>
in
<inline-formula id="IEq80">
<alternatives>
<tex-math id="M163">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\log _2 c)$$\end{document}</tex-math>
<mml:math id="M164">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>log</mml:mo>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq80.gif"></inline-graphic>
</alternatives>
</inline-formula>
time, where
<italic>c</italic>
is the capacity of 
<italic>uc</italic>
.</p>
<sec id="Sec9">
<title>Look-up</title>
<p>The function that tests whether a suffix prefix
<inline-formula id="IEq81">
<alternatives>
<tex-math id="M165">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref} = ab$$\end{document}</tex-math>
<mml:math id="M166">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq81.gif"></inline-graphic>
</alternatives>
</inline-formula>
with binary representation
<inline-formula id="IEq82">
<alternatives>
<tex-math id="M167">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha \beta$$\end{document}</tex-math>
<mml:math id="M168">
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq82.gif"></inline-graphic>
</alternatives>
</inline-formula>
is stored in a compressed container
<italic>cc</italic>
is given in Algorithm 1. Line 1 verifies the presence of prefix
<italic>a</italic>
in the array
<italic>pref</italic>
in
<inline-formula id="IEq83">
<alternatives>
<tex-math id="M169">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(1)$$\end{document}</tex-math>
<mml:math id="M170">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq83.gif"></inline-graphic>
</alternatives>
</inline-formula>
time. If
<italic>a</italic>
is present, line 2 computes in
<inline-formula id="IEq84">
<alternatives>
<tex-math id="M171">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(2^{\lambda })$$\end{document}</tex-math>
<mml:math id="M172">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq84.gif"></inline-graphic>
</alternatives>
</inline-formula>
time the Hamming weight
<italic>i</italic>
of
<italic>a</italic>
, i.e., the index of the cluster in which suffix
<italic>b</italic>
is possibly situated. Line 3 locates the rank of
<italic>i</italic>
, i.e., the start position of the cluster, and lines 4–7 compare the suffixes of the cluster to 
<italic>b</italic>
. Lines 3–7 are computed in
<inline-formula id="IEq85">
<alternatives>
<tex-math id="M173">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(q)$$\end{document}</tex-math>
<mml:math id="M174">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq85.gif"></inline-graphic>
</alternatives>
</inline-formula>
time. Algorithm 1 has therefore a worst case running time of 
<inline-formula id="IEq86">
<alternatives>
<tex-math id="M175">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(2^{\lambda } + q)$$\end{document}</tex-math>
<mml:math id="M176">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq86.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<graphic position="anchor" xlink:href="13015_2016_66_Figb_HTML" id="MO5"></graphic>
<p>The function that tests whether a
<italic>k</italic>
-mer
<italic>x</italic>
is present in a BFT
<inline-formula id="IEq87">
<alternatives>
<tex-math id="M177">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t = (V_t,E_t)$$\end{document}</tex-math>
<mml:math id="M178">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq87.gif"></inline-graphic>
</alternatives>
</inline-formula>
is given in Algorithm 2. Each vertex
<inline-formula id="IEq88">
<alternatives>
<tex-math id="M179">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in V_t$$\end{document}</tex-math>
<mml:math id="M180">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq88.gif"></inline-graphic>
</alternatives>
</inline-formula>
represents
<italic>k</italic>
-mer suffixes possibly stored in its uncompressed container or rooted from its compressed containers. The look-up traverses
<italic>t</italic>
from the root and, for a vertex
<italic>v</italic>
, queries its containers one after the other for suffix
<inline-formula id="IEq89">
<alternatives>
<tex-math id="M181">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf} = {x[l \cdot {depth}(v,t) + 1..k]}$$\end{document}</tex-math>
<mml:math id="M182">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>·</mml:mo>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq89.gif"></inline-graphic>
</alternatives>
</inline-formula>
. If the queried container is compressed, its BF
<italic>quer</italic>
is queried for
<inline-formula id="IEq90">
<alternatives>
<tex-math id="M183">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf}[1..l]$$\end{document}</tex-math>
<mml:math id="M184">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq90.gif"></inline-graphic>
</alternatives>
</inline-formula>
using the function
<inline-formula id="IEq91">
<alternatives>
<tex-math id="M185">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {MayContain}$$\end{document}</tex-math>
<mml:math id="M186">
<mml:mi mathvariant="sans-serif">MayContain</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq91.gif"></inline-graphic>
</alternatives>
</inline-formula>
in
<inline-formula id="IEq92">
<alternatives>
<tex-math id="M187">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(f)$$\end{document}</tex-math>
<mml:math id="M188">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq92.gif"></inline-graphic>
</alternatives>
</inline-formula>
 time where, as above,
<italic>f</italic>
is the number of hash functions used by the BF. In case of a positive answer, the function
<inline-formula id="IEq93">
<alternatives>
<tex-math id="M189">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Contains}$$\end{document}</tex-math>
<mml:math id="M190">
<mml:mi mathvariant="sans-serif">Contains</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq93.gif"></inline-graphic>
</alternatives>
</inline-formula>
is used for an exact membership of
<inline-formula id="IEq94">
<alternatives>
<tex-math id="M191">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf}[1..l]$$\end{document}</tex-math>
<mml:math id="M192">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq94.gif"></inline-graphic>
</alternatives>
</inline-formula>
. If it is found, the traversing procedure continues recursively on the corresponding child. The absence of
<inline-formula id="IEq95">
<alternatives>
<tex-math id="M193">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf}[1..l]$$\end{document}</tex-math>
<mml:math id="M194">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq95.gif"></inline-graphic>
</alternatives>
</inline-formula>
indicates the absence of
<italic>x</italic>
in
<italic>t</italic>
since
<inline-formula id="IEq96">
<alternatives>
<tex-math id="M195">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf}[1..l]$$\end{document}</tex-math>
<mml:math id="M196">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq96.gif"></inline-graphic>
</alternatives>
</inline-formula>
cannot be in another container of
<italic>v</italic>
because of the insertion process explained later in this paper. If the container is uncompressed, the presence of
<inline-formula id="IEq97">
<alternatives>
<tex-math id="M197">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{suf}$$\end{document}</tex-math>
<mml:math id="M198">
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq97.gif"></inline-graphic>
</alternatives>
</inline-formula>
is detected using the function 
<inline-formula id="IEq98">
<alternatives>
<tex-math id="M199">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {BinarySearch}$$\end{document}</tex-math>
<mml:math id="M200">
<mml:mi mathvariant="sans-serif">BinarySearch</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq98.gif"></inline-graphic>
</alternatives>
</inline-formula>
. As an uncompressed container has no children, a match indicates the presence of the
<italic>k</italic>
-mer. Algorithm 2 is initially called as
<inline-formula id="IEq99">
<alternatives>
<tex-math id="M201">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {TreeContains}(x, 1, l, {root})$$\end{document}</tex-math>
<mml:math id="M202">
<mml:mrow>
<mml:mi mathvariant="sans-serif">TreeContains</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq99.gif"></inline-graphic>
</alternatives>
</inline-formula>
. In the worst case, all vertices on a traversed path represent all possible suffix prefixes and the BFs
<italic>quer</italic>
have a false positive ratio of 0. In such case, each traversed vertex contains
<inline-formula id="IEq100">
<alternatives>
<tex-math id="M203">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left\lceil \frac{|\mathcal {A}|^l}{c}\right\rceil$$\end{document}</tex-math>
<mml:math id="M204">
<mml:mfenced close="⌉" open="⌈" separators="">
<mml:mfrac>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mi>l</mml:mi>
</mml:msup>
<mml:mi>c</mml:mi>
</mml:mfrac>
</mml:mfenced>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq100.gif"></inline-graphic>
</alternatives>
</inline-formula>
containers. The longest path of a BFT has
<inline-formula id="IEq101">
<alternatives>
<tex-math id="M205">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{k}{l}$$\end{document}</tex-math>
<mml:math id="M206">
<mml:mfrac>
<mml:mi>k</mml:mi>
<mml:mi>l</mml:mi>
</mml:mfrac>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq101.gif"></inline-graphic>
</alternatives>
</inline-formula>
vertices. Therefore, the worst case time of
<inline-formula id="IEq102">
<alternatives>
<tex-math id="M207">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {TreeContains}$$\end{document}</tex-math>
<mml:math id="M208">
<mml:mi mathvariant="sans-serif">TreeContains</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq102.gif"></inline-graphic>
</alternatives>
</inline-formula>
is
<inline-formula id="IEq103">
<alternatives>
<tex-math id="M209">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}\left (\frac{k}{l} \cdot \left (\left\lceil \frac{|\mathcal {A}|^l}{c}\right\rceil \cdot f + 2^{\lambda } + q\right )\right)$$\end{document}</tex-math>
<mml:math id="M210">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mfenced close=")" open="(" separators="">
<mml:mfrac>
<mml:mi>k</mml:mi>
<mml:mi>l</mml:mi>
</mml:mfrac>
<mml:mo>·</mml:mo>
<mml:mfenced close=")" open="(" separators="">
<mml:mfenced close="⌉" open="⌈" separators="">
<mml:mfrac>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mi>l</mml:mi>
</mml:msup>
<mml:mi>c</mml:mi>
</mml:mfrac>
</mml:mfenced>
<mml:mo>·</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi>
</mml:mfenced>
</mml:mfenced>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq103.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<graphic position="anchor" xlink:href="13015_2016_66_Figc_HTML" id="MO6"></graphic>
</sec>
<sec id="Sec10">
<title>Insertion</title>
<p>Prior to any
<italic>k</italic>
-mer insertion into a BFT
<italic>t</italic>
, a look-up verifies if the
<italic>k</italic>
-mer is already present. If it is, only its set of colors is modified. Otherwise, the look-up stops the trie traversal on a container
<italic>cont</italic>
of a vertex
<italic>v</italic>
where the searched suffix prefix or
<italic>k</italic>
-mer suffix is not present. If
<italic>cont</italic>
is uncompressed, the insertion of the
<italic>k</italic>
-mer suffix and its color is a simple
<inline-formula id="IEq104">
<alternatives>
<tex-math id="M211">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\log _2 c)$$\end{document}</tex-math>
<mml:math id="M212">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>log</mml:mo>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq104.gif"></inline-graphic>
</alternatives>
</inline-formula>
time process. If
<italic>cont</italic>
is compressed, the insertion of suffix prefix
<inline-formula id="IEq105">
<alternatives>
<tex-math id="M213">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref} = ab$$\end{document}</tex-math>
<mml:math id="M214">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq105.gif"></inline-graphic>
</alternatives>
</inline-formula>
is a bit more intricate. In fact, it will only be triggered if
<italic>cont</italic>
is the first compressed container of
<italic>v</italic>
to have
<inline-formula id="IEq106">
<alternatives>
<tex-math id="M215">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}$$\end{document}</tex-math>
<mml:math id="M216">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq106.gif"></inline-graphic>
</alternatives>
</inline-formula>
as a false positive (
<inline-formula id="IEq107">
<alternatives>
<tex-math id="M217">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {MayContain}(s_{pref},{cont}.{quer}) = {true}$$\end{document}</tex-math>
<mml:math id="M218">
<mml:mrow>
<mml:mi mathvariant="sans-serif">MayContain</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>.</mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq107.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq108">
<alternatives>
<tex-math id="M219">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Contains}(s_{pref},{cont}) = {false}$$\end{document}</tex-math>
<mml:math id="M220">
<mml:mrow>
<mml:mi mathvariant="sans-serif">Contains</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq108.gif"></inline-graphic>
</alternatives>
</inline-formula>
). False positives are therefore “recycled”, which is a nice property of BFTs: The BF
<italic>quer</italic>
remains unchanged, and only
<italic>pref</italic>
,
<italic>suf</italic>
and
<italic>clust</italic>
need to be updated in a way similar to Algorithm 1. The presence of prefix 
<italic>a</italic>
must be first verified by testing the value of
<inline-formula id="IEq109">
<alternatives>
<tex-math id="M221">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[\alpha ]$$\end{document}</tex-math>
<mml:math id="M222">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq109.gif"></inline-graphic>
</alternatives>
</inline-formula>
where
<inline-formula id="IEq110">
<alternatives>
<tex-math id="M223">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}</tex-math>
<mml:math id="M224">
<mml:mi mathvariant="italic">α</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq110.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the binary representation of
<italic>a</italic>
. If
<inline-formula id="IEq111">
<alternatives>
<tex-math id="M225">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[\alpha ] = 0$$\end{document}</tex-math>
<mml:math id="M226">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq111.gif"></inline-graphic>
</alternatives>
</inline-formula>
, prefix
<italic>a</italic>
is not present and is recorded by setting
<inline-formula id="IEq112">
<alternatives>
<tex-math id="M227">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[\alpha ]$$\end{document}</tex-math>
<mml:math id="M228">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq112.gif"></inline-graphic>
</alternatives>
</inline-formula>
to 1. Then, the index
<inline-formula id="IEq113">
<alternatives>
<tex-math id="M229">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${id}_{cluster}$$\end{document}</tex-math>
<mml:math id="M230">
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq113.gif"></inline-graphic>
</alternatives>
</inline-formula>
and start position
<inline-formula id="IEq114">
<alternatives>
<tex-math id="M231">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos}_{cluster}$$\end{document}</tex-math>
<mml:math id="M232">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq114.gif"></inline-graphic>
</alternatives>
</inline-formula>
of the new cluster are computed using
<inline-formula id="IEq115">
<alternatives>
<tex-math id="M233">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {HammingWeight}$$\end{document}</tex-math>
<mml:math id="M234">
<mml:mi mathvariant="sans-serif">HammingWeight</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq115.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq116">
<alternatives>
<tex-math id="M235">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Rank}$$\end{document}</tex-math>
<mml:math id="M236">
<mml:mi mathvariant="sans-serif">Rank</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq116.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The suffix
<italic>b</italic>
is inserted into
<inline-formula id="IEq117">
<alternatives>
<tex-math id="M237">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${suf}[{pos}_{cluster}]$$\end{document}</tex-math>
<mml:math id="M238">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq117.gif"></inline-graphic>
</alternatives>
</inline-formula>
and a 1 into
<inline-formula id="IEq118">
<alternatives>
<tex-math id="M239">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${clust}[{pos}_{cluster}]$$\end{document}</tex-math>
<mml:math id="M240">
<mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq118.gif"></inline-graphic>
</alternatives>
</inline-formula>
. This takes
<inline-formula id="IEq119">
<alternatives>
<tex-math id="M241">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(2^{\lambda } + 2q)$$\end{document}</tex-math>
<mml:math id="M242">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq119.gif"></inline-graphic>
</alternatives>
</inline-formula>
time. If
<inline-formula id="IEq120">
<alternatives>
<tex-math id="M243">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pref}[\alpha ] = 1$$\end{document}</tex-math>
<mml:math id="M244">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq120.gif"></inline-graphic>
</alternatives>
</inline-formula>
prior to insertion, prefix
<italic>a</italic>
is already present, and
<inline-formula id="IEq121">
<alternatives>
<tex-math id="M245">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${id}_{cluster}$$\end{document}</tex-math>
<mml:math id="M246">
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq121.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq122">
<alternatives>
<tex-math id="M247">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos}_{cluster}$$\end{document}</tex-math>
<mml:math id="M248">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq122.gif"></inline-graphic>
</alternatives>
</inline-formula>
have already been computed by
<inline-formula id="IEq123">
<alternatives>
<tex-math id="M249">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf {Contains}(s_{pref},{cont})$$\end{document}</tex-math>
<mml:math id="M250">
<mml:mrow>
<mml:mi mathvariant="sans-serif">Contains</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq123.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Let
<italic>n</italic>
be the number of suffixes in cluster
<inline-formula id="IEq124">
<alternatives>
<tex-math id="M251">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${id}_{cluster}$$\end{document}</tex-math>
<mml:math id="M252">
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq124.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Suffix
<italic>b</italic>
is inserted into
<italic>suf</italic>
[
<italic>pos</italic>
] such that
<inline-formula id="IEq125">
<alternatives>
<tex-math id="M253">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos}_{cluster} \le {pos} \le {{pos}_{cluster}+n}$$\end{document}</tex-math>
<mml:math id="M254">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq125.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq126">
<alternatives>
<tex-math id="M255">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${suf}[{pos}-1] < {suf}[{pos}]$$\end{document}</tex-math>
<mml:math id="M256">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq126.gif"></inline-graphic>
</alternatives>
</inline-formula>
. If
<inline-formula id="IEq127">
<alternatives>
<tex-math id="M257">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${pos} = {pos}_{cluster}$$\end{document}</tex-math>
<mml:math id="M258">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq127.gif"></inline-graphic>
</alternatives>
</inline-formula>
,
<italic>b</italic>
starts its cluster: A 1 is inserted into
<italic>clust</italic>
[
<italic>pos</italic>
] and
<inline-formula id="IEq128">
<alternatives>
<tex-math id="M259">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${clust}[{pos}+1]$$\end{document}</tex-math>
<mml:math id="M260">
<mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq128.gif"></inline-graphic>
</alternatives>
</inline-formula>
is set to 0. Otherwise, a 0 is inserted into
<italic>clust</italic>
[
<italic>pos</italic>
]. This takes
<inline-formula id="IEq129">
<alternatives>
<tex-math id="M261">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(2q)$$\end{document}</tex-math>
<mml:math id="M262">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq129.gif"></inline-graphic>
</alternatives>
</inline-formula>
time. The worst case insertion time of a
<italic>k</italic>
-mer is
<inline-formula id="IEq130">
<alternatives>
<tex-math id="M263">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(d + 2^{\lambda } + 2q)$$\end{document}</tex-math>
<mml:math id="M264">
<mml:mrow>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq130.gif"></inline-graphic>
</alternatives>
</inline-formula>
with
<italic>d</italic>
being the worst case time look-up.</p>
<p>The internal representation of the compressed container shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b after insertion of the suffix prefix
<italic>gtat</italic>
is given below (inserted parts are highlighted). The presence of prefix
<italic>gt</italic>
is recorded in
<italic>pref</italic>
[12]. Then, its cluster index and start position are computed as 4 and 5, respectively. Consequently, after reallocation of arrays
<italic>suf</italic>
and
<italic>clust</italic>
, suffix
<italic>at</italic>
is inserted in
<italic>suf</italic>
[5] and
<italic>clust</italic>
[5] is set to 1 to indicate that
<italic>suf</italic>
[5] starts a new cluster.
<graphic position="anchor" xlink:href="13015_2016_66_Figd_HTML" id="MO64"></graphic>
</p>
</sec>
<sec id="Sec11">
<title>Color compression</title>
<p>Remember from the BFT description that color sets associated with
<italic>k</italic>
-mers in a C-DBG are initially stored as bit arrays in BFTs. However, these can be compressed by storing sets of colors that are identical for multiple
<italic>k</italic>
-mers once. To this end, a list of all color sets occurring in the BFT is built and sorted in decreasing order of total size, i.e., the number of
<italic>k</italic>
-mers sharing a color set multiplied by its size. Then, by iterating over the list, each color set is added incrementally to an external array if the integer encoding its position in the array uses less space than the size of the color set itself. Finally, each color set present in the external array is replaced in the BFT by its position in the external array.</p>
</sec>
</sec>
<sec id="Sec12">
<title>Traversing successors and predecessors</title>
<p>Let
<italic>t</italic>
be a BFT that represents a C-DBG
<italic>G</italic>
. For a
<italic>k</italic>
-mer
<italic>x</italic>
, visiting all its predecessors or successors in
<italic>G</italic>
, denoted
<italic>pred</italic>
(
<italic>x</italic>
<italic>G</italic>
) and
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
), respectively, implies the look-up of
<inline-formula id="IEq131">
<alternatives>
<tex-math id="M265">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {A}|$$\end{document}</tex-math>
<mml:math id="M266">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq131.gif"></inline-graphic>
</alternatives>
</inline-formula>
different
<italic>k</italic>
-mers in
<italic>t</italic>
. Such a look-up would visit in the worst case
<inline-formula id="IEq132">
<alternatives>
<tex-math id="M267">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {A}| \cdot ({height}_{max}(t)+1)$$\end{document}</tex-math>
<mml:math id="M268">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq132.gif"></inline-graphic>
</alternatives>
</inline-formula>
vertices in
<italic>t</italic>
. This section describes how to reduce the number of vertices and containers visited in
<italic>t</italic>
during the traversal of a vertex in
<italic>G</italic>
.</p>
<sec id="FPar1">
<title>
<bold>Observation 1</bold>
</title>
<p>Let
<italic>G</italic>
be a C-DBG represented by a BFT
<italic>t</italic>
and
<italic>x</italic>
a
<italic>k</italic>
-mer corresponding to a vertex of
<italic>G</italic>
. All
<italic>k</italic>
-mers of
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
) share
<italic>x</italic>
[2..
<italic>k</italic>
] as a common prefix and therefore share a common subpath in
<italic>t</italic>
starting at the root. On the other hand,
<italic>k</italic>
-mers of
<italic>pred</italic>
(
<italic>x</italic>
<italic>G</italic>
) have different first characters and, therefore, except for the root of
<italic>t</italic>
do not share a common subpath. Hence, the maximum number of visited vertices in
<italic>t</italic>
for all
<italic>k</italic>
-mers of
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
) is
<inline-formula id="IEq133">
<alternatives>
<tex-math id="M269">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 + {height}_{max}(t)$$\end{document}</tex-math>
<mml:math id="M270">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq133.gif"></inline-graphic>
</alternatives>
</inline-formula>
and for all
<italic>k</italic>
-mers of
<italic>pred</italic>
(
<italic>x</italic>
<italic>G</italic>
) is
<inline-formula id="IEq134">
<alternatives>
<tex-math id="M271">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 + |\mathcal {A}| \cdot {height}_{max}(t)$$\end{document}</tex-math>
<mml:math id="M272">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq134.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
</sec>
<sec id="FPar2">
<title>
<bold>Lemma 1</bold>
</title>
<p>Let
<italic>G</italic>
be a C-DBG represented by a BFT
<italic>t</italic>
,
<italic>x</italic>
a
<italic>k</italic>
-mer in
<italic>t</italic>
and
<italic>v</italic>
a vertex of
<italic>t</italic>
that terminates the shared subpath of the
<italic>k</italic>
-mers in
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
). If
<inline-formula id="IEq135">
<alternatives>
<tex-math id="M273">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${depth}(v,t) = {height}_{max}(t)$$\end{document}</tex-math>
<mml:math id="M274">
<mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq135.gif"></inline-graphic>
</alternatives>
</inline-formula>
,
<italic>succ</italic>
(
<italic>x</italic>
<italic>t</italic>
) suffixes may be stored in any container of
<italic>v</italic>
. If not, they are stored in the uncompressed container of
<italic>v</italic>
.</p>
</sec>
<sec id="FPar3">
<title>
<italic>Proof</italic>
</title>
<p>A vertex
<italic>v</italic>
is the root of a sub-trie storing
<italic>k</italic>
-mer suffixes of length
<inline-formula id="IEq136">
<alternatives>
<tex-math id="M275">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${l \cdot ({height}_{max}(t) - {depth}(v,t) + 1)}$$\end{document}</tex-math>
<mml:math id="M276">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>·</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq136.gif"></inline-graphic>
</alternatives>
</inline-formula>
with
<inline-formula id="IEq137">
<alternatives>
<tex-math id="M277">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l = \frac{k}{{height}_{max}(t)+1}$$\end{document}</tex-math>
<mml:math id="M278">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq137.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Let
<italic>s</italic>
be a
<italic>k</italic>
-mer suffix of
<italic>succ</italic>
(
<italic>x</italic>
<italic>t</italic>
) rooted at a vertex
<inline-formula id="IEq138">
<alternatives>
<tex-math id="M279">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in V_t$$\end{document}</tex-math>
<mml:math id="M280">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq138.gif"></inline-graphic>
</alternatives>
</inline-formula>
. If
<inline-formula id="IEq139">
<alternatives>
<tex-math id="M281">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${depth}(v,t) \ne {height}_{max}(t)$$\end{document}</tex-math>
<mml:math id="M282">
<mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq139.gif"></inline-graphic>
</alternatives>
</inline-formula>
but
<italic>s</italic>
is rooted at a compressed container in
<italic>v</italic>
, then this compressed container stores
<italic>s</italic>
[1..
<italic>l</italic>
], and
<inline-formula id="IEq140">
<alternatives>
<tex-math id="M283">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s[l+1..|s|]$$\end{document}</tex-math>
<mml:math id="M284">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq140.gif"></inline-graphic>
</alternatives>
</inline-formula>
is rooted in one of its children. As the divergent character between the
<italic>k</italic>
-mer suffixes of
<italic>succ</italic>
(
<italic>x</italic>
) is in position
<inline-formula id="IEq141">
<alternatives>
<tex-math id="M285">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|s| - 1$$\end{document}</tex-math>
<mml:math id="M286">
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq141.gif"></inline-graphic>
</alternatives>
</inline-formula>
, this character is in
<inline-formula id="IEq142">
<alternatives>
<tex-math id="M287">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s[l+1..|s|]$$\end{document}</tex-math>
<mml:math id="M288">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq142.gif"></inline-graphic>
</alternatives>
</inline-formula>
, rooted at one child of this compressed container. Therefore
<italic>v</italic>
does not terminate the common subpath shared by
<italic>succ</italic>
(
<italic>x</italic>
<italic>t</italic>
)
<italic>k</italic>
-mers.
<inline-formula id="IEq143">
<alternatives>
<tex-math id="M289">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square$$\end{document}</tex-math>
<mml:math id="M290">
<mml:mo></mml:mo>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq143.gif"></inline-graphic>
</alternatives>
</inline-formula>
</p>
<p>Lemma 1 proves that the only two cases where a look-up of
<italic>pred</italic>
(
<italic>x</italic>
<italic>G</italic>
) or
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
) must search in different containers of a vertex are:
<list list-type="bullet">
<list-item>
<p>searching at the root of
<italic>t</italic>
for
<italic>k</italic>
-mers of
<italic>pred</italic>
(
<italic>x</italic>
<italic>G</italic>
),</p>
</list-item>
<list-item>
<p>if
<inline-formula id="IEq144">
<alternatives>
<tex-math id="M291">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${depth}(v,t) = {height}_{max}(t)$$\end{document}</tex-math>
<mml:math id="M292">
<mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq144.gif"></inline-graphic>
</alternatives>
</inline-formula>
, searching at vertex
<italic>v</italic>
for suffixes of
<italic>succ</italic>
(
<italic>x</italic>
<italic>G</italic>
).</p>
</list-item>
</list>
</p>
<p>Restricting the hash functions used in the compressed containers to take only positions 2 through
<inline-formula id="IEq145">
<alternatives>
<tex-math id="M293">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l-1$$\end{document}</tex-math>
<mml:math id="M294">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq145.gif"></inline-graphic>
</alternatives>
</inline-formula>
into account, allows to limit the search space.</p>
</sec>
<sec id="FPar4">
<title>
<bold>Lemma 2</bold>
</title>
<p>Let
<italic>t</italic>
be a BFT where the
<italic>f</italic>
hash functions
<inline-formula id="IEq146">
<alternatives>
<tex-math id="M295">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_i$$\end{document}</tex-math>
<mml:math id="M296">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq146.gif"></inline-graphic>
</alternatives>
</inline-formula>
of
<italic>quer</italic>
have the form
<inline-formula id="IEq147">
<alternatives>
<tex-math id="M297">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_i(s_{pref}):s_{pref}[2..l-1]\rightarrow \{1,..,m\}$$\end{document}</tex-math>
<mml:math id="M298">
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo stretchy="false"></mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq147.gif"></inline-graphic>
</alternatives>
</inline-formula>
for
<inline-formula id="IEq148">
<alternatives>
<tex-math id="M299">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,...,f$$\end{document}</tex-math>
<mml:math id="M300">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq148.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Then, for a vertex
<italic>v</italic>
of
<italic>t</italic>
and a suffix prefix
<inline-formula id="IEq149">
<alternatives>
<tex-math id="M301">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}$$\end{document}</tex-math>
<mml:math id="M302">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq149.gif"></inline-graphic>
</alternatives>
</inline-formula>
, all possible substrings
<inline-formula id="IEq150">
<alternatives>
<tex-math id="M303">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s'_{pref} = c_1 s_{pref}[2..l-1] c_2$$\end{document}</tex-math>
<mml:math id="M304">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq150.gif"></inline-graphic>
</alternatives>
</inline-formula>
are contained in the same container of
<italic>v</italic>
.</p>
</sec>
<sec id="FPar5">
<title>
<italic>Proof</italic>
</title>
<p>Assume a
<italic>k</italic>
-mer suffix
<italic>s</italic>
inserted in a vertex
<italic>v</italic>
of
<italic>t</italic>
. A look-up for
<italic>s</italic>
analyzes the containers of
<italic>v</italic>
from the head to the tail of the container list. In the worst case,
<italic>s</italic>
can be rooted, according to BFs
<italic>quer</italic>
, in all compressed containers as a true positive or as a false positive. However, a look-up stops either on the first compressed container claiming to contain the suffix prefix
<inline-formula id="IEq151">
<alternatives>
<tex-math id="M305">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref} = s[1..l]$$\end{document}</tex-math>
<mml:math id="M306">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq151.gif"></inline-graphic>
</alternatives>
</inline-formula>
, or on the uncompressed container. As the hash functions of
<italic>quer</italic>
consider only
<inline-formula id="IEq152">
<alternatives>
<tex-math id="M307">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}[2..l-1]$$\end{document}</tex-math>
<mml:math id="M308">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq152.gif"></inline-graphic>
</alternatives>
</inline-formula>
, a look-up will therefore stop on the same container for any substring
<inline-formula id="IEq153">
<alternatives>
<tex-math id="M309">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s'_{pref} = c_1 s_{pref}[2..l-1] c_2$$\end{document}</tex-math>
<mml:math id="M310">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq153.gif"></inline-graphic>
</alternatives>
</inline-formula>
.
<inline-formula id="IEq154">
<alternatives>
<tex-math id="M311">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square$$\end{document}</tex-math>
<mml:math id="M312">
<mml:mo></mml:mo>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq154.gif"></inline-graphic>
</alternatives>
</inline-formula>
</p>
<p>As a consequence of Lemma 2, each suffix prefix
<inline-formula id="IEq155">
<alternatives>
<tex-math id="M313">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{pref}$$\end{document}</tex-math>
<mml:math id="M314">
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq155.gif"></inline-graphic>
</alternatives>
</inline-formula>
stored or to store in arrays
<italic>pref</italic>
,
<italic>suf</italic>
and
<italic>clust</italic>
is modified such that
<inline-formula id="IEq156">
<alternatives>
<tex-math id="M315">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s'_{pref} = s_{pref}[2..l] s_{pref}[1]$$\end{document}</tex-math>
<mml:math id="M316">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq156.gif"></inline-graphic>
</alternatives>
</inline-formula>
, which guarantees that all
<inline-formula id="IEq157">
<alternatives>
<tex-math id="M317">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s''_{pref} = s'_{pref}[1..l-2] c_2 c_1$$\end{document}</tex-math>
<mml:math id="M318">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq157.gif"></inline-graphic>
</alternatives>
</inline-formula>
are in the same container. Furthermore, suffixes stored in array
<italic>suf</italic>
are required to have a minimum length of two characters to ensure that characters
<inline-formula id="IEq158">
<alternatives>
<tex-math id="M319">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_1$$\end{document}</tex-math>
<mml:math id="M320">
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq158.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq159">
<alternatives>
<tex-math id="M321">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_2$$\end{document}</tex-math>
<mml:math id="M322">
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq159.gif"></inline-graphic>
</alternatives>
</inline-formula>
, the variable parts between the different
<inline-formula id="IEq160">
<alternatives>
<tex-math id="M323">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s''_{pref}$$\end{document}</tex-math>
<mml:math id="M324">
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq160.gif"></inline-graphic>
</alternatives>
</inline-formula>
, are stored in array
<italic>suf</italic>
. Hence, as all
<inline-formula id="IEq161">
<alternatives>
<tex-math id="M325">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s''_{pref}$$\end{document}</tex-math>
<mml:math id="M326">
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq161.gif"></inline-graphic>
</alternatives>
</inline-formula>
share
<inline-formula id="IEq162">
<alternatives>
<tex-math id="M327">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s'_{pref}[1..l-2]$$\end{document}</tex-math>
<mml:math id="M328">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq162.gif"></inline-graphic>
</alternatives>
</inline-formula>
as a prefix, they share the same cluster in arrays
<italic>suf</italic>
and
<italic>clust</italic>
. Suffix prefixes
<inline-formula id="IEq163">
<alternatives>
<tex-math id="M329">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s''_{pref} = s'_{pref}[1..l-1] c_1$$\end{document}</tex-math>
<mml:math id="M330">
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
</mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>.</mml:mo>
<mml:mo>.</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq163.gif"></inline-graphic>
</alternatives>
</inline-formula>
also have consecutive suffixes in their cluser.</p>
</sec>
</sec>
<sec id="Sec13">
<title>Evaluation</title>
<p>We compared BFT, version 0.5, to SBT [
<xref ref-type="bibr" rid="CR7">7</xref>
], version 0.3.5, on a mid-class laptop with an SSD hard drive and an Intel Core i5-4300M processor cadenced at 2.6 GHz, using a single thread. It was not possible to include Khmer in this evaluation as it does not support
<inline-formula id="IEq164">
<alternatives>
<tex-math id="M331">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k > 32$$\end{document}</tex-math>
<mml:math id="M332">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>></mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq164.gif"></inline-graphic>
</alternatives>
</inline-formula>
and building a labeled de Bruijn graph with it requires concatenated raw sequence files as input, where it is not possible to specify a minimum number of occurences per
<italic>k</italic>
-mer. Results provided in this section can differ from those reported in the preliminary version of this paper [
<xref ref-type="bibr" rid="CR2">2</xref>
] as evaluated software versions are different and computational cost of
<italic>k</italic>
-mer counting is now excluded. Also main memory usage is now provided in addition to the disk space usage. BFT and SBT were used to represent one real and one simulated pan-genome dataset. The real dataset consists of raw sequencing data from 473 clinical isolates of
<italic>Pseudomonas aeruginosa</italic>
sampled from 34 patients (NCBI BioProject PRJEB5438), resulting in 820.76 GB of FASTQ files. The simulated dataset corresponds to 154 isolates generated from 19 strains of
<italic>Yersinia pestis</italic>
. For each isolate, we used Wgsim [
<xref ref-type="bibr" rid="CR23">23</xref>
] to create 6,000,000 reads of length 100 with a substitution sequencing error rate of 0.5 %, resulting in 233.84 GB of FASTQ files. We first used KmerGenie [
<xref ref-type="bibr" rid="CR24">24</xref>
] on a subsample of the files for each dataset to estimate the best
<italic>k</italic>
-mer length and the minimum number of occurrences for considering a
<italic>k</italic>
-mer valid (not resulting from a sequencing error). A length of
<inline-formula id="IEq165">
<alternatives>
<tex-math id="M333">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=63$$\end{document}</tex-math>
<mml:math id="M334">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>63</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq165.gif"></inline-graphic>
</alternatives>
</inline-formula>
with a minimum number of 3 occurrences was selected for the real and a length of
<inline-formula id="IEq166">
<alternatives>
<tex-math id="M335">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=54$$\end{document}</tex-math>
<mml:math id="M336">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>54</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq166.gif"></inline-graphic>
</alternatives>
</inline-formula>
with a minimum of 15 occurrences for the simulated data set. The capacity
<italic>c</italic>
influences the compression ratio as well as the time for insertion and look-up. We chose a value of
<inline-formula id="IEq167">
<alternatives>
<tex-math id="M337">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=248$$\end{document}</tex-math>
<mml:math id="M338">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>248</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq167.gif"></inline-graphic>
</alternatives>
</inline-formula>
, as it showed a good compromise in practice. The prefix length
<italic>l</italic>
determines the size of several internal structures of the BFT and how efficiently they can be stored. We selected
<inline-formula id="IEq168">
<alternatives>
<tex-math id="M339">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l=9$$\end{document}</tex-math>
<mml:math id="M340">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq168.gif"></inline-graphic>
</alternatives>
</inline-formula>
, as this limits the internal fragmentation of the memory. As the size of BFs used by the SBT software must be specified prior to the
<italic>k</italic>
-mer insertion and should be the same for all vertices, the authors of SBT suggested to estimate the number of unique
<italic>k</italic>
-mers in each dataset to design the size of BFs, at the price of an extra computation time (personal communication). Since we knew the exact number of unique
<italic>k</italic>
-mers from the BFT construction, we used this instead: 93,201,551 63-mers for the real dataset and 37,334,323 54-mers for the simulated dataset, resulting in BF sizes of 11.11 MB and 4.45 MB, respectively. We also reused unique
<italic>k</italic>
-mer counts computed by the BFT to estimate the number of hash functions to use in SBT: One hash function for the real dataset and two hash functions for the simulated dataset. Insertion time and memory usage are shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
. Insertion time and peaks of memory include the compression steps proposed by both tools, i.e., color compression and RRR compression [
<xref ref-type="bibr" rid="CR25">25</xref>
], respectively. The SBT disk sizes are given for the leaves first, since the internal vertices can be reconstructed from them, and then for the complete tree. The compressed disk size corresponds to the size of both data structures on disk, compressed using 7z [
<xref ref-type="bibr" rid="CR26">26</xref>
] with the highest compression level and LZMA2 [
<xref ref-type="bibr" rid="CR26">26</xref>
] as compression method.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Insertion time and memory usage for the real (
<italic>P. aeruginosa</italic>
) and simulated (
<italic>Y. pestis</italic>
) dataset. The compression ratio is given w.r.t. the original file sizes. Disk sizes for the SBT are given for the leaves first and then for the complete tree</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="2">
<italic>P. aeruginosa</italic>
</th>
<th align="left" colspan="2">
<italic>Y. pestis</italic>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">BFT</th>
<th align="left">SBT</th>
<th align="left">BFT</th>
<th align="left">SBT</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Insertion time (min)</td>
<td align="left">168.52</td>
<td align="left">371.45</td>
<td align="left">29.88</td>
<td align="left">32.67</td>
</tr>
<tr>
<td align="left">Peak of main memory (MB/compr. ratio)</td>
<td align="left">7487/112:1</td>
<td align="left">7356/114:1</td>
<td align="left">1313/182:1</td>
<td align="left">1586/151:1</td>
</tr>
<tr>
<td align="left">Disk size (MB/compr. ratio)</td>
<td align="left">1644/511:1</td>
<td align="left">2076–4572/405:1–184:1</td>
<td align="left">484/495:1 </td>
<td align="left">538–1117/445:1–214:1</td>
</tr>
<tr>
<td align="left">Compressed disk size (MB/compr. ratio)</td>
<td align="left">833/1009:1</td>
<td align="left">1906–4280/441:1–196:1</td>
<td align="left">225/1064:1</td>
<td align="left">528–1099/454:1–218:1</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>We suspect that 7z delivers such a high compression ratio for the BFT because it takes advantage of the data redundancy among the uncompressed containers, particularly among the sets of colors. Indeed, the color compression step used by the BFT is rather simple but keeps the sets of colors fully indexed and, therefore, does not penalize insertion time. In contrast, SBT uses a more advanced compression method, RRR, which explains the lower compression ratio offered by 7z. The final size of the BFT in main memory and on disk for all pan-genomes made of one up to all isolates for the real and simulated dataset is shown in Figs.
<xref rid="Fig2" ref-type="fig">2</xref>
and
<xref rid="Fig3" ref-type="fig">3</xref>
, respectively. As shown, the memory growth of the BFT is largely sub-linear with respect to the size of the input data.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>BFT main memory and disk size for pan-genomes made of one up to all
<italic>P. aeruginosa</italic>
isolates</p>
</caption>
<graphic xlink:href="13015_2016_66_Fig2_HTML" id="MO8"></graphic>
</fig>
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>BFT main memory and disk size for pan-genomes made of one up to all simulated
<italic>Y. pestis</italic>
isolates</p>
</caption>
<graphic xlink:href="13015_2016_66_Fig3_HTML" id="MO9"></graphic>
</fig>
</p>
<p>For each dataset, a set of randomly selected
<italic>k</italic>
-mers of the BFT was written to disk and reused as a batch query for both data structures. Real and simulated dataset batch queries contain ten million 63-mers and 54-mers, respectively. Query times are shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Total and per
<italic>k</italic>
-mer query times for the real (
<italic>P. aeruginosa</italic>
) and simulated (
<italic>Y. pestis</italic>
) dataset with peaks of main memory</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="2">
<italic>P. aeruginosa</italic>
</th>
<th align="left" colspan="2">
<italic>Y. pestis</italic>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">BFT</th>
<th align="left">SBT</th>
<th align="left">BFT</th>
<th align="left">SBT</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Total query time (min)</td>
<td align="left">1.19 </td>
<td align="left">61.86</td>
<td align="left">0.57</td>
<td align="left">37.42</td>
</tr>
<tr>
<td align="left">Query time per
<italic>k</italic>
-mer (μs)</td>
<td align="left">7.14</td>
<td align="left">371.16</td>
<td align="left">3.42</td>
<td align="left">224.52</td>
</tr>
<tr>
<td align="left">Peak of main memory (MB)</td>
<td align="left">2076</td>
<td align="left">11,678</td>
<td align="left">544 </td>
<td align="left">7775</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>A third experiment gives an estimation of the time required to traverse the graph represented by a BFT: It verifies for each
<italic>k</italic>
-mer of the batch queries whether its corresponding vertex in the graph is branching. This experiment first computes information about the root in a negligible amount of time and memory. Then, the BFT is queried for its branching vertices. For the real dataset, this experiment took 55.52 s (average time of 5.55 
<inline-formula id="IEq173">
<alternatives>
<tex-math id="M341">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}</tex-math>
<mml:math id="M342">
<mml:mi mathvariant="italic">μ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq173.gif"></inline-graphic>
</alternatives>
</inline-formula>
s per 63-mer), resulting in 1,574,198 branching vertices. For the simulated dataset, this experiment took 38.79 s (average time of 3.88 
<inline-formula id="IEq174">
<alternatives>
<tex-math id="M343">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}</tex-math>
<mml:math id="M344">
<mml:mi mathvariant="italic">μ</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2016_66_Article_IEq174.gif"></inline-graphic>
</alternatives>
</inline-formula>
s per 54-mer), resulting in 141,802 branching vertices.</p>
<p>In summary, in our experiments the BFT was up to two times faster to build than the SBT while using about the same amount of main memory. When written on disk, the BFT used less memory than SBT for both datasets and when compressed with 7z, the BFT was about two times smaller than the SBT. For querying
<italic>k</italic>
-mers, the BFT was about 52 to 66 times faster than the SBT while using about 5.5 to 14.3 times less memory.</p>
</sec>
<sec id="Sec14">
<title>Conclusions</title>
<p>We proposed a novel data structure called the Bloom Filter Trie for storing a pan-genome as a colored de Bruijn graph. The trie stores
<italic>k</italic>
-mers and their colors. A new representation of vertices is proposed to compress and index shared substrings. It uses four basic data structures that allow to quickly verify the presence of substrings. In the worst case, the compressed strings have a memory footprint close to their binary representation. However, we observe in practice substantial memory savings. Future work concerns the possiblity to compress non-branching paths that share the same colors [
<xref ref-type="bibr" rid="CR19">19</xref>
] and also the extraction of the different pan-genome regions.</p>
</sec>
</body>
<back>
<ack>
<title>Authors’ contributions</title>
<p>GH developed, implemented and evaluated the method. All authors wrote the paper. All authors read and approved the final manuscript.</p>
<sec id="d30e6511">
<title>Acknowledgements</title>
<p>The authors wish to thank the authors of SBT for helpful comments. GH and RW are funded by the International DFG Research Training Group GRK 1906/1.</p>
</sec>
<sec id="d30e6516">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iqbal</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Turner</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Flicek</surname>
<given-names>P</given-names>
</name>
<name>
<surname>McVean</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>De novo assembly and genotyping of variants using colored de Bruijn graphs</article-title>
<source>Nat Genet</source>
<year>2012</year>
<volume>44</volume>
<issue>2</issue>
<fpage>226</fpage>
<lpage>232</lpage>
<pub-id pub-id-type="doi">10.1038/ng.1028</pub-id>
<pub-id pub-id-type="pmid">22231483</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<mixed-citation publication-type="other">Holley G, Wittler R, Stoye J. Bloom Filter Trie-a data structure for Pan-genome storage. In: Proc. of 15th Workshop on Algorithms in Bioinformatics, vol. 9289. 2015. p. 217–30.</mixed-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<mixed-citation publication-type="other">Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Digital SRC Research Report 124. 1994.</mixed-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cox</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Jakobi</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rosone</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>11</issue>
<fpage>1415</fpage>
<lpage>1419</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts173</pub-id>
<pub-id pub-id-type="pmid">22556365</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>An experimental study of an opportunistic index</article-title>
<source>Proc. of the 12th ACM-SIAM Symposium on Discrete Algorithms</source>
<year>2001</year>
<volume>1</volume>
<fpage>269</fpage>
<lpage>278</lpage>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bloom</surname>
<given-names>BH</given-names>
</name>
</person-group>
<article-title>Space/time trade-offs in hash coding with allowable errors</article-title>
<source>Comm ACM</source>
<year>1970</year>
<volume>13</volume>
<issue>7</issue>
<fpage>422</fpage>
<lpage>426</lpage>
<pub-id pub-id-type="doi">10.1145/362686.362692</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<mixed-citation publication-type="other">Solomon B, Kingsford C. Large-scale search of transcriptomic read sets with sequence Bloom Trees. bioRxiv 017087. 2015.</mixed-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fredking</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Trie Memory</article-title>
<source>Comm ACM</source>
<year>1960</year>
<volume>3</volume>
<issue>9</issue>
<fpage>490</fpage>
<lpage>499</lpage>
<pub-id pub-id-type="doi">10.1145/367390.367400</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heinz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zobel</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>HE</given-names>
</name>
</person-group>
<article-title>Burst tries: a fast, efficient data structure for string keys</article-title>
<source>ACM Trans Inf Syst</source>
<year>2002</year>
<volume>20</volume>
<issue>2</issue>
<fpage>192</fpage>
<lpage>223</lpage>
<pub-id pub-id-type="doi">10.1145/506309.506312</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Banks</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Poplin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Garimella</surname>
<given-names>KV</given-names>
</name>
<name>
<surname>Maguire</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Hartl</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Philippakis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>del Angel</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rivas</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Hanna</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A framework for variation discovery and genotyping using next-generation DNA sequencing data</article-title>
<source>Nat Genet</source>
<year>2011</year>
<volume>43</volume>
<issue>5</issue>
<fpage>491</fpage>
<lpage>498</lpage>
<pub-id pub-id-type="doi">10.1038/ng.806</pub-id>
<pub-id pub-id-type="pmid">21478889</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ernst</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rahmann</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>PanCake: a data structure for pangenomes</article-title>
<source>Proc German Conf Bioinform</source>
<year>2013</year>
<volume>34</volume>
<fpage>35</fpage>
<lpage>45</lpage>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
</person-group>
<article-title>The fragment assembly string graph</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>79</fpage>
<lpage>85</lpage>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nguyen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Hickey</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Raney</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Earl</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Armstrong</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Paten</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Building a pangenome reference for a population</article-title>
<source>J Comput Biol</source>
<year>2015</year>
<volume>22</volume>
<issue>5</issue>
<fpage>387</fpage>
<lpage>401</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2014.0146</pub-id>
<pub-id pub-id-type="pmid">25565268</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Paten</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Earl</surname>
<given-names>D</given-names>
</name>
<name>
<surname>John</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Suh</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Cactus graphs for genome comparisons</article-title>
<source>J Comput Biol</source>
<year>2011</year>
<volume>18</volume>
<issue>3</issue>
<fpage>469</fpage>
<lpage>481</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2010.0252</pub-id>
<pub-id pub-id-type="pmid">21385048</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Popic</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Short read alignment with populations of genomes</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>13</issue>
<fpage>361</fpage>
<lpage>370</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt215</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Fast and accurate short read alignment with Burrows–Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>14</issue>
<fpage>1754</fpage>
<lpage>1760</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp324</pub-id>
<pub-id pub-id-type="pmid">19451168</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wandelt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Starlinger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bux</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Leser</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>RCSI: Scalable similarity search in thousand(s) of genomes</article-title>
<source>Proc VLDB Endowment</source>
<year>2013</year>
<volume>6</volume>
<issue>13</issue>
<fpage>1534</fpage>
<lpage>1545</lpage>
<pub-id pub-id-type="doi">10.14778/2536258.2536265</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wandelt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Leser</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>MRCSI: compressing and searching string collections with multiple references</article-title>
<source>Proc VLDB Endowment</source>
<year>2015</year>
<volume>8</volume>
<issue>5</issue>
<fpage>461</fpage>
<lpage>472</lpage>
<pub-id pub-id-type="doi">10.14778/2735479.2735480</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcus</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
</person-group>
<article-title>SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>24</issue>
<fpage>3476</fpage>
<lpage>3483</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu756</pub-id>
<pub-id pub-id-type="pmid">25398610</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baier</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Beller</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>4</issue>
<fpage>497</fpage>
<lpage>504</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv603</pub-id>
<pub-id pub-id-type="pmid">26504144</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<mixed-citation publication-type="other">Crusoe MR, Alameldin HF, Awad S, Bucher E, Caldwell A, Cartwright R, Charbonneau A, Constantinides B, Edvenson G, Fay S, Fenton J, Fenzl T, Fish J, Garcia-Gutierrez L, Garland P, Gluck J, Gonzalez I, Guermond S, Guo J, Gupta A, Herr JR, Howe A, Hyer A, Harpfer A, Irber L, Kidd R, Lin D, Lippi J, Mansour T, McA’Nulty P, McDonald E, Mizzi J, Murray K, Nahum JR, Nanlohy K, Nederbragt AJ, Ortiz-Zuazaga H, Ory J, Pell J, Pepe-Ranney C, Russ ZN, Schwarz E, Scott C, Seaman J, Sievert S, Simpson J, Skennerton CT, Spencer J, Srinivasan R, Standage D, Stapleton JA, Steinman SR, Stein J, Taylor B, Trimble W, Wiencko HL, Wright M, Wyss B, Zhang Q, Zyme E, Brown CT. The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4. 2015.</mixed-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hintze</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Canino-Koning</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Howe</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>CT</given-names>
</name>
</person-group>
<article-title>Scaling metagenome sequence assembly with probabilistic de Bruijn graphs</article-title>
<source>Proc Natl Acad Sci</source>
<year>2012</year>
<volume>109</volume>
<issue>33</issue>
<fpage>13272</fpage>
<lpage>13277</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1121464109</pub-id>
<pub-id pub-id-type="pmid">22847406</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<mixed-citation publication-type="other">Li H. Wgsim. 2011.
<ext-link ext-link-type="uri" xlink:href="https://github.com/lh3/wgsim">https://github.com/lh3/wgsim</ext-link>
(Accessed 26 Nov 2015).</mixed-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Informed and automated k-mer size selection for genome assembly</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>1</issue>
<fpage>31</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt310</pub-id>
<pub-id pub-id-type="pmid">23732276</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<mixed-citation publication-type="other">Raman R, Raman V, Rao SS. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. of the 13h Annual ACM-SIAM Symposium on Discrete Algorithms. 2002. p. 233–242.</mixed-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<mixed-citation publication-type="other">Pavlov I. 7-Zip compressor. 1999.
<ext-link ext-link-type="uri" xlink:href="http://www.7-zip.org/">http://www.7-zip.org/</ext-link>
(Accessed 26 Nov 2015).</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0002450 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0002450 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021