Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Identifieur interne : 000245 ( Pmc/Curation ); précédent : 000244; suivant : 000246

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Auteurs : Guillaume Holley [Allemagne] ; Roland Wittler [Allemagne] ; Jens Stoye [Allemagne]

Source :

RBID : PMC:4832552

Abstract

Background

High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.

Results

In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.

Conclusion

We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.

Availability

https://www.github.com/GuillaumeHolley/BloomFilterTrie.


Url:
DOI: 10.1186/s13015-016-0066-8
PubMed: 27087830
PubMed Central: 4832552

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4832552

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</title>
<author>
<name sortKey="Holley, Guillaume" sort="Holley, Guillaume" uniqKey="Holley G" first="Guillaume" last="Holley">Guillaume Holley</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wittler, Roland" sort="Wittler, Roland" uniqKey="Wittler R" first="Roland" last="Wittler">Roland Wittler</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Stoye, Jens" sort="Stoye, Jens" uniqKey="Stoye J" first="Jens" last="Stoye">Jens Stoye</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27087830</idno>
<idno type="pmc">4832552</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4832552</idno>
<idno type="RBID">PMC:4832552</idno>
<idno type="doi">10.1186/s13015-016-0066-8</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000245</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000245</idno>
<idno type="wicri:Area/Pmc/Curation">000245</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000245</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</title>
<author>
<name sortKey="Holley, Guillaume" sort="Holley, Guillaume" uniqKey="Holley G" first="Guillaume" last="Holley">Guillaume Holley</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wittler, Roland" sort="Wittler, Roland" uniqKey="Wittler R" first="Roland" last="Wittler">Roland Wittler</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Stoye, Jens" sort="Stoye, Jens" uniqKey="Stoye J" first="Jens" last="Stoye">Jens Stoye</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Center for Biotechnology, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Biotechnology, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>International Research Training Group 1906, Bielefeld University, Bielefeld</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Algorithms for Molecular Biology : AMB</title>
<idno type="eISSN">1748-7188</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored
<italic>k</italic>
-mers, strings of length
<italic>k</italic>
, and stores them in vertices.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored
<italic>k</italic>
-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying
<italic>k</italic>
-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores
<italic>k</italic>
-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.github.com/GuillaumeHolley/BloomFilterTrie">https://www.github.com/GuillaumeHolley/BloomFilterTrie</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
<author>
<name sortKey="Caccamo, M" uniqKey="Caccamo M">M Caccamo</name>
</author>
<author>
<name sortKey="Turner, I" uniqKey="Turner I">I Turner</name>
</author>
<author>
<name sortKey="Flicek, P" uniqKey="Flicek P">P Flicek</name>
</author>
<author>
<name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cox, Aj" uniqKey="Cox A">AJ Cox</name>
</author>
<author>
<name sortKey="Bauer, Mj" uniqKey="Bauer M">MJ Bauer</name>
</author>
<author>
<name sortKey="Jakobi, T" uniqKey="Jakobi T">T Jakobi</name>
</author>
<author>
<name sortKey="Rosone, G" uniqKey="Rosone G">G Rosone</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fredking, E" uniqKey="Fredking E">E Fredking</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heinz, S" uniqKey="Heinz S">S Heinz</name>
</author>
<author>
<name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
<author>
<name sortKey="Williams, He" uniqKey="Williams H">HE Williams</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
<author>
<name sortKey="Banks, E" uniqKey="Banks E">E Banks</name>
</author>
<author>
<name sortKey="Poplin, R" uniqKey="Poplin R">R Poplin</name>
</author>
<author>
<name sortKey="Garimella, Kv" uniqKey="Garimella K">KV Garimella</name>
</author>
<author>
<name sortKey="Maguire, Jr" uniqKey="Maguire J">JR Maguire</name>
</author>
<author>
<name sortKey="Hartl, C" uniqKey="Hartl C">C Hartl</name>
</author>
<author>
<name sortKey="Philippakis, Aa" uniqKey="Philippakis A">AA Philippakis</name>
</author>
<author>
<name sortKey="Del Angel, G" uniqKey="Del Angel G">G del Angel</name>
</author>
<author>
<name sortKey="Rivas, Ma" uniqKey="Rivas M">MA Rivas</name>
</author>
<author>
<name sortKey="Hanna, M" uniqKey="Hanna M">M Hanna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ernst, C" uniqKey="Ernst C">C Ernst</name>
</author>
<author>
<name sortKey="Rahmann, S" uniqKey="Rahmann S">S Rahmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nguyen, N" uniqKey="Nguyen N">N Nguyen</name>
</author>
<author>
<name sortKey="Hickey, G" uniqKey="Hickey G">G Hickey</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Raney, B" uniqKey="Raney B">B Raney</name>
</author>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="Armstrong, J" uniqKey="Armstrong J">J Armstrong</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
<author>
<name sortKey="Paten, B" uniqKey="Paten B">B Paten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paten, B" uniqKey="Paten B">B Paten</name>
</author>
<author>
<name sortKey="Diekhans, M" uniqKey="Diekhans M">M Diekhans</name>
</author>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="John, Js" uniqKey="John J">JS John</name>
</author>
<author>
<name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author>
<name sortKey="Suh, B" uniqKey="Suh B">B Suh</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, L" uniqKey="Huang L">L Huang</name>
</author>
<author>
<name sortKey="Popic, V" uniqKey="Popic V">V Popic</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Starlinger, J" uniqKey="Starlinger J">J Starlinger</name>
</author>
<author>
<name sortKey="Bux, M" uniqKey="Bux M">M Bux</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcus, S" uniqKey="Marcus S">S Marcus</name>
</author>
<author>
<name sortKey="Lee, H" uniqKey="Lee H">H Lee</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baier, U" uniqKey="Baier U">U Baier</name>
</author>
<author>
<name sortKey="Beller, T" uniqKey="Beller T">T Beller</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Algorithms Mol Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Algorithms Mol Biol</journal-id>
<journal-title-group>
<journal-title>Algorithms for Molecular Biology : AMB</journal-title>
</journal-title-group>
<issn pub-type="epub">1748-7188</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27087830</article-id>
<article-id pub-id-type="pmc">4832552</article-id>
<article-id pub-id-type="publisher-id">66</article-id>
<article-id pub-id-type="doi">10.1186/s13015-016-0066-8</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Holley</surname>
<given-names>Guillaume</given-names>
</name>
<address>
<email>gholley@cebitec.uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wittler</surname>
<given-names>Roland</given-names>
</name>
<address>
<email>roland.wittler@uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stoye</surname>
<given-names>Jens</given-names>
</name>
<address>
<email>jens.stoye@uni-bielefeld.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany</aff>
<aff id="Aff2">
<label></label>
Center for Biotechnology, Bielefeld University, Bielefeld, Germany</aff>
<aff id="Aff3">
<label></label>
International Research Training Group 1906, Bielefeld University, Bielefeld, Germany</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>14</day>
<month>4</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>14</day>
<month>4</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>11</volume>
<elocation-id>3</elocation-id>
<history>
<date date-type="received">
<day>8</day>
<month>12</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>3</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>© Holley et al. 2016</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored
<italic>k</italic>
-mers, strings of length
<italic>k</italic>
, and stores them in vertices.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored
<italic>k</italic>
-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying
<italic>k</italic>
-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores
<italic>k</italic>
-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.</p>
</sec>
<sec>
<title>Availability</title>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.github.com/GuillaumeHolley/BloomFilterTrie">https://www.github.com/GuillaumeHolley/BloomFilterTrie</ext-link>
.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Pan-genome</kwd>
<kwd>Similar genomes</kwd>
<kwd>Population genomics</kwd>
<kwd>Colored de bruijn graph</kwd>
<kwd>Bloom filter</kwd>
<kwd>Compression</kwd>
<kwd>Trie</kwd>
<kwd>Index</kwd>
<kwd>Succinct data structure</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution>International DFG Research Training Group GRK 1906/1</institution>
</funding-source>
<award-id>ID0EUJAC90</award-id>
<principal-award-recipient>
<name>
<surname>Holley</surname>
<given-names>Guillaume</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2016</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000245 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000245 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:4832552
   |texte=   Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:27087830" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021