Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000277 ( Pmc/Corpus ); précédent : 0002769; suivant : 0002780 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Efficient computation of spaced seed hashing with block indexing</title>
<author>
<name sortKey="Girotto, Samuele" sort="Girotto, Samuele" uniqKey="Girotto S" first="Samuele" last="Girotto">Samuele Girotto</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pizzi, Cinzia" sort="Pizzi, Cinzia" uniqKey="Pizzi C" first="Cinzia" last="Pizzi">Cinzia Pizzi</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30497364</idno>
<idno type="pmc">6266934</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6266934</idno>
<idno type="RBID">PMC:6266934</idno>
<idno type="doi">10.1186/s12859-018-2415-8</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000277</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000277</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Efficient computation of spaced seed hashing with block indexing</title>
<author>
<name sortKey="Girotto, Samuele" sort="Girotto, Samuele" uniqKey="Girotto S" first="Samuele" last="Girotto">Samuele Girotto</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pizzi, Cinzia" sort="Pizzi, Cinzia" uniqKey="Pizzi C" first="Cinzia" last="Pizzi">Cinzia Pizzi</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to
<italic>k</italic>
-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive
<italic>k</italic>
-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.</p>
</sec>
<sec>
<title>Results</title>
<p>In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to
<italic>k</italic>
-mer of the length of the run.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zielezinski, A" uniqKey="Zielezinski A">A Zielezinski</name>
</author>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
<author>
<name sortKey="Karlowski, Wm" uniqKey="Karlowski W">WM Karlowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Chew, D" uniqKey="Chew D">D Chew</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
<author>
<name sortKey="Waterman, M" uniqKey="Waterman M">M Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Antonello, M" uniqKey="Antonello M">M Antonello</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
<author>
<name sortKey="Ornamenti, M" uniqKey="Ornamenti M">M Ornamenti</name>
</author>
<author>
<name sortKey="Spangaro, S" uniqKey="Spangaro S">S Spangaro</name>
</author>
<author>
<name sortKey="Rombo, Se" uniqKey="Rombo S">SE Rombo</name>
</author>
<author>
<name sortKey="Parida, L" uniqKey="Parida L">L Parida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Leoni, A" uniqKey="Leoni A">A Leoni</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leslie, C" uniqKey="Leslie C">C Leslie</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Noble, W" uniqKey="Noble W">W Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Wanamaker, S" uniqKey="Wanamaker S">S Wanamaker</name>
</author>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
<author>
<name sortKey="Rastas, P" uniqKey="Rastas P">P Rastas</name>
</author>
<author>
<name sortKey="Ukkonen, E" uniqKey="Ukkonen E">E Ukkonen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parida, L" uniqKey="Parida L">L Parida</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
<author>
<name sortKey="Rombo, Se" uniqKey="Rombo S">SE Rombo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shajii, A" uniqKey="Shajii A">A Shajii</name>
</author>
<author>
<name sortKey="Yorukoglu, D" uniqKey="Yorukoglu D">D Yorukoglu</name>
</author>
<author>
<name sortKey="William Yu, Y" uniqKey="William Yu Y">Y William Yu</name>
</author>
<author>
<name sortKey="Berger, B" uniqKey="Berger B">B Berger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Abreu Goodger, C" uniqKey="Abreu Goodger C">C Abreu-Goodger</name>
</author>
<author>
<name sortKey="Enright, A" uniqKey="Enright A">A Enright</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Belazzougui, D" uniqKey="Belazzougui D">D Belazzougui</name>
</author>
<author>
<name sortKey="Cunial, F" uniqKey="Cunial F">F Cunial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buhler, J" uniqKey="Buhler J">J Buhler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Tromp, J" uniqKey="Tromp J">J Tromp</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hahn, L" uniqKey="Hahn L">L Hahn</name>
</author>
<author>
<name sortKey="Leimeister, C A" uniqKey="Leimeister C">C-A Leimeister</name>
</author>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
<author>
<name sortKey="Morgenstern, B" uniqKey="Morgenstern B">B Morgenstern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
<author>
<name sortKey="Mansouri Bigvand, A" uniqKey="Mansouri Bigvand A">A Mansouri Bigvand</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Noe, L" uniqKey="Noe L">L Noé</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leimeister, C A" uniqKey="Leimeister C">C-A Leimeister</name>
</author>
<author>
<name sortKey="Boden, M" uniqKey="Boden M">M Boden</name>
</author>
<author>
<name sortKey="Horwege, S" uniqKey="Horwege S">S Horwege</name>
</author>
<author>
<name sortKey="Lindner, S" uniqKey="Lindner S">S Lindner</name>
</author>
<author>
<name sortKey="Morgenstern, B" uniqKey="Morgenstern B">B Morgenstern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Onodera, T" uniqKey="Onodera T">T Onodera</name>
</author>
<author>
<name sortKey="Shibuya, T" uniqKey="Shibuya T">T Shibuya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rumble, Sm" uniqKey="Rumble S">SM Rumble</name>
</author>
<author>
<name sortKey="Lacroute, P" uniqKey="Lacroute P">P Lacroute</name>
</author>
<author>
<name sortKey="Dalca, Av" uniqKey="Dalca A">AV Dalca</name>
</author>
<author>
<name sortKey="Fiume, M" uniqKey="Fiume M">M Fiume</name>
</author>
<author>
<name sortKey="Sidow, A" uniqKey="Sidow A">A Sidow</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bucher, P" uniqKey="Bucher P">P Bücher</name>
</author>
<author>
<name sortKey="Moret, Bme" uniqKey="Moret B">BME Moret</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="B Inda, K" uniqKey="B Inda K">K Břinda</name>
</author>
<author>
<name sortKey="Sykulski, M" uniqKey="Sykulski M">M Sykulski</name>
</author>
<author>
<name sortKey="Kucherov, G" uniqKey="Kucherov G">G Kucherov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, Dg" uniqKey="Brown D">DG Brown</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohamadi, H" uniqKey="Mohamadi H">H Mohamadi</name>
</author>
<author>
<name sortKey="Chu, J" uniqKey="Chu J">J Chu</name>
</author>
<author>
<name sortKey="Vandervalk, Bp" uniqKey="Vandervalk B">BP Vandervalk</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Keich, U" uniqKey="Keich U">U Keich</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Tromp, J" uniqKey="Tromp J">J Tromp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leimeister, C A" uniqKey="Leimeister C">C-A Leimeister</name>
</author>
<author>
<name sortKey="Sohrabi Jahromi, S" uniqKey="Sohrabi Jahromi S">S Sohrabi-Jahromi</name>
</author>
<author>
<name sortKey="Morgenstern, B" uniqKey="Morgenstern B">B Morgenstern</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30497364</article-id>
<article-id pub-id-type="pmc">6266934</article-id>
<article-id pub-id-type="publisher-id">2415</article-id>
<article-id pub-id-type="doi">10.1186/s12859-018-2415-8</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Efficient computation of spaced seed hashing with block indexing</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Girotto</surname>
<given-names>Samuele</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Comin</surname>
<given-names>Matteo</given-names>
</name>
<address>
<email>comin@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Pizzi</surname>
<given-names>Cinzia</given-names>
</name>
<address>
<email>cinzia.pizzi@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
via Gradenigo 6/A, Padova, Italy</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>30</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>30</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<issue>Suppl 15</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</issue-sponsor>
<elocation-id>441</elocation-id>
<permissions>
<copyright-statement>© The Author(s) 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to
<italic>k</italic>
-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive
<italic>k</italic>
-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.</p>
</sec>
<sec>
<title>Results</title>
<p>In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to
<italic>k</italic>
-mer of the length of the run.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Spaced seeds</kwd>
<kwd>k-mers</kwd>
<kwd>Efficient computation of hashing</kwd>
</kwd-group>
<conference xlink:href="https://www.bbcc-meetings.it/">
<conf-name>BBCC Conference 2017</conf-name>
<conf-acronym>BBCC 2017</conf-acronym>
<conf-loc>Naples, Italy</conf-loc>
<conf-date>18 - 20 December 2017</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>
<italic>k</italic>
-mers counting, indexing and searching are fundamental operations at the very basis of many bioinformatics tools. A most notable example is their exploitation on sequence similarity search for which the “hit-and-extend” method introduced by BLAST [
<xref ref-type="bibr" rid="CR1">1</xref>
] led to a revolutionary fast and sensitive approach for local alignment. In the “hit” step exact matches of
<italic>k</italic>
-mers (
<italic>k</italic>
=11 for DNA) between two sequences are detected. Next, potential candidates are extended to obtain a local alignment with high statistical significance. BLAST has long been one of the most used tools for the analysis of omics sequences.</p>
<p>
<italic>k</italic>
-mers profiles are also widely used in alignment-free techniques [
<xref ref-type="bibr" rid="CR2">2</xref>
] for the definition of statistical scores for sequence comparison [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR4">4</xref>
], finding application on a broad range of bioinformatics problems (e.g. [
<xref ref-type="bibr" rid="CR5">5</xref>
<xref ref-type="bibr" rid="CR13">13</xref>
]), and pushing the development and usage of time and space efficient algorithms and data structures for
<italic>k</italic>
-mer counting and indexing (e.g. [
<xref ref-type="bibr" rid="CR14">14</xref>
<xref ref-type="bibr" rid="CR18">18</xref>
]).</p>
<p>Although the matching of contiguous
<italic>k</italic>
-mers is largely used in sequence analysis, the use of not consecutive matches, i.e.
<italic>spaced seeds</italic>
, can lead in principle to more sensitive results [
<xref ref-type="bibr" rid="CR19">19</xref>
]. This is because spaced seeds offer the advantage, with respect to
<italic>k</italic>
-mers, of considering positions that are not consecutive, hence statistically less dependent. On the other side, the problem of maximizing the spaced seeds sensitivity is known to be NP-hard [
<xref ref-type="bibr" rid="CR20">20</xref>
]. The design of effective spaced seeds has been addressed in several studies [
<xref ref-type="bibr" rid="CR21">21</xref>
<xref ref-type="bibr" rid="CR24">24</xref>
]. Nowadays, spaced seeds have replaced traditional
<italic>k</italic>
-mers based approaches in the design of state-of-the-art solutions to several problems that involve sequence comparison. Among others we can enlist: phylogenetic tree reconstruction [
<xref ref-type="bibr" rid="CR25">25</xref>
], protein classification [
<xref ref-type="bibr" rid="CR26">26</xref>
], mapping of reads [
<xref ref-type="bibr" rid="CR27">27</xref>
], multiple sequence alignment [
<xref ref-type="bibr" rid="CR28">28</xref>
], metagenomics binning and classification [
<xref ref-type="bibr" rid="CR29">29</xref>
<xref ref-type="bibr" rid="CR31">31</xref>
]. The literature on spaced seeds is vast, and we refer the interest reader to [
<xref ref-type="bibr" rid="CR32">32</xref>
] for a survey.</p>
<p>Several routine operations on large scale sequence analysis, including building and querying indexes, and searching for similarity among sequences, are based on
<italic>k</italic>
-mers counting. In order to speed-up
<italic>k</italic>
-mers counting, hashing is often used. In fact, hashing consecutive
<italic>k</italic>
-mers is fast and simple, since the hash of a
<italic>k</italic>
-mer starting at position
<italic>i</italic>
can be computed from the hash of the
<italic>k</italic>
-mer at position
<italic>i</italic>
−1 with few operations, since they share
<italic>k</italic>
−1 symbols [
<xref ref-type="bibr" rid="CR33">33</xref>
].</p>
<p>Unfortunately, this property no longer holds for spaced seeds, due to the presence of “don’t care” positions, leading to a slowdown of the whole analysis. A good example of this effect is the metagenomic read classifier Clark [
<xref ref-type="bibr" rid="CR10">10</xref>
]. Its spaced seed counterpart, Clark-S [
<xref ref-type="bibr" rid="CR31">31</xref>
], has a better classification quality, but a drop from 3.5M to 200k reads per minute on classification rate with respect to Clark. Slow downs when using spaced seeds has also been shown in [
<xref ref-type="bibr" rid="CR26">26</xref>
,
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
].</p>
<p>The problem of speeding up the computation of spaced seed hashing for each position in a given sequence was recently addressed in [
<xref ref-type="bibr" rid="CR34">34</xref>
,
<xref ref-type="bibr" rid="CR35">35</xref>
] where FSH, an approach based on spaced seed self-correlation, was proposed reporting a speed-up of 1.5x, on average, with respect to the standard way to compute spaced seed hashing. In this paper we address the same problem, considering the Rabin-Karp rolling hash.</p>
<p>The novel approach we present here, FISH, is based on the decomposition of the spaced seed mask into blocks of consecutive 1s. These blocks represent contiguous matches, i.e.
<italic>k</italic>
-mers of the specified length. Since the hashing of
<italic>k</italic>
-mers is a very fast operation, we reduced the problem of spaced seed hashing to the problem of hashing its
<italic>k</italic>
-mer components and then combined them in order to obtain the hashing of the complete spaced seed. We performed a wide set of experiments, using several spaced seeds, varying in terms of length and weight, and NGS datasets with different read lengths. Our approach proved to be faster than the standard approach, and also of FSH. We extended our algorithm and experiments also to the multiple spaced seed hashing framework, obtaining an average speed-up with respect to standard indexing of 6x.</p>
<p>In the next sections we will present our approach and the results of our experiments, discussing the performances of our approach under different settings.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<p>In this section we start by recalling some formal definitions about spaced seeds through the notation introduced in [
<xref ref-type="bibr" rid="CR36">36</xref>
], and then we will describe our algorithm to compute the spaced seed hashing of each position in a given input string, a fundamental step in many applications [
<xref ref-type="bibr" rid="CR25">25</xref>
<xref ref-type="bibr" rid="CR29">29</xref>
,
<xref ref-type="bibr" rid="CR31">31</xref>
].</p>
<sec id="Sec3">
<title>Fundamental concepts on spaced seeds</title>
<sec id="d29e441">
<title>
<bold>Definition 1</bold>
</title>
<p>(Spaced seed.) A
<italic>spaced-seed S</italic>
(or just a seed) is a binary string of length
<italic>k</italic>
, where the symbol ‘
<italic>1</italic>
’ requires a match in that position, while a symbol ‘
<italic>0</italic>
’ allows for “don’t care”. A spaced seed is characterized by its length
<italic>k</italic>
and by its weight
<italic>W</italic>
<
<italic>k</italic>
, which is the number of 1s in the string. A spaced seed always begins and ends with a
<italic>1</italic>
.</p>
</sec>
<sec id="d29e472">
<title>
<bold>Definition 2</bold>
</title>
<p>(The shape Q of a spaced seed.) The
<italic>shape Q</italic>
of a spaced seed is the set of non negative integers that correspond to the positions of the spaced seed where there is a
<italic>1</italic>
. The shape
<italic>Q</italic>
can describe a spaced seed completely: the weight
<italic>W</italic>
is equal to |
<italic>Q</italic>
|, and its span (or length)
<italic>s</italic>
(
<italic>Q</italic>
) is given by max
<italic>Q</italic>
+1.</p>
</sec>
<sec id="d29e503">
<title>
<bold>Definition 3</bold>
</title>
<p>(The positioned shape
<italic>i</italic>
+
<italic>Q</italic>
.) Given any integer
<italic>i</italic>
and shape
<italic>Q</italic>
, we define the positioned shape
<italic>i</italic>
+
<italic>Q</italic>
as the set {
<italic>i</italic>
+
<italic>k</italic>
,
<italic>k</italic>
<italic>Q</italic>
}.</p>
</sec>
<sec id="d29e540">
<title>
<bold>Definition 4</bold>
</title>
<p>(
<italic>Q</italic>
-gram.) For any position
<italic>i</italic>
in the string
<italic>x</italic>
=
<italic>x</italic>
<sub>0</sub>
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>
<italic>n</italic>
−1</sub>
, with 0≤
<italic>i</italic>
<italic>n</italic>
<italic>s</italic>
(
<italic>Q</italic>
), let us consider the positioned shape
<italic>i</italic>
+
<italic>Q</italic>
={
<italic>i</italic>
<sub>0</sub>
,
<italic>i</italic>
<sub>1</sub>
,…,
<italic>i</italic>
<sub>
<italic>W</italic>
−1</sub>
}, where
<italic>i</italic>
<sub>0</sub>
<
<italic>i</italic>
<sub>1</sub>
<...<
<italic>i</italic>
<sub>
<italic>W</italic>
−1</sub>
. The
<italic>Q</italic>
-gram
<italic>x</italic>
[
<italic>i</italic>
+
<italic>Q</italic>
], starting at position
<italic>i</italic>
in
<italic>x</italic>
, is the string of length |
<italic>Q</italic>
| described by
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$x_{i_{0}} x_{i_{1}} \dots x_{i_{W-1}}$\end{document}</tex-math>
<mml:math id="M2">
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<p>
<bold>Example</bold>
Let us consider the string
<italic>x</italic>
=
<italic>ACTGACTGGATTGAC</italic>
, and a spaced seed 1101110011111. Then the shape of the spaced seed is
<italic>Q</italic>
={0,1,3,4,5,8,9,10,11,12}, its weight is |
<italic>Q</italic>
|=10 and its span is
<italic>s</italic>
(
<italic>Q</italic>
)=13. The
<italic>Q</italic>
-gram
<italic>x</italic>
[0+
<italic>Q</italic>
] is given by the concatenation of the symbols that occur at positions 0+
<italic>Q</italic>
={0,1,3,4,5,8,9,10,11,12},
<italic>x</italic>
[ 0+
<italic>Q</italic>
]=
<italic>ACGACGATTG</italic>
:</p>
<p>
<graphic position="anchor" xlink:href="12859_2018_2415_Figa_HTML" id="MO1"></graphic>
</p>
<p>Similarly the other
<italic>Q</italic>
-grams are given by the concatenations of the symbols at positions 1+
<italic>Q</italic>
={1,2,4,5,6,9,10,11,12,13}:
<italic>x</italic>
[1+
<italic>Q</italic>
]=
<italic>CTACTATTGA</italic>
; and 2+
<italic>Q</italic>
={2,3,5,6,7,10,11,12,13,14}:
<italic>x</italic>
[2+
<italic>Q</italic>
]=
<italic>TGCTGTTGAC</italic>
.</p>
<p>Now we can formally state our problem such as:</p>
</sec>
<sec id="d29e781">
<title>
<bold>Problem 1</bold>
</title>
<p>Let
<italic>x</italic>
=
<italic>x</italic>
<sub>0</sub>
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
<italic>x</italic>
<sub>
<italic>n</italic>
−1</sub>
be a string of length
<italic>n</italic>
,
<italic>Q</italic>
be a spaced seed, and
<italic>h</italic>
be a hash function that maps a string into a binary codeword. Compute the hash
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {H}(x,Q)$\end{document}</tex-math>
<mml:math id="M4">
<mml:mi mathvariant="script"></mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
for each
<italic>Q</italic>
-gram of the string
<italic>x</italic>
, following in the natural order from the first position 0 to the last position
<italic>n</italic>
<italic>s</italic>
(
<italic>Q</italic>
).
<disp-formula id="Equa">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${}\mathcal{H}(x,Q) = \langle h(x[0+Q]), h(x[1+Q]), \dots h(x[n-s(Q)]) \rangle $$ \end{document}</tex-math>
<mml:math id="M6">
<mml:mrow>
<mml:mi mathvariant="script"></mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mi>h</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>h</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mi>h</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2415_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
</sec>
<sec id="Sec4">
<title>Spaced seed hashing</title>
<p>The first step when computing the hash of a string defined over an alphabet
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {A}$\end{document}</tex-math>
<mml:math id="M8">
<mml:mi mathvariant="script">A</mml:mi>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
is to encode it into a binary string. For genomic sequences the simplest encoding consists in the definition of a function
<italic>encode</italic>
which maps the four nucleotides as follows:
<italic>encode</italic>
(
<italic>A</italic>
)=00,
<italic>encode</italic>
(
<italic>C</italic>
)=01,
<italic>encode</italic>
(
<italic>G</italic>
)=10,
<italic>encode</italic>
(
<italic>T</italic>
)=11. Given this mapping, we can compute the encodings of all symbols of the
<italic>Q</italic>
-gram
<italic>x</italic>
[0+
<italic>Q</italic>
]:
<disp-formula id="Equb">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{ccccccccccc} x[0+Q] &A&C&G&A&C&G&A&T&T&G\\ {encodings}&00&01&10&00&01&10&00&11&11&10 \end{array} $$ \end{document}</tex-math>
<mml:math id="M10">
<mml:mrow>
<mml:mtable class="array" columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>]</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mi>A</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>C</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>G</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>A</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>C</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>G</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>A</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>T</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>T</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mi>G</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">encodings</mml:mtext>
</mml:mtd>
<mml:mtd>
<mml:mn>00</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>01</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>10</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>00</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>01</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>10</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>00</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>11</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>11</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>10</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2415_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Here we focus on the efficient computation of the Rabin-Karp rolling hash. In the case of DNA sequences since
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$|\mathcal {A}|=4$\end{document}</tex-math>
<mml:math id="M12">
<mml:mo>|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>4</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
is a power of 2, the multiplications can be implemented with a shift operation. More formally, for any given position
<italic>i</italic>
of the string
<italic>x</italic>
=
<italic>x</italic>
<sub>0</sub>
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>
<italic>n</italic>
−1</sub>
, we define the hashing
<italic>h</italic>
(
<italic>x</italic>
[
<italic>i</italic>
+
<italic>Q</italic>
]) of the
<italic>Q</italic>
-gram
<italic>x</italic>
[
<italic>i</italic>
+
<italic>Q</italic>
] as:
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ h(x[\!i+Q]) = \bigvee_{k \in Q} \left[(encode(x_{i+k}) \ll (m(k)*{log}_{2}|\mathcal{A}|)\right] $$ \end{document}</tex-math>
<mml:math id="M14">
<mml:mi>h</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>Q</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mfenced close="]" open="[" separators="">
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="italic">encode</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">log</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2018_2415_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>m</italic>
(
<italic>k</italic>
)=|{
<italic>i</italic>
<italic>Q</italic>
, such that
<italic>i</italic>
<
<italic>k</italic>
}|, i.e. given a position
<italic>k</italic>
in the spaced seed,
<italic>m</italic>
(
<italic>k</italic>
) holds the number of 1s to the left of
<italic>k</italic>
. Since each symbol is encoded with 2 bits,
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$m(k)*{log}_{2}|\mathcal {A}|$\end{document}</tex-math>
<mml:math id="M16">
<mml:mi>m</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">log</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>|</mml:mo>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
gives the number of shifts to set the encoding of the
<italic>k</italic>
-th symbol in the right position.</p>
<p>In Table 
<xref rid="Tab1" ref-type="table">1</xref>
we report a step-by-step computation of hashing value for the
<italic>Q</italic>
-gram
<italic>x</italic>
[0+
<italic>Q</italic>
] (up to length 6 just for page width limits constrains). With respect to the above example, the hashing value associated to the
<italic>Q</italic>
-gram
<italic>ACGACGATTG</italic>
simply corresponds to the list of encodings in Little-endian: 10111100100100100100. The hashing values for the others
<italic>Q</italic>
-grams can be determined through the function
<italic>h</italic>
(
<italic>x</italic>
[
<italic>i</italic>
+
<italic>Q</italic>
]) with a similar procedure. Following the above example the hashing values for the
<italic>Q</italic>
-grams
<italic>x</italic>
[ 1+
<italic>Q</italic>
]=
<italic>CTACTATTGA</italic>
and
<italic>x</italic>
[ 2+
<italic>Q</italic>
]=
<italic>TGCTGTTGAC</italic>
are, respectively, 00101111001101001101 and 10001011111011011011.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Step-by-step computation of the encoding of the prefix of length 6 of the
<italic>Q</italic>
-gram
<italic>x</italic>
[ 0+
<italic>Q</italic>
] in little-endian notation using Eq. (
<xref rid="Equ1" ref-type="">1</xref>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">0</th>
<th align="left">1</th>
<th align="left">2</th>
<th align="left">3</th>
<th align="left">4</th>
<th align="left">5</th>
<th align="left">6</th>
<th align="left">7</th>
<th align="left">8</th>
<th align="left">9</th>
</tr>
</thead>
<tbody>
<tr>
<td align="justify">x</td>
<td align="left">A</td>
<td align="left">C</td>
<td align="left">T</td>
<td align="left">G</td>
<td align="left">A</td>
<td align="left">C</td>
<td align="left">T</td>
<td align="left">G</td>
<td align="left">G</td>
<td align="left">A</td>
</tr>
<tr>
<td align="justify">Q</td>
<td align="left">1</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">1</td>
<td align="left">1</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">1</td>
<td align="left">1</td>
</tr>
<tr>
<td align="justify">m</td>
<td align="left">0</td>
<td align="left">1</td>
<td align="left">2</td>
<td align="left">2</td>
<td align="left">3</td>
<td align="left">4</td>
<td align="left">4</td>
<td align="left">5</td>
<td align="left">5</td>
<td align="left">6</td>
</tr>
<tr>
<td align="justify">Shifted-encodings</td>
<td align="left">00</td>
<td align="left">01 ≪ 2</td>
<td align="left"></td>
<td align="left">10 ≪ 4</td>
<td align="left">00 ≪ 6</td>
<td align="left">01 ≪ 8</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">10 ≪ 10</td>
<td align="left"></td>
</tr>
<tr>
<td align="justify"></td>
<td align="left"></td>
<td align="left" colspan="2">0100</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="justify"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left" colspan="2">100100</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="justify"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left" colspan="2">00100100</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="justify"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left" colspan="2">0100100100</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="justify"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left" colspan="2">100100100100</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>The Rabin-Karp rolling hash is very intuitive. However, other hashing functions, that can be more appropriate because they have some properties such as universality, uniform distribution in the output space, and higher-order independence [
<xref ref-type="bibr" rid="CR33">33</xref>
], can be computed in a similar way. For example, one could use the cyclic polynomial rolling hash by replacing: shifts with rotations, OR with XOR, and the function
<italic>e</italic>
<italic>n</italic>
<italic>c</italic>
<italic>o</italic>
<italic>d</italic>
<italic>e</italic>
(·) in Eq. (
<xref rid="Equ1" ref-type="">1</xref>
) with a seed table where the letters of the DNA alphabet are assigned different random 64-bit integers.</p>
<p>Equation (
<xref rid="Equ1" ref-type="">1</xref>
) can be directly used to address Problem
<xref rid="Sec3" ref-type="sec">1</xref>
by applying it at each position in
<italic>x</italic>
. However, for each position the computation of the hashing function
<italic>h</italic>
(
<italic>x</italic>
[
<italic>i</italic>
+
<italic>Q</italic>
]) requires to extract and encode a number of symbols that is equal to the weight of the seed |
<italic>Q</italic>
| or, in other words, each symbol of
<italic>x</italic>
is read and encoded into the hash |
<italic>Q</italic>
| times. Therefore this solution can be very time consuming.</p>
</sec>
<sec id="Sec5">
<title>Computing spaced seed hashing with block indexing</title>
<p>In the following we describe our contribution for the computation of hashing values through Fast Indexing of Spaced seeds Hashings (FISH). Let
<italic>Q</italic>
={
<italic>i</italic>
<sub>1</sub>
,
<italic>i</italic>
<sub>2</sub>
,…
<italic>i</italic>
<sub>
<italic>k</italic>
</sub>
} be a spaced seed. It can be viewed as a series of runs of 1s, or unit blocks, interspersed with runs of 0s. First, we disassemble
<italic>Q</italic>
into its constituents unit blocks and we define the set
<italic>B</italic>
of starting positions of the unit blocks as:
<disp-formula id="Equc">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$B = \{0\} \cup \left\{i_{j} \in Q\setminus\{0\} \text{ such that } i_{j}-i_{j-1} >1\right\} $$ \end{document}</tex-math>
<mml:math id="M18">
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>}</mml:mo>
<mml:mo></mml:mo>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo></mml:mo>
<mml:mo>{</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>}</mml:mo>
<mml:mtext>such that</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2415_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Given
<italic>B</italic>
={
<italic>b</italic>
<sub>1</sub>
,
<italic>b</italic>
<sub>2</sub>
,…,
<italic>b</italic>
<sub>
<italic>t</italic>
</sub>
}, let
<italic>B</italic>
<sub>
<italic>L</italic>
</sub>
={
<italic>l</italic>
<sub>1</sub>
,
<italic>l</italic>
<sub>2</sub>
,…,
<italic>l</italic>
<sub>
<italic>t</italic>
</sub>
} be the (ordered) set of the lengths corresponding to each unit block. To compute the hashing of a spaced seed on a sequence
<italic>x</italic>
of length
<italic>n</italic>
, the FISH algorithm will scan
<italic>x</italic>
for fast hashing of
<italic>l</italic>
-mers whose lengths are in
<italic>B</italic>
<sub>
<italic>L</italic>
</sub>
. For each length
<italic>l</italic>
<italic>B</italic>
<sub>
<italic>L</italic>
</sub>
an array
<italic>T</italic>
<sub>
<italic>l</italic>
</sub>
of length
<italic>n</italic>
<italic>l</italic>
+1 is built where at position
<italic>i</italic>
the hash of the l-mer
<italic>x</italic>
[
<italic>i</italic>
,
<italic>i</italic>
+
<italic>l</italic>
−1] is stored. This pre-processing is very fast, as it can exploit the large overlap (
<italic>l</italic>
−1 symbols) between consecutive
<italic>l</italic>
-mers in order to compute the hashing of consecutive positions in constant time.</p>
<p>Then, to compute the hash of the Q-gram identified by the position shape
<italic>i</italic>
+
<italic>Q</italic>
, we proceed as follows. For each unit block
<italic>b</italic>
<sub>
<italic>j</italic>
</sub>
of length
<italic>l</italic>
<sub>
<italic>j</italic>
</sub>
we look up at the array
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$T_{l_{j}}$\end{document}</tex-math>
<mml:math id="M20">
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
, and specifically to the value stored at position
<italic>i</italic>
+
<italic>b</italic>
<sub>
<italic>j</italic>
</sub>
. Let
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
be such value. The hashing of the Q-gram is then computed by shifting
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
of 2×
<italic>m</italic>
(
<italic>b</italic>
<sub>
<italic>j</italic>
</sub>
) positions to the left. This process is repeated for all unit blocks and the contributions of each block are summed (bitwise OR).</p>
<sec id="d29e1974">
<title>
<bold>Example 1</bold>
</title>
<p>Let us consider again the string
<italic>x</italic>
=
<italic>A</italic>
<italic>C</italic>
<italic>TGACTGGATTGACTCC</italic>
and the spaced seed
<italic>S</italic>
=1101 110011111, with associated shape
<italic>Q</italic>
={0,1,3,4,5,8,9,10,11,12},
<italic>m</italic>
={0,1,2,2,3,4,4,5,5,6,7,8,9,10}, and blocks with starting positions
<italic>B</italic>
={0,3,8}, and lengths
<italic>B</italic>
<sub>
<italic>L</italic>
</sub>
={2,3,5}. To compute the hashing of the
<italic>Q</italic>
-gram
<italic>x</italic>
[0+
<italic>Q</italic>
] we must look up at
<italic>T</italic>
<sub>2</sub>
[0] to retrieve the value of
<italic>h</italic>
<sub>1</sub>
=0100, at
<italic>T</italic>
<sub>3</sub>
[3] to retrieve the value of
<italic>h</italic>
<sub>2</sub>
=010010, and at
<italic>T</italic>
<sub>5</sub>
to retrieve
<italic>h</italic>
<sub>3</sub>
=1011110010 (see Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
).
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>The hashing of each unit block in the spaced seed is looked up in the corresponding length
<italic>k</italic>
-mer table</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig1_HTML" id="MO2"></graphic>
</fig>
</p>
<p>Then the hashings need to be combined to obtain the final hash value of
<italic>x</italic>
[ 0+
<italic>Q</italic>
]:
<disp-formula id="Equd">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{lll} {}H(ACGACGATTG) &=& (h_{1} \! \ll \! 2 \cdot m(b_{1}))\! \vee \! (h_{2} \! \ll \! 2 \cdot m(b_{2})) \! \vee \! (h_{3} \! \ll \! 2 \! \cdot \! m(\!b_{3}\!))\\ &=& (0100 \ll 0) \vee (010010 \ll 4) \vee (1011110010 \ll 10)\\ &=& 10111100100100100100\\ \end{array} $$ \end{document}</tex-math>
<mml:math id="M22">
<mml:mrow>
<mml:mtable class="array" columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>H</mml:mi>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="italic">ACGACGATTG</mml:mtext>
<mml:mo>)</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mo>·</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mo>·</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>·</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>m</mml:mi>
<mml:mo>(</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mo>(</mml:mo>
<mml:mn>0100</mml:mn>
<mml:mo></mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mn>010010</mml:mn>
<mml:mo></mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mn>1011110010</mml:mn>
<mml:mo></mml:mo>
<mml:mn>10</mml:mn>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mn>10111100100100100100</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr></mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2415_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
</sec>
<sec id="Sec6">
<title>Computing multiple spaced seed hashing with block indexing</title>
<p>In some applications (for example [
<xref ref-type="bibr" rid="CR25">25</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
<xref ref-type="bibr" rid="CR31">31</xref>
,
<xref ref-type="bibr" rid="CR37">37</xref>
]) using several spaced seeds increases the sensitivity of the results. In such a context, the FISH algorithm can be further exploited to improve the speed up with respect to the computation of the
<italic>Q</italic>
-grams hashing of each spaced seed separately. In fact, if two spaced seeds share a unit block of the same length
<italic>l</italic>
, we will need to compute the hashing of the
<italic>l</italic>
-mers of the input string just once, and then access the corresponding array
<italic>T</italic>
<sub>
<italic>l</italic>
</sub>
when computing the full hash of
<italic>Q</italic>
-grams for the two different spaced seeds.</p>
<p>More formally, let
<italic>Q</italic>
<sub>1</sub>
,
<italic>Q</italic>
<sub>2</sub>
,…
<italic>Q</italic>
<sub>
<italic>n</italic>
</sub>
be
<italic>n</italic>
spaced seeds. Let
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$B_{L}^{Q_{i}} =\{l_{1}^{Q_{i}}, l_{2}^{Q_{i}}, {,} \dots, l_{t_{i}}^{Q_{i}} \}$\end{document}</tex-math>
<mml:math id="M24">
<mml:msubsup>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mo>}</mml:mo>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
be the set of lengths of the unit blocks of the spaced seed with shape
<italic>Q</italic>
<sub>
<italic>i</italic>
</sub>
, for
<italic>i</italic>
=1,…,
<italic>n</italic>
. Let
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {B}_{L} =\cup _{i=1}^{n} B_{L}^{Q_{i}}$\end{document}</tex-math>
<mml:math id="M26">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
be the superset of all different unit block lengths among the spaced seeds we are considering. We will compute the hashing tables of each
<italic>l</italic>
-mer, with
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$l \in \tilde {B}_{L}$\end{document}</tex-math>
<mml:math id="M28">
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="12859_2018_2415_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
, in the input string
<italic>x</italic>
just once. These tables will be used for all spaced seeds so that if two spaced seeds share a unit block, the corresponding table will be computed only once. When we need to reconstruct the hash for the
<italic>Q</italic>
-gram intercepted by the spaced seed
<italic>Q</italic>
<sub>
<italic>i</italic>
</sub>
at position
<italic>j</italic>
in
<italic>x</italic>
, i.e.
<italic>x</italic>
[
<italic>j</italic>
+
<italic>Q</italic>
<sub>
<italic>i</italic>
</sub>
], FISH will proceed as before by looking up at the
<italic>T</italic>
<sub>
<italic>l</italic>
</sub>
corresponding to the lengths of the blocks in the spaced seed
<italic>Q</italic>
<sub>
<italic>i</italic>
</sub>
.</p>
</sec>
</sec>
<sec id="Sec7" sec-type="results">
<title>Results</title>
<p>In this section we will discuss the time performance of the block indexing based approach FISH, presented here, and the FSH approach [
<xref ref-type="bibr" rid="CR35">35</xref>
]. The speed ups are computed with respect to the time needed for the standard computation of spaced seeds hashing, where the hashing of each
<italic>k</italic>
-mer intercepted by the spaced seed is computed separately for each position in the input string as in Eq. (
<xref rid="Equ1" ref-type="">1</xref>
).</p>
<sec id="Sec8">
<title>Spaced seeds and datasets description</title>
<p>In order to evaluate the performance of FISH we design a series of tests with different type of spaced seeds and various reads datasets. For our experiments we used the same spaced seeds and datasets used in [
<xref ref-type="bibr" rid="CR34">34</xref>
] covering three types of spaced seeds: i) maximizing the hit probability [
<xref ref-type="bibr" rid="CR31">31</xref>
]; ii) minimizing the overlap complexity [
<xref ref-type="bibr" rid="CR23">23</xref>
]; and iii) maximizing the sensitivity [
<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
<p>In line with previous studies, we evaluate nine spaced seeds, three for each category. The spaced seeds used for this test are shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. All spaced seeds
<italic>Q</italic>
1−
<italic>Q</italic>
9 (see Table 
<xref rid="Tab2" ref-type="table">2</xref>
) have the same weight |
<italic>Q</italic>
<italic>i</italic>
|=22 and length
<italic>L</italic>
=31.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>The nine spaced seeds used in the experiments grouped according to their type</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" colspan="2">Spaced seeds maximizing the hit probability [
<xref ref-type="bibr" rid="CR31">31</xref>
]</td>
</tr>
<tr>
<td align="left">Q1</td>
<td align="left">1111011101110010111001011011111</td>
</tr>
<tr>
<td align="left">Q2</td>
<td align="left">1111101011100101101110011011111</td>
</tr>
<tr>
<td align="left">Q3</td>
<td align="left">1111101001110101101100111011111</td>
</tr>
<tr>
<td align="left" colspan="2">Spaced seeds minimizing the overlap complexity [
<xref ref-type="bibr" rid="CR23">23</xref>
]</td>
</tr>
<tr>
<td align="left">Q4</td>
<td align="left">1111010111010011001110111110111</td>
</tr>
<tr>
<td align="left">Q5</td>
<td align="left">1110111011101111010010110011111</td>
</tr>
<tr>
<td align="left">Q6</td>
<td align="left">1111101001011100111110101101111</td>
</tr>
<tr>
<td align="left" colspan="2">Spaced seeds maximizing the sensitivity [
<xref ref-type="bibr" rid="CR21">21</xref>
]</td>
</tr>
<tr>
<td align="left">Q7</td>
<td align="left">1111011110011010111110101011011</td>
</tr>
<tr>
<td align="left">Q8</td>
<td align="left">1110101011101100110100111111111</td>
</tr>
<tr>
<td align="left">Q9</td>
<td align="left">1111110101101011100111011001111</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>In order to evaluate FISH under different conditions, we build several sets of spaced seeds with rashbari, with different lengths from 16 to 45 and weights from 11 to 32. A complete list of spaced seeds is reported in the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Tables S1–S5.</p>
<p>As for the reads data to be scanned and hashed, we consider a series of datasets of metagenomic reads already used for classification and binning [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR38">38</xref>
]. We use synthetic metagenomic datasets (MiSeq, HiSeq, MK_a1, MK_a2, and simBA5) as well as simulated metagenomic datasets (S,L,R). The datasets (
<italic>R</italic>
<sub>
<italic>x</italic>
</sub>
) simulate single-end long reads from Roche 454, with length 700 bp, and sequencing error of 1%. While the datasets (
<italic>S</italic>
<sub>
<italic>x</italic>
</sub>
and
<italic>L</italic>
<sub>
<italic>x</italic>
</sub>
) are paired-end reads of short length (80 bp) following Illumina error profile. The synthetic metagenomic datasets are built from real shotgun reads of different species to mimic various microbiome communities. Furthermore, for the comparison of spaced seeds with different weights and lengths, we generated datasets of increasing read length of 100, 200, and 400 bp with Mason simulator [
<xref ref-type="bibr" rid="CR39">39</xref>
] according to Illumina error profile. A summary of the datasets used in this study is reported in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. All methods have been tested on a laptop with 16 GB RAM and Intel i74510U cpu at 2GHz.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Number of reads and average lengths for each of the dataset used in our experiments</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Datasets</th>
<th align="left">Number of reads</th>
<th align="left">Avg. read length</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">S6</td>
<td align="left">1426457</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">S7</td>
<td align="left">3307100</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">S9</td>
<td align="left">4468336</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">S10</td>
<td align="left">9981172</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">L5</td>
<td align="left">1016418</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">L6</td>
<td align="left">1182178</td>
<td align="left">80</td>
</tr>
<tr>
<td align="left">HiSeq</td>
<td align="left">9989713</td>
<td align="left">91</td>
</tr>
<tr>
<td align="left">simBA5</td>
<td align="left">5439738</td>
<td align="left">100</td>
</tr>
<tr>
<td align="left">MixK1</td>
<td align="left">9629886</td>
<td align="left">101</td>
</tr>
<tr>
<td align="left">MixK2</td>
<td align="left">7149900</td>
<td align="left">101</td>
</tr>
<tr>
<td align="left">MiSeq</td>
<td align="left">9933556</td>
<td align="left">131</td>
</tr>
<tr>
<td align="left">R7</td>
<td align="left">290473</td>
<td align="left">702</td>
</tr>
<tr>
<td align="left">R8</td>
<td align="left">374576</td>
<td align="left">715</td>
</tr>
<tr>
<td align="left">R9</td>
<td align="left">588256</td>
<td align="left">715</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec9">
<title>Analysis of speed up</title>
<p>In the first test we compare the performance of FISH with FSH in terms of speed up with respect to the standard hashing computation. In Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
we report the average speed ups on all datasets, for each spaced seed, obtainable with FISH and FSH approaches.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>The speedup of FISH and FSH with respect to the standard hashing computation, as a function of the spaced seeds used in our experiments</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig2_HTML" id="MO3"></graphic>
</fig>
</p>
<p>We can observe that FISH is faster than FSH independently on the spaced seed considered. As a reference, the standard approach (Eq. (
<xref rid="Equ1" ref-type="">1</xref>
)), requires about 17 minutes to perform the hashing of a seed on all datasets. The two methods FISH and FSH can compute the hashings in 8.5 and 12 minutes respectively, with a speed up of 2 (FISH) and 1.46 (FSH). We noticed that the speed up can vary between spaced seeds, in fact FSH obtains speed ups in the range [1.18-1.58] and FISH in the interval [1.89-2.16]. As expected, the speed up depends on the structure of spaced seed to be hashed, however FSH seems to be highly dependent on the structure with a variation of 0.4 between minimum and maximum speed up, instead FISH variation is only 0.27. In summary, in this first experiments FISH in not only faster, but also less dependent of the spaced seed.</p>
<p>To have a better understanding of the behavior of FISH on all datasets, Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
reports the performance of FISH for each datasets.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Details of the speedup of FISH on each of the considered datasets, ordered by reads length</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig3_HTML" id="MO4"></graphic>
</fig>
</p>
<p>We noticed that the seeds with the best performance are Q2 and Q3, the top two lines in Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
. However, all spaced seeds show a similar behavior across different datasets. The maximum difference between the best seed, top line, and the worse seed, bottom line, remains constant for each datasets confirming the robustness of FISH. Another interesting observation is that the speed up tends to increase with the reads length and it reaches the maximum performance on the long read (see R7, R8 and R9). A possible reason for this behavior is that these datasets contain long reads, and the impact of the initial transient is reduced.</p>
<p>In Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
we report the performance of FISH and FSH for spaced seed Q7 in details over all datasets.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Details of the speedup of FISH and FSH on the spaced seed Q7 for each of the considered datasets, ordered by reads length</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig4_HTML" id="MO5"></graphic>
</fig>
</p>
<p>The results are in line with the above observations and FISH has better speed up across all datasets. However, for FISH the improvement on long reads datasets is substantial with respect to FSH.</p>
</sec>
<sec id="Sec10">
<title>Multiple spaced seed hashing</title>
<p>Several tools exploit the power of spaced seeds by using a combination of such patterns, in order to further improve their performances in terms of quality. Therefore, the simultaneous computation of the hashing of several spaced seeds at once can come very useful in such contexts.</p>
<p>Figure 
<xref rid="Fig5" ref-type="fig">5</xref>
reports the speed up of FISH and FSH when computing the hash of spaced seed independently (light blu and light green), and simultaneously as multiple spaced seeds (dark blu and dark green).
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Speedup of FSH and FISH with the multiple spaced seeds hashing (dark green and dark blu) and with each spaced seed hashed independently (light green and light blu)</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig5_HTML" id="MO6"></graphic>
</fig>
</p>
<p>The use of multiple spaced seeds simultaneously increases the speed up of both methods. However, FSH improves from 1.45 to 1.49 whereas FISH from 2.48 to 6.03. On this experiment the advantage of FISH is gain substantial, where it can hash multiple spaced seeds 4 times faster than FSH. A detailed analysis of the performance on different datasets can be found in Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
. Similarly to Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
we can observe that the speed up increases on long reads datasets.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Details of the time speedup of FISH and FSH for the multiple spaced seeds hashing on different datasets</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig6_HTML" id="MO7"></graphic>
</fig>
</p>
</sec>
<sec id="Sec11">
<title>The impact of reads length and spaced seeds weight</title>
<p>These experiments aim at posing in evidence the impact on the speed up of reads length and spaced seeds density. We generated with rasbhari [
<xref ref-type="bibr" rid="CR22">22</xref>
] different sets of nine spaced seeds with lengths from 16 to 45 and weights in the range from 11 to 32, see the Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Tables S1-S5.</p>
<p>In Fig. 
<xref rid="Fig7" ref-type="fig">7</xref>
we compare the speedup of FISH and FSH on spaced seeds with the same length
<italic>L</italic>
=31, while varying the weight
<italic>W</italic>
. It can be observed that the speed up of both FISH and FSH increases as the weight
<italic>W</italic>
increases. A possible explanation is the following. If a spaced seed has an higher weight, then the ability of FISH to use the partial hashes computed in the k-mers tables increases, and this will results in a better speed up. This behavior is consistent for both FISH and FSH, with the only exception of the speedup of FISH on multiple spaced seeds with
<italic>W</italic>
=22 and
<italic>L</italic>
=31. These are the seeds used in the first experiments and reported in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. As opposed to the other set of seeds that have been created all with same tool and minimizing overlap complexity, these seeds have been created with different methods and thus they might expose more overlaps, allowing for a better speedup. On the other hand if the density
<italic>W</italic>
/
<italic>L</italic>
of spaced seeds weight with respect to the length is low, than both FISH and FSH will have poor performance. For example, if
<italic>W</italic>
/
<italic>L</italic>
is below 0.3 than the standard hashing computation is in general faster. On extreme cases, like the spaced seeds reported in [
<xref ref-type="bibr" rid="CR40">40</xref>
], with
<italic>W</italic>
=12 and
<italic>L</italic>
=112 FISH and FSH might not be of help.
<fig id="Fig7">
<label>Fig. 7</label>
<caption>
<p>The speedup of FISH and FSH as a function of the spaced seeds density (
<italic>L</italic>
=31 and W varies)</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig7_HTML" id="MO8"></graphic>
</fig>
</p>
<p>In Fig. 
<xref rid="Fig8" ref-type="fig">8</xref>
we compare the speedup of FISH while varying the reads length, as a function of spaced seeds density (fixed lenght
<italic>L</italic>
=31). We can note that the speedup grows with the reads length, a behavior observed also in Figs. 
<xref rid="Fig3" ref-type="fig">3</xref>
and
<xref rid="Fig4" ref-type="fig">4</xref>
.
<fig id="Fig8">
<label>Fig. 8</label>
<caption>
<p>The speedup of our approach with respect to the standard hashing computation as a function of reads length (100, 200, 400) and the spaced seeds weight
<italic>W</italic>
(all with the same density)</p>
</caption>
<graphic xlink:href="12859_2018_2415_Fig8_HTML" id="MO9"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec12" sec-type="discussion">
<title>Discussion</title>
<p>In this paper, we address the problem of hashing genomic sequences through the lens of spaced seeds. Spaced seeds are widely used in many tasks related to sequence alignment and comparison. In fact, on the problem of sequence similarity detection spaced seeds have shown better performance than contiguous matches [
<xref ref-type="bibr" rid="CR19">19</xref>
]. While the hashing of contiguous matches can be efficiently performed, for spaced seed this was not the case.</p>
<p>We have already propose a method, called FSH [
<xref ref-type="bibr" rid="CR35">35</xref>
], to address this problem, but in this paper we introduce a new tool, FISH, based on different strategies. FSH is based on spaced seed auto-correlation and dynamic programming, while FISH builds an index of partial common hashings that can be reused multiple times.</p>
<p>In the results section, we have shown that FISH can improve substantially the performance in terms of speed up w.r.t. to FSH and the traditional hashing of spaced seeds. This advantage is demonstrated on a number of different settings, varying spaced seeds density and reads length.</p>
<p>The speed up of FISH increases as the length of the reads grows. This is a desirable property if we consider that modern sequencing technologies can produce longer reads. Also, if spaced seeds with high density are required, FISH indexing strategy outperforms the other methods. One interesting direction of investigation is the use of long and sparse spaced seed, i.e. with very low density, for which FISH and FSH are not suited. It remains an open problem if an alternative hashing method can further improve the hashing computation, closing the gap with the fast hashing of k-mers.</p>
</sec>
<sec id="Sec13" sec-type="conclusion">
<title>Conclusions</title>
<p>In this study we presented FISH, an indexing-based approach for speeding up the computation of rolling hash for spaced seeds. In our experiments FISH was able to compute the hashing values of spaced seeds with a speedup, on average and with respect to the traditional approach, between 1.9× (single) to 6.03× (multi), depending on the structure of the spaced seeds and on the reads length.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec14">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12859_2018_2415_MOESM1_ESM.pdf">
<label>Additional file 1</label>
<caption>
<p>Supplementary Tables. (PDF 45.9 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>NGS</term>
<def>
<p>Next generation sequencing</p>
</def>
</def-item>
<def-item>
<term>FSH</term>
<def>
<p>Fast spaced seed hashing</p>
</def>
</def-item>
<def-item>
<term>FISH</term>
<def>
<p>Fast indexing for spaced seed hashing</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Authors’ information</bold>
</p>
<p>Dipartimento di Ingegneria dell’Informazione - Università degli Studi di Padova</p>
<p>via Gradenigo 6/A, 35131 Padova - Italy</p>
<p>Email addresses: SG (samuele.girotto@gmail.com), MC (comin@dei.unipd.it), CP (cinzia.pizzi@dei.unipd.it)</p>
</fn>
</fn-group>
<ack>
<sec id="d29e3122">
<title>Funding</title>
<p>Publication costs for this article were sponsored by the Italian MIUR project “Compositional Approaches for the Characterization and Mining of Omics Data” (PRIN20122F87B2).</p>
</sec>
<sec id="d29e3127">
<title>Availability of data and materials</title>
<p>The software is freely available for academic use at:
<ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/samu661/fish/overview">https://bitbucket.org/samu661/fish/overview</ext-link>
.</p>
</sec>
<sec id="d29e3137">
<title>About this supplement</title>
<p>This article has been published as part of
<italic>BMC Bioinformatics Volume 19 Supplement 15, 2018: Proceedings of the 12th International BBCC conference</italic>
. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-15">https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-15</ext-link>
.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>All authors contributed to the design of the approach and to the analysis of the results. SG implemented the FISH software tool and performed the experiments. CP and MC conceived the study and drafted the manuscript. CP coordinated and supervised the work. All authors have read and approved the manuscript for publication.</p>
</notes>
<notes notes-type="COI-statement">
<sec>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zielezinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Karlowski</surname>
<given-names>WM</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison: benefits, applications, and tools</article-title>
<source>Genome Biol</source>
<year>2017</year>
<volume>18</volume>
<fpage>186</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-017-1319-7</pub-id>
<pub-id pub-id-type="pmid">28974235</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chew</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison (i): Statistics and power</article-title>
<source>J Comput Biol</source>
<year>2009</year>
<volume>16</volume>
<issue>12</issue>
<fpage>1615</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2009.0198</pub-id>
<pub-id pub-id-type="pmid">20001252</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing</article-title>
<source>Brief Bioinform</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>343</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt067</pub-id>
<pub-id pub-id-type="pmid">24064230</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonello</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinforma</source>
<year>2014</year>
<volume>11</volume>
<issue>3</issue>
<fpage>500</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2013.2297924</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ornamenti</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Spangaro</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rombo</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Parida</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Efficient algorithms for sequence analysis with entropic profiles</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinforma</source>
<year>2018</year>
<volume>15</volume>
<issue>1</issue>
<fpage>117</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2016.2620143</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Leoni</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Schimd</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Clustering of reads with alignment-free measures and quality values</article-title>
<source>Algorithm Mol Biol</source>
<year>2015</year>
<volume>10</volume>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/s13015-014-0029-x</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Leslie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>The spectrum kernel: a string kernel for SVM protein classification</article-title>
<source>Proceedings of Pac Symp Biocomput.</source>
<year>2002</year>
<publisher-loc>Singapore</publisher-loc>
<publisher-name>World Scientific Publishing</publisher-name>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>17</issue>
<fpage>567</fpage>
<lpage>75</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw466</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wanamaker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<fpage>236</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-1419-2</pub-id>
<pub-id pub-id-type="pmid">25879410</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rastas</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ukkonen</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Finding significant matches of position weight matrices in linear time</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinforma</source>
<year>2011</year>
<volume>8</volume>
<issue>1</issue>
<fpage>69</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2009.35</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parida</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rombo</surname>
<given-names>SE</given-names>
</name>
</person-group>
<article-title>Irredundant tandem motifs</article-title>
<source>Theor Comput Sci</source>
<year>2014</year>
<volume>525</volume>
<fpage>89</fpage>
<lpage>102</lpage>
<pub-id pub-id-type="doi">10.1016/j.tcs.2013.08.012</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shajii</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Yorukoglu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>William Yu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Berger</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Fast genotyping of known SNPs through approximate k -mer matching</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>17</issue>
<fpage>538</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw460</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Abreu-Goodger</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Enright</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Detecting microrna binding and sirna off-target effects from expression data</article-title>
<source>Nat Methods</source>
<year>2008</year>
<volume>5</volume>
<issue>12</issue>
<fpage>1023</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1267</pub-id>
<pub-id pub-id-type="pmid">18978784</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kokot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Kmc 2: fast and resource-frugal k-mer counting</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>10</issue>
<fpage>1569</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv022</pub-id>
<pub-id pub-id-type="pmid">25609798</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Opportunistic data structures with applications</article-title>
<source>Proceedings 41st Annual Symposium on Foundations of Computer Science</source>
<year>2000</year>
<publisher-loc>Piscataway</publisher-loc>
<publisher-name>IEEE</publisher-name>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Belazzougui</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cunial</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>A framework for space-efficient string kernels</article-title>
<source>Algorithmica</source>
<year>2017</year>
<volume>79</volume>
<issue>3</issue>
<fpage>857</fpage>
<lpage>83</lpage>
<pub-id pub-id-type="doi">10.1007/s00453-017-0286-4</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buhler</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Efficient large-scale sequence comparison by locality-sensitive hashing</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<issue>5</issue>
<fpage>419</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.5.419</pub-id>
<pub-id pub-id-type="pmid">11331236</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>On the complexity of the spaced seeds</article-title>
<source>J Comput Syst Sci</source>
<year>2007</year>
<volume>73</volume>
<issue>7</issue>
<fpage>1024</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="doi">10.1016/j.jcss.2007.03.008</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tromp</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Patternhunter: faster and more sensitive homology search</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>3</issue>
<fpage>440</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.3.440</pub-id>
<pub-id pub-id-type="pmid">11934743</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hahn</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Leimeister</surname>
<given-names>C-A</given-names>
</name>
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morgenstern</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Rasbhari: Optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison</article-title>
<source>PLoS Comput Biol</source>
<year>2016</year>
<volume>12</volume>
<issue>10</issue>
<fpage>1005107</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1005107</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mansouri Bigvand</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>SpEED: fast computation of sensitive spaced seeds</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>17</issue>
<fpage>2433</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr368</pub-id>
<pub-id pub-id-type="pmid">21690104</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Noé</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds</article-title>
<source>Algorithm Mol Biol</source>
<year>2017</year>
<volume>12</volume>
<fpage>1</fpage>
<pub-id pub-id-type="doi">10.1186/s13015-017-0092-1</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leimeister</surname>
<given-names>C-A</given-names>
</name>
<name>
<surname>Boden</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Horwege</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lindner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morgenstern</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Fast alignment-free sequence comparison using spaced-word frequencies</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>14</issue>
<fpage>1991</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu177</pub-id>
<pub-id pub-id-type="pmid">24700317</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Onodera</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Shibuya</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>The gapped spectrum kernel for support vector machines</article-title>
<source>Proceedings of the 9th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’13</source>
<year>2013</year>
<publisher-loc>Berlin, Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rumble</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Lacroute</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Dalca</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Fiume</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sidow</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>SHRiMP: Accurate mapping of short color-space reads</article-title>
<source>PLOS Comput Biol</source>
<year>2009</year>
<volume>5</volume>
<issue>5</issue>
<fpage>1000386</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000386</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="book">
<person-group person-group-type="editor">
<name>
<surname>Bücher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Moret</surname>
<given-names>BME</given-names>
</name>
</person-group>
<source>Procrastination Leads to Efficient Filtration for Local Multiple Alignment</source>
<year>2006</year>
<publisher-loc>Berlin, Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Břinda</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Sykulski</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kucherov</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Spaced seeds improve k-mer-based metagenomic classification</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>22</issue>
<fpage>3584</fpage>
<lpage>92</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv419</pub-id>
<pub-id pub-id-type="pmid">26209798</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Metagenomic reads binning with spaced seeds</article-title>
<source>Theor Comput Sci</source>
<year>2017</year>
<volume>698</volume>
<fpage>88</fpage>
<lpage>99</lpage>
<pub-id pub-id-type="doi">10.1016/j.tcs.2017.05.023</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Higher classification sensitivity of short metagenomic reads with CLARK-S</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>24</issue>
<fpage>3823</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw542</pub-id>
<pub-id pub-id-type="pmid">27540266</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>A tutorial of recent developments in the seeding of local alignment</article-title>
<source>J Bioinforma Comput Biol</source>
<year>2004</year>
<volume>02</volume>
<issue>04</issue>
<fpage>819</fpage>
<lpage>42</lpage>
<pub-id pub-id-type="doi">10.1142/S0219720004000983</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mohamadi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vandervalk</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>ntHash: recursive nucleotide hashing</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>22</issue>
<fpage>3492</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="pmid">27423894</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Schwartz</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Fast Spaced Seed Hashing</article-title>
<source>17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), vol. 88</source>
<year>2017</year>
<publisher-loc>Dagstuhl, Germany</publisher-loc>
<publisher-name>Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik</publisher-name>
</element-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>FSH: fast spaced seed hashing exploiting adjacent hashes</article-title>
<source>Algorithm Mol Biol</source>
<year>2018</year>
<volume>13</volume>
<fpage>8</fpage>
<pub-id pub-id-type="doi">10.1186/s13015-018-0125-4</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Keich</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tromp</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>On spaced seeds for similarity search</article-title>
<source>Discret Appl Math</source>
<year>2004</year>
<volume>138</volume>
<issue>3</issue>
<fpage>253</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1016/S0166-218X(03)00382-2</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Binning metagenomic reads with probabilistic sequence signatures based on spaced seeds</article-title>
<source>2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).</source>
<year>2017</year>
<publisher-loc>Piscataway</publisher-loc>
<publisher-name>IEEE</publisher-name>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
<source>Genome Biol</source>
<year>2014</year>
<volume>15</volume>
<fpage>46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39</label>
<mixed-citation publication-type="other">M H. Mason: a read simulator for second generation sequencing data. Technical report, FU Berlin. 2010.
<ext-link ext-link-type="uri" xlink:href="http://publications.mi.fu-berlin.de/962">http://publications.mi.fu-berlin.de/962</ext-link>
Accessed 09 Jan 2017.</mixed-citation>
</ref>
<ref id="CR40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leimeister</surname>
<given-names>C-A</given-names>
</name>
<name>
<surname>Sohrabi-Jahromi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morgenstern</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Fast and accurate phylogeny reconstruction using filtered spaced-word matches</article-title>
<source>Bioinformatics</source>
<year>2017</year>
<volume>33</volume>
<issue>7</issue>
<fpage>971</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">28073754</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000277  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000277  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021