Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improving the performance of minimizers and winnowing schemes

Identifieur interne : 000B19 ( Pmc/Corpus ); précédent : 000B18; suivant : 000B20

Improving the performance of minimizers and winnowing schemes

Auteurs : Guillaume Marçais ; David Pellow ; Daniel Bork ; Yaron Orenstein ; Ron Shamir ; Carl Kingsford

Source :

RBID : PMC:5870760

Abstract

AbstractMotivation

The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.

Results

We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence.

Availability and Implementation

The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.


Url:
DOI: 10.1093/bioinformatics/btx235
PubMed: 28881970
PubMed Central: 5870760

Links to Exploration step

PMC:5870760

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Improving the performance of minimizers and winnowing schemes</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bork, Daniel" sort="Bork, Daniel" uniqKey="Bork D" first="Daniel" last="Bork">Daniel Bork</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation>
<nlm:aff id="btx235-aff3">Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28881970</idno>
<idno type="pmc">5870760</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870760</idno>
<idno type="RBID">PMC:5870760</idno>
<idno type="doi">10.1093/bioinformatics/btx235</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000B19</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B19</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Improving the performance of minimizers and winnowing schemes</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bork, Daniel" sort="Bork, Daniel" uniqKey="Bork D" first="Daniel" last="Bork">Daniel Bork</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation>
<nlm:aff id="btx235-aff3">Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>The minimizers scheme is a method for selecting
<italic>k</italic>
-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many
<italic>k</italic>
-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of
<italic>k</italic>
-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We provide an in-depth analysis of the effect of
<italic>k</italic>
-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer
<italic>et al.</italic>
) on the expected density of minimizers in a random sequence.</p>
</sec>
<sec id="SA4">
<title>Availability and Implementation</title>
<p>The software used for this analysis is available on GitHub:
<ext-link ext-link-type="uri" xlink:href="https://github.com/gmarcais/minimizers.git">https://github.com/gmarcais/minimizers.git</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Bruijn, N G" uniqKey="De Bruijn N">N.G. de Bruijn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S. Deorowicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S. Grabowski</name>
</author>
<author>
<name sortKey="Raniszewski, M" uniqKey="Raniszewski M">M. Raniszewski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y. Li</name>
</author>
<author>
<name sortKey="Yan, X" uniqKey="Yan X">X. Yan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y. Orenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y. Orenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M. Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M. Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schleimer, S" uniqKey="Schleimer S">S. Schleimer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, D E" uniqKey="Wood D">D.E. Wood</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C. Ye</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28881970</article-id>
<article-id pub-id-type="pmc">5870760</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btx235</article-id>
<article-id pub-id-type="publisher-id">btx235</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017</subject>
<subj-group subj-group-type="category-toc-heading">
<subject>Hitseq</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Improving the performance of minimizers and winnowing schemes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Marçais</surname>
<given-names>Guillaume</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
<xref ref-type="corresp" rid="btx235-cor1"></xref>
<pmc-comment>gmarcais@cs.cmu.edu</pmc-comment>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bork</surname>
<given-names>Daniel</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Orenstein</surname>
<given-names>Yaron</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
<xref ref-type="corresp" rid="btx235-cor1"></xref>
<pmc-comment>carlk@cs.cmu.edu</pmc-comment>
</contrib>
</contrib-group>
<aff id="btx235-aff1">
<label>1</label>
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</aff>
<aff id="btx235-aff2">
<label>2</label>
Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</aff>
<aff id="btx235-aff3">
<label>3</label>
Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</aff>
<author-notes>
<corresp id="btx235-cor1">To whom correspondence should be addressed. Email:
<email>gmarcais@cs.cmu.edu</email>
or
<email>carlk@cs.cmu.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2017-07-12">
<day>12</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>12</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>33</volume>
<issue>14</issue>
<fpage>i110</fpage>
<lpage>i117</lpage>
<permissions>
<copyright-statement>© The Author 2017. Published by Oxford University Press.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="cc-by-nc" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="btx235.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>The minimizers scheme is a method for selecting
<italic>k</italic>
-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many
<italic>k</italic>
-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of
<italic>k</italic>
-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We provide an in-depth analysis of the effect of
<italic>k</italic>
-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer
<italic>et al.</italic>
) on the expected density of minimizers in a random sequence.</p>
</sec>
<sec id="SA4">
<title>Availability and Implementation</title>
<p>The software used for this analysis is available on GitHub:
<ext-link ext-link-type="uri" xlink:href="https://github.com/gmarcais/minimizers.git">https://github.com/gmarcais/minimizers.git</ext-link>
.</p>
</sec>
</abstract>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Israel Science Foundation</named-content>
<named-content content-type="funder-identifier">10.13039/501100003977</named-content>
</funding-source>
</award-group>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<p>The winnowing scheme was introduced by
<xref rid="btx235-B12" ref-type="bibr">Schleimer
<italic>et al.</italic>
(2003)</xref>
to fingerprint documents for plagiarism detection. Independently, the minimizers algorithm was introduced by
<xref rid="btx235-B11" ref-type="bibr">Roberts
<italic>et al.</italic>
(2004b</xref>
) to compute overlaps between sequencing reads. Even though these algorithms were designed for different purposes, they are essentially identical. The original minimizers scheme compares nucleotide
<italic>k</italic>
-mers using the lexicographic ordering, while the winnowing scheme hashes the
<italic>k</italic>
-grams of letters in a document, effectively randomizing the order.</p>
<p>In the bioinformatics field, minimizers have since been used in many different settings, such as binning input sequences (
<xref rid="btx235-B4" ref-type="bibr">Deorowicz
<italic>et al.</italic>
, 2015</xref>
;
<xref rid="btx235-B7" ref-type="bibr">Li and Yan, 2015</xref>
;
<xref rid="btx235-B10" ref-type="bibr">Roberts
<italic>et al.</italic>
, 2004a,b</xref>
), generating sparse data structures (
<xref rid="btx235-B5" ref-type="bibr">Grabowski and Raniszewski, 2015</xref>
;
<xref rid="btx235-B14" ref-type="bibr">Ye
<italic>et al.</italic>
, 2012</xref>
), and sequence classification (
<xref rid="btx235-B13" ref-type="bibr">Wood and Salzberg, 2014</xref>
). All these applications share the need for a small signature, or fingerprint, in order to recognize longer exact matches between sequences.</p>
<p>The winnowing scheme is defined as follows: given an input sequence
<italic>S</italic>
and parameters
<italic>k</italic>
and
<italic>w</italic>
, select the smallest
<italic>k</italic>
-mer (according to a predefined ordering) in each window of
<italic>w</italic>
consecutive
<italic>k</italic>
-mers in
<italic>S</italic>
(such a window contains
<inline-formula id="IE1">
<mml:math id="IEQ1">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
bases). In this paper, we will interchangeably use the terms ‘minimizers scheme’ or ‘winnowing scheme’. Either term does not imply a particular ordering as we study the effect of various orderings on these schemes. The minimizers (or fingerprints), are, depending on the application, either the set of positions in
<italic>S</italic>
of the selected
<italic>k</italic>
-mers, or the set of selected
<italic>k</italic>
-mers itself. The terms fingerprint and minimizer are interchangeable in the remainder of this paper.</p>
<p>These schemes are designed to select a set of
<italic>k</italic>
-mers that is as sparse as possible while satisfying the following two properties. First, the sequence is approximately uniformly sampled; that is, the distance between two selected
<italic>k</italic>
-mers is always less than
<italic>w</italic>
. Second, if two sequences share an exact subsequence of length at least
<inline-formula id="IE2">
<mml:math id="IEQ2">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, then those two sequences have at least one minimizer in common.</p>
<p>Many of the tools mentioned above do not use lexicographic ordering of the
<italic>k</italic>
-mers, as originally proposed in
<xref rid="btx235-B11" ref-type="bibr">Roberts
<italic>et al.</italic>
(2004b</xref>
). It was recognized early on that homo-polymer runs, particularly repeated
<italic>A</italic>
s, can cause a lexicographic minimizer algorithm to select many consecutive
<italic>k</italic>
-mers as minimizers in genomics applications, leading to a high minimizer density. Because
<italic>A</italic>
is the smallest letter, if the 5-mer
<italic>AAAAC</italic>
is selected as a minimizer in a window, it is likely that the shifted 5-mer
<italic>AAACx</italic>
(for any base
<italic>x</italic>
) will also be selected as a minimizer in a subsequent window, and
<italic>AACxy</italic>
in the following window, and so on.</p>
<p>Another problem with the lexicographic ordering is that any one of the
<inline-formula id="IE3">
<mml:math id="IEQ3">
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
possible
<italic>k</italic>
-mers can theoretically be chosen as a minimizer in some window. In particular, when
<italic>k</italic>
is short, many or even all possible
<italic>k</italic>
-mers could appear as a minimizer. If the minimizers are used for binning sequences, then potentially a very large number of bins will be necessary.</p>
<p>Schleimer
<italic>et al.</italic>
defined the
<italic>density</italic>
of a
<italic>k</italic>
-mer selection scheme for a given sequence
<italic>S</italic>
as the fraction of the
<italic>k</italic>
-mers that are selected. Formally, let
<inline-formula id="IE4">
<mml:math id="IEQ4">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
denote the density of procedure
<italic>A</italic>
and
<italic>A</italic>
(
<italic>S</italic>
,
<italic>k</italic>
) as the set of selected
<italic>k</italic>
-mers positions by
<italic>A</italic>
on sequence
<italic>S</italic>
, then
<disp-formula id="E1">
<label>(1)</label>
<mml:math id="EQ1">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
The
<italic>expected density</italic>
of procedure
<italic>A</italic>
,
<italic>d</italic>
(
<italic>A</italic>
,
<italic>k</italic>
), is calculated by taking the expectation over all possible sequences
<italic>S</italic>
, with every base chosen independently with equal probability.</p>
<p>Schleimer
<italic>et al.</italic>
show that the winnowing scheme with a random ordering has expected density of
<inline-formula id="IE5">
<mml:math id="IEQ5">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. In practice, the minimizers scheme with lexicographic ordering has density greater than
<inline-formula id="IE6">
<mml:math id="IEQ6">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. In addition, they define a
<italic>local scheme</italic>
as a procedure that only has access to the sequence within a given window when selecting a
<italic>k</italic>
-mer. In other words, the
<italic>k</italic>
-mer selected from a window can be expressed as a function of only the identity and relative ordering of the bases within the window itself, and does not depend on the content of any other window or on the position of the window within the overall sequence. They prove that the expected density of a local scheme has a lower bound of
<inline-formula id="IE7">
<mml:math id="IEQ7">
<mml:mrow>
<mml:mn>1.5</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, but they conjecture that this bound cannot be achieved in practice, and that the true lower bound is in fact
<inline-formula id="IE8">
<mml:math id="IEQ8">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. This conjecture would therefore imply that the winnowing scheme with random order is an optimal local scheme.</p>
<p>In this paper, we investigate the effect of different orderings of the
<italic>k</italic>
-mers on the density of the winnowing scheme. Here we only consider local schemes, and exclude global schemes, such as the counting-based orderings used in
<xref rid="btx235-B1" ref-type="bibr">Chikhi
<italic>et al.</italic>
(2015</xref>
,
<xref rid="btx235-B2" ref-type="bibr">2016</xref>
).</p>
<p>By using
<italic>universal k-mer hitting sets</italic>
, as defined in
<xref rid="btx235-B8" ref-type="bibr">Orenstein
<italic>et al.</italic>
(2016a</xref>
), we show how to create orderings that lead to densities smaller than with the lexicographic or randomized ordering. The small universal hitting sets created by the DOCKS algorithm developed in
<xref rid="btx235-B8" ref-type="bibr">Orenstein
<italic>et al.</italic>
(2016a</xref>
) achieve densities below
<inline-formula id="IE9">
<mml:math id="IEQ9">
<mml:mrow>
<mml:mn>1.8</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. As a consequence, the above conjecture of Schleimer
<italic>et al.</italic>
does not hold and densities below
<inline-formula id="IE10">
<mml:math id="IEQ10">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
are achievable with a local scheme. Using some properties of the universal
<italic>k</italic>
-mer hitting sets, we also show that, surprisingly, the winnowing scheme with random ordering can itself have a density slightly below
<inline-formula id="IE11">
<mml:math id="IEQ11">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. We also explain why the original minimizers procedure, using the lexicographic order, performs worse than random ordering.</p>
<p>Finally, we look at the potential effect of using universal hitting set orderings on bioinformatics applications. In the case of DNA sequence binning, such as performed by the
<italic>k</italic>
-mer counters KMC2 (
<xref rid="btx235-B4" ref-type="bibr">Deorowicz
<italic>et al.</italic>
, 2015</xref>
) and MSPKmerCounter (
<xref rid="btx235-B7" ref-type="bibr">Li and Yan, 2015</xref>
), we compare the distribution of the number and sizes of bins created by different orderings proposed in the literature. The universal hitting sets ordering perform better than the other ordering in couple of ways. First, the number of bins created has a known bound, unlike other orderings that can create as many bins as there are
<italic>k</italic>
-mers (
<inline-formula id="IE12">
<mml:math id="IEQ12">
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
). In practice, this bound is also much lower than the actual number of bins created by other orderings. Second, the sizes of the bins are more uniform than with other orderings.</p>
<p>Based on this analysis of the performance of the minimizers algorithm, we advise bioinformatics tool authors to use the winnowing scheme with an appropriate universal hitting set in their application if possible, and a random ordering otherwise, in lieu of the default lexicographic ordering.</p>
</sec>
<sec>
<title>2 Approach</title>
<p>We will first give an overview of the results in this paper. Formal proofs for the results mentioned in this section are presented in subsequent sections.</p>
<p>In the original papers on minimizers and the winnowing schemes, the density is computed by considering any window of
<italic>w</italic>
 + 1 consecutive
<italic>k</italic>
-mers. They make use of the following hypothesis:</p>
<p>
<sc>Hypothesis</sc>
1.
<italic>Every k-mer in a</italic>
<inline-formula id="IE13">
<mml:math id="IEQ13">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>-long window has an equal probability of</italic>
<inline-formula id="IE14">
<mml:math id="IEQ14">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>of being the smallest k-mer.</italic>
</p>
<p>Although not strictly true in practice, this hypothesis is reasonable and reflects reality accurately when using a randomized ordering. We define the
<italic>density factor d
<sub>f</sub>
</italic>
of a
<italic>k</italic>
-mer selection scheme
<italic>A</italic>
as the density times
<inline-formula id="IE15">
<mml:math id="IEQ15">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, i.e.
<disp-formula id="E2">
<label>(2)</label>
<mml:math id="EQ2">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
Assuming that Hypothesis 1 holds, we can rephrase the theorems of Schleimer
<italic>et al.</italic>
and Roberts
<italic>et al.</italic>
as:</p>
<statement id="mthst1">
<p>
<sc>Theorem</sc>
1 (
<xref rid="btx235-B11" ref-type="bibr">Roberts
<italic>et al.</italic>
, 2004b</xref>
;
<xref rid="btx235-B12" ref-type="bibr">Schleimer
<italic>et al.</italic>
, 2003</xref>
). Under Hypothesis 1, the density factor of the minimizers is
<inline-formula id="IE500">
<mml:math id="IEQ500">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo>2</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
</statement>
<p>A
<italic>universal k-mer hitting set</italic>
for given
<italic>k</italic>
and
<italic>w</italic>
is a set
<inline-formula id="IE16">
<mml:math id="IEQ16">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
of
<italic>k</italic>
-mers such that every window of
<italic>w</italic>
consecutive
<italic>k</italic>
-mers must contain an element from
<inline-formula id="IE17">
<mml:math id="IEQ17">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. In
<xref rid="btx235-B8" ref-type="bibr">Orenstein
<italic>et al.</italic>
(2016a</xref>
), we introduced universal hitting sets, and provided an efficient algorithm for constructing a compact universal set. These universal hitting sets do not need to contain all possible
<italic>k</italic>
-mers, and typically contain only about
<inline-formula id="IE18">
<mml:math id="IEQ18">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
of all the
<italic>k</italic>
-mers. Here, we connect the ideas of universal
<italic>k</italic>
-mer hitting sets with minimizers by creating an ordering where the elements of
<inline-formula id="IE19">
<mml:math id="IEQ19">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
are the smallest elements. Because every window contains an element of
<inline-formula id="IE20">
<mml:math id="IEQ20">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and these elements are smaller than other
<italic>k</italic>
-mers, only elements of
<inline-formula id="IE21">
<mml:math id="IEQ21">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
can be selected as minimizers. Consequently,
<italic>k</italic>
-mers that are not in
<inline-formula id="IE22">
<mml:math id="IEQ22">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
have probability 0 of being selected as minimizers and Hypothesis 1 does not hold anymore.</p>
<p>We now consider a refined hypothesis.</p>
<p>
<sc>Hypothesis</sc>
2.
<italic>If a</italic>
<inline-formula id="IE23">
<mml:math id="IEQ23">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>-long window contains j k-mers from a universal hitting set, each of these k-mers has an equal probability of</italic>
<inline-formula id="IE24">
<mml:math id="IEQ24">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>of being the smallest k-mer.</italic>
</p>
<p>Motivated by the proof of Theorem 3 (see below), we define:</p>
<p>
<sc>Definition</sc>
2. The sparsity of a universal hitting set
<inline-formula id="IE25">
<mml:math id="IEQ25">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, is the proportion of
<inline-formula id="IE26">
<mml:math id="IEQ26">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long windows containing only one k-mer from
<inline-formula id="IE27">
<mml:math id="IEQ27">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>Assuming that Hypothesis 2 holds, we obtain a new estimation of the density factor.</p>
<p>
<sc>Theorem</sc>
3. Under Hypothesis 2, the density factor for the minimizers scheme with universal hitting sets
<inline-formula id="IE28">
<mml:math id="IEQ28">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is
<disp-formula id="E3">
<label>(3)</label>
<mml:math id="EQ3">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
Note that when using the scheme we propose below, Hypothesis 2 provides no practical constraint on the use of minimizers in applications. Since any window will contain an element of
<inline-formula id="IE29">
<mml:math id="IEQ29">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
by definition, any application assuming Hypothesis 1 can be made to work under Hypothesis 2 through the use of the appropriate
<italic>k</italic>
-mer ordering.</p>
<p>The expected density of the minimizers scheme for a given ordering is defined by an expectation over the set of all random sequences. So the density of minimizers computed for a given sequence might differ from the expected density. On the other hand, we will show how the expected density can be computed exactly by computing the density of minimizers on a particular sequence.</p>
<statement id="mthst2">
<p>
<sc>Lemma</sc>
4. The expected density of a minimizers scheme with parameters
<italic>k</italic>
and
<italic>w</italic>
is equal to the density of minimizers on any de Bruijn sequence of order
<inline-formula id="IE501">
<mml:math id="IEQ501">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
</statement>
<p>We also show that Theorem 3 provides a good approximation to the density for minimizers with universal hitting sets and exhibit counterexamples to the Schleimer
<italic>et al.</italic>
conjecture.</p>
<p>
<sc>Theorem</sc>
5. The density factor of a local scheme can be less than 2.</p>
<p>The original intent of the minimizers is to select
<italic>k</italic>
-mers from a sequence as uniformly and sparsely as possible. Theorem 5 shows that schemes based on universal hitting set get closer to that goal. Using these schemes can greatly improve the performance of bioinformatics applications (See Section 4).</p>
</sec>
<sec>
<title>3 Proofs and mathematical details</title>
<sec>
<title>3.1 Density with random ordering</title>
<p>First, we succinctly reproduce the proof from
<xref rid="btx235-B12" ref-type="bibr">Schleimer
<italic>et al.</italic>
(2003)</xref>
and
<xref rid="btx235-B11" ref-type="bibr">Roberts
<italic>et al.</italic>
(2004b</xref>
) of Theorem 1.</p>
<p>
<sc>Theorem</sc>
1. (
<xref rid="btx235-B11" ref-type="bibr">Roberts
<italic>et al.</italic>
, 2004b</xref>
;
<xref rid="btx235-B12" ref-type="bibr">Schleimer
<italic>et al.</italic>
, 2003</xref>
). Under Hypothesis 1, the density factor of the minimizers is
<inline-formula id="IE502">
<mml:math id="IEQ502">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<statement id="mthst3">
<p>
<sc>Proof</sc>
: Define a
<italic>charged</italic>
window
<italic>W
<sub>i</sub>
</italic>
of
<italic>w</italic>
consecutive
<italic>k</italic>
-mers starting at position
<italic>i</italic>
in a sequence
<italic>S</italic>
as a window such that the smallest
<italic>k</italic>
-mer is different in
<italic>W
<sub>i</sub>
</italic>
than in
<inline-formula id="IE30">
<mml:math id="IEQ30">
<mml:mrow>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. That is, as we sweep through the sequence from left to right, we charge the left-most window in which a given fingerprint is first selected. The number of fingerprints selected by the winnowing scheme is equal to the number of windows that are charged. Define the random variable
<italic>X
<sub>i</sub>
</italic>
to be 1 if
<italic>W
<sub>i</sub>
</italic>
is charged and 0 if not. Then the expected density can be written as
<inline-formula id="IE31">
<mml:math id="IEQ31">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, where
<inline-formula id="IE32">
<mml:math id="IEQ32">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
is the number of
<italic>k</italic>
-mers in the sequence.</p>
</statement>
<p>Consider the larger window
<inline-formula id="IE33">
<mml:math id="IEQ33">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
of
<italic>w</italic>
 + 1
<italic>k</italic>
-mers starting at position
<italic>i</italic>
– 1 (
<inline-formula id="IE34">
<mml:math id="IEQ34">
<mml:mrow>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
), and the smallest
<italic>k</italic>
-mer
<italic>m</italic>
in
<inline-formula id="IE35">
<mml:math id="IEQ35">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
(see
<xref ref-type="fig" rid="btx235-F1">Fig. 1</xref>
). If
<italic>m</italic>
starts at position
<italic>i</italic>
– 1, then
<italic>W
<sub>i</sub>
</italic>
must be charged, as the previous fingerprint is now outside of the window
<italic>W
<sub>i</sub>
</italic>
. If
<italic>m</italic>
starts at position
<inline-formula id="IE36">
<mml:math id="IEQ36">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, then
<italic>W
<sub>i</sub>
</italic>
must be charged, as the new smallest
<italic>k</italic>
-mer just arrived in
<italic>W
<sub>i</sub>
</italic>
. In all other cases,
<italic>W
<sub>i</sub>
</italic>
and
<inline-formula id="IE37">
<mml:math id="IEQ37">
<mml:mrow>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
selected the same fingerprint and
<italic>W
<sub>i</sub>
</italic>
is not charged. Assuming that the minimum
<italic>m</italic>
has equal probability to be in any of the positions
<inline-formula id="IE38">
<mml:math id="IEQ38">
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, then
<inline-formula id="IE39">
<mml:math id="IEQ39">
<mml:mrow>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. So
<inline-formula id="IE40">
<mml:math id="IEQ40">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
and
<italic>d
<sub>f</sub>
</italic>
 = 2. □</p>
<fig id="btx235-F1" orientation="portrait" position="float">
<label>Fig. 1</label>
<caption>
<p>Windows
<italic>W
<sub>i</sub>
</italic>
, starting at position
<italic>i</italic>
, and window
<inline-formula id="IE41">
<mml:math id="IEQ41">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
starting at position
<italic>i</italic>
– 1. There are 3 different qualitative cases for the start position of the smallest
<italic>k</italic>
-mer
<italic>m</italic>
:
<italic>i</italic>
– 1 (left dot),
<inline-formula id="IE42">
<mml:math id="IEQ42">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
(right dot) or in the range
<inline-formula id="IE43">
<mml:math id="IEQ43">
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</caption>
<graphic xlink:href="btx235f1"></graphic>
</fig>
</sec>
<sec>
<title>3.2 Computing the density</title>
<p>Even though the density is defined as an expectation over all possible random sequences where the bases are uniformly and independently chosen, for small values of
<italic>k</italic>
and
<italic>w</italic>
, it is possible in practice to compute the exact value of the expected density.</p>
<p>A
<italic>de Bruijn sequence</italic>
(
<xref rid="btx235-B3" ref-type="bibr">de Bruijn, 1946</xref>
) of order
<italic>k</italic>
on the alphabet Σ is a cyclic sequence that contains every possible
<italic>k</italic>
-mer as a substring exactly once and has length
<italic>σ
<sup>k</sup>
</italic>
, where
<italic>σ</italic>
is the number of symbols in the alphabet.</p>
<statement id="mthst5">
<p>
<sc>Lemma</sc>
4. The expected density of a minimizers scheme with parameters
<italic>k</italic>
and
<italic>w</italic>
is equal to the density of minimizers on any de Bruijn sequence of order
<inline-formula id="IE503">
<mml:math id="IEQ503">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
</statement>
<statement id="mthst7">
<p>
<sc>Proof</sc>
: As seen in the proof of Section 3.1, whether a window
<italic>W
<sub>i</sub>
</italic>
is charged depends only on the sequence of the
<inline-formula id="IE44">
<mml:math id="IEQ44">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long window
<inline-formula id="IE45">
<mml:math id="IEQ45">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Under the assumption that the input sequence is random, each
<inline-formula id="IE46">
<mml:math id="IEQ46">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long window has a probability of
<inline-formula id="IE47">
<mml:math id="IEQ47">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
. Hence, the density is the number of
<inline-formula id="IE48">
<mml:math id="IEQ48">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long windows
<inline-formula id="IE49">
<mml:math id="IEQ49">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
that cause the window
<italic>W</italic>
to be charged, divided by
<inline-formula id="IE50">
<mml:math id="IEQ50">
<mml:mrow>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
. By computing the density on increasingly long random input sequences, the density would converge to the expected density as the proportion of the
<inline-formula id="IE51">
<mml:math id="IEQ51">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-windows converge to
<inline-formula id="IE52">
<mml:math id="IEQ52">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>A de Bruijn sequence of order
<italic>k </italic>
+
<italic> w</italic>
contains every one of the
<inline-formula id="IE53">
<mml:math id="IEQ53">
<mml:mrow>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
possible
<inline-formula id="IE54">
<mml:math id="IEQ54">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-windows of
<italic>k</italic>
-mers exactly once (since it contains every
<inline-formula id="IE55">
<mml:math id="IEQ55">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long sequence exactly once). In other words, this de Bruijn sequence has all the
<inline-formula id="IE56">
<mml:math id="IEQ56">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-windows in exactly the desired proportion. Therefore, we can compute the expected density exactly by computing the density on the de Bruijn sequence of order
<italic>w </italic>
+
<italic> k</italic>
.                          □</p>
</statement>
</sec>
<sec>
<title>3.3 Universal hitting set ordering</title>
<p>The
<italic>de Bruijn graph G
<sub>k</sub>
</italic>
on a fixed alphabet Σ of size
<italic>σ</italic>
is a directed graph whose vertices are all the
<italic>k</italic>
-mers and a directed edge (
<italic>u</italic>
,
<italic>v</italic>
) represents a
<inline-formula id="IE57">
<mml:math id="IEQ57">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-mer with
<italic>u</italic>
as its prefix and
<italic>v</italic>
as its suffix. A
<italic>universal hitting set</italic>
for the parameters
<italic>k</italic>
and
<italic>w</italic>
is a set of
<italic>k</italic>
-mers,
<inline-formula id="IE58">
<mml:math id="IEQ58">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, so that any string of length
<inline-formula id="IE59">
<mml:math id="IEQ59">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
over Σ contains a substring from
<inline-formula id="IE60">
<mml:math id="IEQ60">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Equivalently, every walk of length
<italic>w</italic>
over the nodes in the de Bruijn graph
<italic>G
<sub>k</sub>
</italic>
contains at least one
<italic>k</italic>
-mer from
<inline-formula id="IE61">
<mml:math id="IEQ61">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. In other words, it is a set of nodes that covers (has non-empty intersection with) all the length-
<italic>w</italic>
walks. (Here and throughout, the
<italic>length</italic>
of a walk is the number of vertices along it.) The set
<inline-formula id="IE62">
<mml:math id="IEQ62">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of all the nodes of
<italic>G
<sub>k</sub>
</italic>
is trivially a universal hitting set, showing that the concept is well defined. In
<xref rid="btx235-B8" ref-type="bibr">Orenstein
<italic>et al.</italic>
(2016a</xref>
), a heuristic is given to generate smaller universal hitting sets. The existence of such sets with certain desired properties is further discussed in Section 3.5.</p>
<p>Given a universal hitting set
<inline-formula id="IE63">
<mml:math id="IEQ63">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, we define an ordering
<inline-formula id="IE64">
<mml:math id="IEQ64">
<mml:mrow>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
on the
<italic>k</italic>
-mers as follows:
<inline-formula id="IE65">
<mml:math id="IEQ65">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
if
<inline-formula id="IE66">
<mml:math id="IEQ66">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula id="IE67">
<mml:math id="IEQ67">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
; otherwise,
<inline-formula id="IE68">
<mml:math id="IEQ68">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
if
<italic>u</italic>
is less than
<italic>v</italic>
in lexicographical order. In other words, the ordering makes the elements of the universal hitting set the smallest elements, and uses lexicographic order within the universal hitting set and its complement.</p>
<p>The most important property of this ordering is that, in each window of
<italic>w</italic>
consecutive
<italic>k</italic>
-mers, the smallest element must be an element of
<inline-formula id="IE69">
<mml:math id="IEQ69">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
</sec>
<sec>
<title>3.4 Density with universal hitting sets</title>
<p>With the ordering defined by a universal hitting set, we use the more appropriate Hypothesis 2 to prove the following theorem.</p>
<statement id="mthst9">
<p>
<sc>Theorem</sc>
3. Under Hypothesis 2, the density factor for the minimizers scheme with universal hitting sets
<inline-formula id="IE70">
<mml:math id="IEQ70">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is
<disp-formula id="E4">
<label>(3)</label>
<mml:math id="EQ4">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</statement>
<statement id="mthst11">
<p>
<sc>Proof</sc>
: We adapt the proof of Section 3.1 to the ordering
<inline-formula id="IE71">
<mml:math id="IEQ71">
<mml:mrow>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. With this ordering, only the elements of
<inline-formula id="IE72">
<mml:math id="IEQ72">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
are selected as fingerprints, and the number of such elements in the
<inline-formula id="IE73">
<mml:math id="IEQ73">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long window varies between 1 and
<italic>w</italic>
 + 1. We can compute the probability that a window is charged depending on how many elements from
<inline-formula id="IE74">
<mml:math id="IEQ74">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
it contains. Let’s define the event
<inline-formula id="IE75">
<mml:math id="IEQ75">
<mml:mrow>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
that the window
<italic>W
<sub>i</sub>
</italic>
is charged, given that there are
<italic>j</italic>
elements of
<inline-formula id="IE76">
<mml:math id="IEQ76">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
in the longer window
<inline-formula id="IE77">
<mml:math id="IEQ77">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
that contains it. Then:
<disp-formula id="E5">
<label>(4)</label>
<mml:math id="EQ5">
<mml:mrow>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="true">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo> </mml:mo>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mo stretchy="true">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E6">
<label>(5)</label>
<mml:math id="EQ6">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</statement>
<p>We can further condition the event
<inline-formula id="IE78">
<mml:math id="IEQ78">
<mml:mrow>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
on whether the
<italic>k</italic>
-mers
<inline-formula id="IE79">
<mml:math id="IEQ79">
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula id="IE80">
<mml:math id="IEQ80">
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, respectively at positions
<italic>i</italic>
– 1 and
<inline-formula id="IE81">
<mml:math id="IEQ81">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, are in the universal hitting set. For
<inline-formula id="IE82">
<mml:math id="IEQ82">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mo><</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
:
<disp-formula id="E7">
<label>(6)</label>
<mml:math id="EQ7">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E8">
<label>(7)</label>
<mml:math id="EQ8">
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E9">
<label>(8)</label>
<mml:math id="EQ9">
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E10">
<label>(9)</label>
<mml:math id="EQ10">
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
The factor
<inline-formula id="IE83">
<mml:math id="IEQ83">
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
in (6) expresses the number of ways to place the
<italic>j</italic>
– 1
<italic>k</italic>
-mers of
<inline-formula id="IE84">
<mml:math id="IEQ84">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
in the
<italic>w</italic>
– 1 positions available (from
<italic>i</italic>
to
<inline-formula id="IE85">
<mml:math id="IEQ85">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
), given that
<inline-formula id="IE86">
<mml:math id="IEQ86">
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is in
<inline-formula id="IE87">
<mml:math id="IEQ87">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula id="IE88">
<mml:math id="IEQ88">
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is not in
<inline-formula id="IE89">
<mml:math id="IEQ89">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. The other factors are analogous.</p>
<p>For
<inline-formula id="IE90">
<mml:math id="IEQ90">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mo><</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, we assume, based on Hypothesis 2, that all of the
<italic>j k</italic>
-mers from
<inline-formula id="IE91">
<mml:math id="IEQ91">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
have equal probability to be the smallest for the order
<inline-formula id="IE92">
<mml:math id="IEQ92">
<mml:mrow>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Therefore the conditional probabilities equal
<inline-formula id="IE93">
<mml:math id="IEQ93">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
for (6) and (7),
<inline-formula id="IE94">
<mml:math id="IEQ94">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
for (8) and 0 for (9). Hence
<disp-formula id="E11">
<mml:math id="EQ11">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>j</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
For
<italic>j </italic>
=
<italic> w</italic>
, term (9) is omitted, as all of but one the
<italic>k</italic>
-mers in
<inline-formula id="IE95">
<mml:math id="IEQ95">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
are in
<inline-formula id="IE96">
<mml:math id="IEQ96">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. For
<inline-formula id="IE97">
<mml:math id="IEQ97">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, which corresponds to the case in the proof of Section 3.1, only term (8) is present, as all the
<italic>k</italic>
-mers in
<inline-formula id="IE98">
<mml:math id="IEQ98">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
are in
<inline-formula id="IE99">
<mml:math id="IEQ99">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Both these cases lead to the same result as the general case, hence
<inline-formula id="IE100">
<mml:math id="IEQ100">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mtext>for</mml:mtext>
<mml:mn>2</mml:mn>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>On the other hand, for
<italic>j</italic>
 = 1,
<inline-formula id="IE101">
<mml:math id="IEQ101">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. (This distinction motivates our definition of
<inline-formula id="IE102">
<mml:math id="IEQ102">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.) This corresponds to the case where there is only one element from the universal hitting set in the
<inline-formula id="IE103">
<mml:math id="IEQ103">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long window
<inline-formula id="IE104">
<mml:math id="IEQ104">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. In that case, this element cannot be located at position
<italic>i</italic>
– 1 or
<inline-formula id="IE105">
<mml:math id="IEQ105">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
as that would leave a window of size
<italic>w</italic>
without any
<italic>k</italic>
-mer from
<inline-formula id="IE106">
<mml:math id="IEQ106">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, which is impossible by construction. Hence, when
<italic>j</italic>
 = 1,
<italic>W
<sub>i</sub>
</italic>
cannot be charged and
<inline-formula id="IE107">
<mml:math id="IEQ107">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
has to be 0.</p>
<p>Finally, because
<inline-formula id="IE108">
<mml:math id="IEQ108">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, we get from equation (5) that
<disp-formula id="E12">
<label>(10)</label>
<mml:math id="EQ12">
<mml:mrow>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mrow>
<mml:mo stretchy="true">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="true">]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">]</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
Under the assumption of random input sequence, the sparsity
<inline-formula id="IE109">
<mml:math id="IEQ109">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is precisely
<inline-formula id="IE110">
<mml:math id="IEQ110">
<mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">[</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
and we have the main result
<disp-formula id="E13">
<mml:math id="EQ13">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<xref ref-type="disp-formula" rid="E3">Equation (3)</xref>
implies that if the universal hitting set
<inline-formula id="IE111">
<mml:math id="IEQ111">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is such that there exist
<inline-formula id="IE112">
<mml:math id="IEQ112">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long walks in
<italic>G
<sub>k</sub>
</italic>
with only one element from
<inline-formula id="IE113">
<mml:math id="IEQ113">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, then the expected density factor is less than 2.</p>
<p>A universal hitting set
<inline-formula id="IE114">
<mml:math id="IEQ114">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
that satisfies this condition, that is,
<inline-formula id="IE115">
<mml:math id="IEQ115">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>></mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, will be called
<italic>w-sparse</italic>
. A
<italic>w</italic>
-sparse universal hitting set has density factor less than 2. See Section 3.7 for an example of
<italic>w</italic>
-sparse universal hitting set with low density.</p>
<p>Similarly to Section 3.2, the sparsity of a universal hitting set
<inline-formula id="IE116">
<mml:math id="IEQ116">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
can be computed exactly as the proportion of
<inline-formula id="IE117">
<mml:math id="IEQ117">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
-long windows with only one
<italic>k</italic>
-mer from
<inline-formula id="IE118">
<mml:math id="IEQ118">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
in the de Bruijn sequence of order
<italic>k </italic>
+
<italic> w</italic>
.</p>
</sec>
<sec>
<title>3.5 Existence of a sparse universal hitting set</title>
<p>We propose here a simple construction of
<italic>w</italic>
-sparse universal hitting sets to show their existence. This construction is not immediately useful in practice as the universal hitting sets it generates are large and have small sparsity (even though they are
<italic>w</italic>
-sparse). Different, more practical constructions are used when we discuss applications of this technique, in Sections 3.7 and 4.1.</p>
<p>Let
<inline-formula id="IE119">
<mml:math id="IEQ119">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
be a simple cycle of length
<inline-formula id="IE120">
<mml:math id="IEQ120">
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo>></mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
in
<inline-formula id="IE121">
<mml:math id="IEQ121">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, the de Bruijn graph of order
<italic>k</italic>
– 1.
<italic>C</italic>
is not necessarily an induced cycle, that is, it may have chords. Because
<italic>G
<sub>k</sub>
</italic>
is the line graph of
<inline-formula id="IE122">
<mml:math id="IEQ122">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, the edges of cycle
<italic>C</italic>
induce a simple cycle
<inline-formula id="IE123">
<mml:math id="IEQ123">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of the same length in
<italic>G
<sub>k</sub>
</italic>
. That is, using indices modulo
<inline-formula id="IE124">
<mml:math id="IEQ124">
<mml:mo></mml:mo>
</mml:math>
</inline-formula>
, node
<inline-formula id="IE125">
<mml:math id="IEQ125">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
of
<italic>G
<sub>k</sub>
</italic>
represents the edge
<inline-formula id="IE126">
<mml:math id="IEQ126">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of
<inline-formula id="IE127">
<mml:math id="IEQ127">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula id="IE128">
<mml:math id="IEQ128">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
.</p>
<p>We claim that the cycle
<inline-formula id="IE129">
<mml:math id="IEQ129">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
must be chordless. Suppose it is not the case, and there is an edge
<inline-formula id="IE130">
<mml:math id="IEQ130">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
where
<inline-formula id="IE131">
<mml:math id="IEQ131">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. This implies that in
<inline-formula id="IE132">
<mml:math id="IEQ132">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
the head of edge
<inline-formula id="IE133">
<mml:math id="IEQ133">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is equal to the tail of
<inline-formula id="IE134">
<mml:math id="IEQ134">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, that is
<inline-formula id="IE135">
<mml:math id="IEQ135">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Because
<inline-formula id="IE136">
<mml:math id="IEQ136">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
and the cycle is simple, this is a contradiction.</p>
<p>Let
<inline-formula id="IE137">
<mml:math id="IEQ137">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
be all the
<italic>k</italic>
-mers, minus the nodes of
<inline-formula id="IE138">
<mml:math id="IEQ138">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
unless their index is a multiple of
<italic>w</italic>
. That is
<inline-formula id="IE139">
<mml:math id="IEQ139">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>V</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
<mml:mi mathvariant="normal"></mml:mi>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Because
<inline-formula id="IE140">
<mml:math id="IEQ140">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is chordless, there is no cycle using only nodes not in
<inline-formula id="IE141">
<mml:math id="IEQ141">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Also, by construction, there is no path of length
<italic>w</italic>
in
<inline-formula id="IE142">
<mml:math id="IEQ142">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mi mathvariant="normal"></mml:mi>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. Hence
<inline-formula id="IE143">
<mml:math id="IEQ143">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is a universal hitting set (though not a minimally sized one). By construction, for every
<italic>i</italic>
that is not a multiple of
<italic>w</italic>
,
<inline-formula id="IE144">
<mml:math id="IEQ144">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
contains only one element from
<inline-formula id="IE145">
<mml:math id="IEQ145">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, which implies that
<inline-formula id="IE146">
<mml:math id="IEQ146">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is
<italic>w</italic>
-sparse.</p>
<p>So any cycle of length
<inline-formula id="IE147">
<mml:math id="IEQ147">
<mml:mo></mml:mo>
</mml:math>
</inline-formula>
in
<inline-formula id="IE148">
<mml:math id="IEQ148">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
induces a
<italic>w</italic>
-sparse hitting set for any
<inline-formula id="IE149">
<mml:math id="IEQ149">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. In particular, any Hamiltonian cycle of
<inline-formula id="IE150">
<mml:math id="IEQ150">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
induces a chordless cycle of length
<inline-formula id="IE151">
<mml:math id="IEQ151">
<mml:mrow>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
in
<italic>G
<sub>k</sub>
</italic>
(these are the longest possible chordless cycles of
<italic>G
<sub>k</sub>
</italic>
). Therefore, for any given
<italic>k</italic>
, there are
<italic>w</italic>
-sparse universal hitting sets for all
<inline-formula id="IE152">
<mml:math id="IEQ152">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="true">(</mml:mo>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. Hence, there are sets for which
<inline-formula id="IE153">
<mml:math id="IEQ153">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is strictly greater than 0, indicating that sets with density factor < 2 exist.</p>
<p>Moreover, for small value of
<italic>w</italic>
, given that every cycle of length
<inline-formula id="IE154">
<mml:math id="IEQ154">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
in
<inline-formula id="IE155">
<mml:math id="IEQ155">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
induces a
<italic>w</italic>
-sparse universal hitting set in
<italic>G
<sub>k</sub>
</italic>
, there exists a large number of
<italic>w</italic>
-sparse universal hitting sets for the parameters
<italic>k</italic>
and
<italic>w</italic>
.</p>
</sec>
<sec>
<title>3.6 Universal hitting sets in random orderings</title>
<p>Consider again the winnowing scheme with random ordering, as we did in Section 3.1. In Section 3.1, it was assumed that all
<italic>w</italic>
 + 1
<italic>k</italic>
-mers in the window
<inline-formula id="IE156">
<mml:math id="IEQ156">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
can potentially be selected as a fingerprint (Hypothesis 1). That assumption is likely not valid because of the existence and abundance of universal hitting sets. That is, even though the ordering of the
<italic>k</italic>
-mers is randomized, not all
<italic>k</italic>
-mers can be selected as minimizers with this ordering.</p>
<p>The following gives hints on why this is true. Consider a random ordering, or permutation, of the
<italic>k</italic>
-mers. Let
<italic>h</italic>
(
<italic>m</italic>
) be the index of the
<italic>k</italic>
-mer
<italic>m</italic>
in this permutation. In other words,
<inline-formula id="IE157">
<mml:math id="IEQ157">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mi>m</mml:mi>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>h</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo><</mml:mo>
<mml:mi>h</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo></mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. More precisely, consider a
<italic>k</italic>
-mer
<inline-formula id="IE158">
<mml:math id="IEQ158">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>m</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
that has a high index value in the permutation (
<inline-formula id="IE159">
<mml:math id="IEQ159">
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>m</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
close to
<italic>σ
<sup>k</sup>
</italic>
). If there exists a universal hitting set
<inline-formula id="IE160">
<mml:math id="IEQ160">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
such that
<inline-formula id="IE161">
<mml:math id="IEQ161">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mo><</mml:mo>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>m</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
for all
<inline-formula id="IE162">
<mml:math id="IEQ162">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, then the
<italic>k</italic>
-mer
<inline-formula id="IE163">
<mml:math id="IEQ163">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>m</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
cannot ever be selected by the winnowing scheme with that random ordering. This holds because every window of size
<italic>w</italic>
must contain a
<italic>k</italic>
-mer from
<inline-formula id="IE164">
<mml:math id="IEQ164">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, which is less than
<inline-formula id="IE165">
<mml:math id="IEQ165">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>m</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
in the ordering. Let
<inline-formula id="IE166">
<mml:math id="IEQ166">
<mml:mi mathvariant="script">U</mml:mi>
</mml:math>
</inline-formula>
be the set of all universal hitting sets of minimum size and let
<inline-formula id="IE167">
<mml:math id="IEQ167">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>min</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>U</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>max</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo></mml:mo>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>h</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. Then any
<italic>k</italic>
-mer with an index greater than
<italic>M</italic>
can never be selected.</p>
<p>As a test, we ran the minimizers scheme (
<italic>k</italic>
 = 10,
<italic>w</italic>
 = 10, binary alphabet) for 1000 different random orderings on a de Bruijn sequence of order
<italic>w </italic>
+
<italic> k</italic>
. For each ordering
<italic>R
<sub>i</sub>
</italic>
, we obtain a set of minimizers
<italic>M
<sub>i</sub>
</italic>
,
<inline-formula id="IE168">
<mml:math id="IEQ168">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1000</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. Because we used a de Bruijn sequence of order
<italic>w </italic>
+
<italic> k</italic>
, the set
<italic>M
<sub>i</sub>
</italic>
contains a
<italic>k</italic>
-mer from every possible
<italic>w</italic>
-window. Hence
<italic>M
<sub>i</sub>
</italic>
is a, possibly trivial, universal hitting set for parameters
<italic>k</italic>
and
<italic>w</italic>
. Even though it is possible for
<italic>M
<sub>i</sub>
</italic>
to be the set of all
<italic>k</italic>
-mers, this was never the case in our 1000 random orderings, and the sets
<italic>M
<sub>i</sub>
</italic>
contain on average 51% of the
<italic>k</italic>
-mers (see
<xref rid="btx235-T1" ref-type="table">Table 1</xref>
).
<table-wrap id="btx235-T1" orientation="portrait" position="float">
<label>Table 1</label>
<caption>
<p>Statistics on the sparsity and density factor of the universal hitting sets generated by random ordering, the DOCKS universal hitting set and the lexicographic ordering</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1">Ordering</th>
<th rowspan="1" colspan="1">
<inline-formula id="IE169">
<mml:math id="IEQ169">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
<th rowspan="1" colspan="1">
<inline-formula id="IE170">
<mml:math id="IEQ170">
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mo>σ</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
<th rowspan="1" colspan="1">
<italic>d
<sub>f</sub>
</italic>
</th>
<th rowspan="1" colspan="1">
<inline-formula id="IE171">
<mml:math id="IEQ171">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th align="center" rowspan="1" colspan="1">%</th>
<th align="center" rowspan="1" colspan="1">%</th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">random</td>
<td rowspan="1" colspan="1">0.07</td>
<td rowspan="1" colspan="1">51</td>
<td rowspan="1" colspan="1">1.999</td>
<td rowspan="1" colspan="1">1.998</td>
</tr>
<tr>
<td rowspan="1" colspan="1">DOCKS</td>
<td rowspan="1" colspan="1">13.3</td>
<td rowspan="1" colspan="1">21</td>
<td rowspan="1" colspan="1">1.737</td>
<td rowspan="1" colspan="1">1.733</td>
</tr>
<tr>
<td rowspan="1" colspan="1">lexicographic</td>
<td rowspan="1" colspan="1">0.00</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">2.236</td>
<td rowspan="1" colspan="1">2.000</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="tblfn1">
<p>The computation is done with
<italic>k</italic>
 = 10,
<italic>w</italic>
 = 10 on a binary alphabet. The values for the random ordering are averages over 1000 different randomized orderings.
<italic>d
<sub>f</sub>
</italic>
is the density factor and
<inline-formula id="IE172">
<mml:math id="IEQ172">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is the density factor estimated given the sparsity of the set by
<xref ref-type="disp-formula" rid="E3">equation 3</xref>
. The difference between these numbers is due to the imperfect nature of Hypotheses 2.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Furthermore, we computed the sparsity
<inline-formula id="IE173">
<mml:math id="IEQ173">
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mo stretchy="true">(</mml:mo>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
of the sets
<italic>M
<sub>i</sub>
</italic>
. In every case, the sparsity was small (average of 0.07%, see
<xref rid="btx235-T1" ref-type="table">Table 1</xref>
), but always greater than 0. Therefore, every observed universal hitting set
<italic>M
<sub>i</sub>
</italic>
is
<italic>w</italic>
-sparse. In other words, a random ordering implicitly defines a universal hitting set, and empirically this set is
<italic>w</italic>
-sparse in all randomized orderings we tested.</p>
</sec>
<sec>
<title>3.7 Density factor of various ordering schemes</title>
<p>
<xref rid="btx235-B8" ref-type="bibr">Orenstein
<italic>et al.</italic>
(2016a</xref>
) introduces the concept of universal hitting sets and provides a heuristic algorithm, called
<italic>DOCKS</italic>
(Design Of Compact
<italic>K</italic>
-mer Sets), to generate small (although not necessarily optimal) universal hitting sets.
<xref rid="btx235-T1" ref-type="table">Table 1</xref>
reports the actual density factor, computed as explained in section 3.2, of the winnowing scheme using three different orderings: randomized, DOCKS universal hitting sets and lexicographic. Because the densities are computed on a de Bruijn sequence of order
<italic>k </italic>
+
<italic> w</italic>
(binary alphabet,
<italic>k</italic>
 = 10 and
<italic>w</italic>
 = 10), the density computed is equal to the expected density.</p>
<p>In
<xref rid="btx235-B12" ref-type="bibr">Schleimer
<italic>et al.</italic>
(2003)</xref>
, the authors prove a lower bound of 1.5, as well as a slight improvement of
<inline-formula id="IE174">
<mml:math id="IEQ174">
<mml:mrow>
<mml:mn>1.5</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, on the density factor for any local scheme. In addition, they conjecture that 2 is in fact the lowest possible density factor. The density factor of 1.737 obtained for DOCKS disproves this conjecture, but is not equal to the lower bound of
<inline-formula id="IE175">
<mml:math id="IEQ175">
<mml:mrow>
<mml:mn>1.5</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>w</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1.55</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, leaving a gap between the empirical results and the best known lower-bound.</p>
<p>The density factor estimation is obtained from the sparsity using
<xref ref-type="disp-formula" rid="E3">equation 3</xref>
. For the random and DOCKS orderings, this formula provides a good estimation of the density factor. The average sparsity of the random ordering is very low but not 0.</p>
<p>The DOCKS ordering, on the other hand, has a relatively high sparsity, which accounts for the low density factor. Also, the number of
<italic>k</italic>
-mers that are minimizers picked by the random and DOCKS orderings is significantly less than the total number of 10-mers.</p>
<p>The situation is very different for the lexicographic ordering, which selected all possible 10-mers and whose sparsity is 0. The estimation of its density factor from its sparsity is erroneous. This means that in addition to not being
<italic>w</italic>
-sparse, the lexicographic ordering does not satisfy Hypothesis 1. This is most likely due to the homo-polymer runs.</p>
</sec>
</sec>
<sec>
<title>4 Application: binning</title>
<sec>
<title>4.1 Binning DNA sequences</title>
<p>We consider the application of binning subsequences of a large DNA sequence, for example, as is done in
<italic>k</italic>
-mer counters such as KMC2 (
<xref rid="btx235-B4" ref-type="bibr">Deorowicz
<italic>et al.</italic>
, 2015</xref>
) and MSPKmerCounter (
<xref rid="btx235-B7" ref-type="bibr">Li and Yan, 2015</xref>
). These applications compute the number of occurrences of each
<italic>k</italic>
-mer in a DNA sequence. They are disk-based programs: first the input sequence is split into bins written to files on disk, then
<italic>k</italic>
-mer occurrences are computed in each bin independently. The number and size of bins considerably affects the running time of these programs.</p>
<p>For example, suppose we count
<italic>L</italic>
-mers in the human genome. Typically, for this application,
<inline-formula id="IE176">
<mml:math id="IEQ176">
<mml:mrow>
<mml:mn>16</mml:mn>
<mml:mo></mml:mo>
<mml:mi>L</mml:mi>
<mml:mo></mml:mo>
<mml:mn>30</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. Following KMC2, we set
<italic>k</italic>
 = 7 and
<inline-formula id="IE177">
<mml:math id="IEQ177">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, and we bin together all the
<italic>w</italic>
-long subsequences that have the same minimizer for parameters
<italic>k</italic>
and
<italic>w</italic>
. Then, it is possible to count the
<italic>L</italic>
-mers in each bin independently and in parallel to obtain the desired counts.</p>
<p>Ideally we would like a large number of bins, to allow for good parallelism, but not too large, to avoid extra overhead of creating and writing to many small files. Say at least a few hundred bins, but no more than a few thousand. For example, KMC2 uses a heuristic to merge in the same file the smaller bins so as to create at most 2000 files. Also, we would like the amount of data in each bin to be roughly the same. The lexicographic ordering for example generates a much larger bin corresponding to the homo-polymer of
<italic>A</italic>
. In particular, we would like the size of the largest bin to not be too large compared to the average bin size.</p>
<p>We next look at different orderings used in bioinformatics applications and their impact on the number of bins generated and the distribution of the data between the bins. In every test below, the same algorithm is used to generate the bins, and only the selection of the minimizers changes.</p>
</sec>
<sec>
<title>4.2 Summary of heuristic orderings used in practice</title>
<p>The following ordering was proposed by the authors of the minimizers technique in the implementation of the UMD Overlapper (
<xref rid="btx235-B10" ref-type="bibr">Roberts
<italic>et al.</italic>
, 2004a</xref>
). They ‘assign the values 0, 1, 2, 3 to C, A, T, G, respectively, for the odd numbered bases of k-mers and assign 0, 1, 2, 3 to G, T, A, C, respectively, for the even numbered bases.’</p>
<p>In the
<italic>k</italic>
-mer counter KMC2 (
<xref rid="btx235-B4" ref-type="bibr">Deorowicz
<italic>et al.</italic>
, 2015</xref>
), the authors use the lexicographic order but ban a subset of
<italic>k</italic>
-mers from being minimizers. They use as minimizers
<italic>k</italic>
-mers that ‘do not start with
<italic>AAA</italic>
, neither start with
<italic>ACA</italic>
, neither contain
<italic>AA</italic>
anywhere except at their beginning’. This effectively creates a (non-optimally sized) universal hitting set (provided that the homopolymer
<inline-formula id="IE178">
<mml:math id="IEQ178">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo></mml:mo>
<mml:mi>A</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
is preserved).</p>
<p>In Kraken (
<xref rid="btx235-B13" ref-type="bibr">Wood and Salzberg, 2014</xref>
), the authors XOR the
<italic>k</italic>
-mer with a random value, before using lexicographic comparison. It is a form of randomization.</p>
<p>Minimap (
<xref rid="btx235-B6" ref-type="bibr">Li, 2016</xref>
) uses randomization as well, by employing a particular invertible hashing function.</p>
<p>The DOCKS ordering is based on the small universal
<italic>k</italic>
-mer hitting sets created by the DOCKS heuristic. Creating the DOCKS set
<inline-formula id="IE179">
<mml:math id="IEQ179">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
for parameter
<italic>k</italic>
and
<italic>w</italic>
takes
<inline-formula id="IE180">
<mml:math id="IEQ180">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
time and
<inline-formula id="IE181">
<mml:math id="IEQ181">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
memory. We can build DOCKS sets for
<inline-formula id="IE182">
<mml:math id="IEQ182">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>13</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula id="IE183">
<mml:math id="IEQ183">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
within five days and requiring at most 12 GB of memory. In practice, these sets are precomputed once for each combination of
<italic>k</italic>
and
<italic>w</italic>
and then can be used by any algorithm. Precomputed DOCKS sets are available from the DOCKS website (
<xref rid="btx235-B9" ref-type="bibr">Orenstein
<italic>et al.</italic>
, 2016b</xref>
).</p>
</sec>
<sec>
<title>4.3 Densities using various heuristics including a universal
<italic>k</italic>
-mer-based scheme</title>
<p>
<xref ref-type="fig" rid="btx235-F2">Figure 2</xref>
shows the distribution of the distance between consecutive minimizers, for
<italic>k</italic>
 = 7 and
<italic>w</italic>
 = 11, and
<xref rid="btx235-T2" ref-type="table">Table 2</xref>
has statistics on these distributions. The distribution is computed for a de Bruijn sequence of order
<inline-formula id="IE184">
<mml:math id="IEQ184">
<mml:mrow>
<mml:mn>18</mml:mn>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
, as explained in Section 3.2, and on the full sequence of the human genome (with all ambiguous bases and
<italic>N</italic>
s removed). Ideally, we would like the selection of
<italic>k</italic>
-mers to be as sparse as possible and, therefore, the distribution to be skewed toward larger distances.</p>
<fig id="btx235-F2" orientation="portrait" position="float">
<label>Fig. 2</label>
<caption>
<p>Distribution of the separation between minimizers for
<italic>k</italic>
 = 7 and
<italic>w</italic>
 = 11 on DNA sequences. (A) Results on a de Bruijn sequence of order
<italic>w </italic>
+
<italic> k</italic>
. (B) Results computed on the human reference genome (hg19). Each line represents a different minimizer scheme using a different ordering. Note that previous heuristic orderings all behave like the randomized orderings (uniform distribution) except for separation of 1 and 2. The universal
<italic>k</italic>
-mer ordering computed by DOCKS has a noticeably different distribution, with a mode and a higher mean</p>
</caption>
<graphic xlink:href="btx235f2"></graphic>
</fig>
<p>
<table-wrap id="btx235-T2" orientation="portrait" position="float">
<label>Table 2</label>
<caption>
<p>Statistics on the distribution of the distances between minimizers in Figure 2</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="left" span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="center" span="1"></col>
<col valign="top" align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th align="center" rowspan="1" colspan="1">Ordering</th>
<th rowspan="1" colspan="1">
<italic>d
<sub>f</sub>
</italic>
</th>
<th rowspan="1" colspan="1">distance</th>
<th rowspan="1" colspan="1">low sep.</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">mean ± stdev</th>
<th rowspan="1" colspan="1">%</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">dbg</td>
<td rowspan="1" colspan="1">lexico.</td>
<td rowspan="1" colspan="1">2.18</td>
<td rowspan="1" colspan="1">5.5(34)</td>
<td rowspan="1" colspan="1">27</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">random</td>
<td rowspan="1" colspan="1">2.00</td>
<td rowspan="1" colspan="1">6.0(32)</td>
<td rowspan="1" colspan="1">18</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Minimap</td>
<td rowspan="1" colspan="1">2.05</td>
<td rowspan="1" colspan="1">5.9(32)</td>
<td rowspan="1" colspan="1">21</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">KMC2</td>
<td rowspan="1" colspan="1">1.97</td>
<td rowspan="1" colspan="1">6.1(32)</td>
<td rowspan="1" colspan="1">18</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">UMD Ovl</td>
<td rowspan="1" colspan="1">1.91</td>
<td rowspan="1" colspan="1">6.3(30)</td>
<td rowspan="1" colspan="1">14</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Kraken</td>
<td rowspan="1" colspan="1">1.88</td>
<td rowspan="1" colspan="1">6.4(29)</td>
<td rowspan="1" colspan="1">11</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">DOCKS</td>
<td rowspan="1" colspan="1">1.75</td>
<td rowspan="1" colspan="1">6.9(25)</td>
<td rowspan="1" colspan="1">4.6</td>
</tr>
<tr>
<td rowspan="1" colspan="1">human</td>
<td rowspan="1" colspan="1">lexico.</td>
<td rowspan="1" colspan="1">2.34</td>
<td rowspan="1" colspan="1">5.1(34)</td>
<td rowspan="1" colspan="1">33</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">random</td>
<td rowspan="1" colspan="1">2.02</td>
<td rowspan="1" colspan="1">6.0(32)</td>
<td rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Minimap</td>
<td rowspan="1" colspan="1">2.09</td>
<td rowspan="1" colspan="1">5.8(33)</td>
<td rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">KMC2</td>
<td rowspan="1" colspan="1">2.02</td>
<td rowspan="1" colspan="1">5.9(33)</td>
<td rowspan="1" colspan="1">19</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">UMD Ovl</td>
<td rowspan="1" colspan="1">1.97</td>
<td rowspan="1" colspan="1">6.1(31)</td>
<td rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Kraken</td>
<td rowspan="1" colspan="1">1.93</td>
<td rowspan="1" colspan="1">6.2(30)</td>
<td rowspan="1" colspan="1">13</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">DOCKS</td>
<td rowspan="1" colspan="1">1.77</td>
<td rowspan="1" colspan="1">6.7(26)</td>
<td rowspan="1" colspan="1">6.2</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="tblfn2">
<p>The table reports the density factor (
<italic>d
<sub>f</sub>
</italic>
), the mean distance between minimizers (mean ± stdev) and the percentage of selected
<italic>k</italic>
-mers that are consecutive or separated by one base (low sep.). These were computed on a de Bruijn sequence (dbg) and on the human genome sequence (human).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The overall behavior for the different orderings does not change quantitatively between the de Bruijn sequence and the DNA sequence, showing that in the limit, the expected density computed on a de Bruijn sequence is a good proxy for performance on a biological sequence. The distribution of the randomized ordering is close to a uniform distribution. All the orderings succeeded in their stated goal of reducing the number of cases where minimizers have a low separation (e.g.
<italic>k</italic>
-mers that are consecutive or separated by one base) compared to the lexicographic ordering. For larger separation between minimizers, the distribution for all orderings except DOCKS are very similar to that of the randomized ordering.</p>
<p>The distribution for the DOCKS ordering is markedly different, with a mode at 6 (which is
<inline-formula id="IE185">
<mml:math id="IEQ185">
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="true">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
) and is generally skewed toward larger separation. The percentage of minimizers with low separation is the lowest among all orderings. This provides strong evidence that using a universal
<italic>k</italic>
-mer-based ordering can reduce the number of minimizers selected.</p>
</sec>
<sec>
<title>4.4 Distribution of bins created by heuristic and universal
<italic>k</italic>
-mer based schemes</title>
<p>
<xref rid="btx235-T3" ref-type="table">Table 3</xref>
reports statistics on the number and sizes of the bins created using various orderings. The DOCKS-based ordering creates far fewer bins than the other orderings, and the ratio between the largest bin and the average size is smaller. The DOCKS distribution of bin sizes is closer to the stated goals of not having too many bins, or any bins that are too large, and according to the Kullback-Leibler divergence, the distribution for DOCKS is closer to uniform than any of the other orderings. These results indicate that if these tools adopted a universal
<italic>k</italic>
-mer ordering, their runtime and memory usage performance would be significantly improved.
<table-wrap id="btx235-T3" orientation="portrait" position="float">
<label>Table 3</label>
<caption>
<p>Statistics on the distribution of the bin sizes</p>
</caption>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col valign="top" align="left" span="1"></col>
<col valign="top" align="left" span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
<col valign="top" align="center" span="1"></col>
<col valign="top" align="char" char="." span="1"></col>
</colgroup>
<thead>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Ordering</th>
<th rowspan="1" colspan="1"># bins</th>
<th rowspan="1" colspan="1">avg size</th>
<th rowspan="1" colspan="1">max ratio</th>
<th rowspan="1" colspan="1">
<inline-formula id="IE186">
<mml:math id="IEQ186">
<mml:mrow>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext>KL</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">(mega bases)</th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">dbg</td>
<td rowspan="1" colspan="1">lexico.</td>
<td rowspan="1" colspan="1">16384</td>
<td rowspan="1" colspan="1">4.19</td>
<td rowspan="1" colspan="1">367</td>
<td rowspan="1" colspan="1">1.37</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">random</td>
<td rowspan="1" colspan="1">12003</td>
<td rowspan="1" colspan="1">5.73</td>
<td rowspan="1" colspan="1">9.32</td>
<td rowspan="1" colspan="1">0.23</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Minimap</td>
<td rowspan="1" colspan="1">13267</td>
<td rowspan="1" colspan="1">5.18</td>
<td rowspan="1" colspan="1">10.4</td>
<td rowspan="1" colspan="1">0.27</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">KMC2</td>
<td rowspan="1" colspan="1">12370</td>
<td rowspan="1" colspan="1">5.56</td>
<td rowspan="1" colspan="1">166</td>
<td rowspan="1" colspan="1">1.13</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">UMD Ovl</td>
<td rowspan="1" colspan="1">13108</td>
<td rowspan="1" colspan="1">5.24</td>
<td rowspan="1" colspan="1">274</td>
<td rowspan="1" colspan="1">1.37</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Kraken</td>
<td rowspan="1" colspan="1">12502</td>
<td rowspan="1" colspan="1">5.5</td>
<td rowspan="1" colspan="1">210</td>
<td rowspan="1" colspan="1">1.31</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">DOCKS</td>
<td rowspan="1" colspan="1">4063</td>
<td rowspan="1" colspan="1">16.9</td>
<td rowspan="1" colspan="1">7.44</td>
<td rowspan="1" colspan="1">0.05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">human</td>
<td rowspan="1" colspan="1">lexico.</td>
<td rowspan="1" colspan="1">16285</td>
<td rowspan="1" colspan="1">0.175</td>
<td rowspan="1" colspan="1">96.4</td>
<td rowspan="1" colspan="1">0.91</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">random</td>
<td rowspan="1" colspan="1">11388</td>
<td rowspan="1" colspan="1">0.251</td>
<td rowspan="1" colspan="1">38.3</td>
<td rowspan="1" colspan="1">0.62</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Minimap</td>
<td rowspan="1" colspan="1">12280</td>
<td rowspan="1" colspan="1">0.233</td>
<td rowspan="1" colspan="1">69.5</td>
<td rowspan="1" colspan="1">0.65</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">KMC2</td>
<td rowspan="1" colspan="1">12287</td>
<td rowspan="1" colspan="1">0.233</td>
<td rowspan="1" colspan="1">74.5</td>
<td rowspan="1" colspan="1">0.77</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">UMD Ovl</td>
<td rowspan="1" colspan="1">11015</td>
<td rowspan="1" colspan="1">0.259</td>
<td rowspan="1" colspan="1">26.1</td>
<td rowspan="1" colspan="1">0.61</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">Kraken</td>
<td rowspan="1" colspan="1">10389</td>
<td rowspan="1" colspan="1">0.275</td>
<td rowspan="1" colspan="1">26.3</td>
<td rowspan="1" colspan="1">0.57</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">DOCKS</td>
<td rowspan="1" colspan="1">4046</td>
<td rowspan="1" colspan="1">0.706</td>
<td rowspan="1" colspan="1">23.5</td>
<td rowspan="1" colspan="1">0.12</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="tblfn3">
<p>The table reports the number of bins created (# bins), the average bin size (avg size, in million of bases), the ratio of the largest bin size to the average (max ratio), and the Kullback-Leibler divergence,
<inline-formula id="IE187">
<mml:math id="IEQ187">
<mml:mrow>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext>KL</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mi>U</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, between
<italic>B</italic>
, the distribution of the bin sizes, and
<italic>U</italic>
, the uniform distribution (a smaller divergence implies that the distribution
<italic>B</italic>
is closer to the uniform distribution). These were computed on a de Bruijn sequence (dbg) and the human genome sequence (human).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec>
<title>5 Discussion</title>
<p>We introduced a new analysis of the performance of the winnowing scheme based on the sparsity of an ordering. In addition, we showed how to construct orderings from small universal hitting sets that outperform randomized and previously designed orderings. While these new orderings perform well on uniformly distributed sequences and actual DNA sequences, many questions of theoretical and practical interest remain.</p>
<p>First, several orderings based on universal hitting sets beat the conjectured lower bound on the density factor of 2, but did not achieve
<inline-formula id="IE188">
<mml:math id="IEQ188">
<mml:mrow>
<mml:mn>1.5</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. The question of the actual lower bound is therefore open. Second, every winnowing local scheme considered here is based on an ordering of the
<italic>k</italic>
-mers. But there exists local schemes that are not based on an ordering. Can the minimum density always be achieved with an ordering? Third, although the performance was similar between the uniformly distributed input sequence and DNA sequence, can we improve the winnowing scheme by taking into account the actual distribution of the bases in the input?</p>
<p>Fourth, the construction of small universal hitting sets with DOCKS is based on a heuristic and is not guaranteed to be optimal. As it stands, it is also intractable for large
<italic>k</italic>
. Can we design an efficient algorithm to construct some (or all) of the optimal universal hitting sets? Or, can we design optimal orderings, even though the actual corresponding universal hitting set is never explicitly constructed?</p>
<p>All of the work in this paper considers a de Bruijn graph. In many situations, it makes sense to consider a
<italic>k</italic>
-mer and its reverse complement to be equal. The above schemes can be used by replacing every
<italic>k</italic>
-mer with its canonical representation (i.e. the smallest of a
<italic>k</italic>
-mer and its reverse complement). But this is likely not optimal. Define the
<italic>RC de Bruijn graph</italic>
from a de Bruijn graph where nodes representing reverse complements are identified and parallel edges are collapsed into a single edge. A fifth question arises in this setting: can an equivalent to universal hitting sets, and orders, be defined so as to create better winnowing schemes when reversed complemented
<italic>k</italic>
-mers are identified? How much can we transfer of what we already know on de Bruijn graphs to RC de Bruijn graphs?</p>
</sec>
<sec>
<title>6 Conclusion</title>
<p>We provided a novel theoretical framework to estimate the density of minimizer procedures. This framework explains why it is possible for
<italic>w</italic>
-sparse local winnowing schemes to have lower density factor than 2. In addition, we provided a practical method to create from small universal hitting sets, such as those generated by DOCKS,
<italic>w</italic>
-sparse winnowing schemes that achieve density factors below 2. These two results combined answer negatively a standing conjecture of Schleimer
<italic>et al.</italic>
</p>
<p>We showed that the lexicographic ordering has the worst behavior while the different orderings used in current software tools perform similarly to randomized orderings, which is slightly better. We showed that universal hitting set based schemes can perform substantially better than random orderings in practice. Therefore, for the development of bioinformatics software using minimizers, we suggest using an ordering based on small universal hitting sets such as those created by the DOCKS algorithm, or, if not practical, randomized orderings. We showed that for binning applications, doing so would lead to a significant improvement in computational efficiency.</p>
</sec>
<sec>
<title>Funding</title>
<p>This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01HG007104). D.P. was supported in part by a Ph.D. fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. R.S. was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program 2015-2018.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="btx235-B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>On the representation of De Bruijn graphs</article-title>
.
<source>J. Comput. Biol</source>
.,
<volume>22</volume>
,
<fpage>336</fpage>
<lpage>352</lpage>
.
<pub-id pub-id-type="pmid">25629448</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016</year>
)
<article-title>Compacting de Bruijn graphs from sequencing data quickly and in low memory</article-title>
.
<source>Bioinformatics</source>
,
<volume>32</volume>
,
<fpage>i201</fpage>
<lpage>i208</lpage>
.
<pub-id pub-id-type="pmid">27307618</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>de Bruijn</surname>
<given-names>N.G.</given-names>
</name>
</person-group>
(
<year>1946</year>
)
<article-title>A combinatorial problem</article-title>
.
<source>Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie Van Wetenschappen Te Amsterdam</source>
,
<volume>49</volume>
,
<fpage>758</fpage>
<lpage>764</lpage>
.</mixed-citation>
</ref>
<ref id="btx235-B4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Deorowicz</surname>
<given-names>S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2015</year>
)
<article-title>KMC 2: fast and resource-frugal k-mer counting</article-title>
.
<source>Bioinformatics</source>
,
<volume>31</volume>
,
<fpage>1569</fpage>
<lpage>1576</lpage>
.
<pub-id pub-id-type="pmid">25609798</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B5">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Grabowski</surname>
<given-names>S.</given-names>
</name>
,
<name name-style="western">
<surname>Raniszewski</surname>
<given-names>M.</given-names>
</name>
</person-group>
(
<year>2015</year>
)
<chapter-title>Sampling the suffix array with minimizers</chapter-title>
In:
<person-group person-group-type="editor">
<name name-style="western">
<surname>Iliopoulos</surname>
<given-names>C.</given-names>
</name>
</person-group>
<etal>et al</etal>
(eds.)
<source>String Processing and Information Retrieval: 22nd International Symposium</source>
, SPIRE 2015, London, UK, September 1-4, 2015, Proceedings.
<publisher-name>Springer International Publishing</publisher-name>
,
<publisher-loc>Berlin</publisher-loc>
, pp. 287–298.</mixed-citation>
</ref>
<ref id="btx235-B6">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group>
(
<year>2016</year>
)
<article-title>Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences</article-title>
.
<source>Bioinformatics</source>
,
<volume>32</volume>
,
<fpage>2103</fpage>
<lpage>2110</lpage>
.
<pub-id pub-id-type="pmid">27153593</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B7">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
,
<name name-style="western">
<surname>Yan</surname>
<given-names>X.</given-names>
</name>
</person-group>
(
<year>2015</year>
) MSPKmerCounter: a fast and memory efficient approach for K-mer counting.
<italic>arXiv:1505.06550 [cs, q-bio].</italic>
arXiv: 1505.06550.</mixed-citation>
</ref>
<ref id="btx235-B8">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Orenstein</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016a</year>
)
<chapter-title>Compact universal k-mer hitting sets</chapter-title>
In:
<person-group person-group-type="editor">
<name name-style="western">
<surname>Frith</surname>
<given-names>M.</given-names>
</name>
,
<name name-style="western">
<surname>Pedersen</surname>
<given-names>C.N.S.</given-names>
</name>
</person-group>
(eds.)
<source>Algorithms in Bioinformatics</source>
, number 9838 in Lecture Notes in Computer Science.
<publisher-name>Springer International Publishing</publisher-name>
,
<publisher-loc>Berlin</publisher-loc>
, pp.
<fpage>257</fpage>
<lpage>268</lpage>
.</mixed-citation>
</ref>
<ref id="btx235-B9">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Orenstein</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2016b</year>
) DOCKS website.
<ext-link ext-link-type="uri" xlink:href="http://acgt.cs.tau.ac.il/docks/">http://acgt.cs.tau.ac.il/docks/</ext-link>
.</mixed-citation>
</ref>
<ref id="btx235-B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Roberts</surname>
<given-names>M.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2004a</year>
)
<article-title>A preprocessor for shotgun assembly of large genomes</article-title>
.
<source>J. Comput. Biol</source>
.,
<volume>11</volume>
,
<fpage>734</fpage>
<lpage>752</lpage>
.
<pub-id pub-id-type="pmid">15579242</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Roberts</surname>
<given-names>M.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2004b</year>
)
<article-title>Reducing storage requirements for biological sequence comparison</article-title>
.
<source>Bioinformatics</source>
,
<volume>20</volume>
,
<fpage>3363</fpage>
<lpage>3369</lpage>
.
<pub-id pub-id-type="pmid">15256412</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B12">
<mixed-citation publication-type="other">
<person-group person-group-type="author">
<name name-style="western">
<surname>Schleimer</surname>
<given-names>S.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2003</year>
) Winnowing: Local Algorithms for Document Fingerprinting. In:
<italic>Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data</italic>
, SIGMOD ’03, New York, NY, USA. ACM, pp. 76–85.</mixed-citation>
</ref>
<ref id="btx235-B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Wood</surname>
<given-names>D.E.</given-names>
</name>
,
<name name-style="western">
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
(
<year>2014</year>
)
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
.
<source>Genome Biol</source>
.,
<volume>15</volume>
,
<fpage>R46.</fpage>
<pub-id pub-id-type="pmid">24580807</pub-id>
</mixed-citation>
</ref>
<ref id="btx235-B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ye</surname>
<given-names>C.</given-names>
</name>
</person-group>
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Exploiting sparseness in de novo genome assembly</article-title>
.
<source>BMC Bioinformatics</source>
,
<volume>13</volume>
,
<fpage>S1.</fpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B19 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B19 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5870760
   |texte=   Improving the performance of minimizers and winnowing schemes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:28881970" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021