Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000F86 ( Pmc/Corpus ); précédent : 000F859; suivant : 000F870 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28968408</idno>
<idno type="pmc">5645146</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645146</idno>
<idno type="RBID">PMC:5645146</idno>
<idno type="doi">10.1371/journal.pcbi.1005777</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000F86</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F86</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers
<italic>k</italic>
and
<italic>L</italic>
>
<italic>k</italic>
, we say that a set of
<italic>k</italic>
-mers is a
<italic>universal hitting set</italic>
(UHS) if every possible
<italic>L</italic>
-long sequence must contain a
<italic>k</italic>
-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of
<italic>k</italic>
and
<italic>L</italic>
and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author>
<name sortKey="Jackman, S" uniqKey="Jackman S">S Jackman</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author>
<name sortKey="Ma, Zs" uniqKey="Ma Z">ZS Ma</name>
</author>
<author>
<name sortKey="Cannon, Ch" uniqKey="Cannon C">CH Cannon</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Douglas, Wy" uniqKey="Douglas W">WY Douglas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Numanagi, I" uniqKey="Numanagi I">I Numanagić</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Champarnaud, Jm" uniqKey="Champarnaud J">JM Champarnaud</name>
</author>
<author>
<name sortKey="Hansel, G" uniqKey="Hansel G">G Hansel</name>
</author>
<author>
<name sortKey="Perrin, D" uniqKey="Perrin D">D Perrin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mykkeltveit, J" uniqKey="Mykkeltveit J">J Mykkeltveit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chvatal, V" uniqKey="Chvatal V">V Chvatal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rhoads, A" uniqKey="Rhoads A">A Rhoads</name>
</author>
<author>
<name sortKey="Au, Kf" uniqKey="Au K">KF Au</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Branton, D" uniqKey="Branton D">D Branton</name>
</author>
<author>
<name sortKey="Deamer, Dw" uniqKey="Deamer D">DW Deamer</name>
</author>
<author>
<name sortKey="Marziali, A" uniqKey="Marziali A">A Marziali</name>
</author>
<author>
<name sortKey="Bayley, H" uniqKey="Bayley H">H Bayley</name>
</author>
<author>
<name sortKey="Benner, Sa" uniqKey="Benner S">SA Benner</name>
</author>
<author>
<name sortKey="Butler, T" uniqKey="Butler T">T Butler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group>
<journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28968408</article-id>
<article-id pub-id-type="pmc">5645146</article-id>
<article-id pub-id-type="publisher-id">PCOMPBIOL-D-17-00243</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.1005777</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Human Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Computer and Information Sciences</subject>
<subj-group>
<subject>Graph Theory</subject>
<subj-group>
<subject>Directed Graphs</subject>
<subj-group>
<subject>Directed Acyclic Graphs</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Graph Theory</subject>
<subj-group>
<subject>Directed Graphs</subject>
<subj-group>
<subject>Directed Acyclic Graphs</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Animal Genomics</subject>
<subj-group>
<subject>Invertebrate Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Microbiology</subject>
<subj-group>
<subject>Bacteriology</subject>
<subj-group>
<subject>Bacterial Genetics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Microbial Genetics</subject>
<subj-group>
<subject>Bacterial Genetics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Microbial Genomics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Microbiology</subject>
<subj-group>
<subject>Microbial Genomics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Applied Mathematics</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Simulation and Modeling</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Molecular biology</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Sequencing techniques</subject>
<subj-group>
<subject>DNA sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Sequencing techniques</subject>
<subj-group>
<subject>DNA sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Algebra</subject>
<subj-group>
<subject>Polynomials</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</article-title>
<alt-title alt-title-type="running-head">Compact universal
<italic>k</italic>
-mer hitting sets</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Orenstein</surname>
<given-names>Yaron</given-names>
</name>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Software</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Visualization</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0002-5083-5925</contrib-id>
<name>
<surname>Marçais</surname>
<given-names>Guillaume</given-names>
</name>
<role content-type="http://credit.casrai.org/">Conceptualization</role>
<role content-type="http://credit.casrai.org/">Data curation</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-1889-9870</contrib-id>
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Raphael</surname>
<given-names>Benjamin J.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Princeton University, UNITED STATES</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>The authors have declared that no competing interests exist.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>carlk@cs.cmu.edu</email>
(CK);
<email>rshamir@cs.tau.ac.il</email>
(RS)</corresp>
</author-notes>
<pub-date pub-type="collection">
<month>10</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>2</day>
<month>10</month>
<year>2017</year>
</pub-date>
<volume>13</volume>
<issue>10</issue>
<elocation-id>e1005777</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>2</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>9</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© 2017 Orenstein et al</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Orenstein et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pcbi.1005777.pdf"></self-uri>
<abstract>
<p>With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers
<italic>k</italic>
and
<italic>L</italic>
>
<italic>k</italic>
, we say that a set of
<italic>k</italic>
-mers is a
<italic>universal hitting set</italic>
(UHS) if every possible
<italic>L</italic>
-long sequence must contain a
<italic>k</italic>
-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of
<italic>k</italic>
and
<italic>L</italic>
and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.</p>
</abstract>
<abstract abstract-type="summary">
<title>Author summary</title>
<p>High-throughput sequencing data has been accumulating at an extreme pace. The need to efficiently analyze and process it has become a critical challenge of the field. Many of the data structures and algorithms for this task rely on
<italic>k</italic>
-mer sets (DNA words of length
<italic>k</italic>
) to represent the sequences in a dataset. The runtime and memory usage of these highly depend on the size of the
<italic>k</italic>
-mer sets used. Thus, a minimum-size
<italic>k</italic>
-mer hitting set, namely, a set of
<italic>k</italic>
-mers that hit (have non-empty overlap with) all sequences, is desirable. In this work, we create universal
<italic>k</italic>
-mer hitting sets that hit any
<italic>L</italic>
-long sequence. We present several heuristic approaches for constructing such small sets; the approaches vary in the trade-off between the size of the produced set and runtime and memory usage. We show the benefit in practice of using the produced universal
<italic>k</italic>
-mer hitting sets compared to minimizers and randomly created hitting sets on the human genome.</p>
</abstract>
<funding-group>
<award-group id="award001">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/501100003977</institution-id>
<institution>Israel Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>ISF-NSFC joint program 2015-2018</award-id>
<principal-award-recipient>
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-1889-9870</contrib-id>
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award002">
<funding-source>
<institution>Edmond J. Safra Center for Bioinformatics at Tel Aviv University</institution>
</funding-source>
<award-id>Ph.D. fellowship</award-id>
<principal-award-recipient>
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award003">
<funding-source>
<institution>Gordon and Betty Moore Foundation (US)</institution>
</funding-source>
<award-id>Data-Driven Discovery Initiative through Grant GBMF4554</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award004">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000001</institution-id>
<institution>National Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>CCF-1256087</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award005">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000001</institution-id>
<institution>National Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>CCF-1319998</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award006">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000002</institution-id>
<institution>National Institutes of Health</institution>
</institution-wrap>
</funding-source>
<award-id>R01HG007104</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award007">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000879</institution-id>
<institution>Alfred P. Sloan Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>Fellow</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<funding-statement>RS was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program 2015-2018. DP was supported in part by a Ph.D. fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University and in part by an Israel Ministry of Immigrant Absorption fellowship. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to CK, by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01HG007104). CK received support as an Alfred P. Sloan Research Fellow. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<fig-count count="3"></fig-count>
<table-count count="2"></table-count>
<page-count count="15"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>PLOS Publication Stage</meta-name>
<meta-value>vor-update-to-uncorrected-proof</meta-value>
</custom-meta>
<custom-meta>
<meta-name>Publication Update</meta-name>
<meta-value>2017-10-17</meta-value>
</custom-meta>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>All relevant data are within the paper and its Supporting Information files. The software is available from github.com/Shamir-Lab/DOCKS and k-mer sets are available from acgt.cs.tau.ac.il/docks website.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>All relevant data are within the paper and its Supporting Information files. The software is available from github.com/Shamir-Lab/DOCKS and k-mer sets are available from acgt.cs.tau.ac.il/docks website.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec001">
<title>Introduction</title>
<p>The pace of high-throughput sequencing keeps accelerating as it becomes cheaper and faster and with it the need for faster and more memory efficient genomic analysis methods grows. The NIH Sequence Read Archive, for example, currently contains over 12 petabases of sequence data and is growing at a fast pace. Increased use of sequence-based assays (DNA sequencing, RNA-seq, numerous other “*-seq”s) in research and in clinical settings creates high computational processing burdens. Metagenomic studies generate even larger sequencing datasets. New fundamental computational ideas are essential to manage and analyze these data.</p>
<p>The minimizer approach has been extremely successful in increasing the efficiency of several sequence analysis challenges. Given a sequence of length
<italic>L</italic>
, its
<italic>minimizer</italic>
is the lexicographically smallest
<italic>k</italic>
-mer in it [
<xref rid="pcbi.1005777.ref001" ref-type="bibr">1</xref>
,
<xref rid="pcbi.1005777.ref002" ref-type="bibr">2</xref>
]. For a sequence
<italic>S</italic>
of any length its
<italic>minimizer set</italic>
is the set of minimizers of every
<italic>L</italic>
-long subsequence in
<italic>S</italic>
. Hence, every window of length
<italic>L</italic>
in
<italic>S</italic>
is represented in the set, and a minimizer set for a sequence
<italic>S</italic>
constitutes a succinct representation for it. As we discuss below, minimizers have had numerous applications in sequence analysis.</p>
<p>Here, we generalize and improve on the minimizer idea. To avoid dependence on a particular sequence
<italic>S</italic>
, we introduce the notion of a universal hitting set. For integers
<italic>k</italic>
,
<italic>L</italic>
, a set
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
is called a
<italic>universal hitting set</italic>
of
<italic>k</italic>
-mers (UHS) if every possible sequence of length
<italic>L</italic>
must contain at least one
<italic>k</italic>
-mer from
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
. The set of all
<italic>k</italic>
-mers is a trivial UHS, but it does not provide any useful reduction in computational resources needed. Hence, our main computational problem is:</p>
<p>
<italic>Problem 1</italic>
. Given
<italic>k</italic>
and
<italic>L</italic>
, find a smallest universal hitting set of
<italic>k</italic>
-mers.</p>
<p>A small UHS has a variety of applications in speeding up genomic analyses since it can be used where minimizers have been used in the past. For example:</p>
<list list-type="order">
<list-item>
<p>
<bold>Hashing for read overlapping.</bold>
A naïve read overlapper must test
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
) pairs of reads to see whether they overlap (where
<italic>n</italic>
is the number of reads). If we require an overlap of length
<italic>L</italic>
, any pair of reads with such an overlap must share a
<italic>k</italic>
-mer from set
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
in this overlapped region. By bucketing reads into bins according to the universal
<italic>k</italic>
-mers they contain, we need only test pairs of reads in the same bucket. The number of buckets is limited by |
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
|.</p>
</list-item>
<list-item>
<p>
<bold>Sparse suffix arrays.</bold>
A sparse suffix array of a string
<italic>S</italic>
saves memory by storing an index for only every
<italic>s</italic>
th position in
<italic>S</italic>
[
<xref rid="pcbi.1005777.ref003" ref-type="bibr">3</xref>
]. To query a sparse suffix array for string
<italic>q</italic>
, we perform at most
<italic>s</italic>
queries starting from indices 0, …,
<italic>s</italic>
− 1 in
<italic>q</italic>
; one of these queries will intersect a position stored in the suffix array. Using
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
, we can instead store only positions in
<italic>S</italic>
that start with a
<italic>k</italic>
-mer in
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
. Any query with |
<italic>q</italic>
| ≥
<italic>L</italic>
must contain one of these selected
<italic>k</italic>
-mers and will be matched when searching the suffix array. This approach has been applied with minimizers [
<xref rid="pcbi.1005777.ref004" ref-type="bibr">4</xref>
] to good effect.</p>
</list-item>
<list-item>
<p>
<bold>Bloom filters to speed up sequence search.</bold>
Bloom filters have been used to speed up sequence search by storing
<italic>k</italic>
-mers present in a read set for quick testing [
<xref rid="pcbi.1005777.ref005" ref-type="bibr">5</xref>
,
<xref rid="pcbi.1005777.ref006" ref-type="bibr">6</xref>
]. In current implementations, all
<italic>k</italic>
-mers present in a read set are stored in these filters. If, instead, only the set of
<italic>k</italic>
-mers in
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
is stored, any window of length ≥
<italic>L</italic>
is still guaranteed to contain one of these representative queries, potentially reducing the size of Bloom filters that must be maintained.</p>
</list-item>
</list>
<p>Minimizers have been used for some of these and similar applications [
<xref rid="pcbi.1005777.ref004" ref-type="bibr">4</xref>
,
<xref rid="pcbi.1005777.ref007" ref-type="bibr">7</xref>
<xref rid="pcbi.1005777.ref009" ref-type="bibr">9</xref>
]. They were originally introduced by Roberts
<italic>et al</italic>
. [
<xref rid="pcbi.1005777.ref001" ref-type="bibr">1</xref>
] for genome assembly. The same idea was introduced independently for plagiarism detection in Schleimer
<italic>et al</italic>
. [
<xref rid="pcbi.1005777.ref002" ref-type="bibr">2</xref>
]. For example, MSP [
<xref rid="pcbi.1005777.ref010" ref-type="bibr">10</xref>
] compresses
<italic>k</italic>
-mers by hashing them to their 4-mer minimizer to efficiently construct a de Bruijn graph for assembly. SparseAssembler [
<xref rid="pcbi.1005777.ref011" ref-type="bibr">11</xref>
] represents the de Bruijn graph using only every
<italic>g</italic>
-th
<italic>k</italic>
-mer in the sequence (and has also been implemented using minimizers). Kraken [
<xref rid="pcbi.1005777.ref012" ref-type="bibr">12</xref>
] uses minimizers to speed up database queries for
<italic>k</italic>
-mers during metagenome sequence classification. KMC 2 [
<xref rid="pcbi.1005777.ref008" ref-type="bibr">8</xref>
] uses minimizers to cluster subsequences for counting
<italic>k</italic>
-mer occurrences. The Locally Consistent Parsing (LCP) [
<xref rid="pcbi.1005777.ref013" ref-type="bibr">13</xref>
] algorithm provides the concept of “core substrings” which, like minimizers, are guaranteed to be shared by long enough identical strings. SCALCE [
<xref rid="pcbi.1005777.ref014" ref-type="bibr">14</xref>
] uses core substrings to compress DNA sequences.</p>
<p>A small UHS, if it can be found, has a number of advantages over minimizers for these applications:</p>
<list list-type="order">
<list-item>
<p>The set of minimizers for a given collection of reads may be as dense as the complete set of
<italic>k</italic>
-mers (size |Σ|
<sup>
<italic>k</italic>
</sup>
for an alphabet Σ), whereas we show that we can often generate UHSs smaller by a factor of nearly
<italic>k</italic>
. We also demonstrate on real genomic sequences that the number of UHS
<italic>k</italic>
-mers needed to process them is substantially smaller.</p>
</list-item>
<list-item>
<p>For any
<italic>k</italic>
and
<italic>L</italic>
, a set of universal
<italic>k</italic>
-mers needs to be computed only once and not recomputed for every dataset.</p>
</list-item>
<list-item>
<p>The hash buckets, sparse suffix arrays, and Bloom filters created for different datasets will contain a comparable set of
<italic>k</italic>
-mers if they are sampled according to a UHS. This will enable easier comparison and integration of the datasets.</p>
</list-item>
<list-item>
<p>One does not need to look at the reads or to build a dataset-specific de Bruijn graph in order to decide which
<italic>k</italic>
-mers to use.</p>
</list-item>
</list>
<p>Problem 1 can be rephrased as a problem on the complete de Bruijn graph of order
<italic>k</italic>
(see Definition 1 below). This is the viewpoint we take for most of this study:</p>
<p>
<italic>Problem 2</italic>
. Given a de Bruijn graph
<italic>D</italic>
<sub>
<italic>k</italic>
</sub>
of order
<italic>k</italic>
and an integer
<italic>L</italic>
, find a smallest set of vertices
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
such that any path in
<italic>D</italic>
<sub>
<italic>k</italic>
</sub>
of length ℓ =
<italic>L</italic>
<italic>k</italic>
passes through at least one vertex of
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
.</p>
<p>Here and throughout, the length of a path is the number of
<italic>edges</italic>
in it. We show that the related problem of finding a minimum-size
<italic>k</italic>
-mer set that hits every string in a given set
<inline-formula id="pcbi.1005777.e001">
<alternatives>
<graphic xlink:href="pcbi.1005777.e001.jpg" id="pcbi.1005777.e001g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M1">
<mml:mover accent="true">
<mml:mi>S</mml:mi>
<mml:mo>˜</mml:mo>
</mml:mover>
</mml:math>
</alternatives>
</inline-formula>
of
<italic>L</italic>
-long strings is NP-hard. This problem differs from ours, in that the set
<inline-formula id="pcbi.1005777.e002">
<alternatives>
<graphic xlink:href="pcbi.1005777.e002.jpg" id="pcbi.1005777.e002g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M2">
<mml:mover accent="true">
<mml:mi>S</mml:mi>
<mml:mo>˜</mml:mo>
</mml:mover>
</mml:math>
</alternatives>
</inline-formula>
is part of the input. However, the fact that finding a small set of
<italic>k</italic>
-mers that hits every sequence in a particular data set is hard further motivates the need for a universal set that can be computed once for any input sequence. Our main contribution is an algorithm called DOCKS that finds a compact set of
<italic>k</italic>
-mers that hits any
<italic>L</italic>
-long sequence. We also provide several variants of the algorithm, trading-off some solution quality for speed. We show empirically that the produced sets are often close to a theoretical lower bound, implying their near-optimality. Our use of a greedy heuristic is motivated by the fact that finding a minimum-size ℓ-long path cover in a graph
<italic>G</italic>
is NP-hard when
<italic>G</italic>
is a directed acyclic graph (DAG). We report on the size of the universal
<italic>k</italic>
-mer hitting set produced by DOCKS and demonstrate on genomic datasets that we can more uniformly cover sequences with a smaller set of
<italic>k</italic>
-mers than is possible using minimizers. For example, we show that the number of
<italic>k</italic>
-mers needed to cover the human genome using a UHS is less than one third of that required by minimizers.</p>
<p>The software to compute small UHSs is freely available at
<ext-link ext-link-type="uri" xlink:href="http://github.com/Shamir-Lab/DOCKS/">github.com/Shamir-Lab/DOCKS/</ext-link>
. Universal sets of
<italic>k</italic>
-mers computed by DOCKS for a range of values of
<italic>L</italic>
and
<italic>k</italic>
are freely available at
<ext-link ext-link-type="uri" xlink:href="http://acgt.cs.tau.ac.il/docks/">acgt.cs.tau.ac.il/docks/</ext-link>
. A preliminary version of this study appeared in [
<xref rid="pcbi.1005777.ref015" ref-type="bibr">15</xref>
].</p>
<sec id="sec002">
<title>Preliminaries</title>
<p>Throughout this paper,
<italic>k</italic>
denotes the length of a
<italic>k</italic>
-mer word, while
<italic>L</italic>
denotes the length of the long sequences.</p>
<p>
<bold>Definition 1 (de Bruijn Graph).</bold>
A
<italic>de Bruijn graph</italic>
of order
<italic>k</italic>
over alphabet Σ is a directed graph in which every vertex has an associated label (a string over Σ) of length
<italic>k</italic>
(
<italic>k</italic>
-mer) and every edge has an associated label of length
<italic>k</italic>
+ 1. There are exactly |Σ|
<sup>
<italic>k</italic>
</sup>
vertices in a de Bruijn graph, each representing a unique
<italic>k</italic>
-mer. If an edge (
<italic>u</italic>
,
<italic>v</italic>
) has label
<italic>l</italic>
, then the label of
<italic>u</italic>
must be the
<italic>k</italic>
-prefix (prefix of length
<italic>k</italic>
) of
<italic>l</italic>
and the label of
<italic>v</italic>
must be the
<italic>k</italic>
-suffix (suffix of length
<italic>k</italic>
) of
<italic>l</italic>
. A
<italic>complete</italic>
de Bruijn graph contains all possible edges of this type, which represent together all (
<italic>k</italic>
+ 1)-mers over Σ.</p>
<p>Every path in a de Bruijn graph represents a sequence. A path
<italic>v</italic>
<sub>0</sub>
,
<italic>e</italic>
<sub>0</sub>
,
<italic>v</italic>
<sub>1</sub>
,
<italic>e</italic>
<sub>1</sub>
,
<italic>v</italic>
<sub>2</sub>
, …,
<italic>v</italic>
<sub>
<italic>n</italic>
</sub>
of length
<italic>n</italic>
spells a sequence
<italic>s</italic>
of length
<italic>n</italic>
+
<italic>k</italic>
such that the label of
<italic>v</italic>
<sub>
<italic>i</italic>
</sub>
occurs in
<italic>s</italic>
starting at position
<italic>i</italic>
for all 0 ≤
<italic>i</italic>
<italic>n</italic>
, and the label of
<italic>e</italic>
<sub>
<italic>i</italic>
</sub>
occurs in
<italic>s</italic>
starting at position
<italic>i</italic>
for all 0 ≤
<italic>i</italic>
<italic>n</italic>
− 1. Note that vertices and edges may repeat in a path.</p>
<p>We define terminology for
<italic>k</italic>
-mers intersecting sequences over an alphabet Σ:</p>
<p>
<bold>Definition 2 (hits).</bold>
We say that
<italic>k</italic>
-mer
<italic>w</italic>
<italic>hits</italic>
string
<italic>S</italic>
, denoted
<italic>w</italic>
<italic>S</italic>
, if
<italic>w</italic>
appears as a contiguous substring in
<italic>S</italic>
.
<italic>k</italic>
-mer set
<italic>X</italic>
<italic>hits</italic>
string
<italic>S</italic>
if there exists
<italic>w</italic>
<italic>X</italic>
s.t.
<italic>w</italic>
<italic>S</italic>
. Define
<italic>hit</italic>
(
<italic>w</italic>
,
<italic>L</italic>
) = {
<italic>S</italic>
∈ Σ
<sup>
<italic>L</italic>
</sup>
<italic>w</italic>
<italic>S</italic>
} for
<italic>k</italic>
-mer
<italic>w</italic>
and length
<italic>L</italic>
, where Σ
<sup>
<italic>L</italic>
</sup>
is the set of all
<italic>L</italic>
-long substrings over alphabet Σ. Define
<inline-formula id="pcbi.1005777.e003">
<alternatives>
<graphic xlink:href="pcbi.1005777.e003.jpg" id="pcbi.1005777.e003g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M3">
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
.</p>
<p>The universal set of hitting
<italic>k</italic>
-mers from Problem 1 is then a set
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
which satisfies
<italic>hit</italic>
(
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
,
<italic>L</italic>
) = Σ
<sup>
<italic>L</italic>
</sup>
.</p>
</sec>
</sec>
<sec sec-type="materials|methods" id="sec003">
<title>Materials and methods</title>
<p>It is not known how to efficiently find a minimum universal (
<italic>k</italic>
,
<italic>L</italic>
)-hitting set. As we prove in the Appendix, the problem of finding a minimum (non-universal)
<italic>k</italic>
-mer set that hits a given set of input sequences is NP-hard (see Appendix, Subsection NP-hardness of MINIMUM (
<italic>k</italic>
,
<italic>L</italic>
)-HITTING SET in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
). In the face of the hardness result for this related problem, we give below a practical heuristic to find a compact (near-optimal) universal
<italic>k</italic>
-mer set. This algorithm works on the de Bruijn graph of order
<italic>k</italic>
in two steps: first it finds and removes a minimum-size
<italic>k</italic>
-mer set hitting all infinite sequences, and then it finds and removes additional
<italic>k</italic>
-mers in order to hit all remaining
<italic>L</italic>
-long sequences. We now describe these two steps in detail.</p>
<sec id="sec004">
<title>Finding a minimum
<italic>k</italic>
-mer set hitting all infinite sequences</title>
<p>The problem of finding a minimum-size
<italic>k</italic>
-mer set hitting all infinite sequences is known in the literature as finding an unavoidable set of constant length [
<xref rid="pcbi.1005777.ref016" ref-type="bibr">16</xref>
]. Note that finite words may avoid the set. Finding a minimum-size unavoidable set for a given
<italic>k</italic>
can be solved in time polynomial in the output size [
<xref rid="pcbi.1005777.ref016" ref-type="bibr">16</xref>
]. The original algorithm is due to Mykkeltveit [
<xref rid="pcbi.1005777.ref017" ref-type="bibr">17</xref>
]. Its running time is
<italic>O</italic>
(
<italic>kM</italic>
(
<italic>k</italic>
)), where
<italic>M</italic>
(
<italic>k</italic>
) is the size of the minimum unavoidable set.
<italic>M</italic>
(
<italic>k</italic>
) converges to |Σ|
<sup>
<italic>k</italic>
</sup>
/
<italic>k</italic>
(an exact formula is given in
<xref ref-type="disp-formula" rid="pcbi.1005777.e016">Eq 11</xref>
), so the running time is
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
</sup>
).</p>
<p>An unavoidable set of constant length
<italic>k</italic>
is equivalent to a set of vertices in a complete de Bruijn graph of order
<italic>k</italic>
whose removal turns it into a directed acyclic graph (DAG). Each
<italic>k</italic>
-mer in the set corresponds to a vertex, and the removal of vertices from every cycle guarantees that no infinite sequence is represented as a path in the graph. This set is known as a
<italic>decycling set</italic>
.</p>
</sec>
<sec id="sec005">
<title>Hitting remaining length
<italic>L</italic>
sequences</title>
<p>Unfortunately, finding an unavoidable set is not enough, as there may be
<italic>L</italic>
-long sequences that avoid that set. Thus, we need additional
<italic>k</italic>
-mers to hit those. If we consider the graph formulation, after removal of a decycling set from the graph we are left with a DAG, which may contain (
<italic>L</italic>
<italic>k</italic>
)-long paths representing
<italic>L</italic>
-long sequences. We need to remove additional vertices, so that there is no path of length ℓ =
<italic>L</italic>
<italic>k</italic>
. The problem of finding a minimum-size set of vertices that hit all ℓ-long paths in a general directed acyclic graph is known to be NP-hard, as we review in the Appendix (see Appendix, Subsection NP-hardness of MINIMUM ℓ-PATH COVER IN A DAG in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
). Therefore, we give a heuristic solution.</p>
<p>Our initial algorithm is based on the greedy algorithm for the minimum hitting set [
<xref rid="pcbi.1005777.ref018" ref-type="bibr">18</xref>
]. We define the
<italic>hitting number</italic>
<italic>T</italic>
(
<italic>v</italic>
,
<italic></italic>
) of a vertex
<italic>v</italic>
to be the number of paths of length
<italic></italic>
that contain
<italic>v</italic>
. The main observation is that we can calculate the hitting number of each vertex efficiently using dynamic programming. The solution is based on calculating the number of paths of length
<italic>i</italic>
that terminate at vertex
<italic>v</italic>
, and the number of paths of length
<italic>i</italic>
that start at vertex
<italic>v</italic>
, for all
<italic>v</italic>
<italic>V</italic>
and 0 ≤
<italic>i</italic>
<italic></italic>
. Then, the number of
<italic></italic>
-long paths through
<italic>v</italic>
is directly computable from these values by breaking any path into an
<italic>i</italic>
-long path ending at
<italic>v</italic>
and an (
<italic></italic>
<italic>i</italic>
)-long path starting at
<italic>v</italic>
, for all possible values of
<italic>i</italic>
. We set
<italic></italic>
=
<italic>L</italic>
<italic>k</italic>
to get the desired hitting number of each vertex.</p>
<p>Specifically, let
<italic>G</italic>
′ = (
<italic>V</italic>
′,
<italic>E</italic>
′) be the directed acyclic graph, after removal of the decycling set. Denote by
<italic>D</italic>
and
<italic>F</italic>
matrices of size |
<italic>V</italic>
′| × (
<italic></italic>
+ 1) where
<italic>D</italic>
(
<italic>v</italic>
,
<italic>i</italic>
) is the number of
<italic>i</italic>
-long paths in
<italic>G</italic>
′ starting at vertex
<italic>v</italic>
and
<italic>F</italic>
(
<italic>v</italic>
,
<italic>i</italic>
) is the number of
<italic>i</italic>
-long paths ending at vertex
<italic>v</italic>
.</p>
<p>The calculation of
<italic>D</italic>
and
<italic>F</italic>
is done recursively as follows:
<disp-formula id="pcbi.1005777.e004">
<alternatives>
<graphic xlink:href="pcbi.1005777.e004.jpg" id="pcbi.1005777.e004g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M4">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mtext>for</mml:mtext>
<mml:mspace width="4.pt"></mml:mspace>
<mml:mtext>all</mml:mtext>
<mml:mspace width="4.pt"></mml:mspace>
<mml:mi>v</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>V</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</alternatives>
<label>(1)</label>
</disp-formula>
<disp-formula id="pcbi.1005777.e005">
<alternatives>
<graphic xlink:href="pcbi.1005777.e005.jpg" id="pcbi.1005777.e005g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M5">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>E</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(2)</label>
</disp-formula>
<disp-formula id="pcbi.1005777.e006">
<alternatives>
<graphic xlink:href="pcbi.1005777.e006.jpg" id="pcbi.1005777.e006g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M6">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>E</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(3)</label>
</disp-formula>
To get the number of
<italic></italic>
-long paths that vertex
<italic>v</italic>
participates in, we sum:
<disp-formula id="pcbi.1005777.e007">
<alternatives>
<graphic xlink:href="pcbi.1005777.e007.jpg" id="pcbi.1005777.e007g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M7">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi></mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi></mml:mi>
</mml:munderover>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>·</mml:mo>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi></mml:mi>
<mml:mo>-</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(4)</label>
</disp-formula>
The running time is proportional to the sum of all vertex degrees (which is Θ(|
<italic>E</italic>
′|)) times
<italic></italic>
, giving a running time of
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
<italic></italic>
) for
<italic></italic>
=
<italic>L</italic>
<italic>k</italic>
.</p>
</sec>
<sec id="sec006">
<title>The DOCKS algorithm</title>
<p>The full algorithm combines the two steps. First, we find a decycling set in a complete de Bruijn graph of order
<italic>k</italic>
and remove it from the graph, obtaining a DAG. Then, we repeatedly remove a vertex
<italic>v</italic>
with the largest hitting number
<italic>T</italic>
(
<italic>v</italic>
,
<italic></italic>
) until there are no
<italic></italic>
-long paths, recomputing
<italic>T</italic>
(
<italic>u</italic>
,
<italic></italic>
) for all remaining vertices
<italic>u</italic>
after each removal. This is summarized below (Algorithm 1).</p>
<boxed-text id="pcbi.1005777.box001" position="float" orientation="portrait">
<sec id="sec007">
<title>Algorithm 1 DOCKS: Find a compact
<italic>k</italic>
-mer set hitting all
<italic>L</italic>
-long sequences</title>
<p>1: Generate a complete de Bruijn graph
<italic>G</italic>
of order
<italic>k</italic>
, set
<italic></italic>
=
<italic>L</italic>
<italic>k</italic>
.</p>
<p>2: Find a decycling vertex set
<italic>X</italic>
using Mykkeltveit’s algorithm.</p>
<p>3: Remove all vertices in
<italic>X</italic>
from graph
<italic>G</italic>
, resulting in
<italic>G</italic>
′.</p>
<p>4:
<bold>while</bold>
there are still paths of length
<italic></italic>
<bold>do</bold>
</p>
<p>5:  Calculate
<italic>D</italic>
(
<italic>v</italic>
,
<italic>i</italic>
) and
<italic>F</italic>
(
<italic>v</italic>
,
<italic>i</italic>
) for each vertex
<italic>v</italic>
and 0 ≤
<italic>i</italic>
<italic></italic>
.</p>
<p>6:  Calculate
<italic>T</italic>
(
<italic>v</italic>
,
<italic></italic>
) for each vertex
<italic>v</italic>
.</p>
<p>7:  Remove a vertex with maximum hitting number from
<italic>G</italic>
′, and add it to set
<italic>X</italic>
.</p>
<p>8:
<bold>end while</bold>
</p>
<p>9: Output set
<italic>X</italic>
.</p>
</sec>
</boxed-text>
<p>Finding the decycling set takes
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
</sup>
). In the second phase, each iteration calculates the hitting number of all vertices in time
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
<italic></italic>
). The number of iterations is 1 +
<italic>p</italic>
, where
<italic>p</italic>
is the number of vertices removed. Thus, the total running time is dominated by steps 4–8 and is
<italic>O</italic>
((1 +
<italic>p</italic>
)|Σ|
<sup>
<italic>k</italic>
+1</sup>
<italic></italic>
).</p>
<p>The exponential dependence of DOCKS on
<italic>k</italic>
limits the range of
<italic>k</italic>
to which it can be applied (see
<xref ref-type="sec" rid="sec012">Results</xref>
, Subsection DOCKS). This motivates us to develop two variants that trade larger solution sizes for faster running times in the different heuristics described next.</p>
</sec>
<sec id="sec008">
<title>The DOCKSany algorithms</title>
<p>In order to extend the range of
<italic>k</italic>
,
<italic>L</italic>
values beyond what DOCKS can compute in reasonable times, we develop a faster heuristic that may produce cruder solutions. Instead of calculating the number of
<italic></italic>
-long paths through each vertex, we consider
<italic>all</italic>
paths through each vertex. This number, denoted by
<italic>T</italic>
(
<italic>v</italic>
), can be calculated more quickly and serve as an estimate of
<italic>T</italic>
(
<italic>v</italic>
,
<italic></italic>
). We call this heuristic DOCKSany (Algorithm 2).</p>
<p>DOCKSany has the same structure as DOCKS, but with one difference: it removes a node
<italic>v</italic>
with maximum
<italic>T</italic>
(
<italic>v</italic>
) in each iteration. To compute
<italic>T</italic>
(
<italic>v</italic>
) for all
<italic>v</italic>
, the vertices in the current graph
<italic>G</italic>
′ = (
<italic>V</italic>
′,
<italic>E</italic>
′) are first sorted in topological order
<italic>v</italic>
<sub>1</sub>
≤ … ≤
<italic>v</italic>
<sub>
<italic>n</italic>
</sub>
. Define
<italic>F</italic>
(
<italic>v</italic>
) as the number of paths ending at
<italic>v</italic>
. The vertices are visited in topological order and the incoming edges into
<italic>v</italic>
are used to compute:
<disp-formula id="pcbi.1005777.e008">
<alternatives>
<graphic xlink:href="pcbi.1005777.e008.jpg" id="pcbi.1005777.e008g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M8">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>E</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(5)</label>
</disp-formula>
Similarly,
<italic>D</italic>
(
<italic>v</italic>
), the number of paths starting at
<italic>v</italic>
is computed by visiting the vertices in reverse topological order and computing.
<disp-formula id="pcbi.1005777.e009">
<alternatives>
<graphic xlink:href="pcbi.1005777.e009.jpg" id="pcbi.1005777.e009g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M9">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:munder>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:msup>
<mml:mi>E</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(6)</label>
</disp-formula>
<italic>T</italic>
(
<italic>v</italic>
) is then calculated for all vertices as:
<disp-formula id="pcbi.1005777.e010">
<alternatives>
<graphic xlink:href="pcbi.1005777.e010.jpg" id="pcbi.1005777.e010g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M10">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>·</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
<label>(7)</label>
</disp-formula>
</p>
<p>A vertex
<italic>v</italic>
with maximum
<italic>T</italic>
(
<italic>v</italic>
) is removed,
<italic>G</italic>
′ is updated, and the process is repeated until there are no paths of length
<italic></italic>
in the graph.</p>
<boxed-text id="pcbi.1005777.box002" position="float" orientation="portrait">
<sec id="sec009">
<title>Algorithm 2 DOCKSany: A faster heuristic for a compact
<italic>k</italic>
-mer set hitting all
<italic>L</italic>
-long sequences</title>
<p>1: Generate a complete de Bruijn graph
<italic>G</italic>
of order
<italic>k</italic>
, set
<italic></italic>
=
<italic>L</italic>
<italic>k</italic>
.</p>
<p>2: Find a decycling vertex set
<italic>X</italic>
using Mykkeltveit’s algorithm.</p>
<p>3: Remove all vertices in
<italic>X</italic>
from graph
<italic>G</italic>
, resulting in
<italic>G</italic>
′.</p>
<p>4:
<bold>while</bold>
there are still paths of length
<italic></italic>
<bold>do</bold>
</p>
<p>5:  Calculate
<italic>D</italic>
(
<italic>v</italic>
) and
<italic>F</italic>
(
<italic>v</italic>
) at each vertex
<italic>v</italic>
.</p>
<p>6:  Calculate the number
<italic>T</italic>
(
<italic>v</italic>
) of paths passing through each vertex
<italic>v</italic>
.</p>
<p>7:  Remove a vertex
<italic>v</italic>
with maximum
<italic>T</italic>
(
<italic>v</italic>
) from
<italic>G</italic>
′, and add it to set
<italic>X</italic>
.</p>
<p>8:
<bold>end while</bold>
</p>
<p>9: Output set
<italic>X</italic>
.</p>
</sec>
</boxed-text>
<p>Computing
<italic>D</italic>
(
<italic>v</italic>
) for all
<italic>v</italic>
requires visiting each edge in the graph once, and hence takes
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
). The time for computing
<italic>F</italic>
(
<italic>v</italic>
) for all
<italic>v</italic>
is the same. Hence,
<italic>T</italic>
is computable in
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
) time. Computing the longest path in a DAG (step 4) also requires
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
). If
<italic>p</italic>
vertices are removed, then the total runtime for this algorithm is
<italic>O</italic>
((1 +
<italic>p</italic>
)|Σ|
<sup>
<italic>k</italic>
+1</sup>
), a factor of Θ(
<italic></italic>
) faster than the DOCKS algorithm. The space complexity is also smaller,
<italic>O</italic>
(|Σ|
<sup>
<italic>k</italic>
+1</sup>
) vs.
<italic>O</italic>
(
<italic></italic>
|Σ|
<sup>
<italic>k</italic>
+1</sup>
) for DOCKS.</p>
<p>In addition to shorter runtimes and decreased memory usage, this heuristic offers one more advantage over the original DOCKS algorithm. The vertex removal choice is independent of
<italic>L</italic>
. The value of
<italic>L</italic>
only determines when the algorithm terminates. Thus, hitting sets for all values of
<italic>L</italic>
or larger can be computed in one run. This is in contrast with DOCKS, in which the hitting number of each vertex depends on
<italic>L</italic>
, and so DOCKS must be run for each desired value of
<italic>L</italic>
.</p>
<p>Finally, in order to calculate the hitting set for even larger
<italic>k</italic>
, we can further speed up DOCKSany as follows. In the DOCKSanyX heuristic, the top
<italic>X</italic>
vertices, ranked by the hitting number
<italic>T</italic>
(
<italic>v</italic>
), are removed (in step 7) in each iteration. This can shorten the running time of each iteration by a factor of
<italic>X</italic>
, but may produce larger hitting set solutions.</p>
</sec>
<sec id="sec010">
<title>An integer linear programming (ILP) formulation</title>
<p>To investigate whether optimal solutions can be found practically, we formulate the problem of the minimal universal
<italic>k</italic>
-mer hitting set as an integer linear program (ILP). In the ILP formulation there are |Σ|
<sup>
<italic>k</italic>
</sup>
binary variables
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
representing whether vertex
<italic>i</italic>
is in the solution hitting set. There are also |Σ|
<sup>
<italic>k</italic>
</sup>
variables
<italic>L</italic>
<sub>
<italic>i</italic>
</sub>
representing an upper bound on the number of edges in the longest path ending at vertex
<italic>i</italic>
. The constraints on
<italic>L</italic>
<sub>
<italic>i</italic>
</sub>
guarantee that the vertices chosen remove all
<italic></italic>
-long paths (
<italic></italic>
=
<italic>L</italic>
<italic>k</italic>
) from the graph. The ILP is defined as follows:
<disp-formula id="pcbi.1005777.e011">
<alternatives>
<graphic xlink:href="pcbi.1005777.e011.jpg" id="pcbi.1005777.e011g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M11">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>minimize:</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mstyle displaystyle="true" mathsize="140%">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow></mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>subject</mml:mtext>
<mml:mspace width="1pt"></mml:mspace>
<mml:mtext>to:</mml:mtext>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow></mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</alternatives>
<label>(8)</label>
</disp-formula>
<disp-formula id="pcbi.1005777.e012">
<alternatives>
<graphic xlink:href="pcbi.1005777.e012.jpg" id="pcbi.1005777.e012g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M12">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">l</mml:mi>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
<mml:mspace width="20pt"></mml:mspace>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(9)</label>
</disp-formula>
</p>
<p>Here
<italic>E</italic>
contains all |Σ|
<sup>
<italic>k</italic>
+1</sup>
possible edges. The constraint on edge (
<italic>u</italic>
,
<italic>v</italic>
) requires that if
<italic>v</italic>
is not in the set then
<italic>L</italic>
<sub>
<italic>v</italic>
</sub>
≥ 1 +
<italic>L</italic>
<sub>
<italic>u</italic>
</sub>
. The validity of this formulation is proven in the Appendix (see Appendix, Subsection Validity of the ILP formulation in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
).</p>
<p>The number of variables and constraints grows exponentially in
<italic>k</italic>
, making it hard to use for
<italic>k</italic>
> 7. However, the ILP solver can start from a feasible solution produced by one of the DOCKS algorithms and improve that solution for a limited set time.</p>
</sec>
<sec id="sec011">
<title>Handling larger
<italic>k</italic>
</title>
<p>The DOCKS variants described above have exponential dependence in
<italic>k</italic>
in both runtime and memory usage. Hence, the range of
<italic>k</italic>
values to which they can be applied is limited. To extend this range, we present below a procedure to construct a universal
<italic>k</italic>
-mer hitting set by extending UHSs computed for smaller
<italic>k</italic>
values. Given a set
<italic>U</italic>
<sub>
<italic>K</italic>
,
<italic>L</italic>
</sub>
and integer
<italic>j</italic>
, we can construct set
<italic>U</italic>
<sub>
<italic>k</italic>
+
<italic>j</italic>
,
<italic>L</italic>
+
<italic>j</italic>
</sub>
by concatenating all possible
<italic>j</italic>
-mers over Σ to each
<italic>k</italic>
-mer in
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
. Formally,
<disp-formula id="pcbi.1005777.e013">
<alternatives>
<graphic xlink:href="pcbi.1005777.e013.jpg" id="pcbi.1005777.e013g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M13">
<mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>·</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mo>Σ</mml:mo>
<mml:mi>j</mml:mi>
</mml:msup>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
<label>(10)</label>
</disp-formula>
To see that
<italic>U</italic>
<sub>
<italic>k</italic>
+
<italic>j</italic>
,
<italic>L</italic>
+
<italic>j</italic>
</sub>
is a universal (
<italic>k</italic>
+
<italic>j</italic>
)-mer hitting set, denote by
<italic>S</italic>
an (
<italic>L</italic>
+
<italic>j</italic>
)-long sequence. By definition, there must be at least one
<italic>k</italic>
-mer
<italic>w</italic>
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
that hits
<italic>S</italic>
’s
<italic>L</italic>
-long prefix.
<italic>U</italic>
<sub>
<italic>k</italic>
+
<italic>j</italic>
,
<italic>L</italic>
+
<italic>j</italic>
</sub>
contains all (
<italic>k</italic>
+
<italic>j</italic>
)-mers
<italic>w</italic>
<italic>x</italic>
, where
<italic>x</italic>
is any
<italic>j</italic>
-mer. Thus, it must contain a (
<italic>k</italic>
+
<italic>j</italic>
)-mer that hits
<italic>S</italic>
.</p>
<p>For example, by appending all possible 10-mers to each 10-mer in
<italic>U</italic>
<sub>10,20</sub>
we obtain
<italic>U</italic>
<sub>20,30</sub>
. The size of the set
<italic>U</italic>
<sub>10,20</sub>
is |
<italic>U</italic>
<sub>10,20</sub>
| =
<italic>c</italic>
<italic>dec</italic>
<sub>10</sub>
, where
<inline-formula id="pcbi.1005777.e014">
<alternatives>
<graphic xlink:href="pcbi.1005777.e014.jpg" id="pcbi.1005777.e014g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M14">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mn>10</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mn>10</mml:mn>
</mml:msup>
<mml:mn>10</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
is the size of a minimum decycling set for
<italic>k</italic>
= 10 (
<xref ref-type="disp-formula" rid="pcbi.1005777.e016">Eq 11</xref>
). Here
<italic>c</italic>
≥ 1 is the approximation factor obtained by the UHS. Then, the size of
<italic>U</italic>
<sub>20,30</sub>
is
<inline-formula id="pcbi.1005777.e015">
<alternatives>
<graphic xlink:href="pcbi.1005777.e015.jpg" id="pcbi.1005777.e015g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M15">
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mn>20</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>30</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mn>10</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mo>·</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mn>10</mml:mn>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mn>10</mml:mn>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mn>10</mml:mn>
</mml:msup>
<mml:mo></mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>·</mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mn>20</mml:mn>
</mml:msup>
<mml:mn>10</mml:mn>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>c</mml:mi>
<mml:mo>·</mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mn>20</mml:mn>
</mml:msup>
<mml:mn>20</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
by this construction. This is approximately |
<italic>U</italic>
<sub>20,30</sub>
| ≈ 2
<italic>c</italic>
<italic>dec</italic>
<sub>20</sub>
, i.e. the approximation factor doubled.</p>
</sec>
</sec>
<sec sec-type="results" id="sec012">
<title>Results</title>
<sec id="sec013">
<title>A theoretical lower bound for |
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
|</title>
<p>For a given
<italic>k</italic>
-mer
<italic>w</italic>
, its
<italic>conjugacy class</italic>
is the set of
<italic>k</italic>
-mers obtained by rotation of
<italic>w</italic>
. Conjugacy classes form cycles in the de Bruijn graph and form a partition of the
<italic>k</italic>
-mers. The number of conjugacy classes over all
<italic>k</italic>
-mers is given by [
<xref rid="pcbi.1005777.ref016" ref-type="bibr">16</xref>
]:
<disp-formula id="pcbi.1005777.e016">
<alternatives>
<graphic xlink:href="pcbi.1005777.e016.jpg" id="pcbi.1005777.e016g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M16">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:mo>|</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo form="prefix" movablelimits="true">gcd</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
<label>(11)</label>
</disp-formula>
</p>
<p>A decycling set necessarily contains a
<italic>k</italic>
-mer from each conjugacy class. Golomb’s conjecture, proved by Mykkeltveit [
<xref rid="pcbi.1005777.ref017" ref-type="bibr">17</xref>
], states that the smallest decycling set has cardinality
<italic>C</italic>
(|Σ|,
<italic>k</italic>
). Consequently, a minimum hitting set
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
has a size ≥
<italic>C</italic>
(|Σ|,
<italic>k</italic>
) ≥ |Σ|
<sup>
<italic>k</italic>
</sup>
/
<italic>k</italic>
.</p>
<p>
<xref rid="pcbi.1005777.t001" ref-type="table">Table 1</xref>
reports
<italic>L</italic>
<sub>
<italic>max</italic>
</sub>
, the length of the longest sequence in a complete de Bruijn graph after a minimum decycling set computed using Mykkeltveit’s algorithm is removed, for
<italic>k</italic>
= 2 to 14. For this range of
<italic>k</italic>
, the length of sequences avoiding the decycling set can theoretically be appropriate for long-read sequencing technologies, such as PacBio [
<xref rid="pcbi.1005777.ref019" ref-type="bibr">19</xref>
] and Nanopore [
<xref rid="pcbi.1005777.ref020" ref-type="bibr">20</xref>
], which produce reads of length
<italic>L</italic>
> 1000. Such long reads are all hit by a decycling set according to
<xref rid="pcbi.1005777.t001" ref-type="table">Table 1</xref>
for
<italic>k</italic>
≤ 14 (although a shorter window size may be needed to overcome sequencing errors). However, many short reads can avoid the decycling set. Additional
<italic>k</italic>
-mers must be selected to obtain a hitting set for shorter sequences. Note that different minimum decycling sets may result in different lengths of the longest path in the remaining DAG. Mykkeltveit’s approach is different from that of Champarnaud et al. (2004), and the former has an advantage in producing solutions with shorter longest paths [
<xref rid="pcbi.1005777.ref016" ref-type="bibr">16</xref>
].</p>
<table-wrap id="pcbi.1005777.t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1005777.t001</object-id>
<label>Table 1</label>
<caption>
<title>Length of longest sequence avoiding an unavoidable set for different values of
<italic>k</italic>
.</title>
<p>For each value
<italic>k</italic>
, a minimum decycling set was removed from a complete de Bruijn graph, and the length
<italic>L</italic>
<sub>
<italic>max</italic>
</sub>
of the longest sequence, represented as a longest path, was calculated.</p>
</caption>
<alternatives>
<graphic id="pcbi.1005777.t001g" xlink:href="pcbi.1005777.t001"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="center" rowspan="1" colspan="1">
<italic>k</italic>
</td>
<td align="center" rowspan="1" colspan="1">2</td>
<td align="center" rowspan="1" colspan="1">3</td>
<td align="center" rowspan="1" colspan="1">4</td>
<td align="center" rowspan="1" colspan="1">5</td>
<td align="center" rowspan="1" colspan="1">6</td>
<td align="center" rowspan="1" colspan="1">7</td>
<td align="center" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">10</td>
<td align="center" rowspan="1" colspan="1">11</td>
<td align="center" rowspan="1" colspan="1">12</td>
<td align="center" rowspan="1" colspan="1">13</td>
<td align="center" rowspan="1" colspan="1">14</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">
<italic>L</italic>
<sub>
<italic>max</italic>
</sub>
</td>
<td align="center" rowspan="1" colspan="1">5</td>
<td align="center" rowspan="1" colspan="1">11</td>
<td align="center" rowspan="1" colspan="1">20</td>
<td align="center" rowspan="1" colspan="1">45</td>
<td align="center" rowspan="1" colspan="1">70</td>
<td align="center" rowspan="1" colspan="1">117</td>
<td align="center" rowspan="1" colspan="1">148</td>
<td align="center" rowspan="1" colspan="1">239</td>
<td align="center" rowspan="1" colspan="1">311</td>
<td align="center" rowspan="1" colspan="1">413</td>
<td align="center" rowspan="1" colspan="1">570</td>
<td align="center" rowspan="1" colspan="1">697</td>
<td align="center" rowspan="1" colspan="1">931</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
<sec id="sec014">
<title>DOCKS</title>
<p>We implemented and ran DOCKS over a range of
<italic>k</italic>
and
<italic>L</italic>
: 5 ≤
<italic>k</italic>
≤ 10 and 20 ≤
<italic>L</italic>
≤ 200, in increments of 10. The values of
<italic>k</italic>
are typical lengths for minimizers, and the
<italic>L</italic>
values are typical lengths of short reads. Note that in some applications, like KMC 2 [
<xref rid="pcbi.1005777.ref008" ref-type="bibr">8</xref>
] and Kraken [
<xref rid="pcbi.1005777.ref012" ref-type="bibr">12</xref>
], the length of the window used (denoted by
<italic>k</italic>
there) corresponds to our
<italic>L</italic>
parameter, and the length of the minimizers (
<italic>m</italic>
in KMC 2) corresponds to our
<italic>k</italic>
parameter.</p>
<p>The results are summarized in
<xref ref-type="fig" rid="pcbi.1005777.g001">Fig 1</xref>
. As expected, the fraction of
<italic>k</italic>
-mers included in the solution set decreases with
<italic>L</italic>
. It is easier to hit longer sequences as they contain more
<italic>k</italic>
-mers. In addition, running times and memory usage increase exponentially with
<italic>k</italic>
. For
<italic>k</italic>
= 10, DOCKS terminated after more than 2.5 hours and used more than 1 GB of memory. For
<italic>k</italic>
= 11 and
<italic>L</italic>
= 20 running time was 128 hours. Hence, DOCKS runtime would be prohibitively long for larger values of
<italic>k</italic>
. Running times were benchmarked on a single CPU of a 20-CPU Intel Xeon E5-2650 (2.3GHz) machine with 384GB 2133MHz RAM.</p>
<fig id="pcbi.1005777.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1005777.g001</object-id>
<label>Fig 1</label>
<caption>
<title>Performance of DOCKS.</title>
<p>For different combinations of
<italic>k</italic>
and
<italic>L</italic>
we ran DOCKS over the DNA alphabet. (A) Set sizes. The results are shown as a fraction of the total number of
<italic>k</italic>
-mers |Σ|
<sup>
<italic>k</italic>
</sup>
. The broken lines show the decycling set size for each
<italic>k</italic>
. (B) Running time in seconds. Note that y-axis is in log scale. (C) Maximum memory usage in megabytes. Note that y-axis is in log scale.</p>
</caption>
<graphic xlink:href="pcbi.1005777.g001"></graphic>
</fig>
<p>
<xref ref-type="fig" rid="pcbi.1005777.g001">Fig 1A</xref>
also shows the size of the decycling set for each
<italic>k</italic>
. For
<italic>k</italic>
= 10 and
<italic>L</italic>
= 20 the number of added
<italic>k</italic>
-mers roughly equals the size of the decycling set, while for
<italic>k</italic>
= 5 and
<italic>L</italic>
= 20 it is only 20% larger. For all values of
<italic>k</italic>
, the ratio improves as
<italic>L</italic>
grows. We also compared DOCKS to a pure greedy algorithm that repeatedly removes a vertex with a maximum hitting number, without removing a decycling set first. For almost all combinations of (
<italic>k</italic>
,
<italic>L</italic>
) the size of the produced set, runtime and memory of the greedy algorithm were far greater than those of DOCKS (see Fig A in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
). In particular, the greedy algorithm’s runtime was greater by a factor of more than 1000 for
<italic>k</italic>
= 8 (taking days compared to minutes), and it increased with
<italic>L</italic>
, as opposed to DOCKS’s runtime, which decreased with
<italic>L</italic>
.</p>
</sec>
<sec id="sec015">
<title>DOCKSany</title>
<p>We ran DOCKSany for 5 ≤
<italic>k</italic>
≤ 11 and 20 ≤
<italic>L</italic>
≤ 200. The results for
<italic>k</italic>
= 10 are shown in
<xref ref-type="fig" rid="pcbi.1005777.g002">Fig 2</xref>
and the full results are in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s001">S1 Table</xref>
and visualized in Fig B in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
. In comparison to DOCKS (see Fig C in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
), the produced sets are larger, especially for smaller values of
<italic>L</italic>
, and that gap grows with
<italic>k</italic>
: from 10% for
<italic>k</italic>
= 5 to 60% larger for
<italic>k</italic>
= 10. Set sizes of DOCKS and DOCKSany are closer as
<italic>L</italic>
increases and both approach the size of the decycling set. In terms of running time, on the other hand, we see a great benefit in using DOCKSany as runtimes decrease to a small fraction of the DOCKS running times for the larger values of
<italic>k</italic>
. We also see reduced memory usage for larger values of
<italic>k</italic>
and
<italic>L</italic>
(see the table in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s001">S1 Table</xref>
). Still, DOCKSany becomes impractical for
<italic>k</italic>
≥ 13 (runtime for
<italic>k</italic>
= 12,
<italic>L</italic>
= 20 was 45 days), so we turn to another heuristic to increase runtime on the expense of larger set sizes.</p>
<fig id="pcbi.1005777.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1005777.g002</object-id>
<label>Fig 2</label>
<caption>
<title>Comparison of the sizes of the universal sets generated by the different heuristics.</title>
<p>The histogram shows the size of the universal sets generated by DOCKS, DOCKSany, and DOCKSanyX with
<italic>X</italic>
= 625. The results are for
<italic>k</italic>
= 10 and 20 ≤
<italic>L</italic>
≤ 200. The size of the decycling set is provided as a lower bound for comparison.</p>
</caption>
<graphic xlink:href="pcbi.1005777.g002"></graphic>
</fig>
</sec>
<sec id="sec016">
<title>DOCKSanyX</title>
<p>We tested the performance of DOCKSanyX for
<italic>k</italic>
= 10, 20 ≤
<italic>L</italic>
≤ 200 and
<italic>X</italic>
= 5
<sup>
<italic>i</italic>
</sup>
for 0 ≤
<italic>i</italic>
≤ 5 (Fig D in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
). As expected, the generated set sizes increase with
<italic>X</italic>
, but the differences are very small for
<italic>X</italic>
≤ 125. On the other hand, the running time improves dramatically as
<italic>X</italic>
increases and the memory usage also improves with
<italic>X</italic>
, albeit not as dramatically (see
<xref ref-type="supplementary-material" rid="pcbi.1005777.s001">S1 Table</xref>
).
<xref ref-type="fig" rid="pcbi.1005777.g002">Fig 2</xref>
compares the sizes of the sets generated by DOCKS, DOCKSany, and DOCKSanyX (for
<italic>X</italic>
= 625). Remarkably, for
<italic>k</italic>
= 10, the size of the solution is similar to that of DOCKSany while there is a factor of > 100 × speedup. The results, runtime and memory usage of DOCKSanyX are in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s001">S1 Table</xref>
and visualized in Fig E in
<xref ref-type="supplementary-material" rid="pcbi.1005777.s002">S1 Text</xref>
.</p>
</sec>
<sec id="sec017">
<title>ILP solutions</title>
<p>We solved the ILP using Gurobi 6.5.2 [
<xref rid="pcbi.1005777.ref021" ref-type="bibr">21</xref>
] for 5 ≤
<italic>k</italic>
≤ 10 with 20 ≤
<italic>L</italic>
≤ 200. To save time, we set the starting feasible solution to be the DOCKS solution. We let the solver run for up to one day for each
<italic>k</italic>
and
<italic>L</italic>
. This did not necessarily produce an optimal solution to the ILP, although the solver was often able to improve on the starting DOCKS solution. In
<xref ref-type="fig" rid="pcbi.1005777.g003">Fig 3</xref>
, we show the improvement in the solution set size obtained by the ILP over the DOCKS solution. We can see that using the ILP solver leads to minor improvements over the DOCKS solution (0-4%), especially for small
<italic>k</italic>
. Improvements diminish as
<italic>L</italic>
increases, since the set sizes approach the theoretical lower bound, i.e., the size of the minimum decycling set. Letting the ILP solver run for longer times may provide further improvements for small values of
<italic>L</italic>
.</p>
<fig id="pcbi.1005777.g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1005777.g003</object-id>
<label>Fig 3</label>
<caption>
<title>Performance of ILP solver compared to DOCKS.</title>
<p>For each combination of 5 ≤
<italic>k</italic>
≤ 10 and 20 ≤
<italic>L</italic>
≤ 200 we ran the ILP solver for up to 24 hours starting from a DOCKS feasible solution. The histograms show the percent improvement of the
<italic>k</italic>
-mer set size generated by the ILP solver compared to DOCKS. For
<italic>L</italic>
> 60 and all tested values of
<italic>k</italic>
, the improvement was <1%.</p>
</caption>
<graphic xlink:href="pcbi.1005777.g003"></graphic>
</fig>
</sec>
<sec id="sec018">
<title>Comparison to minimizers on several genomes</title>
<p>The minimizer algorithm [
<xref rid="pcbi.1005777.ref001" ref-type="bibr">1</xref>
] selects the lexicographically smallest
<italic>k</italic>
-mer in each window of
<italic>w</italic>
consecutive
<italic>k</italic>
-mers in order to reduce storage size for sequence comparison. We can improve the minimizers algorithm by choosing the lexicographically smallest
<italic>k</italic>
-mer that is in the DOCKS set for the corresponding
<italic>k</italic>
and
<italic>L</italic>
parameters (i.e.
<italic>L</italic>
=
<italic>k</italic>
+
<italic>w</italic>
− 1). Such a
<italic>k</italic>
-mer is guaranteed to exist, as by construction, every window of length
<italic>w</italic>
contains a
<italic>k</italic>
-mer in the UHS. We ran the minimizer selection algorithm and DOCKS-based selection on four different genomes, using
<italic>k</italic>
= 10 and
<italic>L</italic>
= 30: the entire human reference genome (GRCh38), the bacteria
<italic>A. tropicalis</italic>
strain NBRC 16470, and
<italic>C. crescentus</italic>
strain CB15, the worm C. elegans assembly WBcel235. For comparison, we also included the results when using the minimizer according to a random ordering of the
<italic>k</italic>
-mers, instead of lexicographic. This random ordering typically improves over minimizers since it avoids the problem of always selecting the common poly-A homopolymer.</p>
<p>
<xref rid="pcbi.1005777.t002" ref-type="table">Table 2</xref>
shows that DOCKS selects far fewer
<italic>k</italic>
-mers and those
<italic>k</italic>
-mers are more widely spread apart in the sequence. The advantage of DOCKS grows as the sequence length increases, having a size ≈ 85% of the next-best method for the small bacterial genome, ≈ 50% for the larger
<italic>C. elegans</italic>
genome, and only ≈ 40% for the human genome.</p>
<table-wrap id="pcbi.1005777.t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1005777.t002</object-id>
<label>Table 2</label>
<caption>
<title>The number of 10-mers needed to hit all 30-long sequences in four genomes: Two bacterial genomes
<italic>A. tropicalis</italic>
,
<italic>C. crescentus</italic>
, the worm
<italic>C. elegans</italic>
and a mammal genome,
<italic>H. sapiens</italic>
.</title>
<p>The genome sizes are quoted after removing all
<italic>N</italic>
s and ambiguous codes. We tested three algorithms: minimizers picking the lexicographically smallest 10-mer, minimizer picking the first in a random
<italic>k</italic>
-mer ordering, and selection using the set produced by DOCKS. In case of multiple DOCKS-selected 10-mers in the 30-long window, the lexicographically smallest was chosen.
<italic># mers</italic>
is the number of distinct 10-mers selected, and
<italic>avg. dist.</italic>
is the average distance between two selected 10-mers.</p>
</caption>
<alternatives>
<graphic id="pcbi.1005777.t002g" xlink:href="pcbi.1005777.t002"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Species</th>
<th align="center" rowspan="1" colspan="1">Genome size (Mbp)</th>
<th align="left" rowspan="1" colspan="1">Method</th>
<th align="center" rowspan="1" colspan="1"># mers (thousands)</th>
<th align="center" rowspan="1" colspan="1">avg. dist.</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="3" colspan="1">
<italic>A. tropicalis</italic>
</td>
<td align="center" rowspan="3" colspan="1">0.393</td>
<td align="left" rowspan="1" colspan="1">lexicographic</td>
<td align="char" char="." rowspan="1" colspan="1">32.9</td>
<td align="char" char="." rowspan="1" colspan="1">9.48</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">randomized</td>
<td align="char" char="." rowspan="1" colspan="1">28.0</td>
<td align="char" char="." rowspan="1" colspan="1">11.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">DOCKS</td>
<td align="char" char="." rowspan="1" colspan="1">23.7</td>
<td align="char" char="." rowspan="1" colspan="1">12.4</td>
</tr>
<tr>
<td align="left" rowspan="3" colspan="1">
<italic>C. crescentus</italic>
</td>
<td align="center" rowspan="3" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">lexicographic</td>
<td align="char" char="." rowspan="1" colspan="1">114.0</td>
<td align="char" char="." rowspan="1" colspan="1">10.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">randomized</td>
<td align="char" char="." rowspan="1" colspan="1">89.6</td>
<td align="char" char="." rowspan="1" colspan="1">11.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">DOCKS</td>
<td align="char" char="." rowspan="1" colspan="1">66.0</td>
<td align="char" char="." rowspan="1" colspan="1">12.4</td>
</tr>
<tr>
<td align="left" rowspan="3" colspan="1">
<italic>C. elegans</italic>
</td>
<td align="center" rowspan="3" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">lexicographic</td>
<td align="char" char="." rowspan="1" colspan="1">286.0</td>
<td align="char" char="." rowspan="1" colspan="1">8.83</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">randomized</td>
<td align="char" char="." rowspan="1" colspan="1">277.0</td>
<td align="char" char="." rowspan="1" colspan="1">11.0</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">DOCKS</td>
<td align="char" char="." rowspan="1" colspan="1">145.0</td>
<td align="char" char="." rowspan="1" colspan="1">12.4</td>
</tr>
<tr>
<td align="left" rowspan="3" colspan="1">
<italic>H. sapiens</italic>
</td>
<td align="center" rowspan="3" colspan="1">2900</td>
<td align="left" rowspan="1" colspan="1">lexicographic</td>
<td align="char" char="." rowspan="1" colspan="1">543.0</td>
<td align="char" char="." rowspan="1" colspan="1">9.13</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">randomized</td>
<td align="char" char="." rowspan="1" colspan="1">389.0</td>
<td align="char" char="." rowspan="1" colspan="1">10.9</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">DOCKS</td>
<td align="char" char="." rowspan="1" colspan="1">154.0</td>
<td align="char" char="." rowspan="1" colspan="1">12.1</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions" id="sec019">
<title>Discussion</title>
<p>We presented the DOCKS algorithm, which generates a compact set of
<italic>k</italic>
-mers that together hit all
<italic>L</italic>
-long DNA sequences. Such compact sets have many applications in sequence analysis, including space efficient data structures and large-scale sequence analysis. We tested the sets produced by our algorithm in an application that requires finding a small set of 10-mers hitting all 30-long words in the input genomes. Compared to minimizers, the current state of the art, our sets were almost 2.5-3.5 times smaller for the human genome. We could produce sets for the range of
<italic>k</italic>
= 5 to 10 and
<italic>L</italic>
= 20 to 200, and the results show that for
<italic>L</italic>
> 100 the size of the solution is quite close to the theoretical lower bound. We expect the sets produced by our approach to be useful and improving a variety of biological applications that require complex analysis of numerous sequences.</p>
<p>We see the benefit of our compact UHSs in many data structures and algorithms that analyze high-throughput sequencing data. For example, we expect that binning-based
<italic>k</italic>
-mer counting applications, such as KMC 2 [
<xref rid="pcbi.1005777.ref008" ref-type="bibr">8</xref>
], can reduce the number of bins, and thus the number of disk accesses, using universal
<italic>k</italic>
-mer hitting sets. Analyses that rely on
<italic>k</italic>
-mer counting, such as metagenomic binning as implemented in Kraken [
<xref rid="pcbi.1005777.ref012" ref-type="bibr">12</xref>
], will also see improved computational resource usage. The minimizer idea has been widely deployed, and universal hitting
<italic>k</italic>
-mers can typically be used as a drop-in replacement, improving computational performance.</p>
<p>The good performance of the algorithms can be attributed to their two phase approach. In the first phase we optimally and rapidly remove a minimum-size set that hits all infinite sequences, which also takes care of many
<italic>L</italic>
-long sequences. In the second phase we greedily remove
<italic>k</italic>
-mers that hit remaining
<italic>L</italic>
-long sequences. Overall efficiency is primarily due to the first phase, which runs in time
<italic>O</italic>
(
<italic>k</italic>
) times the size of the output. In the second phase dynamic programming is used, providing running time polynomial in the output size.</p>
<p>We developed two additional variants of DOCKS that reduce the runtime and memory usage at the price of increasing the size of the set created. DOCKS can provide a solution for
<italic>k</italic>
= 10, DOCKSany for
<italic>k</italic>
= 11, and the fastest variant, DOCKSanyX for
<italic>k</italic>
= 13 (with
<italic>X</italic>
= 10000) with
<italic>L</italic>
= 200, within a day. Note that all heuristics are bound to hit a limit since their runtime depends exponentially on
<italic>k</italic>
. This is an inherent property of the problem and its output size. Still, we manage to increase
<italic>k</italic>
by one or two using each heuristic. In partial remedy, we also proposed a construction that can push that limit further at the expense of solution size.</p>
<p>Our approaches are heuristic in nature. This is not surprising, since as we show, the problem of finding a minimum (
<italic>k</italic>
,
<italic>L</italic>
)-hitting set for a given set of sequences is NP-hard. Moreover, even after removing an optimal decycling set, one needs to solve the problem of finding a minimum vertex set that hits all
<italic>L</italic>
-long sequences in a directed acyclic graph, which is NP-hard. Hence, DOCKS usually produces sub-optimal solutions. For example, for
<italic>k</italic>
= 4 and
<italic>L</italic>
= 10 the optimal solution obtained by solving an ILP formulation had size 89, compared to 91 produced by DOCKS. In fact, our tests show that if further reduction to the hitting set size is needed, starting from the DOCKS solution and improving it using ILP is a good strategy, at least for small values of
<italic>k</italic>
.</p>
<p>Our study raises several open problems. First, is there a characterization for a minimum universal (
<italic>k</italic>
,
<italic>L</italic>
)-hitting set similar to the characterization of decycling sets by Mykkeltveit [
<xref rid="pcbi.1005777.ref017" ref-type="bibr">17</xref>
]? That is, does there exist an algorithm polynomial in
<italic>k</italic>
and
<italic>L</italic>
that can check if a
<italic>k</italic>
-mer belongs to a particular universal (
<italic>k</italic>
,
<italic>L</italic>
)-hitting set. The fact that MINIMUM (
<italic>k</italic>
,
<italic>L</italic>
)-HITTING SET on a given set of input sequences is NP-hard still leaves the universal case open. A related question is whether one can find an algorithm that generates an optimal (universal) (
<italic>k</italic>
,
<italic>L</italic>
)-hitting set while requiring work polynomial in the output set size. This is particularly interesting for the universal case, where the input is only the values
<italic>k</italic>
and
<italic>L</italic>
and the output size is > |Σ|
<sup>
<italic>k</italic>
</sup>
/
<italic>k</italic>
. Second, is the problem of minimum
<italic></italic>
-path cover in a DAG
<italic>G</italic>
polynomial when
<italic>G</italic>
is a subgraph of a de Bruijn graph? We know it is hard for a general DAG, but the specific structure of de Bruijn graphs may make the problem easier. Third, the bottleneck to DOCKS running time is the second phase, which currently re-calculates the vertex hitting numbers on each iteration. Can one find a dynamic algorithm that updates these numbers more efficiently after the removal of one vertex? Fourth, is there a tight upper bound on the number
<italic>p</italic>
of vertices that will be removed by the greedy heuristic? Fifth, can we give an upper bound or a tighter lower bound on the size of
<italic>U</italic>
<sub>
<italic>k</italic>
,
<italic>L</italic>
</sub>
?</p>
<sec id="sec020">
<title>Conclusion</title>
<p>We demonstrated the ability of DOCKS to generate compact sets of
<italic>k</italic>
-mers that hit all
<italic>L</italic>
-long sequences. These
<italic>k</italic>
-mer sets can be generated once for any desired value of
<italic>k</italic>
≤ 13 and
<italic>L</italic>
and then readily used for many different purposes. For example, we produced a set of only 700 6-mers out of a total of 4096 that hits every sequence longer than 70 bases—a typical read length for many sequencing experiments—enabling efficient binning of reads. Our compact sets can improve many of the applications that currently use minimizers, as we showed that they are both smaller and more sparsely distributed across genome sequences.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="sec021">
<title>Supporting information</title>
<supplementary-material content-type="local-data" id="pcbi.1005777.s001">
<label>S1 Table</label>
<caption>
<title>Set size, running time and memory usage of DOCKS, DOCKSany, DOCKSanyX, and the greedy algorithm for the hitting set problem.</title>
<p>The table contains solution set size, time in seconds and memory in KB for DOCKS, DOCKSany, DOCKSanyX and the greedy approach algorithms. Note that the reported times are for individual runs of each (
<italic>k</italic>
,
<italic>L</italic>
) pair, but the sets for all longer
<italic>L</italic>
values are computed when computing the (
<italic>k</italic>
,
<italic>L</italic>
= 20) set with DOCKSany or DOCKSanyX and the runtime can be amortized across all of these calculations.</p>
<p>(XLSX)</p>
</caption>
<media xlink:href="pcbi.1005777.s001.xlsx">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1005777.s002">
<label>S1 Text</label>
<caption>
<title>Supplementary figures and theoretical proofs.</title>
<p>(PDF)</p>
</caption>
<media xlink:href="pcbi.1005777.s002.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>Part of this work was done while Y.O., R.S. and C.K. were visiting the Simons Institute for the Theory of Computing.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pcbi.1005777.ref001">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Hayes</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Hunt</surname>
<given-names>BR</given-names>
</name>
,
<name>
<surname>Mount</surname>
<given-names>SM</given-names>
</name>
,
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
.
<article-title>Reducing storage requirements for biological sequence comparison</article-title>
.
<source>Bioinformatics</source>
.
<year>2004</year>
;
<volume>20</volume>
(
<issue>18</issue>
):
<fpage>3363</fpage>
<lpage>3369</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth408</pub-id>
<pub-id pub-id-type="pmid">15256412</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref002">
<label>2</label>
<mixed-citation publication-type="other">Schleimer S, Wilkerson DS, Aiken A. Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. SIGMOD’03. New York, NY, USA: ACM; 2003. p. 76–85. Available from:
<ext-link ext-link-type="uri" xlink:href="http://doi.acm.org/10.1145/872757.872770">http://doi.acm.org/10.1145/872757.872770</ext-link>
.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref003">
<label>3</label>
<mixed-citation publication-type="other">Karkkainen J, Ukkonen E. Sparse Suffix Trees. In: Computing and Combinatorics: 2nd Annual International Conference, COCOON’96. vol. 2. Springer; 1996. p. 219–230.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref004">
<label>4</label>
<mixed-citation publication-type="other">Grabowski S, Raniszewski M. Sampling the Suffix Array with Minimizers. In: Proceedings of the 22nd International Symposium on String Processing and Information Retrieval. vol. 9309. Springer-Verlag New York, Inc.; 2015. p. 287–298.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref005">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Solomon</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
.
<article-title>Fast search of thousands of short-read sequencing experiments</article-title>
.
<source>Nature Biotech</source>
.
<year>2016</year>
<month>3</month>
;
<volume>34</volume>
(
<issue>3</issue>
):
<fpage>300</fpage>
<lpage>302</lpage>
.
<pub-id pub-id-type="doi">10.1038/nbt.3442</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref006">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Solomon</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
.
<article-title>Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees</article-title>
.
<source>bioRxiv</source>
.
<year>2016</year>
; Available from:
<ext-link ext-link-type="uri" xlink:href="http://biorxiv.org/content/early/2016/12/02/086561">http://biorxiv.org/content/early/2016/12/02/086561</ext-link>
.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref007">
<label>7</label>
<mixed-citation publication-type="other">Movahedi NS, Forouzmand E, Chitsaz H. De novo co-assembly of bacterial genomes from multiple single cells. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2012. p. 1–5.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref008">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Kokot</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
.
<article-title>KMC 2: fast and resource-frugal k-mer counting</article-title>
.
<source>Bioinformatics</source>
.
<year>2015</year>
<month>5</month>
;
<volume>31</volume>
(
<issue>10</issue>
):
<fpage>1569</fpage>
<lpage>1576</lpage>
. Available from:
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/31/10/1569">http://bioinformatics.oxfordjournals.org/content/31/10/1569</ext-link>
.
<pub-id pub-id-type="pmid">25609798</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref009">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Limasset</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Jackman</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
,
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
.
<article-title>On the representation of de Bruijn graphs</article-title>
.
<source>Journal of Computational Biology</source>
.
<year>2015</year>
<month>1</month>
;
<volume>22</volume>
(
<issue>5</issue>
):
<fpage>336</fpage>
<lpage>352</lpage>
. Available from:
<ext-link ext-link-type="uri" xlink:href="http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0160">http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0160</ext-link>
.
<pub-id pub-id-type="pmid">25629448</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref010">
<label>10</label>
<mixed-citation publication-type="other">Li Y, Kamousi P, Han F, Yang S, Yan X, Suri S. Memory efficient minimum substring partitioning. In: Proceedings of the VLDB Endowment. vol. 6. VLDB Endowment; 2013. p. 169–180.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref011">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ye</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Ma</surname>
<given-names>ZS</given-names>
</name>
,
<name>
<surname>Cannon</surname>
<given-names>CH</given-names>
</name>
,
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Douglas</surname>
<given-names>WY</given-names>
</name>
.
<article-title>Exploiting sparseness in de novo genome assembly</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2012</year>
;
<volume>13</volume>
(
<issue>6</issue>
):
<fpage>S1</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-13-S6-S1</pub-id>
<pub-id pub-id-type="pmid">22537038</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref012">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wood</surname>
<given-names>DE</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
.
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
.
<source>Genome Biology</source>
.
<year>2014</year>
;
<volume>15</volume>
(
<issue>3</issue>
):
<fpage>R46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref013">
<label>13</label>
<mixed-citation publication-type="other">Sahinalp SC, Vishkin U. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: 37th Annual Symposium on Foundations of Computer Science, Proceedings; 1996. p. 320–328.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hach</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Numanagić</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Sahinalp</surname>
<given-names>SC</given-names>
</name>
.
<article-title>SCALCE: boosting sequence compression algorithms using locally consistent encoding</article-title>
.
<source>Bioinformatics</source>
.
<year>2012</year>
<month>12</month>
;
<volume>28</volume>
(
<issue>23</issue>
):
<fpage>3051</fpage>
<lpage>3057</lpage>
. Available from:
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/28/23/3051">http://bioinformatics.oxfordjournals.org/content/28/23/3051</ext-link>
<pub-id pub-id-type="pmid">23047557</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref015">
<label>15</label>
<mixed-citation publication-type="other">Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C. Compact universal k-mer hitting sets. In: International Workshop on Algorithms in Bioinformatics. vol. 9838. Springer; 2016. p. 257–268.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref016">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Champarnaud</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Hansel</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Perrin</surname>
<given-names>D</given-names>
</name>
.
<article-title>Unavoidable sets of constant length</article-title>
.
<source>International Journal of Algebra and Computation</source>
.
<year>2004</year>
;
<volume>14</volume>
(
<issue>02</issue>
):
<fpage>241</fpage>
<lpage>251</lpage>
.
<pub-id pub-id-type="doi">10.1142/S0218196704001700</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref017">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Mykkeltveit</surname>
<given-names>J</given-names>
</name>
.
<article-title>A proof of Golomb’s conjecture for the de Bruijn graph</article-title>
.
<source>Journal of Combinatorial Theory, Series B</source>
.
<year>1972</year>
<month>8</month>
;
<volume>13</volume>
(
<issue>1</issue>
):
<fpage>40</fpage>
<lpage>45</lpage>
. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.sciencedirect.com/science/article/pii/0095895672900068">http://www.sciencedirect.com/science/article/pii/0095895672900068</ext-link>
.</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref018">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chvatal</surname>
<given-names>V</given-names>
</name>
.
<article-title>A greedy heuristic for the set-covering problem</article-title>
.
<source>Mathematics of Operations Research</source>
.
<year>1979</year>
;
<volume>4</volume>
(
<issue>3</issue>
):
<fpage>233</fpage>
<lpage>235</lpage>
.
<pub-id pub-id-type="doi">10.1287/moor.4.3.233</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref019">
<label>19</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rhoads</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Au</surname>
<given-names>KF</given-names>
</name>
.
<article-title>PacBio sequencing and its applications</article-title>
.
<source>Genomics, proteomics & bioinformatics</source>
.
<year>2015</year>
;
<volume>13</volume>
(
<issue>5</issue>
):
<fpage>278</fpage>
<lpage>289</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.gpb.2015.08.002</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref020">
<label>20</label>
<mixed-citation publication-type="journal">
<name>
<surname>Branton</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Deamer</surname>
<given-names>DW</given-names>
</name>
,
<name>
<surname>Marziali</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Bayley</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Benner</surname>
<given-names>SA</given-names>
</name>
,
<name>
<surname>Butler</surname>
<given-names>T</given-names>
</name>
,
<etal>et al</etal>
<article-title>The potential and challenges of nanopore sequencing</article-title>
.
<source>Nature biotechnology</source>
.
<year>2008</year>
;
<volume>26</volume>
(
<issue>10</issue>
):
<fpage>1146</fpage>
<lpage>1153</lpage>
.
<pub-id pub-id-type="doi">10.1038/nbt.1495</pub-id>
<pub-id pub-id-type="pmid">18846088</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1005777.ref021">
<label>21</label>
<mixed-citation publication-type="other">Gurobi Optimization, Inc. Gurobi Optimizer Reference Manual; 2016. Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.gurobi.com">http://www.gurobi.com</ext-link>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F86  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F86  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021