Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Identifieur interne : 000F86 ( Pmc/Curation ); précédent : 000F85; suivant : 000F87

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Auteurs : Yaron Orenstein [États-Unis] ; David Pellow [Israël] ; Guillaume Marçais [États-Unis] ; Ron Shamir [Israël] ; Carl Kingsford [États-Unis]

Source :

RBID : PMC:5645146

Abstract

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.


Url:
DOI: 10.1371/journal.pcbi.1005777
PubMed: 28968408
PubMed Central: 5645146

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:5645146

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28968408</idno>
<idno type="pmc">5645146</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645146</idno>
<idno type="RBID">PMC:5645146</idno>
<idno type="doi">10.1371/journal.pcbi.1005777</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000F86</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F86</idno>
<idno type="wicri:Area/Pmc/Curation">000F86</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000F86</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers
<italic>k</italic>
and
<italic>L</italic>
>
<italic>k</italic>
, we say that a set of
<italic>k</italic>
-mers is a
<italic>universal hitting set</italic>
(UHS) if every possible
<italic>L</italic>
-long sequence must contain a
<italic>k</italic>
-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of
<italic>k</italic>
and
<italic>L</italic>
and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author>
<name sortKey="Jackman, S" uniqKey="Jackman S">S Jackman</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author>
<name sortKey="Ma, Zs" uniqKey="Ma Z">ZS Ma</name>
</author>
<author>
<name sortKey="Cannon, Ch" uniqKey="Cannon C">CH Cannon</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Douglas, Wy" uniqKey="Douglas W">WY Douglas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Numanagi, I" uniqKey="Numanagi I">I Numanagić</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Champarnaud, Jm" uniqKey="Champarnaud J">JM Champarnaud</name>
</author>
<author>
<name sortKey="Hansel, G" uniqKey="Hansel G">G Hansel</name>
</author>
<author>
<name sortKey="Perrin, D" uniqKey="Perrin D">D Perrin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mykkeltveit, J" uniqKey="Mykkeltveit J">J Mykkeltveit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chvatal, V" uniqKey="Chvatal V">V Chvatal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rhoads, A" uniqKey="Rhoads A">A Rhoads</name>
</author>
<author>
<name sortKey="Au, Kf" uniqKey="Au K">KF Au</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Branton, D" uniqKey="Branton D">D Branton</name>
</author>
<author>
<name sortKey="Deamer, Dw" uniqKey="Deamer D">DW Deamer</name>
</author>
<author>
<name sortKey="Marziali, A" uniqKey="Marziali A">A Marziali</name>
</author>
<author>
<name sortKey="Bayley, H" uniqKey="Bayley H">H Bayley</name>
</author>
<author>
<name sortKey="Benner, Sa" uniqKey="Benner S">SA Benner</name>
</author>
<author>
<name sortKey="Butler, T" uniqKey="Butler T">T Butler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group>
<journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28968408</article-id>
<article-id pub-id-type="pmc">5645146</article-id>
<article-id pub-id-type="publisher-id">PCOMPBIOL-D-17-00243</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.1005777</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Human Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Computer and Information Sciences</subject>
<subj-group>
<subject>Graph Theory</subject>
<subj-group>
<subject>Directed Graphs</subject>
<subj-group>
<subject>Directed Acyclic Graphs</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Graph Theory</subject>
<subj-group>
<subject>Directed Graphs</subject>
<subj-group>
<subject>Directed Acyclic Graphs</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Animal Genomics</subject>
<subj-group>
<subject>Invertebrate Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Microbiology</subject>
<subj-group>
<subject>Bacteriology</subject>
<subj-group>
<subject>Bacterial Genetics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Microbial Genetics</subject>
<subj-group>
<subject>Bacterial Genetics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Microbial Genomics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Microbiology</subject>
<subj-group>
<subject>Microbial Genomics</subject>
<subj-group>
<subject>Bacterial Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Applied Mathematics</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Simulation and Modeling</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Molecular biology</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Sequencing techniques</subject>
<subj-group>
<subject>DNA sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Sequencing techniques</subject>
<subj-group>
<subject>DNA sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Algebra</subject>
<subj-group>
<subject>Polynomials</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</article-title>
<alt-title alt-title-type="running-head">Compact universal
<italic>k</italic>
-mer hitting sets</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Orenstein</surname>
<given-names>Yaron</given-names>
</name>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Software</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Visualization</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0002-5083-5925</contrib-id>
<name>
<surname>Marçais</surname>
<given-names>Guillaume</given-names>
</name>
<role content-type="http://credit.casrai.org/">Conceptualization</role>
<role content-type="http://credit.casrai.org/">Data curation</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-1889-9870</contrib-id>
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Raphael</surname>
<given-names>Benjamin J.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Princeton University, UNITED STATES</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>The authors have declared that no competing interests exist.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>carlk@cs.cmu.edu</email>
(CK);
<email>rshamir@cs.tau.ac.il</email>
(RS)</corresp>
</author-notes>
<pub-date pub-type="collection">
<month>10</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>2</day>
<month>10</month>
<year>2017</year>
</pub-date>
<volume>13</volume>
<issue>10</issue>
<elocation-id>e1005777</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>2</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>9</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© 2017 Orenstein et al</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Orenstein et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pcbi.1005777.pdf"></self-uri>
<abstract>
<p>With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers
<italic>k</italic>
and
<italic>L</italic>
>
<italic>k</italic>
, we say that a set of
<italic>k</italic>
-mers is a
<italic>universal hitting set</italic>
(UHS) if every possible
<italic>L</italic>
-long sequence must contain a
<italic>k</italic>
-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of
<italic>k</italic>
and
<italic>L</italic>
and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.</p>
</abstract>
<abstract abstract-type="summary">
<title>Author summary</title>
<p>High-throughput sequencing data has been accumulating at an extreme pace. The need to efficiently analyze and process it has become a critical challenge of the field. Many of the data structures and algorithms for this task rely on
<italic>k</italic>
-mer sets (DNA words of length
<italic>k</italic>
) to represent the sequences in a dataset. The runtime and memory usage of these highly depend on the size of the
<italic>k</italic>
-mer sets used. Thus, a minimum-size
<italic>k</italic>
-mer hitting set, namely, a set of
<italic>k</italic>
-mers that hit (have non-empty overlap with) all sequences, is desirable. In this work, we create universal
<italic>k</italic>
-mer hitting sets that hit any
<italic>L</italic>
-long sequence. We present several heuristic approaches for constructing such small sets; the approaches vary in the trade-off between the size of the produced set and runtime and memory usage. We show the benefit in practice of using the produced universal
<italic>k</italic>
-mer hitting sets compared to minimizers and randomly created hitting sets on the human genome.</p>
</abstract>
<funding-group>
<award-group id="award001">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/501100003977</institution-id>
<institution>Israel Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>ISF-NSFC joint program 2015-2018</award-id>
<principal-award-recipient>
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-1889-9870</contrib-id>
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award002">
<funding-source>
<institution>Edmond J. Safra Center for Bioinformatics at Tel Aviv University</institution>
</funding-source>
<award-id>Ph.D. fellowship</award-id>
<principal-award-recipient>
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award003">
<funding-source>
<institution>Gordon and Betty Moore Foundation (US)</institution>
</funding-source>
<award-id>Data-Driven Discovery Initiative through Grant GBMF4554</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award004">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000001</institution-id>
<institution>National Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>CCF-1256087</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award005">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000001</institution-id>
<institution>National Science Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>CCF-1319998</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award006">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000002</institution-id>
<institution>National Institutes of Health</institution>
</institution-wrap>
</funding-source>
<award-id>R01HG007104</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award007">
<funding-source>
<institution-wrap>
<institution-id institution-id-type="funder-id">http://dx.doi.org/10.13039/100000879</institution-id>
<institution>Alfred P. Sloan Foundation</institution>
</institution-wrap>
</funding-source>
<award-id>Fellow</award-id>
<principal-award-recipient>
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
</principal-award-recipient>
</award-group>
<funding-statement>RS was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program 2015-2018. DP was supported in part by a Ph.D. fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University and in part by an Israel Ministry of Immigrant Absorption fellowship. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to CK, by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01HG007104). CK received support as an Alfred P. Sloan Research Fellow. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<fig-count count="3"></fig-count>
<table-count count="2"></table-count>
<page-count count="15"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>PLOS Publication Stage</meta-name>
<meta-value>vor-update-to-uncorrected-proof</meta-value>
</custom-meta>
<custom-meta>
<meta-name>Publication Update</meta-name>
<meta-value>2017-10-17</meta-value>
</custom-meta>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>All relevant data are within the paper and its Supporting Information files. The software is available from github.com/Shamir-Lab/DOCKS and k-mer sets are available from acgt.cs.tau.ac.il/docks website.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>All relevant data are within the paper and its Supporting Information files. The software is available from github.com/Shamir-Lab/DOCKS and k-mer sets are available from acgt.cs.tau.ac.il/docks website.</p>
</notes>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F86 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000F86 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:5645146
   |texte=   Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:28968408" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021