Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improving the performance of minimizers and winnowing schemes

Identifieur interne : 000B19 ( Pmc/Curation ); précédent : 000B18; suivant : 000B20

Improving the performance of minimizers and winnowing schemes

Auteurs : Guillaume Marçais [États-Unis] ; David Pellow [Israël] ; Daniel Bork [États-Unis] ; Yaron Orenstein [États-Unis] ; Ron Shamir [Israël] ; Carl Kingsford [États-Unis]

Source :

RBID : PMC:5870760

Abstract

AbstractMotivation

The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.

Results

We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence.

Availability and Implementation

The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.


Url:
DOI: 10.1093/bioinformatics/btx235
PubMed: 28881970
PubMed Central: 5870760

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:5870760

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Improving the performance of minimizers and winnowing schemes</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bork, Daniel" sort="Bork, Daniel" uniqKey="Bork D" first="Daniel" last="Bork">Daniel Bork</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff3">Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28881970</idno>
<idno type="pmc">5870760</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870760</idno>
<idno type="RBID">PMC:5870760</idno>
<idno type="doi">10.1093/bioinformatics/btx235</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000B19</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B19</idno>
<idno type="wicri:Area/Pmc/Curation">000B19</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000B19</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Improving the performance of minimizers and winnowing schemes</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bork, Daniel" sort="Bork, Daniel" uniqKey="Bork D" first="Daniel" last="Bork">Daniel Bork</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff3">Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff2">Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="1">
<nlm:aff id="btx235-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>The minimizers scheme is a method for selecting
<italic>k</italic>
-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many
<italic>k</italic>
-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of
<italic>k</italic>
-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We provide an in-depth analysis of the effect of
<italic>k</italic>
-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer
<italic>et al.</italic>
) on the expected density of minimizers in a random sequence.</p>
</sec>
<sec id="SA4">
<title>Availability and Implementation</title>
<p>The software used for this analysis is available on GitHub:
<ext-link ext-link-type="uri" xlink:href="https://github.com/gmarcais/minimizers.git">https://github.com/gmarcais/minimizers.git</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Bruijn, N G" uniqKey="De Bruijn N">N.G. de Bruijn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S. Deorowicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S. Grabowski</name>
</author>
<author>
<name sortKey="Raniszewski, M" uniqKey="Raniszewski M">M. Raniszewski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y. Li</name>
</author>
<author>
<name sortKey="Yan, X" uniqKey="Yan X">X. Yan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y. Orenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y. Orenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M. Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M. Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schleimer, S" uniqKey="Schleimer S">S. Schleimer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, D E" uniqKey="Wood D">D.E. Wood</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C. Ye</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28881970</article-id>
<article-id pub-id-type="pmc">5870760</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btx235</article-id>
<article-id pub-id-type="publisher-id">btx235</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017</subject>
<subj-group subj-group-type="category-toc-heading">
<subject>Hitseq</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Improving the performance of minimizers and winnowing schemes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Marçais</surname>
<given-names>Guillaume</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
<xref ref-type="corresp" rid="btx235-cor1"></xref>
<pmc-comment>gmarcais@cs.cmu.edu</pmc-comment>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bork</surname>
<given-names>Daniel</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Orenstein</surname>
<given-names>Yaron</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shamir</surname>
<given-names>Ron</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<xref ref-type="aff" rid="btx235-aff1">1</xref>
<xref ref-type="corresp" rid="btx235-cor1"></xref>
<pmc-comment>carlk@cs.cmu.edu</pmc-comment>
</contrib>
</contrib-group>
<aff id="btx235-aff1">
<label>1</label>
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</aff>
<aff id="btx235-aff2">
<label>2</label>
Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</aff>
<aff id="btx235-aff3">
<label>3</label>
Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA</aff>
<author-notes>
<corresp id="btx235-cor1">To whom correspondence should be addressed. Email:
<email>gmarcais@cs.cmu.edu</email>
or
<email>carlk@cs.cmu.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2017-07-12">
<day>12</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>12</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>33</volume>
<issue>14</issue>
<fpage>i110</fpage>
<lpage>i117</lpage>
<permissions>
<copyright-statement>© The Author 2017. Published by Oxford University Press.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="cc-by-nc" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="btx235.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>The minimizers scheme is a method for selecting
<italic>k</italic>
-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many
<italic>k</italic>
-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of
<italic>k</italic>
-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>We provide an in-depth analysis of the effect of
<italic>k</italic>
-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer
<italic>et al.</italic>
) on the expected density of minimizers in a random sequence.</p>
</sec>
<sec id="SA4">
<title>Availability and Implementation</title>
<p>The software used for this analysis is available on GitHub:
<ext-link ext-link-type="uri" xlink:href="https://github.com/gmarcais/minimizers.git">https://github.com/gmarcais/minimizers.git</ext-link>
.</p>
</sec>
</abstract>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">Israel Science Foundation</named-content>
<named-content content-type="funder-identifier">10.13039/501100003977</named-content>
</funding-source>
</award-group>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B19 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000B19 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:5870760
   |texte=   Improving the performance of minimizers and winnowing schemes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:28881970" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021