Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

Identifieur interne : 000C31 ( PubMed/Curation ); précédent : 000C30; suivant : 000C32

The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

Auteurs : Meznah Almutairy [États-Unis] ; Eric Torng [États-Unis]

Source :

RBID : pubmed:28686614

Descripteurs français

English descriptors

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.

DOI: 10.1371/journal.pone.0179046
PubMed: 28686614

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:28686614

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2017">2017</date>
<idno type="RBID">pubmed:28686614</idno>
<idno type="pmid">28686614</idno>
<idno type="doi">10.1371/journal.pone.0179046</idno>
<idno type="wicri:Area/PubMed/Corpus">000C31</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000C31</idno>
<idno type="wicri:Area/PubMed/Curation">000C31</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000C31</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PloS one</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2017" type="published">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Databases, Genetic</term>
<term>Genome, Human (genetics)</term>
<term>Humans</term>
<term>Sequence Alignment (methods)</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences ()</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Bases de données génétiques</term>
<term>Biologie informatique ()</term>
<term>Génome humain (génétique)</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Genome, Human</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Génome humain</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Databases, Genetic</term>
<term>Humans</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de séquence d'ADN</term>
<term>Bases de données génétiques</term>
<term>Biologie informatique</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">28686614</PMID>
<DateCompleted>
<Year>2017</Year>
<Month>10</Month>
<Day>04</Day>
</DateCompleted>
<DateRevised>
<Year>2019</Year>
<Month>02</Month>
<Day>08</Day>
</DateRevised>
<Article PubModel="Electronic-eCollection">
<Journal>
<ISSN IssnType="Electronic">1932-6203</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>12</Volume>
<Issue>7</Issue>
<PubDate>
<Year>2017</Year>
</PubDate>
</JournalIssue>
<Title>PloS one</Title>
<ISOAbbreviation>PLoS ONE</ISOAbbreviation>
</Journal>
<ArticleTitle>The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.</ArticleTitle>
<Pagination>
<MedlinePgn>e0179046</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1371/journal.pone.0179046</ELocationID>
<Abstract>
<AbstractText>One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Almutairy</LastName>
<ForeName>Meznah</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Torng</LastName>
<ForeName>Eric</ForeName>
<Initials>E</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2017</Year>
<Month>07</Month>
<Day>07</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>PLoS One</MedlineTA>
<NlmUniqueID>101285081</NlmUniqueID>
<ISSNLinking>1932-6203</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="N">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D030541" MajorTopicYN="N">Databases, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015894" MajorTopicYN="N">Genome, Human</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2016</Year>
<Month>11</Month>
<Day>18</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2017</Year>
<Month>05</Month>
<Day>23</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2017</Year>
<Month>7</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2017</Year>
<Month>7</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2017</Year>
<Month>10</Month>
<Day>5</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">28686614</ArticleId>
<ArticleId IdType="doi">10.1371/journal.pone.0179046</ArticleId>
<ArticleId IdType="pii">PONE-D-16-45928</ArticleId>
<ArticleId IdType="pmc">PMC5501444</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Genome Res. 2002 Apr;12(4):656-64</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11932250</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3162770</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2004;11(4):734-52</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15579242</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2009 Sep;19(9):1646-54</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19592482</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2000 Feb-Apr;7(1-2):203-14</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10890397</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2008 Aug 15;24(16):1757-64</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18567917</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2015 Feb 15;31(4):509-14</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25399029</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2001 Oct;11(10):1725-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11591649</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Genet. 2009 Oct;41(10):1061-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19718026</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2001 Feb 15;409(6822):928-33</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11237013</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2004 Dec 12;20(18):3363-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15256412</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2005 May 1;21(9):1859-75</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15728110</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Methods. 2010 Aug;7(8):576-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20676076</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Mar;40(6):e41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22199254</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Comput Biol. 2009 May;5(5):e1000386</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19461883</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2013;14 Suppl 1:S13</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23369189</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2013 Jun 07;14:184</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23758764</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2013 Mar 15;29(6):802-4</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23349213</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2001 May;11(5):863-74</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11337480</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2016 Jul 15;32(14 ):2103-10</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">27153593</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2012 Apr 19;13 Suppl 6:S1</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22537038</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS One. 2014 Oct 07;9(10):e109384</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25289699</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Jul 15;27(14):1915-21</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21586516</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Genet. 2000 Oct;26(2):233-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11017085</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 1997 Sep 1;25(17):3389-402</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9254694</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C31 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 000C31 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:28686614
   |texte=   The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:28686614" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021