Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

Identifieur interne : 001030 ( Pmc/Curation ); précédent : 001029; suivant : 001031

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

Auteurs : Meznah Almutairy ; Eric Torng

Source :

RBID : PMC:5501444

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.


Url:
DOI: 10.1371/journal.pone.0179046
PubMed: 28686614
PubMed Central: 5501444

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:5501444

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28686614</idno>
<idno type="pmc">5501444</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501444</idno>
<idno type="RBID">PMC:5501444</idno>
<idno type="doi">10.1371/journal.pone.0179046</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">001030</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001030</idno>
<idno type="wicri:Area/Pmc/Curation">001030</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">001030</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a
<italic>k</italic>
-mer index such as BLAST. A big problem with
<italic>k</italic>
-mer indexes is the space required to store the lists of all occurrences of all
<italic>k</italic>
-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some
<italic>k</italic>
-mer occurrences are stored. Most previous work uses
<italic>hard sampling</italic>
, in which enough
<italic>k</italic>
-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study
<italic>soft sampling</italic>
, which further reduces the number of stored
<italic>k</italic>
-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Schwartz, S" uniqKey="Schwartz S">S Schwartz</name>
</author>
<author>
<name sortKey="Wagner, L" uniqKey="Wagner L">L Wagner</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgulis, A" uniqKey="Morgulis A">A Morgulis</name>
</author>
<author>
<name sortKey="Coulouris, G" uniqKey="Coulouris G">G Coulouris</name>
</author>
<author>
<name sortKey="Raytselis, Y" uniqKey="Raytselis Y">Y Raytselis</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Agarwala, R" uniqKey="Agarwala R">R Agarwala</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Irizarry, K" uniqKey="Irizarry K">K Irizarry</name>
</author>
<author>
<name sortKey="Kustanovich, V" uniqKey="Kustanovich V">V Kustanovich</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
<author>
<name sortKey="Brown, N" uniqKey="Brown N">N Brown</name>
</author>
<author>
<name sortKey="Nelson, S" uniqKey="Nelson S">S Nelson</name>
</author>
<author>
<name sortKey="Wong, W" uniqKey="Wong W">W Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sachidanandam, R" uniqKey="Sachidanandam R">R Sachidanandam</name>
</author>
<author>
<name sortKey="Weissman, D" uniqKey="Weissman D">D Weissman</name>
</author>
<author>
<name sortKey="Schmidt, Sc" uniqKey="Schmidt S">SC Schmidt</name>
</author>
<author>
<name sortKey="Kakol, Jm" uniqKey="Kakol J">JM Kakol</name>
</author>
<author>
<name sortKey="Stein, Ld" uniqKey="Stein L">LD Stein</name>
</author>
<author>
<name sortKey="Marth, G" uniqKey="Marth G">G Marth</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ng, Pc" uniqKey="Ng P">PC Ng</name>
</author>
<author>
<name sortKey="Henikoff, S" uniqKey="Henikoff S">S Henikoff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ning, Z" uniqKey="Ning Z">Z Ning</name>
</author>
<author>
<name sortKey="Cox, Aj" uniqKey="Cox A">AJ Cox</name>
</author>
<author>
<name sortKey="Mullikin, Jc" uniqKey="Mullikin J">JC Mullikin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Td" uniqKey="Wu T">TD Wu</name>
</author>
<author>
<name sortKey="Watanabe, Ck" uniqKey="Watanabe C">CK Watanabe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Starlinger, J" uniqKey="Starlinger J">J Starlinger</name>
</author>
<author>
<name sortKey="Bux, M" uniqKey="Bux M">M Bux</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Danek, A" uniqKey="Danek A">A Danek</name>
</author>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hatem, A" uniqKey="Hatem A">A Hatem</name>
</author>
<author>
<name sortKey="Bozda, D" uniqKey="Bozda D">D Bozdağ</name>
</author>
<author>
<name sortKey="Toland, Ae" uniqKey="Toland A">AE Toland</name>
</author>
<author>
<name sortKey="Catalyurek, Uv" uniqKey="Catalyurek U">ÜV Çatalyürek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Kidd, Jm" uniqKey="Kidd J">JM Kidd</name>
</author>
<author>
<name sortKey="Marques Bonet, T" uniqKey="Marques Bonet T">T Marques-Bonet</name>
</author>
<author>
<name sortKey="Aksay, G" uniqKey="Aksay G">G Aksay</name>
</author>
<author>
<name sortKey="Antonacci, F" uniqKey="Antonacci F">F Antonacci</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rumble, Sm" uniqKey="Rumble S">SM Rumble</name>
</author>
<author>
<name sortKey="Lacroute, P" uniqKey="Lacroute P">P Lacroute</name>
</author>
<author>
<name sortKey="Dalca, Av" uniqKey="Dalca A">AV Dalca</name>
</author>
<author>
<name sortKey="Fiume, M" uniqKey="Fiume M">M Fiume</name>
</author>
<author>
<name sortKey="Sidow, A" uniqKey="Sidow A">A Sidow</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ahmadi, A" uniqKey="Ahmadi A">A Ahmadi</name>
</author>
<author>
<name sortKey="Behm, A" uniqKey="Behm A">A Behm</name>
</author>
<author>
<name sortKey="Honnalli, N" uniqKey="Honnalli N">N Honnalli</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
<author>
<name sortKey="Weng, L" uniqKey="Weng L">L Weng</name>
</author>
<author>
<name sortKey="Xie, X" uniqKey="Xie X">X Xie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Emde, Ak" uniqKey="Emde A">AK Emde</name>
</author>
<author>
<name sortKey="Rausch, T" uniqKey="Rausch T">T Rausch</name>
</author>
<author>
<name sortKey="Doring, A" uniqKey="Doring A">A Döring</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
<author>
<name sortKey="Bolanos, Ra" uniqKey="Bolanos R">RA Bolanos</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author>
<name sortKey="Ma, Zs" uniqKey="Ma Z">ZS Ma</name>
</author>
<author>
<name sortKey="Cannon, Ch" uniqKey="Cannon C">CH Cannon</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Douglas, Wy" uniqKey="Douglas W">WY Douglas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author>
<name sortKey="Jackman, S" uniqKey="Jackman S">S Jackman</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abouelhoda, Mi" uniqKey="Abouelhoda M">MI Abouelhoda</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vyverman, M" uniqKey="Vyverman M">M Vyverman</name>
</author>
<author>
<name sortKey="De Baets, B" uniqKey="De Baets B">B De Baets</name>
</author>
<author>
<name sortKey="Fack, V" uniqKey="Fack V">V Fack</name>
</author>
<author>
<name sortKey="Dawyndt, P" uniqKey="Dawyndt P">P Dawyndt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khiste, N" uniqKey="Khiste N">N Khiste</name>
</author>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xin, H" uniqKey="Xin H">H Xin</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Yedkar, S" uniqKey="Yedkar S">S Yedkar</name>
</author>
<author>
<name sortKey="Mutlu, O" uniqKey="Mutlu O">O Mutlu</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28686614</article-id>
<article-id pub-id-type="pmc">5501444</article-id>
<article-id pub-id-type="publisher-id">PONE-D-16-45928</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0179046</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Biological Databases</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Database and informatics methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence analysis</subject>
<subj-group>
<subject>BLAST algorithm</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>DNA</subject>
<subj-group>
<subject>Forms of DNA</subject>
<subj-group>
<subject>Complementary DNA</subject>
<subj-group>
<subject>Expressed Sequence Tags</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Nucleic acids</subject>
<subj-group>
<subject>DNA</subject>
<subj-group>
<subject>Forms of DNA</subject>
<subj-group>
<subject>Complementary DNA</subject>
<subj-group>
<subject>Expressed Sequence Tags</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Human Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Biological Databases</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Alignment</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Database Searching</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Mathematical and statistical techniques</subject>
<subj-group>
<subject>Statistical methods</subject>
<subj-group>
<subject>Monte Carlo method</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Statistics (mathematics)</subject>
<subj-group>
<subject>Statistical methods</subject>
<subj-group>
<subject>Monte Carlo method</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</article-title>
<alt-title alt-title-type="running-head">The Effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Almutairy</surname>
<given-names>Meznah</given-names>
</name>
<xref ref-type="aff" rid="aff001"></xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Torng</surname>
<given-names>Eric</given-names>
</name>
<xref ref-type="aff" rid="aff001"></xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Kalendar</surname>
<given-names>Ruslan</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>University of Helsinki, FINLAND</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con">
<p>
<list list-type="simple">
<list-item>
<p>
<bold>Conceptualization:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Data curation:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Formal analysis:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Investigation:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Methodology:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Project administration:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Resources:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Software:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Supervision:</bold>
ET.</p>
</list-item>
<list-item>
<p>
<bold>Validation:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Visualization:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Writing – original draft:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Writing – review & editing:</bold>
MA ET.</p>
</list-item>
</list>
</p>
</fn>
<corresp id="cor001">* E-mail:
<email>almutai4@msu.edu</email>
(MA);
<email>torng@msu.edu</email>
(ET)</corresp>
</author-notes>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>7</day>
<month>7</month>
<year>2017</year>
</pub-date>
<volume>12</volume>
<issue>7</issue>
<elocation-id>e0179046</elocation-id>
<history>
<date date-type="received">
<day>18</day>
<month>11</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>5</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© 2017 Almutairy, Torng</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Almutairy, Torng</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pone.0179046.pdf"></self-uri>
<abstract>
<p>One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a
<italic>k</italic>
-mer index such as BLAST. A big problem with
<italic>k</italic>
-mer indexes is the space required to store the lists of all occurrences of all
<italic>k</italic>
-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some
<italic>k</italic>
-mer occurrences are stored. Most previous work uses
<italic>hard sampling</italic>
, in which enough
<italic>k</italic>
-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study
<italic>soft sampling</italic>
, which further reduces the number of stored
<italic>k</italic>
-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</p>
</abstract>
<funding-group>
<funding-statement>The authors received no specific funding for this work.</funding-statement>
</funding-group>
<counts>
<fig-count count="6"></fig-count>
<table-count count="6"></table-count>
<page-count count="23"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>Program and Data in this paper are publicly available at:
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/blast">https://www.ncbi.nlm.nih.gov/blast</ext-link>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast</ext-link>
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/dbEST">https://www.ncbi.nlm.nih.gov/dbEST</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>Program and Data in this paper are publicly available at:
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/blast">https://www.ncbi.nlm.nih.gov/blast</ext-link>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast</ext-link>
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/dbEST">https://www.ncbi.nlm.nih.gov/dbEST</ext-link>
.</p>
</notes>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001030 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 001030 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:5501444
   |texte=   The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:28686614" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021