Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

Identifieur interne : 001030 ( Pmc/Corpus ); précédent : 001029; suivant : 001031

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

Auteurs : Meznah Almutairy ; Eric Torng

Source :

RBID : PMC:5501444

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.


Url:
DOI: 10.1371/journal.pone.0179046
PubMed: 28686614
PubMed Central: 5501444

Links to Exploration step

PMC:5501444

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28686614</idno>
<idno type="pmc">5501444</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501444</idno>
<idno type="RBID">PMC:5501444</idno>
<idno type="doi">10.1371/journal.pone.0179046</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">001030</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001030</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</title>
<author>
<name sortKey="Almutairy, Meznah" sort="Almutairy, Meznah" uniqKey="Almutairy M" first="Meznah" last="Almutairy">Meznah Almutairy</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Torng, Eric" sort="Torng, Eric" uniqKey="Torng E" first="Eric" last="Torng">Eric Torng</name>
<affiliation>
<nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a
<italic>k</italic>
-mer index such as BLAST. A big problem with
<italic>k</italic>
-mer indexes is the space required to store the lists of all occurrences of all
<italic>k</italic>
-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some
<italic>k</italic>
-mer occurrences are stored. Most previous work uses
<italic>hard sampling</italic>
, in which enough
<italic>k</italic>
-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study
<italic>soft sampling</italic>
, which further reduces the number of stored
<italic>k</italic>
-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Schwartz, S" uniqKey="Schwartz S">S Schwartz</name>
</author>
<author>
<name sortKey="Wagner, L" uniqKey="Wagner L">L Wagner</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgulis, A" uniqKey="Morgulis A">A Morgulis</name>
</author>
<author>
<name sortKey="Coulouris, G" uniqKey="Coulouris G">G Coulouris</name>
</author>
<author>
<name sortKey="Raytselis, Y" uniqKey="Raytselis Y">Y Raytselis</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Agarwala, R" uniqKey="Agarwala R">R Agarwala</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Irizarry, K" uniqKey="Irizarry K">K Irizarry</name>
</author>
<author>
<name sortKey="Kustanovich, V" uniqKey="Kustanovich V">V Kustanovich</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
<author>
<name sortKey="Brown, N" uniqKey="Brown N">N Brown</name>
</author>
<author>
<name sortKey="Nelson, S" uniqKey="Nelson S">S Nelson</name>
</author>
<author>
<name sortKey="Wong, W" uniqKey="Wong W">W Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sachidanandam, R" uniqKey="Sachidanandam R">R Sachidanandam</name>
</author>
<author>
<name sortKey="Weissman, D" uniqKey="Weissman D">D Weissman</name>
</author>
<author>
<name sortKey="Schmidt, Sc" uniqKey="Schmidt S">SC Schmidt</name>
</author>
<author>
<name sortKey="Kakol, Jm" uniqKey="Kakol J">JM Kakol</name>
</author>
<author>
<name sortKey="Stein, Ld" uniqKey="Stein L">LD Stein</name>
</author>
<author>
<name sortKey="Marth, G" uniqKey="Marth G">G Marth</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ng, Pc" uniqKey="Ng P">PC Ng</name>
</author>
<author>
<name sortKey="Henikoff, S" uniqKey="Henikoff S">S Henikoff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ning, Z" uniqKey="Ning Z">Z Ning</name>
</author>
<author>
<name sortKey="Cox, Aj" uniqKey="Cox A">AJ Cox</name>
</author>
<author>
<name sortKey="Mullikin, Jc" uniqKey="Mullikin J">JC Mullikin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Td" uniqKey="Wu T">TD Wu</name>
</author>
<author>
<name sortKey="Watanabe, Ck" uniqKey="Watanabe C">CK Watanabe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wandelt, S" uniqKey="Wandelt S">S Wandelt</name>
</author>
<author>
<name sortKey="Starlinger, J" uniqKey="Starlinger J">J Starlinger</name>
</author>
<author>
<name sortKey="Bux, M" uniqKey="Bux M">M Bux</name>
</author>
<author>
<name sortKey="Leser, U" uniqKey="Leser U">U Leser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Danek, A" uniqKey="Danek A">A Danek</name>
</author>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hatem, A" uniqKey="Hatem A">A Hatem</name>
</author>
<author>
<name sortKey="Bozda, D" uniqKey="Bozda D">D Bozdağ</name>
</author>
<author>
<name sortKey="Toland, Ae" uniqKey="Toland A">AE Toland</name>
</author>
<author>
<name sortKey="Catalyurek, Uv" uniqKey="Catalyurek U">ÜV Çatalyürek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Kidd, Jm" uniqKey="Kidd J">JM Kidd</name>
</author>
<author>
<name sortKey="Marques Bonet, T" uniqKey="Marques Bonet T">T Marques-Bonet</name>
</author>
<author>
<name sortKey="Aksay, G" uniqKey="Aksay G">G Aksay</name>
</author>
<author>
<name sortKey="Antonacci, F" uniqKey="Antonacci F">F Antonacci</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rumble, Sm" uniqKey="Rumble S">SM Rumble</name>
</author>
<author>
<name sortKey="Lacroute, P" uniqKey="Lacroute P">P Lacroute</name>
</author>
<author>
<name sortKey="Dalca, Av" uniqKey="Dalca A">AV Dalca</name>
</author>
<author>
<name sortKey="Fiume, M" uniqKey="Fiume M">M Fiume</name>
</author>
<author>
<name sortKey="Sidow, A" uniqKey="Sidow A">A Sidow</name>
</author>
<author>
<name sortKey="Brudno, M" uniqKey="Brudno M">M Brudno</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ahmadi, A" uniqKey="Ahmadi A">A Ahmadi</name>
</author>
<author>
<name sortKey="Behm, A" uniqKey="Behm A">A Behm</name>
</author>
<author>
<name sortKey="Honnalli, N" uniqKey="Honnalli N">N Honnalli</name>
</author>
<author>
<name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
<author>
<name sortKey="Weng, L" uniqKey="Weng L">L Weng</name>
</author>
<author>
<name sortKey="Xie, X" uniqKey="Xie X">X Xie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Emde, Ak" uniqKey="Emde A">AK Emde</name>
</author>
<author>
<name sortKey="Rausch, T" uniqKey="Rausch T">T Rausch</name>
</author>
<author>
<name sortKey="Doring, A" uniqKey="Doring A">A Döring</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
<author>
<name sortKey="Bolanos, Ra" uniqKey="Bolanos R">RA Bolanos</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author>
<name sortKey="Ma, Zs" uniqKey="Ma Z">ZS Ma</name>
</author>
<author>
<name sortKey="Cannon, Ch" uniqKey="Cannon C">CH Cannon</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Douglas, Wy" uniqKey="Douglas W">WY Douglas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author>
<name sortKey="Jackman, S" uniqKey="Jackman S">S Jackman</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abouelhoda, Mi" uniqKey="Abouelhoda M">MI Abouelhoda</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vyverman, M" uniqKey="Vyverman M">M Vyverman</name>
</author>
<author>
<name sortKey="De Baets, B" uniqKey="De Baets B">B De Baets</name>
</author>
<author>
<name sortKey="Fack, V" uniqKey="Fack V">V Fack</name>
</author>
<author>
<name sortKey="Dawyndt, P" uniqKey="Dawyndt P">P Dawyndt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khiste, N" uniqKey="Khiste N">N Khiste</name>
</author>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xin, H" uniqKey="Xin H">H Xin</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Yedkar, S" uniqKey="Yedkar S">S Yedkar</name>
</author>
<author>
<name sortKey="Mutlu, O" uniqKey="Mutlu O">O Mutlu</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28686614</article-id>
<article-id pub-id-type="pmc">5501444</article-id>
<article-id pub-id-type="publisher-id">PONE-D-16-45928</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0179046</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Biological Databases</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Database and informatics methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence analysis</subject>
<subj-group>
<subject>BLAST algorithm</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>DNA</subject>
<subj-group>
<subject>Forms of DNA</subject>
<subj-group>
<subject>Complementary DNA</subject>
<subj-group>
<subject>Expressed Sequence Tags</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Nucleic acids</subject>
<subj-group>
<subject>DNA</subject>
<subj-group>
<subject>Forms of DNA</subject>
<subj-group>
<subject>Complementary DNA</subject>
<subj-group>
<subject>Expressed Sequence Tags</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
<subj-group>
<subject>Human Genomics</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Biological Databases</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Alignment</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Database Searching</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Mathematical and statistical techniques</subject>
<subj-group>
<subject>Statistical methods</subject>
<subj-group>
<subject>Monte Carlo method</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Physical sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Statistics (mathematics)</subject>
<subj-group>
<subject>Statistical methods</subject>
<subj-group>
<subject>Monte Carlo method</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>The effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes: Theoretical and empirical comparisons using the human genome</article-title>
<alt-title alt-title-type="running-head">The Effects of sampling on the efficiency and accuracy of
<italic>k</italic>
−mer indexes</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Almutairy</surname>
<given-names>Meznah</given-names>
</name>
<xref ref-type="aff" rid="aff001"></xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Torng</surname>
<given-names>Eric</given-names>
</name>
<xref ref-type="aff" rid="aff001"></xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Kalendar</surname>
<given-names>Ruslan</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>University of Helsinki, FINLAND</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con">
<p>
<list list-type="simple">
<list-item>
<p>
<bold>Conceptualization:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Data curation:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Formal analysis:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Investigation:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Methodology:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Project administration:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Resources:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Software:</bold>
MA.</p>
</list-item>
<list-item>
<p>
<bold>Supervision:</bold>
ET.</p>
</list-item>
<list-item>
<p>
<bold>Validation:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Visualization:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Writing – original draft:</bold>
MA ET.</p>
</list-item>
<list-item>
<p>
<bold>Writing – review & editing:</bold>
MA ET.</p>
</list-item>
</list>
</p>
</fn>
<corresp id="cor001">* E-mail:
<email>almutai4@msu.edu</email>
(MA);
<email>torng@msu.edu</email>
(ET)</corresp>
</author-notes>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>7</day>
<month>7</month>
<year>2017</year>
</pub-date>
<volume>12</volume>
<issue>7</issue>
<elocation-id>e0179046</elocation-id>
<history>
<date date-type="received">
<day>18</day>
<month>11</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>5</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© 2017 Almutairy, Torng</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Almutairy, Torng</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pone.0179046.pdf"></self-uri>
<abstract>
<p>One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a
<italic>k</italic>
-mer index such as BLAST. A big problem with
<italic>k</italic>
-mer indexes is the space required to store the lists of all occurrences of all
<italic>k</italic>
-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some
<italic>k</italic>
-mer occurrences are stored. Most previous work uses
<italic>hard sampling</italic>
, in which enough
<italic>k</italic>
-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study
<italic>soft sampling</italic>
, which further reduces the number of stored
<italic>k</italic>
-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.</p>
</abstract>
<funding-group>
<funding-statement>The authors received no specific funding for this work.</funding-statement>
</funding-group>
<counts>
<fig-count count="6"></fig-count>
<table-count count="6"></table-count>
<page-count count="23"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>Program and Data in this paper are publicly available at:
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/blast">https://www.ncbi.nlm.nih.gov/blast</ext-link>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast</ext-link>
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/dbEST">https://www.ncbi.nlm.nih.gov/dbEST</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>Program and Data in this paper are publicly available at:
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/blast">https://www.ncbi.nlm.nih.gov/blast</ext-link>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast</ext-link>
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/dbEST">https://www.ncbi.nlm.nih.gov/dbEST</ext-link>
.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec001">
<title>Introduction</title>
<p>We study the problem of trying to find the best sampling strategy to create simultaneously efficient and accurate
<italic>k</italic>
-mer indexes. These
<italic>k</italic>
-mer indexes have been widely used to accelerate the process of searching for all
<italic>highly similar local alignments</italic>
(HSLAs) between a query sequence and a database of sequences. This is a fundamental operation for a wide variety of biological applications including homologous search [
<xref rid="pone.0179046.ref001" ref-type="bibr">1</xref>
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
], detection of single nucleotide polymorphisms (SNP) [
<xref rid="pone.0179046.ref005" ref-type="bibr">5</xref>
<xref rid="pone.0179046.ref007" ref-type="bibr">7</xref>
], and mapping cDNA sequences against the corresponding genome [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
<xref rid="pone.0179046.ref010" ref-type="bibr">10</xref>
]. We focus on finding HSLAs over nucleotide sequences where nucleotides are represented by A, C, G, and T. The HSLAs are commonly used in applications that compare sequences within the same species or closely related species, and we restrict our study’s database to the human genome.</p>
<p>One of the biggest problems with using
<italic>k</italic>
-mer indexes is that the size of the index is significantly larger than the underlying database. As biological databases/data sets rapidly increase in size, the size of the resulting
<italic>k</italic>
-mer indexes make using a
<italic>k</italic>
-mer index infeasible in many applications. Furthermore, query time increases rapidly as the database’s size and/or the number of queries increases. To ensure
<italic>k</italic>
-mer indexes remain viable, we must mitigate
<italic>k</italic>
-mer index size and query time.</p>
<p>One of the most effective and widely used ways of mitigating
<italic>k</italic>
-mer index size and query time is to perform sampling, in which we omit some
<italic>k</italic>
-mer occurrences from the index. In this paper, we study how best to sample a
<italic>k</italic>
-mer index to manage index size, query time, and accuracy where accuracy refers to finding all desired HSLAs. We evaluate a wide range of sampling rates that includes existing sampling practices as well as many new sampling rates. In particular, we study
<italic>soft sampling</italic>
, or sampling especially sparsely to further reduce index size and query time at the risk of missing some HSLAs. We show that using soft sampling, which has largely been ignored in previous studies, significantly reduces index size and computation times with very little loss in accuracy.</p>
<sec id="sec002">
<title>Application 1: Highly similar local alignments</title>
<p>We study
<italic>k</italic>
-mer indexes in the context of two motivating applications. The first and primary application is finding HLSAs between a query sequence
<italic>q</italic>
and a database of sequences
<italic>DB</italic>
. The second application is to map ESTs to a genome, first finding HSLAs between the ESTs and the genome. We now formally define the first application, finding HSLAs.</p>
<p>We start by defining a local alignment
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
) between two sequences
<italic>s</italic>
and
<italic>q</italic>
. For simplicity, we denote
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
) as just
<italic>A</italic>
.</p>
<p>
<bold>Definition 1 (Local alignment)</bold>
<italic>A local alignment</italic>
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
)
<italic>between any two sequences</italic>
<italic>s</italic>
<italic>and</italic>
<italic>q</italic>
<italic>is a triple</italic>
(
<italic>x</italic>
,
<italic>y</italic>
,
<italic>m</italic>
)
<italic>where</italic>
<italic>x</italic>
<italic>is a contiguous subsequence of</italic>
<italic>s</italic>
,
<italic>y</italic>
<italic>is a contiguous subsequence of</italic>
<italic>q</italic>
,
<italic>and</italic>
<italic>m</italic>
<italic>is an injective and monotonically increasing mapping from positions in</italic>
<italic>x</italic>
<italic>to positions in</italic>
<italic>y</italic>
.</p>
<p>Within an alignment
<italic>A</italic>
= (
<italic>x</italic>
,
<italic>y</italic>
,
<italic>m</italic>
), some positions in
<italic>x</italic>
may map to no positions in
<italic>y</italic>
and vice versa. Let
<italic>map</italic>
(
<italic>A</italic>
) denote the number of positions in
<italic>x</italic>
that map to positions in
<italic>y</italic>
, and let
<italic>match</italic>
(
<italic>A</italic>
) denote the number of mapped positions that are identical. We then define the length of
<italic>A</italic>
to be |
<italic>A</italic>
| = |
<italic>x</italic>
| + |
<italic>y</italic>
| −
<italic>map</italic>
(
<italic>A</italic>
), and we define the match percentage to be
<italic>mp</italic>
(
<italic>A</italic>
) =
<italic>match</italic>
(
<italic>A</italic>
)/|
<italic>A</italic>
|. Finally, we define
<italic>E</italic>
(
<italic>A</italic>
) = |
<italic>A</italic>
| −
<italic>match</italic>
(
<italic>A</italic>
) to be the number of errors in alignment
<italic>A</italic>
.</p>
<p>To illustrate these definitions, consider the example in
<xref ref-type="fig" rid="pone.0179046.g001">Fig 1</xref>
with two local alignments
<italic>A</italic>
1 and
<italic>A</italic>
2. We have
<italic>map</italic>
(
<italic>A</italic>
1) = 6,
<italic>match</italic>
(
<italic>A</italic>
1) = 6, |
<italic>A</italic>
1| = 7,
<italic>mp</italic>
(
<italic>A</italic>
1) = 85.7%, and
<italic>E</italic>
(
<italic>A</italic>
) = 1 whereas
<italic>map</italic>
(
<italic>A</italic>
2) = 11,
<italic>match</italic>
(
<italic>A</italic>
2) = 10, |
<italic>A</italic>
2| = 11,
<italic>mp</italic>
(
<italic>A</italic>
2) = 91%, and
<italic>E</italic>
(
<italic>A</italic>
) = 1.</p>
<fig id="pone.0179046.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g001</object-id>
<label>Fig 1</label>
<caption>
<title>Example database sequence
<italic>s</italic>
and query sequence
<italic>q</italic>
and two local alignments
<italic>A</italic>
<sub>1</sub>
and
<italic>A</italic>
<sub>2</sub>
, where (|) identifies to two mapped identical positions and (*) is an inserted gap position in one of the two sequences.</title>
</caption>
<graphic xlink:href="pone.0179046.g001"></graphic>
</fig>
<p>When searching for local alignments, our goal is to find all HSLAs that have a minimum length and match percentage. We formally define our targeted HSLAs, which we also refer to as true matches, as follows:</p>
<p>
<bold>Definition 2 (True match or HSLA)</bold>
<italic>For a database of sequences</italic>
<italic>DB</italic>
,
<italic>a query sequence</italic>
<italic>q</italic>
,
<italic>an alignment length threshold</italic>
<italic>l</italic>
,
<italic>and a match threshold</italic>
<italic>t</italic>
,
<italic>we define</italic>
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) = {
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
) ∣
<italic>s</italic>
<italic>DB</italic>
,|
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
)| ≥
<italic>l</italic>
<italic>and</italic>
<italic>mp</italic>
(
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
)) ≥
<italic>t</italic>
}.</p>
<p>We use
<italic>HSLA</italic>
(
<italic>q</italic>
,
<italic>t</italic>
) when
<italic>DB</italic>
and
<italic>l</italic>
are clear from context. We also define short HSLAs to represent HSLAs that are barely in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) and are the hardest to find.</p>
<p>
<bold>Definition 3 (Short HSLA)</bold>
<italic>For a database of sequences</italic>
<italic>DB</italic>
,
<italic>a query sequence</italic>
<italic>q</italic>
,
<italic>an alignment length threshold</italic>
<italic>l</italic>
,
<italic>and a match threshold</italic>
<italic>t</italic>
,
<italic>we define</italic>
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) = {
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
) ∣
<italic>s</italic>
<italic>DB</italic>
,
<italic>l</italic>
≤ |
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
)| ≤ (2 −
<italic>t</italic>
)
<italic>l</italic>
<italic>and</italic>
<italic>mp</italic>
(
<italic>A</italic>
(
<italic>s</italic>
,
<italic>q</italic>
)) ≥
<italic>t</italic>
}.</p>
<p>For example,
<italic>HSLA</italic>
({
<italic>s</italic>
},
<italic>q</italic>
, 6, 85%) = {
<italic>A</italic>
1,
<italic>A</italic>
2} whereas
<italic>HSLA</italic>
({
<italic>s</italic>
},
<italic>q</italic>
, 11, 85%) = {
<italic>A</italic>
2};
<italic>A</italic>
1 is omitted because it does not meet the length threshold of 11. Likewise,
<italic>HSLA</italic>
({
<italic>s</italic>
},
<italic>q</italic>
, 6, 90%) = {
<italic>A</italic>
2};
<italic>A</italic>
1 is dropped because it does not meet the match percentage threshold. Focusing on short HSLAs,
<italic>HSLA</italic>
<sub>short</sub>
({
<italic>s</italic>
},
<italic>q</italic>
, 6, 85%) = {
<italic>A</italic>
1}.
<italic>A</italic>
2 is dropped because it is too long. Note that
<italic>HSLA</italic>
({
<italic>s</italic>
},
<italic>q</italic>
, 6, 90%) actually includes several alignments that overlap significantly with
<italic>A</italic>
1; we follow standard practice and only include the longest alignment with highest match percentage from any group of highly overlapping alignments in
<italic>HSLA</italic>
(
<italic>s</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
).</p>
</sec>
<sec id="sec003">
<title>Finding HSLAs using indexed BLAST</title>
<p>We now describe how indexed BLAST [
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
] is typically used to find HSLAs in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). Specifically, indexed BLAST uses a seed-and-extend search process where we have one
<italic>seed</italic>
phase and two
<italic>extension</italic>
phases. In the seed phase, for a given
<italic>k</italic>
value
<italic>k</italic>
′, indexed BLAST uses a
<italic>k</italic>
′-mer index to find shared
<italic>k</italic>
′-mers, where a shared
<italic>k</italic>
′-mer is a substring formed by
<italic>k</italic>
′ consecutive letters that appear in both a database sequence
<italic>s</italic>
<italic>DB</italic>
and in the query
<italic>q</italic>
. More specifically, indexed BLAST identifies the locations or occurrences of these shared
<italic>k</italic>
′-mers. Once shared
<italic>k</italic>
′-mer occurrences are found, BLAST performs the first extension phase. In this phase, each occurrence is extended in both directions to find a maximal exact match (MEM), which is an exact match that cannot be extended in either direction without introducing mismatches. If a found MEM has length at least some threshold
<italic>k</italic>
* (defined below), BLAST performs the second extension phase where it tries to extend the MEM into an HSLA. BLAST’s extension process in this second phase is slightly more complex than the process from its first phase since BLAST must allow some mismatches and gaps in this second phase.</p>
<p>To illustrate this process, consider our example from
<xref ref-type="fig" rid="pone.0179046.g001">Fig 1</xref>
and suppose we use BLAST to search for
<italic>HSLA</italic>
({
<italic>s</italic>
},
<italic>q</italic>
, 11, 90%) with
<italic>k</italic>
′ = 4 and
<italic>k</italic>
* = 5. Suppose the seed phase returns the shared 4-mers AACG, TTTT, and TGCG. When BLAST performs the first extension phase, it would find the MEMs AACGA, TTTTT, and TGCGT. Since
<italic>k</italic>
* = 5, BLAST would then try to extend the three MEMs to HSLAs. The latter two would extend to
<italic>A</italic>
2 whereas the MEM AACGA cannot be extended into an HSLA.</p>
<p>We now describe BLAST’s first two phases in more detail starting with the seed phase. BLAST constructs a
<italic>k</italic>
′-mer index as follows. The
<italic>k</italic>
′-mer index saves a list of database
<italic>k</italic>
′-mers in a lookup table of all possible
<italic>k</italic>
′-mers, which is 4
<sup>
<italic>k</italic>
</sup>
entries. We refer to this lookup table as a dictionary. For each
<italic>k</italic>
′-mer in the dictionary, BLAST saves some of its occurrences in an inverted list (also known as an offset list). A
<italic>k</italic>
′-mer occurrence is an ordered pair (
<italic>s</italic>
,
<italic>i</italic>
) where
<italic>s</italic>
is the string containing this occurrence and
<italic>i</italic>
is the position of the last character in this occurrence.</p>
<p>BLAST then finds shared
<italic>k</italic>
′-mers as follows. BLAST extracts all
<italic>k</italic>
′-mers from query sequence
<italic>q</italic>
. BLAST then searches for each extracted
<italic>k</italic>
′-mer in the dictionary. If the extracted
<italic>k</italic>
′-mer is in the dictionary, it represents a shared
<italic>k</italic>
′-mer for
<italic>q</italic>
and some
<italic>s</italic>
<italic>DB</italic>
. BLAST uses that
<italic>k</italic>
′-mer’s inverted list to find occurrences of that
<italic>k</italic>
′-mer in
<italic>DB</italic>
.</p>
<p>A key choice is what value of
<italic>k</italic>
′ should be used. Typically,
<italic>k</italic>
′ is chosen to be at most 16 so that the list of all possible
<italic>k</italic>
′-mers, which has 4
<sup>
<italic>k</italic>
</sup>
entries, can be stored as an array in RAM. Since we use BLAST to perform our experiments, we use BLAST’s default value of
<italic>k</italic>
′ = 12.</p>
<p>We next describe the first extension phase where BLAST searches for MEM
<sub>k</sub>
*s, which are MEMs of length at least
<italic>k</italic>
*. The extension itself is straightforward since mismatches and gaps are not allowed. The key issue for this phase is what
<italic>k</italic>
* should be. We want
<italic>k</italic>
* to be as large as possible to reduce the number of false positives, which are MEM
<sub>k</sub>
*s that cannot be extended into HSLAs. It is well known how to compute
<italic>k</italic>
* given a target alignment length
<italic>L</italic>
and a maximum number of errors
<italic>E</italic>
[
<xref rid="pone.0179046.ref011" ref-type="bibr">11</xref>
<xref rid="pone.0179046.ref013" ref-type="bibr">13</xref>
]. Specifically,
<italic>k</italic>
* = ⌊
<italic>L</italic>
/(1 +
<italic>E</italic>
)⌋. The basic idea is that the worst case is when the errors are evenly spaced. The question then is what value of
<italic>L</italic>
and
<italic>E</italic>
should be used. The hardest HSLAs to find are the short HSLAs defined in Definition 3; basically those of length exactly
<italic>l</italic>
and match percentage
<italic>t</italic>
. Thus, we use
<italic>L</italic>
=
<italic>l</italic>
and
<italic>E</italic>
= ⌊(1 −
<italic>t</italic>
)
<italic>l</italic>
⌋, which leads to
<italic>k</italic>
* =
<italic>l</italic>
/(1 + ⌊(1 −
<italic>t</italic>
)
<italic>l</italic>
⌋).</p>
</sec>
<sec id="sec004">
<title>Hard versus Soft sampling</title>
<p>The fundamental issue with using
<italic>k</italic>
′-mer indexes to search for HSLAs is that the
<italic>k</italic>
′-mer index can be very large. Most systems including BLAST control dictionary size by limiting
<italic>k</italic>
′ to a small value such as 12. With this choice of
<italic>k</italic>
′, the problem is that there are too many
<italic>k</italic>
′-mer occurrences because the total number of
<italic>k</italic>
′-mer occurrences is roughly the total length of all the sequences in the database
<italic>DB</italic>
. The human genome is roughly 3 billion base pairs, so this would mean roughly 3 billion
<italic>k</italic>
′-mer occurrences.</p>
<p>For this reason,
<italic>k</italic>
′-mer indexes are typically sampled where we only save some
<italic>k</italic>
′-mer occurrences rather than all of them. We focus on
<italic>fixed sampling</italic>
where for a given sampling step
<italic>w</italic>
≥ 1, a
<italic>k</italic>
-mer that occurs at every
<italic>w</italic>
th position is saved. We distinguish between two types of fixed sampling:
<italic>hard sampling</italic>
[
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
,
<xref rid="pone.0179046.ref009" ref-type="bibr">9</xref>
] and
<italic>soft sampling</italic>
[
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
].</p>
<p>In
<bold>hard sampling</bold>
, we choose
<italic>w</italic>
<italic>k</italic>
* −
<italic>k</italic>
′ + 1 so that we are guaranteed to find a
<italic>k</italic>
′-mer within every MEM
<sub>k</sub>
*. Thus, when we apply the first extension step, we will find the resulting MEM
<sub>k</sub>
*. Since we find all MEM
<sub>k</sub>
*s after the first extension step, we are guaranteed to find all HSLAs after the second extension step. Without loss of generality, for hard sampling, we assume
<italic>w</italic>
=
<italic>k</italic>
* −
<italic>k</italic>
′ + 1 since this maximizes the space savings with no loss in accuracy. We refer to this
<italic>w</italic>
value as
<italic>w</italic>
<sub>0</sub>
(
<italic>w</italic>
<sub>0</sub>
=
<italic>k</italic>
* −
<italic>k</italic>
′ + 1).</p>
<p>In
<bold>soft sampling</bold>
, we consider
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
. Because we no longer are guaranteed to choose a
<italic>k</italic>
′-mer from every MEM
<sub>k</sub>
*, when we apply the first extension phase, we may miss some MEM
<sub>k</sub>
*s, which may lead to missing some HSLAs in the next extension phase. Thus, if we use soft sampling, we risk missing some HSLAs.</p>
</sec>
<sec id="sec005">
<title>Retention rates and false positives</title>
<p>Recall our goal is to find
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). We denote the HSLAs and short HSLAs found by using indexed BLAST with parameter values
<italic>k</italic>
′ and
<italic>w</italic>
to be
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
) and
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
), respectively. With hard sampling (
<italic>w</italic>
=
<italic>w</italic>
<sub>0</sub>
), we know
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
<sub>0</sub>
) =
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). With soft sampling (
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
),
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) −
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
) ≠ ∅ is possible. We define the retention rate of HSLAs as a function of
<italic>w</italic>
as follows.</p>
<p>
<bold>Definition 4 (Retention Rate)</bold>
<italic>For a</italic>
<italic>k</italic>
′-
<italic>mer index with a sampling step</italic>
<italic>w</italic>
,
<italic>the retention rate for</italic>
<italic>HSLA</italic>
<italic>is</italic>
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) = |
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
)|/|
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
)|,
<italic>and the retention rate for</italic>
<italic>HSLA</italic>
<sub>short</sub>
<italic>is</italic>
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) = |
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
)|/|
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
)|.</p>
<p>We typically express these ratios as percentages. We will study how
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) change as a function of
<italic>w</italic>
. Because short HSLAs are the hardest true matches to find, we expect
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) >
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) in most cases.</p>
<p>We present a new analytical model to compute the expected retention rate of HSLAs in
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). The new model is an extension to Kent’s analytical model [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
] where he essentially assumed
<italic>w</italic>
=
<italic>k</italic>
′ =
<italic>k</italic>
*. On the other hand, we propose a new model where we assume
<italic>k</italic>
′ <
<italic>k</italic>
* and
<italic>w</italic>
≥ 1. We refer to the new model as the BLAST model since it accounts for typical parameters used in BLAST searches.</p>
<p>Searching with sampled
<italic>k</italic>
′-mer index produces two intermediate results: shared
<italic>k</italic>
′-mers and MEM
<sub>k</sub>
*s. The second extension process, extending MEM
<sub>k</sub>
*s into HSLAs, is more complex and costly than the first extension process since we are allowing some mismatches and gaps. We thus define MEM
<sub>k</sub>
*s that do not extend into HSLAs to be
<bold>false positives</bold>
. We will also study how the number of false positives changes as a function of
<italic>w</italic>
.</p>
</sec>
<sec id="sec006">
<title>Application 2: Using
<italic>k</italic>
-mer indexes in EST mapping</title>
<p>Our second motivating application, which builds upon the first, is mapping ESTs on a genome, a fundamental procedure in genome research. These mappings are used to discover the intron-exon structure of genes, SNPs, and cDNA insertions and deletions, to name just a few applications. Many different mapping tools are available, each with their own advantages [
<xref rid="pone.0179046.ref014" ref-type="bibr">14</xref>
]. We focus on hash table–based, seed-and-extend mappers such as mrFAST/mrsFAST [
<xref rid="pone.0179046.ref015" ref-type="bibr">15</xref>
,
<xref rid="pone.0179046.ref016" ref-type="bibr">16</xref>
], SHRiMP [
<xref rid="pone.0179046.ref017" ref-type="bibr">17</xref>
], Hobbes [
<xref rid="pone.0179046.ref018" ref-type="bibr">18</xref>
], drFAST [
<xref rid="pone.0179046.ref019" ref-type="bibr">19</xref>
], and RazerS [
<xref rid="pone.0179046.ref020" ref-type="bibr">20</xref>
]. These mappers are typically fully sensitive mappers that “can detect reads missed by other tools” [
<xref rid="pone.0179046.ref014" ref-type="bibr">14</xref>
] but may be relatively slow.</p>
<p>We study whether soft sampling
<italic>k</italic>
-mer indexes might increase the speed of these mappers with relatively little loss in sensitivity when working with the human genome as our database. These methods work in two stages. First, they find the set of all HSLAs between an EST and a genome. Then they map the EST to the genome by selecting and linking these HSLAs. The mappers usually differ in how to modify, evaluate, and use the resulting HSLAs to assess the final mapping process.
<xref ref-type="fig" rid="pone.0179046.g002">Fig 2</xref>
illustrates the mapping procedure.</p>
<fig id="pone.0179046.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g002</object-id>
<label>Fig 2</label>
<caption>
<title>Illustration of EST mapping process.</title>
<p>The
<italic>HSLA</italic>
s (
<italic>A</italic>
,
<italic>A</italic>
*), (
<italic>B</italic>
,
<italic>B</italic>
*), (
<italic>C</italic>
,
<italic>C</italic>
*), and (
<italic>D</italic>
,
<italic>D</italic>
*) are used to report the final mapping.</p>
</caption>
<graphic xlink:href="pone.0179046.g002"></graphic>
</fig>
<p>In this paper, we assess the effectiveness of soft sampling in mapping human ESTs on a human genome. Specifically, we assess whether the correct mapping is retained when we use soft-sampled
<italic>k</italic>
′-mer indexes to complete the first stage of finding HSLAs. We measure the effect of sampling on both the index size and the query time. We only simulate the mapping process because we want our results to be general and independent from the details of the final mapping process of a mapper. We hope our findings encourage more developers to allow the use of a wider range of
<italic>k</italic>
′ and
<italic>w</italic>
values in their mappers.</p>
</sec>
<sec id="sec007">
<title>Related work</title>
<p>Most previous studies of sampled
<italic>k</italic>
′-mer indexes have focused on hard sampling with limited study of soft sampling and thus have not studied the effect of choosing a large value of
<italic>w</italic>
on retention rate, query time, or false positive rate. For example, Morgulis
<italic>et al</italic>
. [
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
] built Indexed BLAST, which uses
<italic>w</italic>
=
<italic>k</italic>
* −
<italic>k</italic>
′ + 1 and supports
<italic>k</italic>
′ values up to 15. Ning
<italic>et al</italic>
. [
<xref rid="pone.0179046.ref009" ref-type="bibr">9</xref>
] built the index SSAHA with
<italic>k</italic>
′ = 1/2(
<italic>k</italic>
* + 1) and
<italic>w</italic>
=
<italic>k</italic>
′.</p>
<p>Kent [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
] has performed the main previous study of soft sampling. Instead of selecting
<italic>k</italic>
* as defined above, Kent developed an analytical model for estimating the likelihood of retaining matches and creating false positives for a variety of indexed search strategies. These include searching with one
<italic>k</italic>
′-mer, two nearby small
<italic>k</italic>
′-mers, and one large
<italic>k</italic>
′-mer with one allowed error. In all cases, he built a soft sampled
<italic>k</italic>
′-mer index where
<italic>w</italic>
=
<italic>k</italic>
* =
<italic>k</italic>
′. Kent computed the best choice of
<italic>k</italic>
′ such that the expected accuracy to find all HSLAs was above a given threshold and the number of shared
<italic>k</italic>
′-mers that did not lead to HSLAs was as small as possible.</p>
<p>Kent’s work differs from ours in several key ways. First, we consider only
<italic>k</italic>
′ = 12 so that we can use BLAST to perform our searches, whereas Kent considered multiple
<italic>k</italic>
′ values. Second, we consider a wide range of
<italic>w</italic>
values, whereas Kent only considered
<italic>w</italic>
=
<italic>k</italic>
′. Thus, Kent’s work does not allow a true study of the effect of
<italic>w</italic>
on index performance since, in his work,
<italic>k</italic>
′ is always changing in addition to
<italic>w</italic>
.</p>
<p>We extend Kent’s analytical model to work with our choices of
<italic>k</italic>
′ <
<italic>k</italic>
* and
<italic>w</italic>
and we call this new model BLAST model. We compare our empirical accuracy with both Kent’s original model and BLAST model predicted accuracies. Our results show that BLAST model is reasonable accurate in predicting HSLAs retention rate. On the other hand, Kent’s model significantly underestimates the retention rate in our experiments with the human genome. This is expected since Kent’s model is not designed to handle the case when
<italic>k</italic>
′ <
<italic>k</italic>
*.</p>
<p>We focus on fixed sampling where we take every
<italic>w</italic>
th
<italic>k</italic>
′-mer occurrence; this is the sampling option supported by NCBI BLAST. An alternative sampling technique is minimizer sampling where we choose a “minimum”
<italic>k</italic>
′-mer within a given window [
<xref rid="pone.0179046.ref021" ref-type="bibr">21</xref>
<xref rid="pone.0179046.ref026" ref-type="bibr">26</xref>
]. More specifically, for a window of
<italic>w</italic>
adjacent
<italic>k</italic>
′-mers, the
<italic>k</italic>
′-mer that is alphabetically minimum is selected. The next window then starts one position to right from the previous window.</p>
<p>We focus on fixed sampling for four main reasons. First, it is supported by NCBI BLAST, whereas minimizer sampling is not. Second, constructing a sampled index using fixed sampling is much simpler computationally. Third, fixed sampling reduces the
<italic>k</italic>
′-mer index size more than minimizer sampling does. Finally, our results indicate that fixed soft sampling achieves good accuracy in finding true matches, so the added complexity of minimizer sampling seems unnecessary when searching for HSLAs.</p>
<p>Our use of
<italic>k</italic>
′-mers fits within the seed-and-extend searching method. In general, seed-and-extend searching uses two types of seeds: continuous and spaced. We focus on continuous seeds, which are equivalent to
<italic>k</italic>
-mer seeds. Spaced seeds, seeds that allow mismatches in some predefined positions, increase sensitivity, usually at the cost of greater complexity. We focus on continuous seeds because they minimize the number of false positives without compromising retention rate when searching for HSLAs.</p>
<p>We use
<italic>k</italic>
′-mer indexes to search for MEM
<sub>k</sub>
*s. We could use a suffix-tree or a suffix-array index to search for MEM
<sub>k</sub>
*s instead. But we focus on
<italic>k</italic>
′-mer indexes because they require less memory than these other indexes, despite the development of many new compressed and sparse suffix array data structures [
<xref rid="pone.0179046.ref027" ref-type="bibr">27</xref>
,
<xref rid="pone.0179046.ref028" ref-type="bibr">28</xref>
]. In particular, a recent paper [
<xref rid="pone.0179046.ref029" ref-type="bibr">29</xref>
] shows that efficient implementations of a
<italic>k</italic>
′-mer indexes can find MEM
<sub>k</sub>
*s more efficiently than compressed sparse suffix arrays, especially when a very large database such as the human genome is used.</p>
<p>Recently, a complementary approach of space-efficient referentially compressed search indexes has been proposed to support similarity searches on genome data sets [
<xref rid="pone.0179046.ref011" ref-type="bibr">11</xref>
,
<xref rid="pone.0179046.ref012" ref-type="bibr">12</xref>
]. In this method, genomes are compressed against some reference genome(s). Given a query, the index then searches two parts: the reference and all genome-specific individual differences. Both parts are saved in compressed suffix trees. Danek
<italic>et al</italic>
. [
<xref rid="pone.0179046.ref013" ref-type="bibr">13</xref>
] extend reference-based compression with the use of a
<italic>k</italic>
′-mer index. We think employing the complementary approach of reference compression in unison with sampled
<italic>k</italic>
′-mer indexes may be fruitful.</p>
<p>Finally, with respect to mapping ESTs on the human genome, Xin
<italic>et al</italic>
. [
<xref rid="pone.0179046.ref030" ref-type="bibr">30</xref>
] proposed two general techniques to accelerate
<italic>k</italic>
′-mer based mappers. The first technique is to use the set of adjacent
<italic>k</italic>
′-mers as supporting evidence for the existence of a true match. The second is to use shared infrequent
<italic>k</italic>
′-mers to select the best mapping location. Similar to other studies, they only used
<italic>w</italic>
=
<italic>k</italic>
while evaluating these techniques. In contrast, we test a broader range of
<italic>w</italic>
values and demonstrate that using a larger
<italic>w</italic>
greatly reduces query time and index size while suffering only a small loss of sensitivity.</p>
</sec>
<sec id="sec008">
<title>Problem statement and overall aims</title>
<p>We study the effects of using BLAST with soft sampling when searching for HSLAs and mapping ESTs onto the human genome. Our work is unique in that there is little prior work that has considered soft sampling, and the little work that has considered it has not systematically studied how
<italic>w</italic>
affects accuracy. We specifically study the effect of sampling parameter
<italic>w</italic>
on the size, accuracy, and query time of the
<italic>k</italic>
′-mer index. We also extend previous analytical models to work with our chosen parameters of
<italic>k</italic>
′ = 12 and
<italic>w</italic>
, and then compare our empirical results with predictions from both the original and the extended analytical models. We summarize our major contributions.</p>
<list list-type="order">
<list-item>
<p>We systematically assess how well BLAST can find
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) when using soft sampling where
<italic>w</italic>
>
<italic>k</italic>
* −
<italic>k</italic>
′ + 1 when working with the human genome as our database. In particular, we study the retention rates
<italic>RR</italic>
(
<italic>w</italic>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
). We show that both
<italic>RR</italic>
(
<italic>w</italic>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
) are high, even for large choices of
<italic>w</italic>
. Furthermore, the false positive rate, in the form of MEM
<sub>k</sub>
*s that do not extend into
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
), is significantly reduced, leading to a corresponding significant reduction in query time. This demonstrates that soft sampling is a simple but effective method to increase index efficiency with surprisingly little loss in accuracy.</p>
</list-item>
<list-item>
<p>We design a new analytical model that we call the BLAST model by extending previously developed analytical models to work with our values of
<italic>k</italic>
′ <
<italic>k</italic>
* and
<italic>w</italic>
. We compare the theoretical predictions from our new BLAST model and old models with our empirical results. We show that the new model is reasonably accurate whereas other analytical models are not accurate in our context. We also highlight some possible shortcomings of our new BLAST model.</p>
</list-item>
<list-item>
<p>Finally, we study the effects of using soft sampling for the problem of mapping human ESTs against the human genome. We conservatively simulate the process because either existing mapping tools do not support soft sampling or do not allow us to replace the first phase aligner. We show that we are able to map more than 98% of the query ESTs perfectly while reducing index size by 3-5 times and query time by 23.3% when compared to hard sampling.</p>
</list-item>
</list>
</sec>
</sec>
<sec sec-type="materials|methods" id="sec009">
<title>Materials and methods</title>
<p>We evaluate the effect of soft sampling on using BLAST to (i) find HSLAs and (ii) map ESTs to the human genome. For both applications, we describe our database and how we create sampled indexes. We then describe our query sets and how we perform queries. We next describe our evaluation metrics. Finally, we describe how we extend Kent’s analytical model to work with our choices of
<italic>k</italic>
′ = 12 and
<italic>w</italic>
.</p>
<sec id="sec010">
<title>Experimental settings</title>
<sec id="sec011">
<title>Database</title>
<p>For both applications of finding HSLAs and EST mapping, we use the human genome database provided by Morgulis
<italic>et al</italic>
. from their MegaBLAST paper [
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
] as our database. Morgulis
<italic>et al</italic>
. note that the human genome database was the most frequently searched database in NCBI in 2007 with 10,000 submitted queries per weekday. They partitioned the human genome database into volumes, each volume is roughly 1 GB in size and available at (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/fasta/human">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/fasta/human</ext-link>
). We summarize key characteristics of each volume in
<xref rid="pone.0179046.t001" ref-type="table">Table 1</xref>
. We did experiments with both masked and unmasked data but report results only for the unmasked data since the results were similar. As in [
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
], we treat each volume as a separate database. That is, we create an index for each volume separately and search each volume’s index separately. To obtain results for the human database, we then simply union the results found for each volume.</p>
<table-wrap id="pone.0179046.t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t001</object-id>
<label>Table 1</label>
<caption>
<title>Human genome volume characteristics.</title>
</caption>
<alternatives>
<graphic id="pone.0179046.t001g" xlink:href="pone.0179046.t001"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" style="border-bottom:thick;border-right:thick" rowspan="1" colspan="1">Name</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">Size(Mbytes)</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">Size(bp)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">Chr. 1-5, unmasked</td>
<td align="char" char="." rowspan="1" colspan="1">1039.86</td>
<td align="left" rowspan="1" colspan="1">1,025,201,451</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">Chr. 6-13, unmasked</td>
<td align="char" char="." rowspan="1" colspan="1">1093.27</td>
<td align="left" rowspan="1" colspan="1">1,077,856,590</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">Chr. 14-Y, unmasked</td>
<td align="char" char="." rowspan="1" colspan="1">778.75</td>
<td align="left" rowspan="1" colspan="1">767,769,314</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">Chr. 1-8, masked</td>
<td align="char" char="." rowspan="1" colspan="1">1517.93</td>
<td align="left" rowspan="1" colspan="1">1,493,033,824</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">Chr. 9-Y, masked</td>
<td align="char" char="." rowspan="1" colspan="1">1400.78</td>
<td align="left" rowspan="1" colspan="1">1,377,793,531</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="t001fn001">
<p>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/fasta/human">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/fasta/human</ext-link>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="sec012">
<title>Sampled index construction</title>
<p>For finding HSLAs, we use four different minimum alignment lengths
<italic>l</italic>
: 50, 100, 200, and 400 and a match threshold
<italic>t</italic>
= 96% or
<italic>t</italic>
= 97%. For each of our four choices of
<italic>l</italic>
, we use
<italic>k</italic>
* =
<italic>l</italic>
/(1 + ⌊(1 −
<italic>t</italic>
)
<italic>l</italic>
⌋). For mapping ESTs, similar to Kent’s design of BLAT [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
], we use the same choices except we omit
<italic>l</italic>
= 400. Specifically, Kent used
<italic>l</italic>
= 100; we also include
<italic>l</italic>
= 50 and
<italic>l</italic>
= 200 to study EST mapping under a wider set of possible choices. We use a geometric progression with base
<inline-formula id="pone.0179046.e001">
<alternatives>
<graphic xlink:href="pone.0179046.e001.jpg" id="pone.0179046.e001g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M1">
<mml:msqrt>
<mml:mn>2</mml:mn>
</mml:msqrt>
</mml:math>
</alternatives>
</inline-formula>
to choose
<italic>w</italic>
values for soft sampling indexes. Specifically, we consider
<inline-formula id="pone.0179046.e002">
<alternatives>
<graphic xlink:href="pone.0179046.e002.jpg" id="pone.0179046.e002g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M2">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:msqrt>
<mml:mn>2</mml:mn>
</mml:msqrt>
<mml:mi>i</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
. For each
<italic>l</italic>
, we ignore
<italic>w</italic>
less than
<italic>w</italic>
<sub>0</sub>
=
<italic>k</italic>
* −
<italic>k</italic>
′ + 1 since
<italic>w</italic>
<sub>0</sub>
is the largest hard sampling value. Likewise, we ignore
<italic>w</italic>
<italic>l</italic>
as these can completely skip over a potential alignment of length
<italic>l</italic>
. This results in a total of eleven choices ranging from
<italic>w</italic>
= 8 to
<italic>w</italic>
= 256. Combined with four choices of
<italic>w</italic>
<sub>0</sub>
for hard sampling and three volumes, we create a total of 15 × 3 = 45 sampled indexes.</p>
<p>We use
<italic>SI</italic>
(
<italic>w</italic>
) to denote a sampled index created with sampling parameter
<italic>w</italic>
; note
<italic>SI</italic>
(
<italic>w</italic>
<sub>0</sub>
) denotes a hard sampling index. These choices are summarized in
<xref rid="pone.0179046.t002" ref-type="table">Table 2</xref>
. Note some sampled indexes are used with multiple
<italic>l</italic>
values. For example, the sampled indexes
<italic>SI</italic>
(22) and
<italic>SI</italic>
(32) are used for each choice of
<italic>l</italic>
.</p>
<table-wrap id="pone.0179046.t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t002</object-id>
<label>Table 2</label>
<caption>
<title>A summary of the parameters used in our experiments for (1) finding HSLAs and (2) EST mappings.</title>
<p>For HSLA, we consider all four choices of
<italic>l</italic>
. For EST, we only consider the first three choices of
<italic>l</italic>
.</p>
</caption>
<alternatives>
<graphic id="pone.0179046.t002g" xlink:href="pone.0179046.t002"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" colspan="6" rowspan="1">Sampling parameters</th>
</tr>
<tr>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>t</italic>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>k</italic>
*</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>k</italic>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>w</italic>
<sub>0</sub>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">50</td>
<td align="left" rowspan="1" colspan="1">96%</td>
<td align="left" rowspan="1" colspan="1">16</td>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">8, 11, 16, 22, 32</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">97%</td>
<td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="left" rowspan="1" colspan="1">14</td>
<td align="left" rowspan="1" colspan="1">16, 22, 32, 45, 64</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">200</td>
<td align="left" rowspan="1" colspan="1">97%</td>
<td align="left" rowspan="1" colspan="1">28</td>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="left" rowspan="1" colspan="1">17</td>
<td align="left" rowspan="1" colspan="1">22, 32, 45, 64, 90, 128</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">400</td>
<td align="left" rowspan="1" colspan="1">97%</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="left" rowspan="1" colspan="1">19</td>
<td align="left" rowspan="1" colspan="1">22, 32, 45, 64, 90, 128, 181, 256</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="t002fn001">
<p>The
<italic>k</italic>
′-mer indexes are built using BLAST with sampling steps
<italic>w</italic>
<sub>0</sub>
and
<italic>w</italic>
. True matches HSLAs are of length ≥
<italic>l</italic>
and a match percentage ≥
<italic>t</italic>
. Only HSLAs that have shared
<italic>k</italic>
*-mers are reported by BLAST.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We build our sampled indexes using the BLAST program
<monospace>makembindex</monospace>
for the three volumes of the unmasked human genome database using BLAST’s default value of
<italic>k</italic>
′ = 12.</p>
</sec>
<sec id="sec013">
<title>Query sets and mappable queries</title>
<p>For HSLA, we use the same query sets that Morgulis
<italic>et al</italic>
. used to evaluate Indexed BLAST [
<xref rid="pone.0179046.ref004" ref-type="bibr">4</xref>
]. Morgulis
<italic>et al</italic>
. organized the queries into three sets based on the average query length:
<italic>qsmall</italic>
(average length 500),
<italic>qmedium</italic>
(average length 10,000), and
<italic>qlarge</italic>
(average length 100,000). Each set has 100 queries for 300 total queries. We group all the queries into a single set of 300 queries and report all results using this single query set. The query sets are available at the following url:
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/queries/human">ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/queries/human</ext-link>
. For EST mapping, we form our query set
<italic>Q</italic>
by randomly selecting 1000 human ESTs (average length 490) from Expressed Sequence Tags database from NCBI
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/dbEST">https://www.ncbi.nlm.nih.gov/dbEST</ext-link>
. For each length
<italic>l</italic>
, we define
<italic>Q</italic>
(
<italic>l</italic>
) to be the subset of
<italic>Q</italic>
that has a non-empty
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) and refer to these as the
<italic>mappable queries</italic>
for length
<italic>l</italic>
.</p>
</sec>
<sec id="sec014">
<title>Query processing</title>
<p>For every query
<italic>q</italic>
in the query set, we run BLAST using the
<monospace>blastn</monospace>
program with the
<monospace>-task megablast</monospace>
option using its default settings except we select MEM
<sub>k</sub>
* value using
<monospace>-word-size</monospace>
<italic>k</italic>
*, we use multiple values of
<italic>w</italic>
, and we set the matching threshold, also known as identity percentage, using
<monospace>-perc-identity 96% for</monospace>
<italic>l</italic>
= 50
<monospace>and -perc-identity 97% for all other</monospace>
<italic>l</italic>
. This will return
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>k</italic>
*,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
) where every alignment must have an MEM
<sub>k</sub>
*. That is, the match percentage
<italic>t</italic>
will be satisfied, but the lengths are only guaranteed to be at least
<italic>k</italic>
*, not
<italic>l</italic>
. We filter out any HSLAs that are too short to produce
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
).</p>
</sec>
<sec id="sec015">
<title>False positives</title>
<p>For any query
<italic>q</italic>
and any
<italic>w</italic>
, we report the number of false positives
<italic>FP</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) as the number of alignments in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>k</italic>
*,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
) −
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). This should be very close to the number of MEM
<sub>k</sub>
*s that do not extend to alignments in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
); the two numbers might differ if multiple MEM
<sub>k</sub>
*s are part of the same alignment in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>k</italic>
*,
<italic>t</italic>
) −
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
).</p>
</sec>
<sec id="sec016">
<title>Experimental system</title>
<p>We run the experiments on a cluster that runs the Community Enterprise Operating System (CentOS) 6.6. The cluster has 24 nodes where each node has two 2.5Ghz 10-core Intel Xeon E5-2670v2 processors and 256 GB memory.</p>
</sec>
<sec id="sec017">
<title>HSLA evaluation metrics</title>
<p>We evaluate the effectiveness of a given
<italic>k</italic>
′-mer index
<italic>SI</italic>
(
<italic>w</italic>
) as a function of
<italic>w</italic>
and
<italic>w</italic>
<sub>0</sub>
using three metrics: (1) index size reduction, (2) retention rate of HSLAs, and (3) query time reduction. For retention rate, we consider retention of all HSLAs as denoted by
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) and short HSLAs as denoted by
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). To help explain query time reduction, we also measure false positive reduction. We describe each metric in more detail.</p>
<p>For each
<italic>SI</italic>
(
<italic>w</italic>
) and each choice of
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
, we define the sampled index size reduction as
<disp-formula id="pone.0179046.e003">
<alternatives>
<graphic xlink:href="pone.0179046.e003.jpg" id="pone.0179046.e003g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M3">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(1)</label>
</disp-formula>
where |
<italic>I</italic>
| is the size of index
<italic>I</italic>
. Index size is the sum of dictionary size, measured by counting the number of
<italic>k</italic>
′-mers, and inverted lists’ size, measured by counting the number of
<italic>k</italic>
′-mer occurrences in all the inverted lists. Since the human genome is split into three volumes and we create a sampled index for each volume, we compute the total index size for all indexes over all volumes. For the total dictionary size, we take the union of all three dictionaries, and then we measure the total dictionary size by counting the number of
<italic>k</italic>
′-mers in the union set. For the total inverted lists’ size, we take the sum over all three inverted lists’ sizes.</p>
<p>For each
<italic>SI</italic>
(
<italic>w</italic>
) and each choice of
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
, we report the full retention rate
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and the short retention rate
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), which we define as follows. For
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
and
<italic>q</italic>
, we define
<disp-formula id="pone.0179046.e004">
<alternatives>
<graphic xlink:href="pone.0179046.e004.jpg" id="pone.0179046.e004g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M4">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(2)</label>
</disp-formula>
and
<disp-formula id="pone.0179046.e005">
<alternatives>
<graphic xlink:href="pone.0179046.e005.jpg" id="pone.0179046.e005g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M5">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mtext>short</mml:mtext>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mtext>short</mml:mtext>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(3)</label>
</disp-formula>
We say that
<italic>RR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) or
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is undefined if the denominator is 0. We typically report both ratios as percentages. We use all three volumes to get these percentages. We then set
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) to be the average of
<italic>RR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), respectively, where we only include query
<italic>q</italic>
in the average if
<italic>RR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) or
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), respectively, is defined. We report
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) since this is a typical user query. We specifically define
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) to fairly compare empirical retention rate with expected retention rate. Intuitively,
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) focuses on the hardest to retain HSLAs.</p>
<p>For each
<italic>SI</italic>
(
<italic>w</italic>
) and each choice of
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
, we report the average query time reduction percentage
<italic>QTR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), which we define as follows. We start by defining the query time
<italic>QT</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) for a given query
<italic>q</italic>
and sampled index
<italic>SI</italic>
(
<italic>w</italic>
) (including
<italic>SI</italic>
(
<italic>w</italic>
<sub>0</sub>
)) as follows. We process each query
<italic>q</italic>
on
<italic>SI</italic>
(
<italic>w</italic>
) five times using BLAST and we set
<italic>QT</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) to be the median of the five values. Since
<italic>SI</italic>
(
<italic>w</italic>
) is partitioned into three volumes, the query time for a given
<italic>q</italic>
is the sum of the query times over the three volumes. The query time reduction
<italic>QTR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is then
<disp-formula id="pone.0179046.e006">
<alternatives>
<graphic xlink:href="pone.0179046.e006.jpg" id="pone.0179046.e006g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M6">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>Q</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>Q</mml:mi>
<mml:mi>T</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>Q</mml:mi>
<mml:mi>T</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(4)</label>
</disp-formula>
Finally, the average query time reduction
<italic>QTR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is the average of
<italic>QTR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) over all
<italic>q</italic>
.</p>
<p>Finally, to help explain the query time reduction results, for each
<italic>SI</italic>
(
<italic>w</italic>
) and each choice of
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
, we report the average false positive reduction rate
<italic>FPR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), which we define as follows. For a given query
<italic>q</italic>
and sampled index
<italic>SI</italic>
(
<italic>w</italic>
) (including
<italic>SI</italic>
(
<italic>w</italic>
<sub>0</sub>
)), we define
<italic>FP</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) to be the number of false positive; that is, HSLAs that do not lead to elements of
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) when we apply the second, more expensive, extension phase. We believe that
<italic>FP</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) decreases as
<italic>w</italic>
increases, and this may help explain any reduction in query time. To test this, we define the false positive reduction rate
<italic>FPR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) to be
<disp-formula id="pone.0179046.e007">
<alternatives>
<graphic xlink:href="pone.0179046.e007.jpg" id="pone.0179046.e007g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M7">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(5)</label>
</disp-formula>
</p>
<p>Finally, the average false positive reduction rate
<italic>FPR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is the average of
<italic>FPR</italic>
(
<italic>q</italic>
,
<italic>w</italic>
) over all
<italic>q</italic>
.</p>
</sec>
<sec id="sec018">
<title>EST mapping evaluation metrics</title>
<p>For each soft sampled index
<italic>SI</italic>
(
<italic>w</italic>
) and a given length
<italic>l</italic>
, we report its retention rate,
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
(
<italic>w</italic>
,
<italic>l</italic>
), as the percentage of
<italic>Q</italic>
(
<italic>l</italic>
) such that
<italic>all</italic>
of
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
) is found using
<italic>SI</italic>
(
<italic>w</italic>
). We use this requirement because this implies that the mapping result for
<italic>SI</italic>
(
<italic>w</italic>
) for the given query
<italic>q</italic>
will be identical to the mapping result for
<italic>SI</italic>
(
<italic>w</italic>
<sub>0</sub>
) and
<italic>q</italic>
regardless of the mapping procedure used. Otherwise, at least one highly similar local alignment is lost and we pessimistically assume that the mapping result would be lost as well.</p>
<p>More formally, for a given mappable queries set
<italic>Q</italic>
(
<italic>l</italic>
) and
<italic>k</italic>
′-mer index
<italic>SI</italic>
(
<italic>w</italic>
), we define the set
<italic>Q</italic>
′(
<italic>l</italic>
) ⊂
<italic>Q</italic>
(
<italic>l</italic>
) as follows
<disp-formula id="pone.0179046.e008">
<alternatives>
<graphic xlink:href="pone.0179046.e008.jpg" id="pone.0179046.e008g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M8">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo></mml:mo>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(6)</label>
</disp-formula>
</p>
<p>Then, we define the index retention rate
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
as follows:
<disp-formula id="pone.0179046.e009">
<alternatives>
<graphic xlink:href="pone.0179046.e009.jpg" id="pone.0179046.e009g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M9">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(7)</label>
</disp-formula>
</p>
<p>We also report the effect of
<italic>w</italic>
on query time using the same process as with HSLA, namely, running each query five times, taking the median time, and then reporting the average reduction in query time over all 1000 queries. Note that we use all queries rather than just the mappable queries when reporting query time.</p>
</sec>
</sec>
<sec id="sec019">
<title>Analytical modeling</title>
<p>We now describe how we analytically model two of the evaluation metrics, index size reduction and retention rate.</p>
<sec id="sec020">
<title>Predicting index size</title>
<p>We first show how we compute the expected size of a sampled index
<italic>SI</italic>
(
<italic>w</italic>
). For the dictionary, we assume the
<italic>k</italic>
′-mer dictionary is full and thus the size of a
<italic>k</italic>
′-mer dictionary is 4
<sup>
<italic>k</italic>
</sup>
entries, which in our case, is 4
<sup>12</sup>
. This may not be accurate, but since the dictionary size is typically much smaller than the inverted lists size given
<italic>k</italic>
′ = 12, this is accurate enough. The number of
<italic>k</italic>
′-mer occurrences stored in the inverted lists is simply (
<italic>D</italic>
<italic>S</italic>
(
<italic>k</italic>
′ + 1))/
<italic>w</italic>
where
<italic>D</italic>
= |
<italic>DB</italic>
|, the number of positions in
<italic>DB</italic>
, and
<italic>S</italic>
is the number of distinct sequences in
<italic>DB</italic>
. Thus, the predicted size of
<italic>SI</italic>
(
<italic>w</italic>
) is simply
<disp-formula id="pone.0179046.e010">
<alternatives>
<graphic xlink:href="pone.0179046.e010.jpg" id="pone.0179046.e010g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M10">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>z</mml:mi>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>-</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>(</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>w</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(8)</label>
</disp-formula>
</p>
</sec>
<sec id="sec021">
<title>Predicting retention rate</title>
<p>We present a new analytical model to compute the expected retention rate of HSLAs in
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). We start by presenting Kent’s analytical model [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
] where he essentially assumed
<italic>w</italic>
=
<italic>k</italic>
′ =
<italic>k</italic>
* in his model. We then propose a new model that we refer to as the BLAST model to account for typical parameters used in BLAST searches. We refer to the expected retention rates as
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
] and
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
] for Kent’s model and our model, respectively. For both retention rates, we make a few simplifying assumptions and refer to the two models generically as
<italic>E</italic>
[
<italic>RR</italic>
] when describing these common assumptions. First, we restrict our attention to HSLAs that have length exactly
<italic>l</italic>
. Second, we assume each HSLA in
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
,
<italic>k</italic>
′,
<italic>w</italic>
) is retained with the same probability, and this probability is independent of other HSLAs. This implies
<disp-formula id="pone.0179046.e011">
<alternatives>
<graphic xlink:href="pone.0179046.e011.jpg" id="pone.0179046.e011g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M11">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>R</mml:mi>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mtext>short</mml:mtext>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mtext>short</mml:mtext>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(9)</label>
</disp-formula>
which simplifies to just
<italic>p</italic>
(
<italic>A</italic>
)
<disp-formula id="pone.0179046.e012">
<alternatives>
<graphic xlink:href="pone.0179046.e012.jpg" id="pone.0179046.e012g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M12">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>R</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(10)</label>
</disp-formula>
The
<italic>p</italic>
(
<italic>A</italic>
) represents the probability that a short HSLA
<italic>A</italic>
is retained. This allows us to focus on a single short HSLA
<italic>A</italic>
in the rest of this analysis. Finally, we assume each position in
<italic>A</italic>
is independent of other positions and the probability that any position in
<italic>A</italic>
is a match is exactly
<italic>t</italic>
.</p>
</sec>
<sec id="sec022">
<title>Kent’s original retention rate model (
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
])</title>
<p>We start with Kent’s original model [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
]. where he assumes
<italic>w</italic>
=
<italic>k</italic>
* =
<italic>k</italic>
′. The number of
<italic>k</italic>
*-mers that are guaranteed to be chosen from
<italic>x</italic>
within
<italic>A</italic>
is
<disp-formula id="pone.0179046.e013">
<alternatives>
<graphic xlink:href="pone.0179046.e013.jpg" id="pone.0179046.e013g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M13">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(11)</label>
</disp-formula>
Furthermore, these
<italic>k</italic>
*-mers will be adjacent to each other with no gaps. For
<italic>A</italic>
to be retained, at least one of these chosen
<italic>k</italic>
*-mers from
<italic>x</italic>
must exactly match the corresponding
<italic>k</italic>
*-mer in
<italic>y</italic>
from
<italic>q</italic>
. The probability of such an exact match assuming each position is independent and that the overall match percentage within
<italic>A</italic>
is
<italic>t</italic>
is then
<disp-formula id="pone.0179046.e014">
<alternatives>
<graphic xlink:href="pone.0179046.e014.jpg" id="pone.0179046.e014g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M14">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(12)</label>
</disp-formula>
Since the sampled
<italic>k</italic>
*-mers do not overlap, the probability that all fail to match is then (1 −
<italic>p</italic>
)
<sup>
<italic>T</italic>
</sup>
. Thus, the probability that at least one will match and alignment
<italic>A</italic>
will be found is
<italic>p</italic>
(
<italic>A</italic>
) = 1 − (1 −
<italic>p</italic>
)
<sup>
<italic>T</italic>
</sup>
. Since
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
] =
<italic>p</italic>
(
<italic>A</italic>
), we have
<disp-formula id="pone.0179046.e015">
<alternatives>
<graphic xlink:href="pone.0179046.e015.jpg" id="pone.0179046.e015g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M15">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>K</mml:mi>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(13)</label>
</disp-formula>
</p>
</sec>
<sec id="sec023">
<title>BLAST retention rate model (
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
])</title>
<p>To extend this analysis to the typical BLAST setting with distinct
<italic>w</italic>
,
<italic>k</italic>
* and
<italic>k</italic>
′, we must modify the formula in two ways. The first key issue is that we sample
<italic>k</italic>
′-mers but then extend them to search for
<italic>k</italic>
*-mers. The sampled
<italic>k</italic>
′-mer must be an exact match, which again happens with probability
<italic>p</italic>
=
<italic>t</italic>
<sup>
<italic>k′</italic>
</sup>
The key issue after this is whether this can be extended to an MEM
<sub>k</sub>
*. Suppose this can extend exactly 0 ≤
<italic>l</italic>
<italic>k</italic>
* −
<italic>k</italic>
′ − 1 characters to the left before we get a mismatch. We then need it to extend at least
<italic>k</italic>
* −
<italic>l</italic>
<italic>k</italic>
′ characters to the right. The probability we can extend exactly
<italic>l</italic>
characters to the left is
<italic>t</italic>
<sup>
<italic>l</italic>
</sup>
(1 −
<italic>t</italic>
). The probability we can extend at least
<italic>k</italic>
* −
<italic>l</italic>
<italic>k</italic>
′ characters to the right is
<italic>t</italic>
<sup>
<italic>k</italic>
*</sup>
<sup>
<italic>k</italic>
</sup>
<sup>
<italic>l</italic>
</sup>
. Thus, the probability that we have a
<italic>k</italic>
′-mer, it extends exactly 0 ≤
<italic>l</italic>
<italic>k</italic>
* −
<italic>k</italic>
′ − 1 characters to the left, and it extends at least
<italic>k</italic>
* −
<italic>l</italic>
<italic>k</italic>
′ characters to the right is then
<italic>t</italic>
<sup>
<italic>k</italic>
</sup>
<italic>t</italic>
<sup>
<italic>l</italic>
</sup>
(1 −
<italic>t</italic>
)
<italic>t</italic>
<sup>
<italic>k</italic>
*</sup>
<sup>
<italic>k</italic>
</sup>
<sup>
<italic>l</italic>
</sup>
=
<italic>t</italic>
<sup>
<italic>k</italic>
*</sup>
(1 −
<italic>t</italic>
) There are
<italic>k</italic>
* −
<italic>k</italic>
′ − 1 choices for
<italic>l</italic>
leading to a final probability of (
<italic>k</italic>
* −
<italic>k</italic>
′ + 1)
<italic>t</italic>
<sup>
<italic>k</italic>
*</sup>
(1 −
<italic>t</italic>
). The other possibility is that it extends at least
<italic>k</italic>
* −
<italic>k</italic>
′ characters to the left, which occurs with probability
<italic>t</italic>
<sup>
<italic>k</italic>
*</sup>
giving us a total probability of
<disp-formula id="pone.0179046.e016">
<alternatives>
<graphic xlink:href="pone.0179046.e016.jpg" id="pone.0179046.e016g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M16">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(14)</label>
</disp-formula>
</p>
<p>The second key issue is that in
<xref ref-type="disp-formula" rid="pone.0179046.e013">Eq 11</xref>
, we used the floor function as this is the number of
<italic>k</italic>
*-mers from
<italic>x</italic>
within
<italic>A</italic>
that are guaranteed to be chosen. Using the floor function ignores the possibility that we may have an additional
<italic>k</italic>
*-mer chosen from
<italic>x</italic>
. That is, the number of
<italic>k</italic>
*-mers from
<italic>k</italic>
that will be sampled might be either
<disp-formula id="pone.0179046.e017">
<alternatives>
<graphic xlink:href="pone.0179046.e017.jpg" id="pone.0179046.e017g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M17">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(15)</label>
</disp-formula>
<disp-formula id="pone.0179046.e018">
<alternatives>
<graphic xlink:href="pone.0179046.e018.jpg" id="pone.0179046.e018g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M18">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(16)</label>
</disp-formula>
where
<italic>T</italic>
<sub>
<italic>f</italic>
</sub>
=
<italic>T</italic>
from
<xref ref-type="disp-formula" rid="pone.0179046.e013">Eq 11</xref>
. If we assume that each possible window for
<italic>w</italic>
is equally likely, then
<disp-formula id="pone.0179046.e019">
<alternatives>
<graphic xlink:href="pone.0179046.e019.jpg" id="pone.0179046.e019g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M19">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>w</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(17)</label>
</disp-formula>
<disp-formula id="pone.0179046.e020">
<alternatives>
<graphic xlink:href="pone.0179046.e020.jpg" id="pone.0179046.e020g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M20">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(18)</label>
</disp-formula>
For the case where ⌊(|
<italic>x</italic>
| −
<italic>k</italic>
* + 1)/
<italic>w</italic>
⌋ = (|
<italic>x</italic>
| −
<italic>k</italic>
* + 1)/
<italic>w</italic>
,
<italic>p</italic>
(
<italic>T</italic>
<sub>
<italic>f</italic>
</sub>
) = 0, which means
<italic>p</italic>
(
<italic>T</italic>
<sub>
<italic>c</italic>
</sub>
) = 1 so the result is still correct.</p>
<p>In our new BLAST model, we update
<xref ref-type="disp-formula" rid="pone.0179046.e015">Eq 13</xref>
replacing
<italic>p</italic>
with
<italic>p</italic>
′ and replacing
<italic>T</italic>
with
<italic>T</italic>
<sub>
<italic>c</italic>
</sub>
and
<italic>T</italic>
<sub>
<italic>f</italic>
</sub>
as follows.
<disp-formula id="pone.0179046.e021">
<alternatives>
<graphic xlink:href="pone.0179046.e021.jpg" id="pone.0179046.e021g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M21">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mi>R</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>B</mml:mi>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:msup>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:msup>
<mml:mi>p</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(19)</label>
</disp-formula>
We will compare both Kent’s model and our new BLAST model in our results.</p>
</sec>
</sec>
</sec>
<sec id="sec024">
<title>Results and discussion</title>
<p>We report the impact of sampling on the efficiency of a
<italic>k</italic>
′-mer index on the index size and query performance. We report both the expected and the actual impact of sampling.</p>
<sec id="sec025">
<title>Index size</title>
<p>
<italic>As expected, the index size is inversely proportional to the sampling step</italic>
<italic>w</italic>
. This means that soft sampling does lead to a significant reduction in space when compared to hard sampling. For example, when
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is roughly 1.7 and 4.4, the index size reduces by 38% and 74% for all values of
<italic>l</italic>
we considered. The percentage of reduction increases as
<italic>l</italic>
increases. For example when
<italic>l</italic>
= 400 and
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is almost 10, the index size reduces by 90%.</p>
<p>With hard sampling
<italic>w</italic>
<sub>0</sub>
=
<italic>k</italic>
* −
<italic>k</italic>
′ + 1, the space reduction is limited by
<italic>k</italic>
*. With soft sampling
<italic>w</italic>
>
<italic>k</italic>
* −
<italic>k</italic>
′ + 1,
<italic>w</italic>
is limited primarily by
<italic>l</italic>
, where typically
<italic>l</italic>
<italic>k</italic>
* (see
<xref rid="pone.0179046.t002" ref-type="table">Table 2</xref>
). We plot results for the percentage reduction in index size in
<xref ref-type="fig" rid="pone.0179046.g003">Fig 3</xref>
. Since the expected index size (see
<xref ref-type="disp-formula" rid="pone.0179046.e010">Eq 8</xref>
) and the actual index size are almost identical, the expected size is omitted.</p>
<fig id="pone.0179046.g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g003</object-id>
<label>Fig 3</label>
<caption>
<title>The sampled index
<italic>SI</italic>
(
<italic>w</italic>
) size (percentage) as a function of sampling step size
<italic>w</italic>
of
<italic>SI</italic>
(
<italic>w</italic>
) versus sampled index
<italic>SI</italic>
(
<italic>w</italic>
<sub>0</sub>
).</title>
<p>The
<italic>k</italic>
′-mer indexes are built with
<italic>k</italic>
′ = 12 and
<italic>w</italic>
<italic>w</italic>
<sub>0</sub>
where
<italic>w</italic>
<sub>0</sub>
=
<italic>l</italic>
<italic>k</italic>
+ 1.</p>
</caption>
<graphic xlink:href="pone.0179046.g003"></graphic>
</fig>
<p>Sampling reduces index size because it reduces the number of sampled
<italic>k</italic>
′-mers leading to a factor of
<italic>w</italic>
reduction in inverted lists size, the dominant component of index size. On the other hand, although sampling does reduce dictionary size, the reduction is relatively small and does not greatly affect the final index size. For example when
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is roughly 1.7 and 4.4, the dictionary is about 9% and 17% of the the index size and the average reduction in the dictionary size is 9% and 27% respectively.</p>
</sec>
<sec id="sec026">
<title>Retention rate of
<italic>HSLA</italic>
s</title>
<p>We first examine
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) to study how the overall HSLA retention rate changes as a function of
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
, and
<italic>l</italic>
. We first observe that
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) improves as
<italic>l</italic>
increases. In particular, as can be seen from our
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) results from
<xref ref-type="fig" rid="pone.0179046.g004">Fig 4</xref>
, if we look at choices for
<italic>w</italic>
and
<italic>w</italic>
<sub>0</sub>
that have a similar ratio
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
, the
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) retention result is higher for larger
<italic>l</italic>
. In particular, whereas
<italic>RR</italic>
(32, 5) for
<italic>l</italic>
= 50 falls below 80%,
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) ≥ 96.6% for
<italic>l</italic>
≥ 100 for all tested values of
<italic>w</italic>
, and
<italic>RR</italic>
(256, 30) = 97.5% for
<italic>l</italic>
= 400. Thus, for large values of
<italic>l</italic>
, we can use soft sampling where
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
approaches even 10 and still achieve retention rates of close to 100%.</p>
<fig id="pone.0179046.g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g004</object-id>
<label>Fig 4</label>
<caption>
<title>The actual HSLA retention rate
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), the actual short HSLA retention rate
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), and the expected short HSLA retention rate using both Kent’s model
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] and BLAST model
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] for (a)
<italic>l</italic>
= 50,
<italic>w</italic>
<sub>0</sub>
= 5, (b)
<italic>l</italic>
= 100,
<italic>w</italic>
<sub>0</sub>
= 14, (c)
<italic>l</italic>
= 200,
<italic>w</italic>
<sub>0</sub>
= 17, and (d)
<italic>l</italic>
= 400,
<italic>w</italic>
<sub>0</sub>
= 30.</title>
<p>For other parameters values see
<xref rid="pone.0179046.t002" ref-type="table">Table 2</xref>
.</p>
</caption>
<graphic xlink:href="pone.0179046.g004"></graphic>
</fig>
<p>We now focus on the retention rate of short HSLAs. First we compare
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) with
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) as a function of
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
, and
<italic>l</italic>
.
<xref ref-type="fig" rid="pone.0179046.g004">Fig 4</xref>
also contains our
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) results.
<italic>We observe that if</italic>
<italic>w</italic>
/
<italic>w</italic>
0 < 4,
<italic>then the difference between</italic>
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)
<italic>and</italic>
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)
<italic>is small (less than 4%) for almost all choices of</italic>
<italic>l</italic>
; the one outlier is
<italic>l</italic>
= 100 where we see a difference of 11% for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
= 45/14 ≈ 3.2. For example, for all values of
<italic>l</italic>
except 100, the difference between
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is at most 1.8%. We do see a significant difference between
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) for our largest choices of
<italic>w</italic>
, which means that
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) does fall off by
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
= 10 or so. Thus, in contrast to
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), there is an upper limit to how much we can soft sample before
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) suffers.</p>
<p>Next we examine how
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) changes as a function of
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
, and
<italic>l</italic>
. Similar to
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), we observe
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) generally improves as
<italic>l</italic>
increases given roughly the same ratio of
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
. For example, when
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is roughly 6,
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) is 65%, and 85%, and 95% for
<italic>l</italic>
equal to 50, 200, and 400, respectively. In general, we can retain 90% more short HSLAs for either small
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
ratios (less than 2 or 3) or large
<italic>l</italic>
values (200 or 400).</p>
<p>We now want to compare empirical retention rate with predicted retention rate as a function of
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
and
<italic>l</italic>
. Comparing
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) to
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] is not fair as
<italic>RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) includes many alignments significantly longer than
<italic>l</italic>
whereas
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] focuses only on alignments with length exactly
<italic>l</italic>
. To more fairly compare empirical retention rate to expected retention rate, we compare
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
), where the length of the alignment is in the range [
<italic>l</italic>
, (2 −
<italic>t</italic>
)
<italic>l</italic>
], with
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)], where an alignment is assumed to have a length
<italic>l</italic>
. We consider HSLAs with length up to (2 −
<italic>t</italic>
)
<italic>l</italic>
to ensure there are a reasonable number of HSLAs. We also note that if we assume that the (1 −
<italic>t</italic>
)
<italic>l</italic>
errors were all insertions rather than substitutions, this would increase the length of the HSLA to (2 −
<italic>t</italic>
)
<italic>l</italic>
.
<xref ref-type="fig" rid="pone.0179046.g004">Fig 4</xref>
also contains our
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] results.</p>
<p>
<italic>We observe that the BLAST model predicts actual retention rate of short HSLAs with reasonable accuracy</italic>
,
<italic>particularly for small</italic>
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
<italic>and for larger</italic>
<italic>l</italic>
. For example, the typical difference between
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] when
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
≤ 2 is less than 5% for all choices of
<italic>l</italic>
and is less than 1% for large
<italic>l</italic>
= 200 and
<italic>l</italic>
= 400. The typical difference stays below 10% for almost all choices of
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
and
<italic>l</italic>
with only a few exceptions. The difference between
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] does grow as
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
increases, but at a relatively slow rate, typically maximized at the largest choice of
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
, though this does not hold for
<italic>l</italic>
= 400. We do note that we have relatively few empirical HSLAs for the large
<italic>w</italic>
values for
<italic>l</italic>
= 400, so perhaps with more samples, the difference between
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
) and
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] might increase for these
<italic>l</italic>
and
<italic>w</italic>
choices.</p>
<p>Finally, we compare the predictions from Kent’s model
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)] and our new BLAST model
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
)].
<italic>Our new BLAST model is significantly more accurate than Kent’s model, especially as</italic>
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
<italic>increases</italic>
. For example, for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
equal to 1.7, 3.4 and 4.7,
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
] is on average less than
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>B</italic>
</sub>
] by 5%, 18%, and 23%, respectively, for all
<italic>l</italic>
values we consider. Kent’s model has several issues. First, because of the floor function used in
<xref ref-type="disp-formula" rid="pone.0179046.e013">Eq 11</xref>
, it underestimates the number of sampled
<italic>k</italic>
*-mers from a given HSLA resulting in common retention rate predictions for multiple values of
<italic>w</italic>
. For example when
<italic>l</italic>
= 100,
<italic>E</italic>
[
<italic>RR</italic>
<sub>
<italic>K</italic>
</sub>
] = 46.70% for both
<italic>w</italic>
= 45 and
<italic>w</italic>
= 64. The second flaw is that Kent’s model was not designed to handle different values for
<italic>k</italic>
′,
<italic>k</italic>
*, and
<italic>w</italic>
, which is what is typically used in BLAST. Because our BLAST model is designed to overcome both issues, it achieves better results, particularly for larger
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
and for larger
<italic>l</italic>
.</p>
</sec>
<sec id="sec027">
<title>Possible improvements for the BLAST model</title>
<p>While our new BLAST model is much more accurate than Kent’s original model, it still underestimates actual retention rates for large
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
. We now explore possible explanations for this underestimate. We believe the fundamental problem with our new BLAST model (as well as Kent’s model) is that for any
<italic>HSLA</italic>
<sub>short</sub>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
), it only assumes that each position is a match with probability
<italic>t</italic>
.</p>
<p>We demonstrate the shortcomings of this assumption in two different ways. We first show that using this assumption, we greatly underestimate the length of the maximum MEM within any HSLA; we refer to this maximum MEM length as MAX-MEM. Long MEMs are relevant because long MEMs significantly increase the likelihood of recovering an HSLA. For example, if an HSLA includes an MEM of length
<italic>w</italic>
+
<italic>k</italic>
− 1, then it is guaranteed the HSLA will be found since one
<italic>k</italic>
-mer is guaranteed to be chosen from within the MEM.</p>
<p>We perform this comparison as follows. We first obtain an empirical distribution of MAX-MEM by recording the length of the longest MEM in every HSLA in
<italic>RR</italic>
<sub>
<italic>short</italic>
</sub>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
). We then use the BLAST model’s fundamental assumption that each position in an HSLA is identical with probability
<italic>t</italic>
to create a corresponding predicted distribution of MAX-MEM. For this predicted distribution of MAX-MEM, we assume that the length of the HSLA is
<italic>l</italic>
, the number of mismatches is exactly (1 −
<italic>t</italic>
)
<italic>l</italic>
, and each position is equally likely to be a mismatch. All told, there are
<italic>l</italic>
choose (1 −
<italic>t</italic>
)
<italic>l</italic>
different combinations of errors that are equally likely. We can then compute the predicted distribution by enumerating all possibilities for
<italic>l</italic>
= 50 and
<italic>l</italic>
= 100.</p>
<p>For
<italic>l</italic>
> 100, it takes too much time to enumerate all possibilities. Thus, we use Monte Carlo simulation to compute a second predicted distribution for MAX-MEM. We create short HSLAs as follows. We start with an alignment of length
<italic>l</italic>
, a set
<italic>S</italic>
of mismatch positions, which is initialized to empty, and a count
<italic>C</italic>
of the number of mismatch positions, which is initially 0. Then, we repeatedly choose a position in the range [0,
<italic>L</italic>
− 1] uniformly at random. If the position is not in
<italic>S</italic>
, we add the position to
<italic>S</italic>
and increment
<italic>C</italic>
by one. Otherwise, we do nothing with the chosen position and choose another one. When
<italic>C</italic>
reaches (1 −
<italic>t</italic>
)
<italic>l</italic>
, we stop with a complete HSLA. We then record its longest MEM. We do this until we have recorded one million such longest MEMs.</p>
<p>For
<italic>l</italic>
= 50 and
<italic>l</italic>
= 100, Monte Carlo simulation and complete enumeration produce essentially identical distributions for MAX-MEM. Thus, we only show results from our Monte Carlo simulations since these cover all choices of
<italic>l</italic>
. We show the results for our experimental and Monte Carlo distribution of MAX-MEM for all four choices of
<italic>l</italic>
in
<xref ref-type="fig" rid="pone.0179046.g005">Fig 5</xref>
. We observe that the empirical MAX-MEM distribution is weighted more heavily towards longer MEMs than the predicted distribution. This demonstrates that the assumptions used in the BLAST model do not correctly predict the distribution for MAX-MEM length; in general, they underestimate the probability for finding longer MEMs. While the distribution of MAX-MEM is not identical to retention rate of HSLAs, this finding provides evidence that we need stronger assumptions to better predict retention rate of HSLAs.</p>
<fig id="pone.0179046.g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g005</object-id>
<label>Fig 5</label>
<caption>
<title>Distribution of predicted and empirical MAX-MEM lengths in HSLAs.</title>
<p>The predicted MAX-MEM lengths are computed from a Monte Carlo simulation. (a)
<italic>l</italic>
= 50,
<italic>t</italic>
= 96%, (b)
<italic>l</italic>
= 100,
<italic>t</italic>
= 97%, (c)
<italic>l</italic>
= 200,
<italic>t</italic>
= 97%, and (d)
<italic>l</italic>
= 400,
<italic>t</italic>
= 97%.</p>
</caption>
<graphic xlink:href="pone.0179046.g005"></graphic>
</fig>
<p>We now show another fundamental flaw with the base assumption of the BLAST model. Consider an HSLA of length
<italic>l</italic>
, and suppose we assume that each query position is independent and matches its corresponding database position with probability
<italic>t</italic>
(similar to the BLAST model’s assumptions). Then, there is a (1 −
<italic>t</italic>
) probability that each position does not match. In this scenario, the total number of mismatches has the binomial distribution
<italic>Bin</italic>
(
<italic>l</italic>
, (1 −
<italic>t</italic>
)), which has an expected number of mismatches of exactly (1 −
<italic>t</italic>
)
<italic>l</italic>
. If the number of mismatches exceeds this expected value, we would no longer have an HSLA, but this is clearly contradicts with the first assumption that we start with an HSLA. The probability that the number of mismatches exceeds (1 −
<italic>t</italic>
)
<italic>l</italic>
is given in
<xref rid="pone.0179046.t003" ref-type="table">Table 3</xref>
. Given this weak assumption, the BLAST model essentially starts with a probability ranging from 32% to 43% that the given HSLA is not an HSLA.</p>
<table-wrap id="pone.0179046.t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t003</object-id>
<label>Table 3</label>
<caption>
<title>The probability that the number of mismatches exceeds (1 −
<italic>t</italic>
)
<italic>l</italic>
for various choices of
<italic>l</italic>
and
<italic>t</italic>
.</title>
</caption>
<alternatives>
<graphic id="pone.0179046.t003g" xlink:href="pone.0179046.t003"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>t</italic>
</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">Probablity mismatches exceeds (1 −
<italic>t</italic>
)
<italic>l</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" rowspan="1" colspan="1">50</td>
<td align="char" char="." rowspan="1" colspan="1">0.96</td>
<td align="char" char="." rowspan="1" colspan="1">32.3%</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">100</td>
<td align="char" char="." rowspan="1" colspan="1">0.97</td>
<td align="char" char="." rowspan="1" colspan="1">35.3%</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">200</td>
<td align="char" char="." rowspan="1" colspan="1">0.97</td>
<td align="char" char="." rowspan="1" colspan="1">39.4%</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">400</td>
<td align="char" char="." rowspan="1" colspan="1">0.97</td>
<td align="char" char="." rowspan="1" colspan="1">42.4%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="t003fn001">
<p>Under the assumption that the number of mismatches in a HSLA follow
<italic>Bin</italic>
(
<italic>l</italic>
, (1 −
<italic>t</italic>
)), Kents’s model underestimates the the existence of the HSLA by 30%–40%.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We have shown that the weak assumption used in the BLAST model (1)underestimates the probability of longer MAX-MEMs and (2) gives a significant probability for HSLAs to not be HSLAs. Taken together, we believe a new model with stronger assumptions is needed to produce more accurate predictions about retention rate of short HSLAs.</p>
</sec>
<sec id="sec028">
<title>Query time</title>
<p>We now examine how soft sampling affects query time.
<italic>The query time is approximately inversely proportional to the sampling step w for all l values without significantly reducing retention rate RR</italic>
(
<italic>w</italic>
,
<italic>w</italic>
<sub>0</sub>
). For example, for
<italic>l</italic>
= 200, the median query time for hard sampling is 26 hours. When
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is 3.8 and 5.3, the median query time reduction percentages are 64.51% (median query time 10 hours) and 73.38% (median query time 7.4 hours), respectively, while maintaining
<italic>RR</italic>
> 99%. Similarly, for
<italic>l</italic>
= 400, the median query time for hard sampling is 18.6 hours. When
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
is 6.7 and 9.3, the query time reduction percentages are 78.36% (median query time 4.3 hours) and 83.99% (median query time 3.3 hours), respectively, while maintaining
<italic>RR</italic>
> 99%.
<xref ref-type="fig" rid="pone.0179046.g006">Fig 6</xref>
shows our full query time results represented as query time reduction (
<italic>QTR</italic>
) percentages.</p>
<fig id="pone.0179046.g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.g006</object-id>
<label>Fig 6</label>
<caption>
<title>The average median query time reduction (QTR) percentages and the actual false positive reduction (FPR) percentages as a function of sampling step
<italic>w</italic>
.</title>
<p>(a)
<italic>l</italic>
= 50,
<italic>w</italic>
<sub>0</sub>
= 5, (b)
<italic>l</italic>
= 100,
<italic>w</italic>
<sub>0</sub>
= 14, (c)
<italic>l</italic>
= 200,
<italic>w</italic>
<sub>0</sub>
= 17, and (d)
<italic>l</italic>
= 400,
<italic>w</italic>
<sub>0</sub>
= 30. For other parameters values see
<xref rid="pone.0179046.t002" ref-type="table">Table 2</xref>
.</p>
</caption>
<graphic xlink:href="pone.0179046.g006"></graphic>
</fig>
<p>The reduction in false positive rate mostly explains the reduction in query time. Recall that false positives are alignments in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>k</italic>
*,
<italic>t</italic>
) −
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
), which is roughly the number of MEM
<sub>k</sub>
*s that do not extend into alignments in
<italic>HSLA</italic>
(
<italic>DB</italic>
,
<italic>q</italic>
,
<italic>l</italic>
,
<italic>t</italic>
). This implies much of the query time is spent ruling out false positives and that using soft sampling not only has little affect on retention rate but also significantly reduces false positives. Our full false positive reduction rate results are shown in
<xref ref-type="fig" rid="pone.0179046.g006">Fig 6</xref>
. As can be seen from this figure, the plots for false positive reduction rate (
<italic>FPR</italic>
) and query time reduction (
<italic>QTR</italic>
) percentage are very similar.</p>
</sec>
<sec id="sec029">
<title>Mapping results</title>
<p>We report our retention rate and query time results for EST mapping in Tables
<xref rid="pone.0179046.t004" ref-type="table">4</xref>
,
<xref rid="pone.0179046.t005" ref-type="table">5</xref>
and
<xref rid="pone.0179046.t006" ref-type="table">6</xref>
.
<italic>Our results show that the number of mappable queries that retain all HSLAs is very high even when we use soft sampling</italic>
. Furthermore, we achieve significant reductions in query processing time. Recall that a query
<italic>q</italic>
is mappable is there is at least one HSLA between
<italic>q</italic>
and the reference (the human genome in our case). When an index
<italic>SI</italic>
(
<italic>w</italic>
) is used where
<italic>w</italic>
>
<italic>w</italic>
<sub>0</sub>
, a query is lost if even a single HSLA is not found by
<italic>SI</italic>
(
<italic>w</italic>
).</p>
<table-wrap id="pone.0179046.t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t004</object-id>
<label>Table 4</label>
<caption>
<title>The retention rate (
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
) and query time reduction (
<italic>QTR</italic>
) results for all 1000 queries when
<italic>l</italic>
= 50 and
<italic>w</italic>
<sub>0</sub>
= 5, where 879 were mappable queries.</title>
</caption>
<alternatives>
<graphic id="pone.0179046.t004g" xlink:href="pone.0179046.t004"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" style="border-bottom:thick;border-right:thick;" rowspan="1" colspan="1">
<italic>w</italic>
</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1"># retained queries</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>QTR</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">5</td>
<td align="center" rowspan="1" colspan="1">879</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">876</td>
<td align="char" char="." rowspan="1" colspan="1">99.66%</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">11</td>
<td align="center" rowspan="1" colspan="1">864</td>
<td align="char" char="." rowspan="1" colspan="1">98.29%</td>
<td align="char" char="." rowspan="1" colspan="1">89.72%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">16</td>
<td align="center" rowspan="1" colspan="1">849</td>
<td align="char" char="." rowspan="1" colspan="1">96.59%</td>
<td align="char" char="." rowspan="1" colspan="1">75.67%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">22</td>
<td align="center" rowspan="1" colspan="1">811</td>
<td align="char" char="." rowspan="1" colspan="1">92.26%</td>
<td align="char" char="." rowspan="1" colspan="1">64.19%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">32</td>
<td align="center" rowspan="1" colspan="1">677</td>
<td align="char" char="." rowspan="1" colspan="1">77.02%</td>
<td align="char" char="." rowspan="1" colspan="1">51.40%</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0179046.t005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t005</object-id>
<label>Table 5</label>
<caption>
<title>The retention rate (
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
)and query time reduction (
<italic>QTR</italic>
) results for all 1000 queries when
<italic>l</italic>
= 100 and
<italic>w</italic>
<sub>0</sub>
= 14, where 794 were mappable queries.</title>
</caption>
<alternatives>
<graphic id="pone.0179046.t005g" xlink:href="pone.0179046.t005"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" style="border-bottom:thick;border-right:thick;" rowspan="1" colspan="1">
<italic>w</italic>
</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1"># retained queries</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>QTR</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">14</td>
<td align="center" rowspan="1" colspan="1">794</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">16</td>
<td align="center" rowspan="1" colspan="1">794</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">95.60%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">22</td>
<td align="center" rowspan="1" colspan="1">794</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">85.67%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">32</td>
<td align="center" rowspan="1" colspan="1">793</td>
<td align="char" char="." rowspan="1" colspan="1">99.87%</td>
<td align="char" char="." rowspan="1" colspan="1">83.07%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">45</td>
<td align="center" rowspan="1" colspan="1">789</td>
<td align="char" char="." rowspan="1" colspan="1">99.37%</td>
<td align="char" char="." rowspan="1" colspan="1">80.26%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">64</td>
<td align="center" rowspan="1" colspan="1">784</td>
<td align="char" char="." rowspan="1" colspan="1">98.74%</td>
<td align="char" char="." rowspan="1" colspan="1">76.68%</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0179046.t006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0179046.t006</object-id>
<label>Table 6</label>
<caption>
<title>The retention rate (
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
)and query time reduction (
<italic>QTR</italic>
) results for all 1000 queries when
<italic>l</italic>
= 200 and and
<italic>w</italic>
<sub>0</sub>
= 17, where 528 were mappable queries.</title>
</caption>
<alternatives>
<graphic id="pone.0179046.t006g" xlink:href="pone.0179046.t006"></graphic>
<table frame="box" rules="all" border="0">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" style="border-bottom:thick;border-right:thick" rowspan="1" colspan="1">
<italic>w</italic>
</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1"># retained queries</th>
<th align="center" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>RR</italic>
<sub>
<italic>map</italic>
</sub>
</th>
<th align="left" style="border-bottom:thick" rowspan="1" colspan="1">
<italic>QTR</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">17</td>
<td align="center" rowspan="1" colspan="1">528</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">22</td>
<td align="center" rowspan="1" colspan="1">528</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">93.76%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">32</td>
<td align="center" rowspan="1" colspan="1">528</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">91.91%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">45</td>
<td align="center" rowspan="1" colspan="1">525</td>
<td align="char" char="." rowspan="1" colspan="1">99.43%</td>
<td align="char" char="." rowspan="1" colspan="1">87.07%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">64</td>
<td align="center" rowspan="1" colspan="1">528</td>
<td align="char" char="." rowspan="1" colspan="1">100.00%</td>
<td align="char" char="." rowspan="1" colspan="1">84.08%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">90</td>
<td align="center" rowspan="1" colspan="1">523</td>
<td align="char" char="." rowspan="1" colspan="1">99.05%</td>
<td align="char" char="." rowspan="1" colspan="1">80.36%</td>
</tr>
<tr>
<td align="left" style="border-right:thick" rowspan="1" colspan="1">128</td>
<td align="center" rowspan="1" colspan="1">506</td>
<td align="char" char="." rowspan="1" colspan="1">95.83%</td>
<td align="char" char="." rowspan="1" colspan="1">78.76%</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>Using soft sampling, we are able to greatly reduce the index size, significantly reduce query time, and correctly map more than 95% of the mappable queries for
<italic>l</italic>
≥ 100. In fact, for
<italic>l</italic>
≥ 100, we correctly map almost 99% of the mappable queries for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
approaching 5. For
<italic>l</italic>
= 200, we correctly map at least 95% of the mappable queries for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
= 7.5. For
<italic>l</italic>
= 50, we still see good retention rates but the drop off is a bit faster. Specifically, for
<italic>l</italic>
= 50, for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
approaching 3, we correctly map 96% of the mappable queries. For
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
between 4 and 5, we correctly map 92% of the mappable queries, and for
<italic>w</italic>
/
<italic>w</italic>
<sub>0</sub>
between 6 and 7, we correctly map 77% of the mappable queries. Finally, the actual retention rates may be even better than the ones reported as we required that all HSLAs be retained whereas mapping may proceed properly even if some HSLAs are lost.</p>
<p>In human EST mapping, it is common to use HSLAs of length
<italic>l</italic>
= 100 to search for the best mapping [
<xref rid="pone.0179046.ref008" ref-type="bibr">8</xref>
<xref rid="pone.0179046.ref010" ref-type="bibr">10</xref>
]. To study a broader range of possibilities, we also consider
<italic>l</italic>
= 50 (more HSLAs and thus more mappable queries) and
<italic>l</italic>
= 200 (fewer HSLAs and thus fewer mappable queries). Our results imply that soft sampling can be used with relatively small loss in sensitivity for the commonly used case of
<italic>l</italic>
= 100. Given that the index sizes are significantly reduced and query times are also reduced, soft sampling may allow for EST mapping using
<italic>k</italic>
-mer based methods for larger genomes with only a small loss in sensitivity.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="sec030">
<title>Conclusion</title>
<p>We now summarize our main conclusions. Based on our experiments with the human genome and NCBI BLAST, soft sampling achieves significant space and time savings while also retaining highly similar local alignments with much higher probabilities than predicted by analytical modeling. Even better, when applied to EST mapping, soft sampling achieves significant space and time savings while retaining 98% of all mapping results (for
<italic>l</italic>
= 100), and the retention results may be even better as we pessimistically assume that losing even a single highly similar local alignment will lead to an incorrect mapping result. Further, because we performed all of our local alignments using BLAST, these results can be easily tested and adopted by other researchers.</p>
<p>This is but a first step in studying soft sampling. We can extend this work in many directions. We would like to test the effectiveness of soft sampling using other real biological data sets and in other applications such as clustering or SNP detection. Also, we would like to combine soft sampling with reference-based compressed
<italic>k</italic>
-mer indexes, which are useful when there is high redundancy in the data sets [
<xref rid="pone.0179046.ref011" ref-type="bibr">11</xref>
<xref rid="pone.0179046.ref013" ref-type="bibr">13</xref>
]. We focused on soft fixed sampling. An alternative method is soft minimizer sampling, which uses lexicographic information during the sampling process and allows sampling of query
<italic>k</italic>
-mers. In future research we plan to compare the two methods.</p>
</sec>
</body>
<back>
<ack>
<p>We thank the anonymous reviewers for their very helpful comments that greatly improved the paper. This work was supported in part by Michigan State University through computational resources provided by the Institute for Cyber-Enabled Research.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0179046.ref001">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
,
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
.
<article-title>Improved tools for biological sequence comparison</article-title>
.
<source>Proceedings of the National Academy of Sciences</source>
.
<year>1988</year>
;
<volume>85</volume>
(
<issue>8</issue>
):
<fpage>2444</fpage>
<lpage>2448</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.85.8.2444</pub-id>
<pub-id pub-id-type="pmid">3162770</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref002">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
,
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
,
<name>
<surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
,
<etal>et al</etal>
<article-title>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>1997</year>
;
<volume>25</volume>
(
<issue>17</issue>
):
<fpage>3389</fpage>
<lpage>3402</lpage>
.
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref003">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Schwartz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Wagner</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
.
<article-title>A greedy algorithm for aligning DNA sequences</article-title>
.
<source>Journal of Computational Biology</source>
.
<year>2000</year>
;
<volume>7</volume>
(
<issue>1–2</issue>
):
<fpage>203</fpage>
<lpage>214</lpage>
.
<pub-id pub-id-type="doi">10.1089/10665270050081478</pub-id>
<pub-id pub-id-type="pmid">10890397</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref004">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Morgulis</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Coulouris</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Raytselis</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
,
<name>
<surname>Agarwala</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
.
<article-title>Database indexing for production MegaBLAST searches</article-title>
.
<source>Bioinformatics</source>
.
<year>2008</year>
;
<volume>24</volume>
(
<issue>16</issue>
):
<fpage>1757</fpage>
<lpage>1764</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn322</pub-id>
<pub-id pub-id-type="pmid">18567917</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref005">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Irizarry</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Kustanovich</surname>
<given-names>V</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Brown</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Nelson</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Wong</surname>
<given-names>W</given-names>
</name>
,
<etal>et al</etal>
<article-title>Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences</article-title>
.
<source>Nature Genetics</source>
.
<year>2000</year>
;
<volume>26</volume>
(
<issue>2</issue>
):
<fpage>233</fpage>
<lpage>236</lpage>
.
<pub-id pub-id-type="doi">10.1038/79981</pub-id>
<pub-id pub-id-type="pmid">11017085</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref006">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Sachidanandam</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Weissman</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Schmidt</surname>
<given-names>SC</given-names>
</name>
,
<name>
<surname>Kakol</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Stein</surname>
<given-names>LD</given-names>
</name>
,
<name>
<surname>Marth</surname>
<given-names>G</given-names>
</name>
,
<etal>et al</etal>
<article-title>A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms</article-title>
.
<source>Nature</source>
.
<year>2001</year>
;
<volume>409</volume>
(
<issue>6822</issue>
):
<fpage>928</fpage>
<lpage>933</lpage>
.
<pub-id pub-id-type="doi">10.1038/35057149</pub-id>
<pub-id pub-id-type="pmid">11237013</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref007">
<label>7</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ng</surname>
<given-names>PC</given-names>
</name>
,
<name>
<surname>Henikoff</surname>
<given-names>S</given-names>
</name>
.
<article-title>Predicting deleterious amino acid substitutions</article-title>
.
<source>Genome Research</source>
.
<year>2001</year>
;
<volume>11</volume>
(
<issue>5</issue>
):
<fpage>863</fpage>
<lpage>874</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.176601</pub-id>
<pub-id pub-id-type="pmid">11337480</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref008">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
.
<article-title>BLAT-the BLAST-like alignment tool</article-title>
.
<source>Genome Research</source>
.
<year>2002</year>
;
<volume>12</volume>
(
<issue>4</issue>
):
<fpage>656</fpage>
<lpage>664</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.229202</pub-id>
<pub-id pub-id-type="pmid">11932250</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref009">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ning</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Cox</surname>
<given-names>AJ</given-names>
</name>
,
<name>
<surname>Mullikin</surname>
<given-names>JC</given-names>
</name>
.
<article-title>SSAHA: A fast search method for large DNA databases</article-title>
.
<source>Genome Research</source>
.
<year>2001</year>
;
<volume>11</volume>
(
<issue>10</issue>
):
<fpage>1725</fpage>
<lpage>1729</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.194201</pub-id>
<pub-id pub-id-type="pmid">11591649</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref010">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wu</surname>
<given-names>TD</given-names>
</name>
,
<name>
<surname>Watanabe</surname>
<given-names>CK</given-names>
</name>
.
<article-title>GMAP: a genomic mapping and alignment program for mRNA and EST sequences</article-title>
.
<source>Bioinformatics</source>
.
<year>2005</year>
;
<volume>21</volume>
(
<issue>9</issue>
):
<fpage>1859</fpage>
<lpage>1875</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti310</pub-id>
<pub-id pub-id-type="pmid">15728110</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref011">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wandelt</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Leser</surname>
<given-names>U</given-names>
</name>
.
<article-title>Mrcsi: Compressing and searching string collections with multiple references</article-title>
.
<source>Proceedings of the VLDB Endowment</source>
.
<year>2015</year>
;
<volume>8</volume>
(
<issue>5</issue>
):
<fpage>461</fpage>
<lpage>472</lpage>
.
<pub-id pub-id-type="doi">10.14778/2735479.2735480</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref012">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wandelt</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Starlinger</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Bux</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Leser</surname>
<given-names>U</given-names>
</name>
.
<article-title>RCSI: Scalable Similarity Search in Thousand(s) of Genomes</article-title>
.
<source>Proceedings of the VLDB Endowment</source>
.
<year>2013</year>
;
<volume>6</volume>
(
<issue>13</issue>
):
<fpage>1534</fpage>
<lpage>1545</lpage>
.
<pub-id pub-id-type="doi">10.14778/2536258.2536265</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref013">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Danek</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
.
<article-title>Indexes of large genome collections on a PC</article-title>
.
<source>PLOS ONE</source>
.
<year>2014</year>
;
<volume>9</volume>
(
<issue>10</issue>
):
<fpage>e109384</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0109384</pub-id>
<pub-id pub-id-type="pmid">25289699</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hatem</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Bozdağ</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Toland</surname>
<given-names>AE</given-names>
</name>
,
<name>
<surname>Çatalyürek</surname>
<given-names>ÜV</given-names>
</name>
.
<article-title>Benchmarking short sequence mapping tools</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2013</year>
;
<volume>14</volume>
(
<issue>1</issue>
):
<fpage>1</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-14-184</pub-id>
<pub-id pub-id-type="pmid">23323762</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref015">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hach</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Eichler</surname>
<given-names>EE</given-names>
</name>
,
<etal>et al</etal>
<article-title>mrsFAST: A cache-oblivious algorithm for short-read mapping</article-title>
.
<source>Nature Methods</source>
.
<year>2010</year>
;
<volume>7</volume>
(
<issue>8</issue>
):
<fpage>576</fpage>
<lpage>577</lpage>
.
<pub-id pub-id-type="doi">10.1038/nmeth0810-576</pub-id>
<pub-id pub-id-type="pmid">20676076</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref016">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Kidd</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Marques-Bonet</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Aksay</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Antonacci</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
,
<etal>et al</etal>
<article-title>Personalized copy number and segmental duplication maps using next-generation sequencing</article-title>
.
<source>Nature Genetics</source>
.
<year>2009</year>
;
<volume>41</volume>
(
<issue>10</issue>
):
<fpage>1061</fpage>
<lpage>1067</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng.437</pub-id>
<pub-id pub-id-type="pmid">19718026</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref017">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rumble</surname>
<given-names>SM</given-names>
</name>
,
<name>
<surname>Lacroute</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Dalca</surname>
<given-names>AV</given-names>
</name>
,
<name>
<surname>Fiume</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Sidow</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Brudno</surname>
<given-names>M</given-names>
</name>
.
<article-title>SHRiMP: Accurate mapping of short color-space reads</article-title>
.
<source>PLOS ONE Computational Biology</source>
.
<year>2009</year>
;
<volume>5</volume>
(
<issue>5</issue>
):
<fpage>e1000386</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000386</pub-id>
<pub-id pub-id-type="pmid">19461883</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref018">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ahmadi</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Behm</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Honnalli</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Weng</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Xie</surname>
<given-names>X</given-names>
</name>
.
<article-title>Hobbes: Optimized gram-based methods for efficient read alignment</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>2012</year>
;
<volume>40</volume>
(
<issue>6</issue>
):
<fpage>e41</fpage>
<lpage>e41</lpage>
.
<pub-id pub-id-type="doi">10.1093/nar/gkr1246</pub-id>
<pub-id pub-id-type="pmid">22199254</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref019">
<label>19</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Hach</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Sahinalp</surname>
<given-names>SC</given-names>
</name>
,
<name>
<surname>Eichler</surname>
<given-names>EE</given-names>
</name>
,
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
.
<article-title>Sensitive and fast mapping of di-base encoded reads</article-title>
.
<source>Bioinformatics</source>
.
<year>2011</year>
;
<volume>27</volume>
(
<issue>14</issue>
):
<fpage>1915</fpage>
<lpage>1921</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr303</pub-id>
<pub-id pub-id-type="pmid">21586516</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref020">
<label>20</label>
<mixed-citation publication-type="journal">
<name>
<surname>Weese</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Emde</surname>
<given-names>AK</given-names>
</name>
,
<name>
<surname>Rausch</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Döring</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
.
<article-title>RazerS: Fast read mapping with sensitivity control</article-title>
.
<source>Genome Research</source>
.
<year>2009</year>
;
<volume>19</volume>
(
<issue>9</issue>
):
<fpage>1646</fpage>
<lpage>1654</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.088823.108</pub-id>
<pub-id pub-id-type="pmid">19592482</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref021">
<label>21</label>
<mixed-citation publication-type="journal">
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Hayes</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Hunt</surname>
<given-names>BR</given-names>
</name>
,
<name>
<surname>Mount</surname>
<given-names>SM</given-names>
</name>
,
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
.
<article-title>Reducing storage requirements for biological sequence comparison</article-title>
.
<source>Bioinformatics</source>
.
<year>2004</year>
;
<volume>20</volume>
(
<issue>18</issue>
):
<fpage>3363</fpage>
<lpage>3369</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth408</pub-id>
<pub-id pub-id-type="pmid">15256412</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref022">
<label>22</label>
<mixed-citation publication-type="journal">
<name>
<surname>Roberts</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Hunt</surname>
<given-names>BR</given-names>
</name>
,
<name>
<surname>Yorke</surname>
<given-names>JA</given-names>
</name>
,
<name>
<surname>Bolanos</surname>
<given-names>RA</given-names>
</name>
,
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
.
<article-title>A preprocessor for shotgun assembly of large genomes</article-title>
.
<source>Journal of Computational Biology</source>
.
<year>2004</year>
;
<volume>11</volume>
(
<issue>4</issue>
):
<fpage>734</fpage>
<lpage>752</lpage>
.
<pub-id pub-id-type="doi">10.1089/cmb.2004.11.734</pub-id>
<pub-id pub-id-type="pmid">15579242</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref023">
<label>23</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ye</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Ma</surname>
<given-names>ZS</given-names>
</name>
,
<name>
<surname>Cannon</surname>
<given-names>CH</given-names>
</name>
,
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Douglas</surname>
<given-names>WY</given-names>
</name>
.
<article-title>Exploiting sparseness in de novo genome assembly</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2012</year>
;
<volume>13</volume>
(
<issue>6</issue>
):
<fpage>1</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-13-S6-S1</pub-id>
<pub-id pub-id-type="pmid">22214541</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref024">
<label>24</label>
<mixed-citation publication-type="book">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Limasset</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Jackman</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
,
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
.
<chapter-title>On the representation of de Bruijn graphs</chapter-title>
In:
<source>Research in Computational Molecular Biology</source>
.
<publisher-name>Springer</publisher-name>
;
<year>2014</year>
p.
<fpage>35</fpage>
<lpage>55</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0179046.ref025">
<label>25</label>
<mixed-citation publication-type="other">Movahedi NS, Forouzmand E, Chitsaz H. De novo co-assembly of bacterial genomes from multiple single cells. In: Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. IEEE; 2012. p. 1–5.</mixed-citation>
</ref>
<ref id="pone.0179046.ref026">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
.
<article-title>Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences</article-title>
.
<source>Bioinformatics</source>
.
<year>2016</year>
; p. btw152.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw152</pub-id>
<pub-id pub-id-type="pmid">27153593</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref027">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Abouelhoda</surname>
<given-names>MI</given-names>
</name>
,
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
.
<article-title>Replacing suffix trees with enhanced suffix arrays</article-title>
.
<source>Journal of Discrete Algorithms</source>
.
<year>2004</year>
;
<volume>2</volume>
(
<issue>1</issue>
):
<fpage>53</fpage>
<lpage>86</lpage>
.
<pub-id pub-id-type="doi">10.1016/S1570-8667(03)00065-0</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref028">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Vyverman</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>De Baets</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Fack</surname>
<given-names>V</given-names>
</name>
,
<name>
<surname>Dawyndt</surname>
<given-names>P</given-names>
</name>
.
<article-title>essaMEM: Finding maximal exact matches using enhanced sparse suffix arrays</article-title>
.
<source>Bioinformatics</source>
.
<year>2013</year>
;
<volume>29</volume>
(
<issue>6</issue>
):
<fpage>802</fpage>
<lpage>804</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt042</pub-id>
<pub-id pub-id-type="pmid">23349213</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref029">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Khiste</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
.
<article-title>E-MEM: Efficient computation of maximal exact matches for very large genomes</article-title>
.
<source>Bioinformatics</source>
.
<year>2015</year>
;
<volume>31</volume>
(
<issue>4</issue>
):
<fpage>509</fpage>
<lpage>514</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu687</pub-id>
<pub-id pub-id-type="pmid">25399029</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0179046.ref030">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>Xin</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Yedkar</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Mutlu</surname>
<given-names>O</given-names>
</name>
,
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
.
<article-title>Accelerating read mapping with FastHASH</article-title>
.
<source>BMC Genomics</source>
.
<year>2013</year>
;
<volume>14</volume>
(
<issue>Suppl 1</issue>
):
<fpage>S13</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-14-S1-S13</pub-id>
<pub-id pub-id-type="pmid">23369189</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001030 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001030 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5501444
   |texte=   The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:28686614" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021