MersV1, Pmc, Corpus, bibRecord, 001009

Indexing Arbitrary-Length k-Mers in Sequencing Reads

Identifieur interne : 001009 ( Pmc/Corpus ); précédent : 001008; suivant : 001010

Indexing Arbitrary-Length k-Mers in Sequencing Reads

Auteurs : Tomasz Kowalski ; Szymon Grabowski ; Sebastian Deorowicz

Source :

PLoS ONE [ 1932-6203 ] ; 2015.

RBID : PMC:4504488

Abstract

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4504488

DOI: 10.1371/journal.pone.0133198
PubMed: 26182400
PubMed Central: 4504488

Links to Exploration step

PMC:4504488

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Indexing Arbitrary-Length <italic>k</italic>
-Mers in Sequencing Reads</title>
<author><name sortKey="Kowalski, Tomasz" sort="Kowalski, Tomasz" uniqKey="Kowalski T" first="Tomasz" last="Kowalski">Tomasz Kowalski</name>
<affiliation><nlm:aff id="aff001"><addr-line>Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation><nlm:aff id="aff001"><addr-line>Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26182400</idno>
<idno type="pmc">4504488</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4504488</idno>
<idno type="RBID">PMC:4504488</idno>
<idno type="doi">10.1371/journal.pone.0133198</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">001009</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001009</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Indexing Arbitrary-Length <italic>k</italic>
-Mers in Sequencing Reads</title>
<author><name sortKey="Kowalski, Tomasz" sort="Kowalski, Tomasz" uniqKey="Kowalski T" first="Tomasz" last="Kowalski">Tomasz Kowalski</name>
<affiliation><nlm:aff id="aff001"><addr-line>Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation><nlm:aff id="aff001"><addr-line>Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating <italic>k</italic>
-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author><name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Danek, A" uniqKey="Danek A">A Danek</name>
</author>
<author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kelly, Dr" uniqKey="Kelly D">DR Kelly</name>
</author>
<author><name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author><name sortKey="Molnar, M" uniqKey="Molnar M">M Molnar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Heo, Y" uniqKey="Heo Y">Y Heo</name>
</author>
<author><name sortKey="Wu, Xl" uniqKey="Wu X">XL Wu</name>
</author>
<author><name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
<author><name sortKey="Ma, J" uniqKey="Ma J">J Ma</name>
</author>
<author><name sortKey="Hwu, Wm" uniqKey="Hwu W">WM Hwu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author><name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author><name sortKey="Holtgrewe, M" uniqKey="Holtgrewe M">M Holtgrewe</name>
</author>
<author><name sortKey="Dimitrova, V" uniqKey="Dimitrova V">V Dimitrova</name>
</author>
<author><name sortKey="Niu, S" uniqKey="Niu S">S Niu</name>
</author>
<author><name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author><name sortKey="Kobert, K" uniqKey="Kobert K">K Kobert</name>
</author>
<author><name sortKey="Flouri, T" uniqKey="Flouri T">T Flouri</name>
</author>
<author><name sortKey="Stamatakis, A" uniqKey="Stamatakis A">A Stamatakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ames, Sk" uniqKey="Ames S">SK Ames</name>
</author>
<author><name sortKey="Hysom, Da" uniqKey="Hysom D">DA Hysom</name>
</author>
<author><name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author><name sortKey="Lloyd, Gs" uniqKey="Lloyd G">GS Lloyd</name>
</author>
<author><name sortKey="Gokhale, Mb" uniqKey="Gokhale M">MB Gokhale</name>
</author>
<author><name sortKey="Allen, Je" uniqKey="Allen J">JE Allen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wood, D" uniqKey="Wood D">D Wood</name>
</author>
<author><name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bazinet, Al" uniqKey="Bazinet A">AL Bazinet</name>
</author>
<author><name sortKey="Cummings, Mp" uniqKey="Cummings M">MP Cummings</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Philippe, N" uniqKey="Philippe N">N Philippe</name>
</author>
<author><name sortKey="Salson, M" uniqKey="Salson M">M Salson</name>
</author>
<author><name sortKey="Lecroq, T" uniqKey="Lecroq T">T Lecroq</name>
</author>
<author><name sortKey="Leonard, M" uniqKey="Leonard M">M Léonard</name>
</author>
<author><name sortKey="Commes, T" uniqKey="Commes T">T Commes</name>
</author>
<author><name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Philippe, N" uniqKey="Philippe N">N Philippe</name>
</author>
<author><name sortKey="Salson, M" uniqKey="Salson M">M Salson</name>
</author>
<author><name sortKey="Commes, T" uniqKey="Commes T">T Commes</name>
</author>
<author><name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author><name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
<author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author><name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author><name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author><name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author><name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author><name sortKey="Pertea, M" uniqKey="Pertea M">M Pertea</name>
</author>
<author><name sortKey="Fahrner, Ja" uniqKey="Fahrner J">JA Fahrner</name>
</author>
<author><name sortKey="Sobreira, N" uniqKey="Sobreira N">N Sobreira</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author><name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author><name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author><name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author><name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author><name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manber, U" uniqKey="Manber U">U Manber</name>
</author>
<author><name sortKey="Myers, G" uniqKey="Myers G">G Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Maier, D" uniqKey="Maier D">D Maier</name>
</author>
<author><name sortKey="Storer, Ja" uniqKey="Storer J">JA Storer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Roguski, L" uniqKey="Roguski L">L Roguski</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group><journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">26182400</article-id>
<article-id pub-id-type="pmc">4504488</article-id>
<article-id pub-id-type="publisher-id">PONE-D-15-06025</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0133198</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Indexing Arbitrary-Length <italic>k</italic>
-Mers in Sequencing Reads</article-title>
<alt-title alt-title-type="running-head">Indexing Arbitrary-Length <italic>k</italic>
-Mers in Sequencing Reads</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Kowalski</surname>
<given-names>Tomasz</given-names>
</name>
<xref ref-type="aff" rid="aff001"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Grabowski</surname>
<given-names>Szymon</given-names>
</name>
<xref ref-type="aff" rid="aff001"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Deorowicz</surname>
<given-names>Sebastian</given-names>
</name>
<xref ref-type="aff" rid="aff002"><sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001"><label>1</label>
<addr-line>Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland</addr-line>
</aff>
<aff id="aff002"><label>2</label>
<addr-line>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland</addr-line>
</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1"><addr-line>University of Maryland, UNITED STATES</addr-line>
</aff>
<author-notes><fn fn-type="COI-statement" id="coi001"><p><bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con" id="contrib001"><p>Conceived and designed the experiments: TK SG SD. Performed the experiments: TK. Analyzed the data: TK SG SD. Wrote the paper: TK SG SD.</p>
</fn>
<corresp id="cor001">* E-mail: <email>sebastian.deorowicz@polsl.pl</email>
</corresp>
</author-notes>
<pub-date pub-type="collection"><year>2015</year>
</pub-date>
<pub-date pub-type="epub"><day>16</day>
<month>7</month>
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>7</issue>
<elocation-id>e0133198</elocation-id>
<history><date date-type="received"><day>13</day>
<month>2</month>
<year>2015</year>
</date>
<date date-type="accepted"><day>24</day>
<month>6</month>
<year>2015</year>
</date>
</history>
<permissions><copyright-statement>© 2015 Kowalski et al</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Kowalski et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0133198.pdf"></self-uri>
<abstract><p>We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating <italic>k</italic>
-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.</p>
</abstract>
<funding-group><funding-statement>This work was supported by The Polish National Science Centre under the project DEC-2012/05/B/ST6/03148. The infrastructure was supported by POIG.02.03.01-24-099/13 grant “GeCONiI---Upper Silesian Center for Computational Science and Engineering.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts><fig-count count="6"></fig-count>
<table-count count="7"></table-count>
<page-count count="16"></page-count>
</counts>
<custom-meta-group><custom-meta id="data-availability"><meta-name>Data Availability</meta-name>
<meta-value>All relevant data (including URLs to public repositories) are available within the paper.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes><title>Data Availability</title>
<p>All relevant data (including URLs to public repositories) are available within the paper.</p>
</notes>
</front>
<body><sec sec-type="intro" id="sec001"><title>Introduction</title>
<p>The genome sequencing costs dropped recently to less than 5 thousand U.S. dollars per human genome with about 30-fold coverage [<xref rid="pone.0133198.ref001" ref-type="bibr">1</xref>
]. Using the recent (and expensive) Illumina HiSeq X Ten system [<xref rid="pone.0133198.ref002" ref-type="bibr">2</xref>
], it may be even possible to reduce this cost to about 1 thousand dollars (or somewhat more) on a long run. The scale of the largest sequencing projects is amazing, e.g., the Million Veteran Program [<xref rid="pone.0133198.ref003" ref-type="bibr">3</xref>
] aims at sequencing 1M human genomes. Needless to say, all this results in enormous amounts of sequencing data.</p>
<p>These data have to be processed in some way. Usually, they are mapped onto reference genomes and then variant calling algorithms are used to identify the mutations present in sequenced genomes. Since the mapping requires fast search over reference genomes, a lot of indexing structures for genomes were adopted or invented. The obvious candidates were the suffix tree and the suffix array [<xref rid="pone.0133198.ref004" ref-type="bibr">4</xref>
], but their space requirements were often prohibitive, especially in the beginning of the 21st century. The situation changed with the advent of much more compact (compressed) index data structures. The most widely used in the read aligning software is the family of FM-indexes [<xref rid="pone.0133198.ref005" ref-type="bibr">5</xref>
], employed by the popular Bowtie [<xref rid="pone.0133198.ref006" ref-type="bibr">6</xref>
], BWA [<xref rid="pone.0133198.ref007" ref-type="bibr">7</xref>
] and many other mappers. Modern computers are more powerful, hence nowadays using a suffix array is often not a problem, especially if the array is sparsified (i.e., only a fraction of indexes is represented explicitly) [<xref rid="pone.0133198.ref008" ref-type="bibr">8</xref>
]. One of the recent successful examples is the MuGI multi-genome index [<xref rid="pone.0133198.ref009" ref-type="bibr">9</xref>
], allowing to index 1092 human genomes in less than 10 GB of memory.</p>
<p>As said above, a lot was done in the area of genome indexing, but very little for the other standard component of the input, i.e., sequencing reads. The main reason is that when the reads are simply mapped onto a reference genome, indexing them is pointless. In many situations, however, the reads are processed in some way before (or instead of) mapping. The most obvious case is read correction, which makes the mapping (or de novo assembling) easier and yields better final results, i.e., better determination of mutations. There are a number of read correctors, e.g., Quake [<xref rid="pone.0133198.ref010" ref-type="bibr">10</xref>
], RACER [<xref rid="pone.0133198.ref011" ref-type="bibr">11</xref>
], BLESS [<xref rid="pone.0133198.ref012" ref-type="bibr">12</xref>
], Fiona [<xref rid="pone.0133198.ref013" ref-type="bibr">13</xref>
]; see the recent survey [<xref rid="pone.0133198.ref014" ref-type="bibr">14</xref>
] for more examples. Sometimes the paired reads are joined if they overlap, with benefits in the mapping quality [<xref rid="pone.0133198.ref015" ref-type="bibr">15</xref>
]. In some other applications, e.g., in metagenomic studies, the goal is to assign reads to species (to identify which organisms can be found in the analyzed probe), and the reads are not mapped at all [<xref rid="pone.0133198.ref016" ref-type="bibr">16</xref>
–<xref rid="pone.0133198.ref018" ref-type="bibr">18</xref>
].</p>
<p>In such cases no reference sequence is used (or it is used only implicitly) and all the available knowledge can be retrieved only from the reads. The simplest approach is to calculate the statistics of <italic>k</italic>
-mers (i.e., all <italic>k</italic>
-symbol long substrings of reads), but some programs use more sophisticated knowledge. Therefore, the necessity of indexing reads was identified recently [<xref rid="pone.0133198.ref019" ref-type="bibr">19</xref>
]. Philippe et al. defined therein the index supporting the following queries. Given a query string <italic>f</italic>:
<list list-type="simple"><list-item><p><italic>Q</italic>
1 In which reads does <italic>f</italic>
 occur?</p>
</list-item>
<list-item><p><italic>Q</italic>
2 In how many reads does <italic>f</italic>
 occur?</p>
</list-item>
<list-item><p><italic>Q</italic>
3 What are the occurrence positions of <italic>f</italic>
?</p>
</list-item>
<list-item><p><italic>Q</italic>
4 What is the number of occurrences of <italic>f</italic>
?</p>
</list-item>
<list-item><p><italic>Q</italic>
5 In which reads does <italic>f</italic>
 occur only once?</p>
</list-item>
<list-item><p><italic>Q</italic>
6 In how many reads does <italic>f</italic>
 occur only once?</p>
</list-item>
<list-item><p><italic>Q</italic>
7 What are the occurrence positions of <italic>f</italic>
 in the reads where it occurs only once?</p>
</list-item>
</list>
</p>
<p>There are two ways in which <italic>f</italic>
 can be given in those queries, which may lead to different time complexities and actual timing results. In one, <italic>f</italic>
 is given as a sequence of DNA symbols. In the other, <italic>f</italic>
 is represented as a read ID followed with the start position of <italic>f</italic>
 in this read (and optionally, <italic>f</italic>
’s length, if it is not fixed).</p>
<p>There are a number of potential applications of this index. Philippe et al. [<xref rid="pone.0133198.ref019" ref-type="bibr">19</xref>
] described the following. The queries <italic>Q</italic>
1 and <italic>Q</italic>
2 can be used for mutation (both SNPs and short indels) detection. The query <italic>Q</italic>
2 can be used to calculate a “local coverage” of a <italic>k</italic>
-mer, i.e., the number of reads sharing it. This was used in the work [<xref rid="pone.0133198.ref020" ref-type="bibr">20</xref>
] for calculation of “support profile” of each <italic>k</italic>
-mer in a large package for analyzing reads from RNA-seq experiments. One more potential usage of index queries <italic>Q</italic>
3 and <italic>Q</italic>
4 can be in clustering and assembly without a reference genome.</p>
<p>One of the successful techniques in read correctors, e.g., BLESS, RACER, is to preprocess the reads to collect the <italic>k</italic>
-mer frequencies (i.e., allow to answer the <italic>Q</italic>
4 queries), which can be obtained with specialized software [<xref rid="pone.0133198.ref021" ref-type="bibr">21</xref>
–<xref rid="pone.0133198.ref023" ref-type="bibr">23</xref>
]. In some other tools, like Fiona [<xref rid="pone.0133198.ref013" ref-type="bibr">13</xref>
], Shrec [<xref rid="pone.0133198.ref024" ref-type="bibr">24</xref>
], HybridShrec [<xref rid="pone.0133198.ref025" ref-type="bibr">25</xref>
], it is necessary also to obtain the list of reads containing the <italic>k</italic>
-mer (i.e., they need <italic>Q</italic>
1 queries). The solution used in Fiona is to construct the generalized suffix array, i.e., suffix array containing all suffixes from all reads. Unfortunately, this implies huge memory requirements, e.g., for reads of human sequencing with 10-fold coverage, the memory occupation is 224 GB.</p>
<p>The recent paper by Salzberg et al. [<xref rid="pone.0133198.ref026" ref-type="bibr">26</xref>
] deals with mutation detection. One of the main difficulties in this problem is a large amount of candidate mutations that must be filtered out. Salzberg et al. propose an innovative approach in their Diamund software. At the first stages, they collect the statistics of <italic>k</italic>
-mers in the sequencing results of a trio (mother, father, proband). Then the statistics are reduced by a huge factor in some way. More precisely, Diamund attempts to identify all <italic>k</italic>
-mers unique to an affected proband and missing from both unaffected parents. The proband data are filtered to remove the <italic>k</italic>
-mers likely to contain sequencing errors, based on the assumption that any <italic>k</italic>
-mer occurring just a few times (in a dataset with a high coverage) represents an error. Intersecting all three sets identifies <italic>k</italic>
-mers that are unique to the proband. Finally, when the number of different <italic>k</italic>
-mers is counted in (tens of) thousands, they need to identify the reads containing these <italic>k</italic>
-mers. Diamund uses Kraken [<xref rid="pone.0133198.ref017" ref-type="bibr">17</xref>
] or MUMmer [<xref rid="pone.0133198.ref027" ref-type="bibr">27</xref>
] for this task. Nevertheless, this is an obvious potential application of an index for sequencing reads.</p>
<p>Currently, only a few indexing structures supporting the mentioned list of queries are known. Historically, the first one is Gk arrays (GkA) [<xref rid="pone.0133198.ref019" ref-type="bibr">19</xref>
]. This scheme works for a single length of <italic>f</italic>
 only (set at construction time). The main GkA idea is to order lexicographically all substrings of length <italic>k</italic>
 = ∣<italic>f</italic>
∣ of the reads. Let us denote the cardinality of the reads collection with <italic>q</italic>
. Assume that the reads are of equal length <italic>m</italic>
. As the number of reads substrings is <italic>q</italic>
(<italic>m</italic>
 − <italic>k</italic>
 + 1), the binary search for sequences with a given <italic>k</italic>
-long prefix, like in a suffix array [<xref rid="pone.0133198.ref028" ref-type="bibr">28</xref>
], has time complexity of <italic>O</italic>
(<italic>k</italic>
 log((<italic>m</italic>
 − <italic>k</italic>
+1)<italic>q</italic>
)). In the following we use the symbol <italic>n</italic>
 = <italic>q</italic>
(<italic>m</italic>
 − <italic>k</italic>
+1) to simplify notation. This operation answers query <italic>Q</italic>
4 with <italic>f</italic>
 given as a sequence of symbols. If, however, the query position is given, then <italic>Q</italic>
4 is handled in constant time. GkA is based on three arrays: one for storing the start position of each <italic>k</italic>
-mer, one inverted array telling the lexicographic rank of a <italic>k</italic>
-mer given its position in a read, and finally an array associating to a <italic>k</italic>
-mer’s rank its number of occurrences. The proposed data structure was found to be both more memory efficient and (in most cases) faster than two alternatives, a hash table and a suffix array augmented with some helper tables.</p>
<p>Välimäki and Rivals [<xref rid="pone.0133198.ref029" ref-type="bibr">29</xref>
] proposed a compressed variant of Gk arrays, based on the compressed suffix array (CSA) [<xref rid="pone.0133198.ref030" ref-type="bibr">30</xref>
]. Their index, CGkA, reduces the size of its predecessor by about 40% to 90%, handling most queries with similar time complexity. There is a sampling rate parameter in the CGkA index telling how many, evenly sampled, suffix array and inverted suffix array entries are stored directly. Like GkA, this solution also supports a single value of <italic>k</italic>
.</p>
<p>The index presented in this paper is based on two ideas: building a pseudogenome by finding overlapping reads in the collection, and using the sparse suffix array [<xref rid="pone.0133198.ref008" ref-type="bibr">8</xref>
] as the search engine in the resulting sequence. We performed a number of experiments to compare the proposed PgSA (pseudogenome suffix array) and the existing GkA and CGkA indexes for the supported queries. Then, to see how PgSA would work in a real environment, we replaced the GkA in CRAC [<xref rid="pone.0133198.ref020" ref-type="bibr">20</xref>
] by our index to check its overall memory consumption and processing time.</p>
</sec>
<sec sec-type="materials|methods" id="sec002"><title>Materials and Methods</title>
<p>We assume that the input alphabet contains 4 (<monospace>ACGT</monospace>
) or 5 symbols (<monospace>ACGTN</monospace>
). The actual number of symbols in the input data implies some design choices in the internal representation of our index. By a <italic>pseudogenome</italic>
 we mean a sequence obtained by concatenation of all (possibly reverse-complemented) reads with overlaps. More formally, let us have a read array 𝓡 = [<italic>R</italic>
<sub>1</sub>
, …, <italic>R</italic>
<sub><italic>q</italic>
</sub>
], where <italic>R</italic>
<sub><italic>i</italic>
</sub>
 = <italic>R</italic>
<sub><italic>i</italic>
</sub>
[1…<italic>m</italic>
] for all <italic>i</italic>
 ∈ {1, …, <italic>q</italic>
}. A pseudogenome is a sequence <italic>PG</italic>
[1…<italic>p</italic>] for which
<list list-type="bullet"><list-item><p>there exists a sequence <italic>j</italic>
<sub>1</sub>
, <italic>j</italic>
<sub>2</sub>
, …, <italic>j</italic>
<sub><italic>q</italic>
</sub>
 such that <italic>j</italic>
<sub>1</sub>
 = 1, <italic>j</italic>
<sub><italic>i</italic>
+1</sub>
 − <italic>j</italic>
<sub><italic>i</italic>
</sub>
 ∈ {0, 1, …, <italic>m</italic>
} for all <italic>i</italic>
 ∈ {2, …, <italic>q</italic>
} and <italic>j</italic>
<sub><italic>q</italic>
</sub>
 = <italic>p</italic>
 − <italic>m</italic>
 + 1,</p>
</list-item>
<list-item><p>for each <italic>j</italic>
<sub><italic>i</italic>
</sub>
 we have <italic>PG</italic>
[<italic>j</italic>
<sub><italic>i</italic>
</sub>
…<italic>j</italic>
<sub><italic>i</italic>
</sub>
 + <italic>m</italic>
 − 1] = <italic>R</italic>
<sub><italic>u</italic>
<sub><italic>i</italic>
</sub>
</sub>
 or <italic>PG</italic>
[<italic>j</italic>
<sub><italic>i</italic>
</sub>
…<italic>j</italic>
<sub><italic>i</italic>
</sub>
+<italic>m</italic>
 − 1] = <italic>rc</italic>
(<italic>R</italic>
<sub><italic>u</italic>
<sub><italic>i</italic>
</sub>
</sub>
), where <italic>rc</italic>
(⋅) is the reverse-complement operation on a DNA sequence,</p>
</list-item>
<list-item><p>[<italic>u</italic>
<sub>1</sub>
, <italic>u</italic>
<sub>2</sub>
, …, <italic>u</italic>
<sub><italic>q</italic>
</sub>
] is a permutation of {1, 2, …, <italic>q</italic>
}.</p>
</list-item>
</list>
We attempt to minimize the pseudogenome length <italic>p</italic>
. In further considerations we usually deal with the permuted read array, hence we define it as <inline-formula id="pone.0133198.e001"><alternatives><graphic xlink:href="pone.0133198.e001.jpg" id="pone.0133198.e001g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M1"><mml:mrow><mml:msup><mml:mi mathvariant="script">R</mml:mi>
<mml:mo>′</mml:mo>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>[</mml:mo>
<mml:msub><mml:mi>R</mml:mi>
<mml:msub><mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>R</mml:mi>
<mml:msub><mml:mi>u</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
, where the indices <italic>u</italic>
<sub><italic>i</italic>
</sub>
 are described just above. Additionally, two symbols, + and ∘, will be useful. <italic>S</italic>
+<italic>T</italic>
 is a plain concatenation of strings <italic>S</italic>
 and <italic>T</italic>
. <italic>S</italic>
 ∘ <italic>T</italic>
 denotes a concatenation of strings <italic>S</italic>
 and <italic>T</italic>
 with a non-zero overlap of maximal length.</p>
<p>While a sequence approximating a <italic>real</italic>
 genome may be obtained by a de novo assembly procedure, we refrain from it because of two reasons. First, our procedure is lightweight, at least in conceptual and programming sense, while the problem of de novo assembly is known to be hard. Second, removing sequencing errors during the assembly is obviously beneficial for the output accuracy, but we aim at indexing original reads, and mapping the reads onto a “corrected” genomic sequence would complicate the index representation and would possibly be detrimental to query handling effectiveness.</p>
<p>Note that the minimal pseudogenome problem, without allowing the reverse-complement operations on the reads, is known in string matching literature under the name of the shortest common superstring (SCS) problem. SCS is NP-hard, as shown by Maier and Storer [<xref rid="pone.0133198.ref031" ref-type="bibr">31</xref>
].</p>
<p>The pseudogenome is generated in the following way. (<xref ref-type="fig" rid="pone.0133198.g001">Fig 1</xref>
 illustrates the main idea of the construction algorithm.) We keep the reads packed, having 3 symbols (when <italic>σ</italic>
 = 5) or 4 symbols (when <italic>σ</italic>
 = 4) per byte. The alphabet size is found in a preliminary pass over the data. We will say that a read has a prefix (suffix) overlap if it is already preceded (followed) with another read with a non-empty overlap. During the main phase of the algorithm we maintain five main arrays: <italic>P</italic>
, <italic>Q</italic>
, <italic>Q</italic>
′, <italic>S</italic>
, and <italic>S</italic>
′. The main loop of the algorithm is run <italic>m</italic> − 1 times. In each loop iteration, the following invariants are held:
<list list-type="bullet"><list-item><p>The elements of array <italic>P</italic>
 have two fields, the information if the current read (i.e., with the ID given by the current index in <italic>P</italic>
) has a suffix overlap and if so, the ID of the suffix-overlapping read and the overlap length.</p>
</list-item>
<list-item><p>Array <italic>Q</italic>
 always stores the IDs of the reads which are not suffix-overlapping any other reads. The items in <italic>Q</italic>
 are arranged by the lexicographical order of the reads.</p>
</list-item>
<list-item><p>Array <italic>S</italic>
 always stores the IDs of the reads which are not prefix-overlapping any other reads. In <italic>i</italic>
-th loop iteration, <italic>i</italic>
 ≥ 1, they are arranged by the lexicographical order of the suffix of the read starting at the position <italic>i</italic>
.</p>
</list-item>
</list>
</p>
<fig id="pone.0133198.g001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g001</object-id>
<label>Fig 1</label>
<caption><title>Pseudogenome generation example.</title>
<p>The input read collection 𝓡 contains 6 reads of length 6.</p>
</caption>
<graphic xlink:href="pone.0133198.g001"></graphic>
</fig>
<p>At the start array <italic>S</italic>
′ contains IDs of lexicographically sorted reads. (To obtain sorted reads, we use C++ std::sort, working in <italic>O</italic>
(<italic>mq</italic>
 log <italic>q</italic>
) time. Replacing it with radix sort we could reduce this time complexity to <italic>O</italic>
(<italic>mq</italic>
), yet it was not implemented.) In the initial phase we use array <italic>S</italic>
′ to find reads with an overlap of length <italic>m</italic>
, i.e., duplicates. If consecutive reads in array <italic>S</italic>
′ are identical, then we mark in array <italic>P</italic>
 that the second one is suffix-overlapping the first. While traversing <italic>S</italic>
′ we copy the reads without a suffix overlap to array <italic>S</italic>
 and the reads without a prefix overlap to array <italic>Q</italic>
. The array <italic>Q</italic>
′ is initially empty and <italic>S</italic>
′ is flushed before the main loop. In each loop iteration we traverse the reads from array <italic>S</italic>
, but in the order of their suffixes starting from position <italic>i</italic>
+1. To this end, we need to store <italic>σ</italic>
 pointers for the current suffix in each group defined by the symbol at position <italic>i</italic>
, which allows to find the minimal of the <italic>σ</italic>
 suffixes starting at the next symbol in <italic>O</italic>
(<italic>σ</italic>
) string comparisons. From now on, we assume that <italic>σ</italic>
 = <italic>O</italic>
(1) for DNA, which allows to neglect the <italic>σ</italic>
 factor in the complexities. Note that finding the next read in <italic>S</italic>
 takes <italic>O</italic>
(<italic>σm</italic>
) = <italic>O</italic>
(<italic>m</italic>
) time, which gives <italic>O</italic>
(<italic>qm</italic>
) time for traversing once the whole array. “In parallel”, we also traverse the reads from array <italic>Q</italic>
 in their natural order. This resembles merging two sorted arrays (as used, e.g., in the textbook merge sort), with the difference that we do not sort the strings, but rather look for matches (overlaps) of length exactly <italic>m</italic>
 − <italic>i</italic>
, in the <italic>i</italic>
th iteration. Each check for an overlap takes <italic>O</italic>
(<italic>m</italic>
) time, hence a pass over the arrays of <italic>S</italic>
 and <italic>Q</italic>
 takes <italic>O</italic>
(<italic>qm</italic>
) time. Now, if for a read <italic>x</italic>
 ∈ <italic>S</italic>
 we find a suffix-overlapping read <italic>y</italic>
 ∈ <italic>Q</italic>
, <italic>y</italic>
 ≠ <italic>x</italic>
, we store this information in <italic>P</italic>
 together with the length of the overlap (i.e., <italic>m</italic>
 − <italic>i</italic>
). If there is no overlap (of length <italic>m</italic>
 − <italic>i</italic>
) for <italic>x</italic>
, we copy the ID of <italic>x</italic>
 to array <italic>S</italic>
′. Similarly, if while traversing <italic>Q</italic>
 we have not found any prefix-overlapping read for <italic>y</italic>
 ∈ <italic>Q</italic>
, then we copy its ID to array <italic>Q</italic>
′. When looking for overlaps we have to take care that the overlapped reads do not form a cycle. It is done by storing (in a separate auxiliary array) for each read that is not suffix-overlapped, the ID of the non-prefix-overlapped read in a chain of overlapped reads. For example, if there is a chain of overlapped reads <italic>R</italic>
<sub>1</sub>
 ∘ <italic>R</italic>
<sub>2</sub>
 ∘ <italic>R</italic>
<sub>3</sub>
 ∘ <italic>R</italic>
<sub>4</sub>
, we store for <italic>R</italic>
<sub>4</sub>
 that the “head” of the chain is <italic>R</italic>
<sub>1</sub>
. Then, when we look for a candidate for suffix overlap of <italic>R</italic>
<sub>4</sub>
, we can exclude <italic>R</italic>
<sub>1</sub>
. These data are easily updated in <italic>O</italic>
(1) for each newly found overlap.</p>
<p>After a pass, <italic>S</italic>
′ contains the IDs of only those reads which are not suffix-overlapped yet, sorted by their suffix starting at position <italic>i</italic>
+1 and <italic>Q</italic>
′ contains the IDs of only those reads which are not prefix-overlapped yet (in lexicographical order). The content of <italic>S</italic>
′ and <italic>Q</italic>
′ is then copied to <italic>S</italic>
 and <italic>Q</italic>
, respectively. <italic>S</italic>
′ and <italic>Q</italic>
′ are flushed before the next iteration. (Of course, in a real implementation, the pointers to arrays are simply swapped, without physical array copying.) It can be easily noticed that the time complexity of the construction algorithm is <italic>O</italic>
(<italic>qm</italic>
(<italic>m</italic>
 + log <italic>q</italic>
)). Using radix sort to initially sort the reads in the array <italic>Q</italic>
 would reduce the time complexity to <italic>O</italic>
(<italic>qm</italic>
<sup>2</sup>
).</p>
<p>Note that our current pseudogenome implementation does not handle reverse-complemented reads. Yet, our preliminary experiments with adding reverse-complemented reads to the generated sequence resulted in rather moderate improvement in the pseudogenome length (e.g., shorter by about 15%), while handling the queries requires significant changes in the used data structures (and possibly more space needed for them). For this reason, we leave this harder problem version as future work.</p>
<p>We note that this procedure is only a heuristic and does not guarantee to produce an optimal (shortest possible) pseudogenome. To see this, consider an example of three reads: <italic>R</italic>
<sub>1</sub>
 = <italic>ACAT</italic>
, <italic>R</italic>
<sub>2</sub>
 = <italic>CATG</italic>
 and <italic>R</italic>
<sub>3</sub>
 = <italic>ATCA</italic>
. According to the presented algorithm, we obtain the assembly (<italic>R</italic>
<sub>1</sub>
 ∘ <italic>R</italic>
<sub>2</sub>
)+<italic>R</italic>
<sub>3</sub>
 → <italic>ACATGATCA</italic>
 of length 9. Yet, the assembly (<italic>R</italic>
<sub>1</sub>
 ∘ <italic>R</italic>
<sub>3</sub>
) ∘ <italic>R</italic>
<sub>2</sub>
 → <italic>ACATCATG</italic>
 produces a sequence of length 8.</p>
<p>The actual pseudogenome representation depends on the given data (number of reads, read length etc.). In general it contains the <italic>PG</italic>
 string and the read array 𝓡<sup>PG</sup> consisting of either 9- or 13-byte records. Consecutive records correspond to consecutive reads in the pseudogenome and contain the following fields:
<list list-type="bullet"><list-item><p>read offset in the pseudogenome (4 or 8 bytes, depending on the pseudogenome length),</p>
</list-item>
<list-item><p>flag data occupying 1 byte (repetitive read flag, occurrence flag, single-occurrence flag, to be described later; several bits of this byte are not used),</p>
</list-item>
<list-item><p>read index in the original read array 𝓡 (4 bytes).</p>
</list-item>
</list>
</p>
<p>Over the pseudogenome a search structure is built. Our basic solution is based on the classic suffix array (SA) [<xref rid="pone.0133198.ref028" ref-type="bibr">28</xref>
], as a simple and fast full-text index. The <italic>SA</italic>
<sup>PG</sup> elements require from 4 to 6 bytes. One element, associated with one pseudogenome suffix, stores the following fields:
<list list-type="bullet"><list-item><p>a read array index of the furthest read (of 𝓡<sup>PG</sup>
) containing starting symbols of the given suffix (3 or 4 bytes, depending on the number of reads in the collection),</p>
</list-item>
<list-item><p>start position of the suffix with regard to the beginning of the read (1 or 2 bytes, depending on the read length).</p>
</list-item>
</list>
</p>
<p>In order to access a suffix one has to obtain from the read array 𝓡<sup>PG</sup>
 the offset of the specified read and add an offset of the suffix. Such organization enables straightforward identification of reads containing the sought prefix of the suffix.</p>
<p>Packing DNA symbols into bytes is a standard idea in compact data structures. We adopt this solution for the pseudogenome, in order to reduce the space use, minimize the rate of cache misses during searches and boost string comparisons (due to a lesser number of compared bytes on average). When the alphabet contains 4 symbols we handle the following compaction variants: (<italic>i</italic>
) 2, 3 or 4 symbols per byte, (<italic>ii</italic>
) 5 or 6 symbols per 2-byte unit (“short”). For the 5-symbol alphabet we pack either (<italic>i</italic>
) 2 or 3 symbols per byte, or (<italic>ii</italic>
) 4, 5 or 6 symbols per 2-byte unit.</p>
<p>Apart from the standard variant, we also implement a sparse suffix array (SpaSA) [<xref rid="pone.0133198.ref008" ref-type="bibr">8</xref>
], which samples the suffixes in regular distances from the SA. The distances between sampled suffixes are specified by input parameter <italic>s</italic>
. More precisely, if the pseudogenome is represented with <italic>PG</italic>
[1…<italic>p</italic>
] (w.l.o.g. assume that <italic>s</italic>
 divides <italic>p</italic>
), the SpaSA index contains <italic>p</italic>
/<italic>s</italic>
 suffix offsets: <italic>s</italic>
, 2<italic>s</italic>
, …, <italic>p</italic>
. The data stored for a sampled suffix are like described above, plus <italic>s</italic>
 − 1 preceding symbols, in packed form. We set the <italic>s</italic>
 ≤ 6 limitation. Storing these <italic>s</italic>
 − 1 symbols allows not to access the pseudogenome sequence during a scan over the suffix array (cf. the <italic>Q</italic>
3 query, described later) and is thus cache friendly. More precisely, the idea of storing the <italic>s</italic>
 − 1 symbols directly preceding a given suffix together with the corresponding offset in the sparse suffix array with sparsity <italic>s</italic>
 is to avoid verifying these symbols (of which some or all may belong to the query’s prefix of length at most <italic>s</italic>
 − 1) with an access into the pseudogenome, which resides in another array. In this way we have more local memory accesses. To make the current implementation easier and faster (due to less conditions necessary to check in the search procedure) the sparsity of the suffix array determines the packing of symbols, e.g., <italic>s</italic>
 = 5 means that 5 symbols are packed into 2-byte unit. Note that the <italic>s</italic>
 − 1 packed symbols require up to 2 bytes, hence the whole element for a suffix requires from 5 to 8 bytes.</p>
<p>For small values of <italic>k</italic>
 it is feasible to precompute all answers for the counting queries (<italic>Q</italic>
2, <italic>Q</italic>
4, and <italic>Q</italic>
6). We assume the query is over the 4-symbol alphabet (<monospace>ACGT</monospace>
). When the pseudogenome is small (up to 300 Mbases) we cache the answers for all <italic>k</italic>
 ≤ 10, and for larges pseudogenomes for all <italic>k</italic>
 ≤ 11. The <italic>Q</italic>
2 and <italic>Q</italic>
6 results occupy 4 bytes each and <italic>Q</italic>
4 results 8 bytes. (Handling <italic>Q</italic>
4 needs more space since <italic>f</italic>
 may appear in a single read several times.)</p>
<p>We note that the queries <italic>Q</italic>
2, <italic>Q</italic>
4, and <italic>Q</italic>
6 are related. For example, the number of reads in which string <italic>f</italic>
 occurs only once (<italic>Q</italic>
6) is often not much smaller than the total number of occurrences of <italic>f</italic>
 (<italic>Q</italic>
4), and sometimes these values may be even equal; the equality of <italic>Q</italic>
4 and <italic>Q</italic>
6 also implies the same value of <italic>Q</italic>
2. We make use of this fact and store answers also for <italic>some</italic>
 longer <italic>k</italic>
-mers: up to <italic>k</italic>
 = 12 using 2-byte units and single bytes for <italic>k</italic>
 = 13. The precomputed answers are stored only if <italic>Q</italic>
2 = <italic>Q</italic>
4 = <italic>Q</italic>
6, and <italic>Q</italic>
2 less than 2<sup>16</sup>
 − 1 or 2<sup>8</sup>
 − 1, depending on the used variant. The opposite case is signaled on the 1- or 2-byte field with an unused value.</p>
<p>We call the main variant as variable-<italic>k</italic>
 PgSA. Still, our tool also has a fixed-<italic>k</italic>
 mode, in which the worst case complexities (although not significantly the performance on real data) improve. In this mode, after building the suffix array over the pseudogenome, the suffixes whose prefix of length <italic>k</italic>
 is not a substring of any read are removed from the SA. Such a check is performed for each suffix with a reference to 𝓡<sup>PG</sup>
. Note that the removed suffixes may start only in reads which are overlapped by at most <italic>k</italic>
 − 2 symbols or are not overlapped at all. As each suffix in the found SA range contains at least one occurrence of the query <italic>f</italic>
, the SA range width does not exceed ∣<italic>Q</italic>
3∣.</p>
<p><xref rid="pone.0133198.t001" ref-type="table">Table 1</xref>
 compares the worst-case time complexities for the queries <italic>Q</italic>
1–<italic>Q</italic>
7 of the existing algorithms. We use the notation ∣<italic>Qx</italic>
∣ to represent the number of occurrences reported by query <italic>Qx</italic>
 (for <italic>x</italic>
 = 1, 3, 5, 7). In the following paragraphs we describe how the seven queries are performed in an order dictated by exposition clarity.</p>
<table-wrap id="pone.0133198.t001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t001</object-id>
<label>Table 1</label>
<caption><title>Worst-case time complexities.</title>
<p>To save space, the <italic>O</italic>
(.) symbols around each formula were omitted. Note that <italic>n</italic>
 = <italic>q</italic>
(<italic>m</italic>
 − <italic>k</italic>
+1). The time complexities for PgSA are given for the fixed-<italic>k</italic>
 mode with SA sparsity set to 1. In the variable-<italic>k</italic>
 mode or when SA sparsity larger than 1 is used, the number of visited <italic>SA</italic>
<sup>PG</sup>
 locations should be added to the PgSA complexities.</p>
</caption>
<alternatives><graphic id="pone.0133198.t001g" xlink:href="pone.0133198.t001"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">query</th>
<th align="left" rowspan="1" colspan="1">GkA (pos)</th>
<th align="left" rowspan="1" colspan="1">CGkA (pos)</th>
<th align="left" rowspan="1" colspan="1">GkA (seq)</th>
<th align="left" rowspan="1" colspan="1">CGkA (seq)</th>
<th align="left" rowspan="1" colspan="1">PgSA (pos/seq)</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">Q1</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
1∣ log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
 + |Q1| log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q2</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q3</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣ log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
 + |Q3| log log</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q4</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q5</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
5∣ log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
 + |Q5| log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q6</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Q7</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1">∣<italic>Q</italic>
7∣ log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>n</italic>
 + ∣<italic>Q</italic>
3∣</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log σ + polylog <italic>n</italic>
 + |Q7| log log <italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
 log <italic>p</italic>
 + ∣<italic>Q</italic>
3∣</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p><italic>Q</italic>
3 We binary search the suffix array <italic>SA</italic>
<sup>PG</sup>
 for the string <italic>f</italic>
, and for each potential match in the found range, pointing to some position in the pseudogenome <italic>PG</italic>
 (represented as a pair: read ID in the read array 𝓡<sup>PG</sup>
 and the suffix offset with regard to the beginning of the read), we check in how many (0 or more) reads <italic>f</italic>
 really occurs. To this end, we check if the suffix offset shifted by <italic>k</italic>
 symbols does not exceed the read length <italic>m</italic>
. If this is the case, we add its position to the output list, otherwise we terminate. Then, we scan over the read array 𝓡<sup>PG</sup>
 backward, adding a position as long as the suffix offset plus <italic>k</italic>
 still does not exceed <italic>m</italic>
. To speed up the binary search over <italic>SA</italic>
<sup>PG</sup>
, we make use of a lookup table (LUT) storing the ranges of suffixes of all possible prefixes of length 11 (note that the number of LUT entries is, depending on the alphabet in a given dataset, 4<sup>11</sup>
 or 5<sup>11</sup>
, which is less than 50M).</p>
<p><italic>Q</italic>
4 We follow the procedure for <italic>Q</italic>
3, but simply count the matches.</p>
<p><italic>Q</italic>
1 This query is related to <italic>Q</italic>
3, but requires filtering, as <italic>f</italic>
 may occur in a read more than once. To this end, “occurrence flags” (stored in flag fields of 𝓡<sup>PG</sup>
) are used. Initially, all these flags are set to false. During the iteration over reads (like in the <italic>Q</italic>
3 query) only non-visited yet reads are added to the output list and for each new read the corresponding flag is set to true. The flag locations are also put on a stack, to remove them in <italic>O</italic>
(∣<italic>Q</italic>
1∣) time at the end, leaving all “occurrence flags” set to false in 𝓡<sup>PG</sup>
. In general ∣<italic>Q</italic>
1∣ ≤ ∣<italic>Q</italic>
3∣, but since the equality often holds, we implemented some optimization. The array 𝓡<sup>PG</sup>
 stores “repetitive read flag” for each read. This flag is true if the read contains at least one 11-mer at least twice. When we process the reads answering the <italic>Q</italic>
1 query we verify the flag. If it is false we are sure that no <italic>f</italic>
 (of length at least 11) can appear in the read more than one time.</p>
<p><italic>Q</italic>
2 This query is to <italic>Q</italic>
1 exactly like <italic>Q</italic>
4 to <italic>Q</italic>
3.</p>
<p><italic>Q</italic>
5 Again, this query is related to <italic>Q</italic>
3, with extra filtration needed. Now “single-occurrence” flags in 𝓡<sup>PG</sup>
 are used. The one-visit only mechanism for reads and unsetting the flags with aid of a stack is identical as in <italic>Q</italic>
1. The operations on the stack take <italic>O</italic>
(∣<italic>Q</italic>
5∣) time, where ∣<italic>Q</italic>
5∣ ≤ ∣<italic>Q</italic>
3∣. Also here the “repetitive read” flags are used as a helpful heuristic.</p>
<p><italic>Q</italic>
6 This query is to <italic>Q</italic>
5 exactly like <italic>Q</italic>
4 to <italic>Q</italic>
3, or <italic>Q</italic>
2 to <italic>Q</italic>
1.</p>
<p><italic>Q</italic>
7 We follow the procedure for <italic>Q</italic>
5, only with replacing read IDs with the match positions.</p>
<p>As a final note, we admit that the flag fields stored in 𝓡<sup>PG</sup>
 prevent multiple threads from querying the data structure concurrently, so the algorithm must be single-threaded. We are going to address this issue in a future version of the algorithm.</p>
</sec>
<sec sec-type="results" id="sec003"><title>Results</title>
<p>We ran experiments to confirm validity of our algorithm in practice. The testbed machine was equipped with an Intel i7 4930K 3.4 GHz CPU and 64 GB of RAM (DDR3-1600, CL11), running Linux 3.13.0-43-generic x86_64 (Ubuntu 14.04.1 LTS). <xref rid="pone.0133198.t002" ref-type="table">Table 2</xref> presents the datasets used in the tests. All these datasets are available at public repositories:
<list list-type="bullet"><list-item><p>E. coli (11.5M reads of 151 bp)—ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R1.fastq.gz, ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R2.fastq.gz, this dataset was used in the CGkA paper [<xref rid="pone.0133198.ref029" ref-type="bibr">29</xref>
],</p>
</list-item>
<list-item><p>GRCh37 (42.4M reads of 75 bp; no N symbols in the data)—<ext-link ext-link-type="uri" xlink:href="http://crac.gforge.inria.fr/index.php?id=genomes-reads">http://crac.gforge.inria.fr/index.php?id = genomes-reads</ext-link>
, this dataset was used in the CRAC paper [<xref rid="pone.0133198.ref020" ref-type="bibr">20</xref>
],</p>
</list-item>
<list-item><p>C. elegans (67.6M reads of 100 bp)—<ext-link ext-link-type="uri" xlink:href="http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/">http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/</ext-link>
.</p>
</list-item>
</list>
The command lines of the examined programs can be found in the PgSA package available at project homepage <ext-link ext-link-type="uri" xlink:href="http://sun.aei.polsl.pl/pgsa">http://sun.aei.polsl.pl/pgsa</ext-link>
.</p>
<table-wrap id="pone.0133198.t002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t002</object-id>
<label>Table 2</label>
<caption><title>Dataset characteristics.</title>
</caption>
<alternatives><graphic id="pone.0133198.t002g" xlink:href="pone.0133198.t002"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Dataset</th>
<th align="left" rowspan="1" colspan="1">No. reads [M]</th>
<th align="left" rowspan="1" colspan="1">Read length</th>
<th align="left" rowspan="1" colspan="1">Alphabet size</th>
<th align="left" rowspan="1" colspan="1">PG length [MB]</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">E. coli</td>
<td align="char" char="." rowspan="1" colspan="1">11.5</td>
<td align="left" rowspan="1" colspan="1">151</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="char" char="." rowspan="1" colspan="1">551.4</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GRCh37</td>
<td align="char" char="." rowspan="1" colspan="1">42.4</td>
<td align="left" rowspan="1" colspan="1">75</td>
<td align="left" rowspan="1" colspan="1">4</td>
<td align="char" char="." rowspan="1" colspan="1">567.9</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">C. elegans</td>
<td align="char" char="." rowspan="1" colspan="1">67.6</td>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="char" char="." rowspan="1" colspan="1">1603.1</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>In the first experiments, we compare PgSA versus GkA (version 2.1.0) and CGkA (version cgka_2013_08_21) on two datasets, E. coli and GRCh37-75bp-simulated reads (Figs <xref ref-type="fig" rid="pone.0133198.g002">2</xref>
, <xref ref-type="fig" rid="pone.0133198.g003">3</xref>
, <xref ref-type="fig" rid="pone.0133198.g004">4</xref>
, <xref ref-type="fig" rid="pone.0133198.g005">5</xref>
). We can see that in <italic>Q</italic>
1 and <italic>Q</italic>
3 queries PgSA is by more than an order of magnitude faster than CGkA at comparable or better compression rate. As expected, GkA is faster than CGkA (and sometimes faster, although not very significantly, than PgSA), yet requiring at least 3 times more space. The speed relation is different for <italic>Q</italic>
2 and <italic>Q</italic>
4 queries. Here CGkA defeats PgSA, sometimes by an order of magnitude. In the <italic>Q</italic>
4 query, given by position, GkA is a clear winner in speed. We note that the tested (latest) GkA version (v2.1.0) does not support <italic>Q</italic>
1, <italic>Q</italic>
2 and <italic>Q</italic>
4 when the query is given as a sequence rather than a position. Overall, we believe that PgSA offers attractive space-time tradeoffs for most queries, and in contrast to its competitors it handles arbitrary values of <italic>k</italic>
 (rather than a fixed one). Additionally, we note that the latest GkA and CGkA versions do not support the <italic>Q</italic>
5–<italic>Q</italic>
7 queries.</p>
<fig id="pone.0133198.g002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g002</object-id>
<label>Fig 2</label>
<caption><title><italic>Q</italic>
1 query results on E. coli (top row) and H. sapiens (bottom row) data.</title>
<p>The three points in PgSA series correspond to sparsity <italic>s</italic>
 = 6 for the leftmost point, <italic>s</italic>
 = 3 (E. coli) or <italic>s</italic>
 = 4 (H. sapiens) for the middle point and <italic>s</italic>
 = 1 for the rightmost point. The three points in CGk series correspond to sampling rates <italic>sr</italic>
 of 512, 25 and 6 (E. coli), and 512, 22 and 6 (H. sapiens), respectively. On the left figures the query is given as a position in the read list, while on the right ones as a string.</p>
</caption>
<graphic xlink:href="pone.0133198.g002"></graphic>
</fig>
<fig id="pone.0133198.g003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g003</object-id>
<label>Fig 3</label>
<caption><title><italic>Q</italic>
2 query results on E. coli (top row) and H. sapiens (bottom row) datasets.</title>
<p>The three points in PgSA series correspond to sparsity <italic>s</italic>
 = 6 for the leftmost point, <italic>s</italic>
 = 3 (E. coli) or <italic>s</italic>
 = 4 (H. sapiens) for the middle point and <italic>s</italic>
 = 1 for the rightmost point. The three points in CGk series correspond to sampling rates <italic>sr</italic>
 of 512, 25 and 6 (E. coli), and 512, 22 and 6 (H. sapiens), respectively. On the left figures the query is given as a position in the read list, while on the right ones as a string.</p>
</caption>
<graphic xlink:href="pone.0133198.g003"></graphic>
</fig>
<fig id="pone.0133198.g004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g004</object-id>
<label>Fig 4</label>
<caption><title><italic>Q</italic>
3 query results on E. coli (top row) and H. sapiens (bottom row) datasets.</title>
<p>The three points in PgSA series correspond to sparsity <italic>s</italic>
 = 6 for the leftmost point, <italic>s</italic>
 = 3 (E. coli) or <italic>s</italic>
 = 4 (H. sapiens) for the middle point and <italic>s</italic>
 = 1 for the rightmost point. The three points in CGk series correspond to sampling rates <italic>sr</italic>
 of 512, 25 and 6 (E. coli), and 512, 22 and 6 (H. sapiens), respectively. On the left figures the query is given as a position in the read list, while on the right ones as a string.</p>
</caption>
<graphic xlink:href="pone.0133198.g004"></graphic>
</fig>
<fig id="pone.0133198.g005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g005</object-id>
<label>Fig 5</label>
<caption><title><italic>Q</italic>
4 query results on E. coli (top row) and H. sapiens (bottom row) datasets.</title>
<p>The three points in PgSA series correspond to sparsity <italic>s</italic>
 = 6 for the leftmost point, <italic>s</italic>
 = 3 (E. coli) or <italic>s</italic>
 = 4 (H. sapiens) for the middle point and <italic>s</italic>
 = 1 for the rightmost point. The three points in CGk series correspond to sampling rates <italic>sr</italic>
 of 512, 25 and 6 (E. coli), and 512, 22 and 6 (H. sapiens), respectively. On the left figures the query is given as a position in the read list, while on the right ones as a string.</p>
</caption>
<graphic xlink:href="pone.0133198.g005"></graphic>
</fig>
<p>In the next experiment we ran only PgSA and GkA on C. elegans dataset (<xref ref-type="fig" rid="pone.0133198.g006">Fig 6</xref>
). We were not able to run CGkA on this dataset. The PgSA lines on the figures are for the queries <italic>Q</italic>
1–<italic>Q</italic>
7 given as a sequence (the time differences with regard to queries given as a position are up to 1 percent), and the left and right figure corresponds to the query length <italic>k</italic>
 = 11 and <italic>k</italic>
 = 16, respectively. Note that the results for the queries <italic>Q</italic>
2, <italic>Q</italic>
4, and <italic>Q</italic>
6 are precomputed (cached) for <italic>k</italic>
 = 11.</p>
<fig id="pone.0133198.g006" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.g006</object-id>
<label>Fig 6</label>
<caption><title><italic>Q</italic>
1–<italic>Q</italic>
7 query results of PgSA and GkA on C. elegans dataset.</title>
<p>The six points in the series correspond (from left to right) to sparsities <italic>s</italic>
 = 6, …, 1. The letter ‘p’ appended to some query names means that the query is given as a position in the read list.</p>
</caption>
<graphic xlink:href="pone.0133198.g006"></graphic>
</fig>
<p>In Tables <xref rid="pone.0133198.t003" ref-type="table">3</xref>
 and <xref rid="pone.0133198.t004" ref-type="table">4</xref>
 we detail out how much space is consumed by the components of the PgSA solution.</p>
<table-wrap id="pone.0133198.t003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t003</object-id>
<label>Table 3</label>
<caption><title>E. coli dataset, space consumption.</title>
<p>All sizes in megabytes.</p>
</caption>
<alternatives><graphic id="pone.0133198.t003g" xlink:href="pone.0133198.t003"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">SA sparsity</th>
<th align="left" rowspan="1" colspan="1"><italic>PG</italic>
</th>
<th align="left" rowspan="1" colspan="1">𝓡<sup>PG</sup>
</th>
<th align="left" rowspan="1" colspan="1"><italic>SA</italic>
<sup>PG</sup>
</th>
<th align="left" rowspan="1" colspan="1"><italic>LUT</italic>
</th>
<th align="left" rowspan="1" colspan="1">total</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">551</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">2205</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">3101</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">276</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">1378</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">1999</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">184</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">919</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">1447</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">276</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">689</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">1309</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">221</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">662</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">1227</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">184</td>
<td align="left" rowspan="1" colspan="1">149</td>
<td align="left" rowspan="1" colspan="1">551</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">1080</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0133198.t004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t004</object-id>
<label>Table 4</label>
<caption><title>C. elegans dataset, space consumption.</title>
<p>All sizes in megabytes.</p>
</caption>
<alternatives><graphic id="pone.0133198.t004g" xlink:href="pone.0133198.t004"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">SA sparsity</th>
<th align="left" rowspan="1" colspan="1"><italic>PG</italic>
</th>
<th align="left" rowspan="1" colspan="1">𝓡<sup>PG</sup>
</th>
<th align="left" rowspan="1" colspan="1"><italic>SA</italic>
<sup>PG</sup>
</th>
<th align="left" rowspan="1" colspan="1"><italic>LUT</italic>
</th>
<th align="left" rowspan="1" colspan="1">total</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1603</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">8016</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">10693</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">802</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">4809</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">6685</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">534</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">3206</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">4814</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">802</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">2405</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">4280</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">641</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">2244</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">3959</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">534</td>
<td align="left" rowspan="1" colspan="1">879</td>
<td align="left" rowspan="1" colspan="1">1870</td>
<td align="left" rowspan="1" colspan="1">195</td>
<td align="left" rowspan="1" colspan="1">3479</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>It may be informative to show the times and maximum memory usages for particular phases of the PgSA index construction. They are revealed in <xref rid="pone.0133198.t005" ref-type="table">Table 5</xref>
, for the variant based on the plain suffix array (i.e., sparsity <italic>s</italic>
 = 1). Morover, <xref rid="pone.0133198.t006" ref-type="table">Table 6</xref>
 contains index construction time, peak constructiontime memory usages and index spaces for the three solutions: GkArrays, CGk, and PgSA.</p>
<table-wrap id="pone.0133198.t005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t005</object-id>
<label>Table 5</label>
<caption><title>Times and maximum memory usages for the PgSA index construction phases.</title>
</caption>
<alternatives><graphic id="pone.0133198.t005g" xlink:href="pone.0133198.t005"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th rowspan="2" align="left" colspan="1"></th>
<th colspan="3" align="center" rowspan="1">Dataset</th>
</tr>
<tr><th align="left" rowspan="1" colspan="1">E. coli</th>
<th align="left" rowspan="1" colspan="1">GRCh37</th>
<th align="left" rowspan="1" colspan="1">C. elegans</th>
</tr>
</thead>
<tbody><tr><td colspan="4" align="center" rowspan="1"><bold>Maximal space usage [MB]</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Pseudogenome (RAM)</td>
<td align="left" rowspan="1" colspan="1">1,361</td>
<td align="left" rowspan="1" colspan="1">2,193</td>
<td align="left" rowspan="1" colspan="1">5,258</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Suffix array (HDD)</td>
<td align="left" rowspan="1" colspan="1">2,206</td>
<td align="left" rowspan="1" colspan="1">2,272</td>
<td align="left" rowspan="1" colspan="1">6,413</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Total (RAM)</td>
<td align="left" rowspan="1" colspan="1">3,028</td>
<td align="left" rowspan="1" colspan="1">3,787</td>
<td align="left" rowspan="1" colspan="1">10,278</td>
</tr>
<tr><td colspan="4" align="center" rowspan="1"><bold>Time [s]</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Pseudogenome construction</td>
<td align="left" rowspan="1" colspan="1">189</td>
<td align="left" rowspan="1" colspan="1">219</td>
<td align="left" rowspan="1" colspan="1">603</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Repetitive reads filter calculation</td>
<td align="left" rowspan="1" colspan="1">9</td>
<td align="left" rowspan="1" colspan="1">7</td>
<td align="left" rowspan="1" colspan="1">23</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Suffix array construction</td>
<td align="left" rowspan="1" colspan="1">221</td>
<td align="left" rowspan="1" colspan="1">236</td>
<td align="left" rowspan="1" colspan="1">733</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">SA lookup construction</td>
<td align="left" rowspan="1" colspan="1">13</td>
<td align="left" rowspan="1" colspan="1">9</td>
<td align="left" rowspan="1" colspan="1">33</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Total (including I/O)</td>
<td align="left" rowspan="1" colspan="1">452</td>
<td align="left" rowspan="1" colspan="1">509</td>
<td align="left" rowspan="1" colspan="1">1,476</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0133198.t006" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t006</object-id>
<label>Table 6</label>
<caption><title>Index construction times and memory usages for the GkArrays, CGk and PgSA algorithms. For the CGkA algorithm <italic>sr</italic>
 denotes the sampling rate parameter, being a space-time tradeoff. CGkA crashed on the C. elegans dataset, in all tested configurations. GkA index is not written to disk, as opposed to the other two tools.</title>
</caption>
<alternatives><graphic id="pone.0133198.t006g" xlink:href="pone.0133198.t006"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Index</th>
<th align="left" rowspan="1" colspan="1">Index space [MB]</th>
<th align="left" rowspan="1" colspan="1">Max. working space [MB]</th>
<th align="left" rowspan="1" colspan="1">User + system time [s]</th>
</tr>
</thead>
<tbody><tr><td colspan="4" align="center" rowspan="1"><bold>E. coli</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">12,500</td>
<td align="left" rowspan="1" colspan="1">19,358</td>
<td align="left" rowspan="1" colspan="1">494</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">12,400</td>
<td align="left" rowspan="1" colspan="1">17,881</td>
<td align="left" rowspan="1" colspan="1">439</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 8), <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">3,120</td>
<td align="left" rowspan="1" colspan="1">3,858</td>
<td align="left" rowspan="1" colspan="1">1,171</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 8), <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">3,120</td>
<td align="left" rowspan="1" colspan="1">3,859</td>
<td align="left" rowspan="1" colspan="1">1,235</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 128), <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">1,538</td>
<td align="left" rowspan="1" colspan="1">3,857</td>
<td align="left" rowspan="1" colspan="1">1,128</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 128), <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">1,538</td>
<td align="left" rowspan="1" colspan="1">3,859</td>
<td align="left" rowspan="1" colspan="1">1,181</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 1), var-k</td>
<td align="left" rowspan="1" colspan="1">3,101</td>
<td align="left" rowspan="1" colspan="1">3,028</td>
<td align="left" rowspan="1" colspan="1">452</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 2), var-k</td>
<td align="left" rowspan="1" colspan="1">1,999</td>
<td align="left" rowspan="1" colspan="1">1,951</td>
<td align="left" rowspan="1" colspan="1">394</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 3), var-k</td>
<td align="left" rowspan="1" colspan="1">1,447</td>
<td align="left" rowspan="1" colspan="1">1,411</td>
<td align="left" rowspan="1" colspan="1">344</td>
</tr>
<tr><td colspan="4" align="center" rowspan="1"><bold>GRCh37</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">21,300</td>
<td align="left" rowspan="1" colspan="1">32,887</td>
<td align="left" rowspan="1" colspan="1">844</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">19,360</td>
<td align="left" rowspan="1" colspan="1">27,615</td>
<td align="left" rowspan="1" colspan="1">695</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 8), <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">3,983</td>
<td align="left" rowspan="1" colspan="1">7,930</td>
<td align="left" rowspan="1" colspan="1">1,313</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 8), <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">3,983</td>
<td align="left" rowspan="1" colspan="1">7,930</td>
<td align="left" rowspan="1" colspan="1">1,400</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 128), <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">2,957</td>
<td align="left" rowspan="1" colspan="1">7,930</td>
<td align="left" rowspan="1" colspan="1">1,280</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CGkA (<italic>sr</italic>
 = 128), <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">2,957</td>
<td align="left" rowspan="1" colspan="1">7,930</td>
<td align="left" rowspan="1" colspan="1">1,395</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 1), var-k</td>
<td align="left" rowspan="1" colspan="1">3,975</td>
<td align="left" rowspan="1" colspan="1">3,787</td>
<td align="left" rowspan="1" colspan="1">509</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 2), var-k</td>
<td align="left" rowspan="1" colspan="1">2,556</td>
<td align="left" rowspan="1" colspan="1">2,401</td>
<td align="left" rowspan="1" colspan="1">421</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 3), var-k</td>
<td align="left" rowspan="1" colspan="1">1,893</td>
<td align="left" rowspan="1" colspan="1">2,181</td>
<td align="left" rowspan="1" colspan="1">378</td>
</tr>
<tr><td colspan="4" align="center" rowspan="1"><bold>C. elegans</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 11</td>
<td align="left" rowspan="1" colspan="1">44,500</td>
<td align="left" rowspan="1" colspan="1">62,728</td>
<td align="left" rowspan="1" colspan="1">2,486</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA, <italic>k</italic>
 = 22</td>
<td align="left" rowspan="1" colspan="1">39,300</td>
<td align="left" rowspan="1" colspan="1">62,740</td>
<td align="left" rowspan="1" colspan="1">2,295</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 1), var-k</td>
<td align="left" rowspan="1" colspan="1">10,693</td>
<td align="left" rowspan="1" colspan="1">10,278</td>
<td align="left" rowspan="1" colspan="1">1,476</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 2), var-k</td>
<td align="left" rowspan="1" colspan="1">6,685</td>
<td align="left" rowspan="1" colspan="1">6,364</td>
<td align="left" rowspan="1" colspan="1">1,275</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA (<italic>s</italic>
 = 3), var-k</td>
<td align="left" rowspan="1" colspan="1">4,814</td>
<td align="left" rowspan="1" colspan="1">5,134</td>
<td align="left" rowspan="1" colspan="1">1,065</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>Finally, we checked how replacing GkA with PgSA affects the CRAC performance (<xref rid="pone.0133198.t007" ref-type="table">Table 7</xref>
). We used CRAC v1.3.2 (<ext-link ext-link-type="uri" xlink:href="http://crac.gforge.inria.fr">http://crac.gforge.inria.fr</ext-link>
) and the dataset GRCh37. Unfortunately, the build time grows several times (and even including the CRAC processing time the difference is at least by factor 2), yet the memory requirements of the PgSA-based variant are significantly lower, which may be a crucial benefit if a low-end workstation is only available.</p>
<table-wrap id="pone.0133198.t007" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0133198.t007</object-id>
<label>Table 7</label>
<caption><title>CRAC, <italic>k</italic>
 = 22, on the dataset GRCh37. Times in minutes, sizes in gigabytes.</title>
</caption>
<alternatives><graphic id="pone.0133198.t007g" xlink:href="pone.0133198.t007"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Type</th>
<th align="left" rowspan="1" colspan="1">Build time</th>
<th align="left" rowspan="1" colspan="1">Build+CRAC time</th>
<th align="left" rowspan="1" colspan="1">Index size</th>
<th align="left" rowspan="1" colspan="1">Max mem. (build)</th>
<th align="left" rowspan="1" colspan="1">Max mem. (CRAC)</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">PgSA, <italic>s</italic>
 = 1</td>
<td align="char" char="." rowspan="1" colspan="1">8.48</td>
<td align="char" char="." rowspan="1" colspan="1">418.98</td>
<td align="char" char="." rowspan="1" colspan="1">3.98</td>
<td align="char" char="." rowspan="1" colspan="1">3.79</td>
<td align="char" char="." rowspan="1" colspan="1">6.40</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PgSA, <italic>s</italic>
 = 4</td>
<td align="char" char="." rowspan="1" colspan="1">7.02</td>
<td align="char" char="." rowspan="1" colspan="1">517.04</td>
<td align="char" char="." rowspan="1" colspan="1">1.56</td>
<td align="char" char="." rowspan="1" colspan="1">2.40</td>
<td align="char" char="." rowspan="1" colspan="1">3.48</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">GkA</td>
<td align="char" char="." rowspan="1" colspan="1">11.57</td>
<td align="char" char="." rowspan="1" colspan="1">220.88</td>
<td align="char" char="." rowspan="1" colspan="1">20.30</td>
<td align="char" char="." rowspan="1" colspan="1">27.60</td>
<td align="char" char="." rowspan="1" colspan="1">21.98</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
<sec sec-type="conclusions" id="sec004"><title>Discussion</title>
<p>We proposed a new indexing structure for read collections. The experiments proved that this structure is much more compact than the existing solutions, GkA and CGkA. The running times of the counting queries are worse than of the CGkA, but in the listing queries PgSA is usually faster.</p>
<p>Several aspects of the presented scheme can be improved. We have noticed that using both direct and reverse-complemented reads in our construction of the pseudogenome reduces its size by about 15%. Still, this easy change for the construction is not equally easy to handle during the search, therefore the current implementation refrains from it. Additionally, our recent progress with read compression [<xref rid="pone.0133198.ref032" ref-type="bibr">32</xref>
] suggests to build the pseudogenome from large datasets on disk (disk-based SA construction algorithms also exist, see, e.g., [<xref rid="pone.0133198.ref033" ref-type="bibr">33</xref>
] and references therein). Finally, the sparse suffix array may be replaced by a recent sparse index, SamSAMi (sampled suffix array with minimizers) [<xref rid="pone.0133198.ref034" ref-type="bibr">34</xref>
], with hopefully better performance.</p>
</sec>
</body>
<back><ack><p>The Polish National Science Centre under the project DEC-2012/05/B/ST6/03148. The infrastructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI—Upper Silesian Center for Computational Science and Engineering’. We thank the anonymous reviewers for constructive comments helping to improve the manuscript.</p>
</ack>
<ref-list><title>References</title>
<ref id="pone.0133198.ref001"><label>1</label>
<mixed-citation publication-type="other">National Human Genome Research Institute. DNA Sequencing Costs; 2015. <ext-link ext-link-type="uri" xlink:href="http://www.genome.gov/sequencingcosts/">http://www.genome.gov/sequencingcosts/</ext-link>
.</mixed-citation>
</ref>
<ref id="pone.0133198.ref002"><label>2</label>
<mixed-citation publication-type="other">Hayden EC. Is the $1,000 genome for real?; 2014. Nature News.</mixed-citation>
</ref>
<ref id="pone.0133198.ref003"><label>3</label>
<mixed-citation publication-type="other">U S Department of Veteran Affairs. Million Veteran Program; 2015. <ext-link ext-link-type="uri" xlink:href="http://www.research.va.gov/mvp/">http://www.research.va.gov/mvp/</ext-link>
.</mixed-citation>
</ref>
<ref id="pone.0133198.ref004"><label>4</label>
<mixed-citation publication-type="book"><name><surname>Gusfield</surname>
<given-names>D</given-names>
</name>
. <source>Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology</source>
. <publisher-name>Cambridge University Press</publisher-name>
; <year>1997</year>
.</mixed-citation>
</ref>
<ref id="pone.0133198.ref005"><label>5</label>
<mixed-citation publication-type="other">Ferragina P, Manzini G. Opportunistic data structures with applications. In: Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE; 2000. p. 390–398.</mixed-citation>
</ref>
<ref id="pone.0133198.ref006"><label>6</label>
<mixed-citation publication-type="journal"><name><surname>Langmead</surname>
<given-names>B</given-names>
</name>
, <name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
. <article-title>Fast gapped-read alignment with Bowtie</article-title>
. <source>Nature Methods</source>
. <year>2012</year>
;<volume>9</volume>
:<fpage>357</fpage>
–<lpage>359</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
<pub-id pub-id-type="pmid">22388286</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref007"><label>7</label>
<mixed-citation publication-type="journal"><name><surname>Li</surname>
<given-names>H</given-names>
</name>
, <name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
. <article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
. <source>Bioinformatics</source>
. <year>2009</year>
;<volume>25</volume>
(<issue>14</issue>
):<fpage>1754</fpage>
–<lpage>1760</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btp324</pub-id>
<pub-id pub-id-type="pmid">19451168</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref008"><label>8</label>
<mixed-citation publication-type="other">Kärkkäinen J, Ukkonen E. Sparse suffix trees. In: Proceedings of the 2nd Annual International Conference on Computing and Combinatorics; 1996. p. 219–230.</mixed-citation>
</ref>
<ref id="pone.0133198.ref009"><label>9</label>
<mixed-citation publication-type="journal"><name><surname>Danek</surname>
<given-names>A</given-names>
</name>
, <name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Grabowski</surname>
<given-names>S</given-names>
</name>
. <article-title>Indexes of large genome collections on a PC</article-title>
. <source>PLoS ONE</source>
. <year>2014</year>
;<volume>9</volume>
(<issue>10</issue>
):<fpage>e109384</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0109384</pub-id>
<pub-id pub-id-type="pmid">25289699</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref010"><label>10</label>
<mixed-citation publication-type="journal"><name><surname>Kelly</surname>
<given-names>DR</given-names>
</name>
, <name><surname>Schatz</surname>
<given-names>MC</given-names>
</name>
, <name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
. <article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
. <source>Genome Biology</source>
. <year>2010</year>
;<volume>11</volume>
(<issue>R116</issue>
).</mixed-citation>
</ref>
<ref id="pone.0133198.ref011"><label>11</label>
<mixed-citation publication-type="journal"><name><surname>Ilie</surname>
<given-names>L</given-names>
</name>
, <name><surname>Molnar</surname>
<given-names>M</given-names>
</name>
. <article-title>RACER: Rapid and accurate correction of errors in reads</article-title>
. <source>Bioinformatics</source>
. <year>2013</year>
;<volume>29</volume>
(<issue>19</issue>
):<fpage>2490</fpage>
–<lpage>2493</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt407</pub-id>
<pub-id pub-id-type="pmid">23853064</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref012"><label>12</label>
<mixed-citation publication-type="journal"><name><surname>Heo</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Wu</surname>
<given-names>XL</given-names>
</name>
, <name><surname>Chen</surname>
<given-names>D</given-names>
</name>
, <name><surname>Ma</surname>
<given-names>J</given-names>
</name>
, <name><surname>Hwu</surname>
<given-names>WM</given-names>
</name>
. <article-title>BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads</article-title>
. <source>Bioinformatics</source>
. <year>2014</year>
;<volume>30</volume>
(<issue>10</issue>
):<fpage>1354</fpage>
–<lpage>1362</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu030</pub-id>
<pub-id pub-id-type="pmid">24451628</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref013"><label>13</label>
<mixed-citation publication-type="journal"><name><surname>Schulz</surname>
<given-names>MH</given-names>
</name>
, <name><surname>Weese</surname>
<given-names>D</given-names>
</name>
, <name><surname>Holtgrewe</surname>
<given-names>M</given-names>
</name>
, <name><surname>Dimitrova</surname>
<given-names>V</given-names>
</name>
, <name><surname>Niu</surname>
<given-names>S</given-names>
</name>
, <name><surname>Reinert</surname>
<given-names>K</given-names>
</name>
, <etal>et al</etal>
<article-title>Fiona: a parallel and automatic strategy for read error correction</article-title>
. <source>Bioinformatics</source>
. <year>2014</year>
;<volume>30</volume>
(<issue>17</issue>
):<fpage>i356</fpage>
–<lpage>i363</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu440</pub-id>
<pub-id pub-id-type="pmid">25161220</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref014"><label>14</label>
<mixed-citation publication-type="other">Molnar M, Ilie L. Correcting Illumina data. Briefings in Bioinformatics. 2014;p.</mixed-citation>
</ref>
<ref id="pone.0133198.ref015"><label>15</label>
<mixed-citation publication-type="journal"><name><surname>Zhang</surname>
<given-names>J</given-names>
</name>
, <name><surname>Kobert</surname>
<given-names>K</given-names>
</name>
, <name><surname>Flouri</surname>
<given-names>T</given-names>
</name>
, <name><surname>Stamatakis</surname>
<given-names>A</given-names>
</name>
. <article-title>PEAR: a fast and accurate Illumina Paired-End reAd mergeR</article-title>
. <source>Bioinformatics</source>
. <year>2014</year>
;<volume>30</volume>
(<issue>5</issue>
):<fpage>614</fpage>
–<lpage>620</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt593</pub-id>
<pub-id pub-id-type="pmid">24142950</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref016"><label>16</label>
<mixed-citation publication-type="journal"><name><surname>Ames</surname>
<given-names>SK</given-names>
</name>
, <name><surname>Hysom</surname>
<given-names>DA</given-names>
</name>
, <name><surname>Gardner</surname>
<given-names>SN</given-names>
</name>
, <name><surname>Lloyd</surname>
<given-names>GS</given-names>
</name>
, <name><surname>Gokhale</surname>
<given-names>MB</given-names>
</name>
, <name><surname>Allen</surname>
<given-names>JE</given-names>
</name>
. <article-title>Scalable metagenomic taxonomy classification using a reference genome database</article-title>
. <source>Bioinformatics</source>
. <year>2013</year>
;<volume>29</volume>
(<issue>18</issue>
):<fpage>2253</fpage>
–<lpage>2260</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt389</pub-id>
<pub-id pub-id-type="pmid">23828782</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref017"><label>17</label>
<mixed-citation publication-type="journal"><name><surname>Wood</surname>
<given-names>D</given-names>
</name>
, <name><surname>Salzberg</surname>
<given-names>S</given-names>
</name>
. <article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
. <source>Genome Biology</source>
. <year>2014</year>
;<volume>15</volume>
(<issue>3</issue>
):<fpage>R46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref018"><label>18</label>
<mixed-citation publication-type="journal"><name><surname>Bazinet</surname>
<given-names>AL</given-names>
</name>
, <name><surname>Cummings</surname>
<given-names>MP</given-names>
</name>
. <article-title>A comparative evaluation of sequence classification programs</article-title>
. <source>BMC Bioinformatics</source>
. <year>2012</year>
;<volume>13</volume>
:<fpage>1</fpage>
–<lpage>13</lpage>
. <pub-id pub-id-type="doi">10.1186/1471-2105-13-92</pub-id>
<pub-id pub-id-type="pmid">22214541</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref019"><label>19</label>
<mixed-citation publication-type="journal"><name><surname>Philippe</surname>
<given-names>N</given-names>
</name>
, <name><surname>Salson</surname>
<given-names>M</given-names>
</name>
, <name><surname>Lecroq</surname>
<given-names>T</given-names>
</name>
, <name><surname>Léonard</surname>
<given-names>M</given-names>
</name>
, <name><surname>Commes</surname>
<given-names>T</given-names>
</name>
, <name><surname>Rivals</surname>
<given-names>E</given-names>
</name>
. <article-title>Querying large read collections in main memory: a versatile data structure</article-title>
. <source>BMC Bioinformatics</source>
. <year>2011</year>
;<volume>12</volume>
:Paper no. 242. <pub-id pub-id-type="doi">10.1186/1471-2105-12-242</pub-id>
<pub-id pub-id-type="pmid">21682852</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref020"><label>20</label>
<mixed-citation publication-type="journal"><name><surname>Philippe</surname>
<given-names>N</given-names>
</name>
, <name><surname>Salson</surname>
<given-names>M</given-names>
</name>
, <name><surname>Commes</surname>
<given-names>T</given-names>
</name>
, <name><surname>Rivals</surname>
<given-names>E</given-names>
</name>
. <article-title>CRAC: an integrated approach to the analysis of RNA-seq reads</article-title>
. <source>Genome Biology</source>
. <year>2013</year>
;<volume>14</volume>
(<issue>3</issue>
):<fpage>R30</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2013-14-3-r30</pub-id>
<pub-id pub-id-type="pmid">23537109</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref021"><label>21</label>
<mixed-citation publication-type="journal"><name><surname>Rizk</surname>
<given-names>G</given-names>
</name>
, <name><surname>Lavenier</surname>
<given-names>D</given-names>
</name>
, <name><surname>Chikhi</surname>
<given-names>R</given-names>
</name>
. <article-title>DSK: k-mer counting with very low memory usage</article-title>
. <source>Bioinformatics</source>
. <year>2013</year>
;<volume>29</volume>
(<issue>5</issue>
):<fpage>652</fpage>
–<lpage>653</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt020</pub-id>
<pub-id pub-id-type="pmid">23325618</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref022"><label>22</label>
<mixed-citation publication-type="journal"><name><surname>Marçais</surname>
<given-names>G</given-names>
</name>
, <name><surname>Kingsford</surname>
<given-names>C</given-names>
</name>
. <article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
. <source>Bioinformatics</source>
. <year>2011</year>
;<volume>27</volume>
(<issue>6</issue>
):<fpage>764</fpage>
–<lpage>770</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref023"><label>23</label>
<mixed-citation publication-type="journal"><name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
, <name><surname>Grabowski</surname>
<given-names>S</given-names>
</name>
. <article-title>Disk-based k-mer counting on a PC</article-title>
. <source>BMC Bioinformatics</source>
. <year>2013</year>
;<volume>14</volume>
:<fpage>160</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-14-160</pub-id>
<pub-id pub-id-type="pmid">23679007</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref024"><label>24</label>
<mixed-citation publication-type="journal"><name><surname>Schröder</surname>
<given-names>J</given-names>
</name>
, <name><surname>Schröder</surname>
<given-names>H</given-names>
</name>
, <name><surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
, <name><surname>Sinha</surname>
<given-names>R</given-names>
</name>
, <name><surname>Schmidt</surname>
<given-names>B</given-names>
</name>
. <article-title>SHREC: a short-read error correction method</article-title>
. <source>Bioinformatics</source>
. <year>2009</year>
;<volume>25</volume>
(<issue>17</issue>
):<fpage>2157</fpage>
–<lpage>2163</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btp379</pub-id>
<pub-id pub-id-type="pmid">19542152</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref025"><label>25</label>
<mixed-citation publication-type="journal"><name><surname>Salmela</surname>
<given-names>L</given-names>
</name>
. <article-title>Correction of sequencing errors in a mixed set of reads</article-title>
. <source>Bioinformatics</source>
. <year>2010</year>
;<volume>26</volume>
(<issue>10</issue>
):<fpage>1284</fpage>
–<lpage>1290</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq151</pub-id>
<pub-id pub-id-type="pmid">20378555</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref026"><label>26</label>
<mixed-citation publication-type="journal"><name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
, <name><surname>Pertea</surname>
<given-names>M</given-names>
</name>
, <name><surname>Fahrner</surname>
<given-names>JA</given-names>
</name>
, <name><surname>Sobreira</surname>
<given-names>N</given-names>
</name>
. <article-title>DIAMUND: Direct Comparison of Genomes to Detect Mutations</article-title>
. <source>Human Mutation</source>
. <year>2014</year>
;<volume>35</volume>
(<issue>3</issue>
):<fpage>283</fpage>
–<lpage>288</lpage>
. <pub-id pub-id-type="doi">10.1002/humu.22503</pub-id>
<pub-id pub-id-type="pmid">24375697</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref027"><label>27</label>
<mixed-citation publication-type="journal"><name><surname>Kurtz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Phillippy</surname>
<given-names>A</given-names>
</name>
, <name><surname>Delcher</surname>
<given-names>AL</given-names>
</name>
, <name><surname>Smoot</surname>
<given-names>M</given-names>
</name>
, <name><surname>Shumway</surname>
<given-names>M</given-names>
</name>
, <name><surname>Antonescu</surname>
<given-names>C</given-names>
</name>
, <etal>et al</etal>
<article-title>Versatile and open software for comparing large genomes</article-title>
. <source>Genome Biology</source>
. <year>2004</year>
;<volume>5</volume>
(<issue>2</issue>
):<fpage>R12</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2004-5-2-r12</pub-id>
<pub-id pub-id-type="pmid">14759262</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref028"><label>28</label>
<mixed-citation publication-type="journal"><name><surname>Manber</surname>
<given-names>U</given-names>
</name>
, <name><surname>Myers</surname>
<given-names>G</given-names>
</name>
. <article-title>Suffix arrays: a new method for on-line string searches</article-title>
. <source>SIAM Journal on Computing</source>
. <year>1993</year>
;<volume>22</volume>
(<issue>5</issue>
):<fpage>935</fpage>
–<lpage>948</lpage>
. <pub-id pub-id-type="doi">10.1137/0222058</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref029"><label>29</label>
<mixed-citation publication-type="other">Välimäki N, Rivals E. Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Proceedings of the 9th International Symposium on Bioinformatics Research and Applications; 2013. p. 237–248.</mixed-citation>
</ref>
<ref id="pone.0133198.ref030"><label>30</label>
<mixed-citation publication-type="other">Grossi R, Gupta A, Vitter JS. High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete algorithms; 2003. p. 841–850.</mixed-citation>
</ref>
<ref id="pone.0133198.ref031"><label>31</label>
<mixed-citation publication-type="book"><name><surname>Maier</surname>
<given-names>D</given-names>
</name>
, <name><surname>Storer</surname>
<given-names>JA</given-names>
</name>
. <source>A Note on the Complexity of the Superstring Problem</source>
. <publisher-name>Princeton University</publisher-name>
; <year>1997</year>
<fpage>233</fpage>
.</mixed-citation>
</ref>
<ref id="pone.0133198.ref032"><label>32</label>
<mixed-citation publication-type="journal"><name><surname>Grabowski</surname>
<given-names>S</given-names>
</name>
, <name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Roguski</surname>
<given-names>L</given-names>
</name>
. <article-title>Disk-based compression of data from genome sequencing</article-title>
. <source>Bioinformatics</source>
. <year>2015</year>
;<volume>31</volume>
(<issue>9</issue>
):<fpage>1389</fpage>
–<lpage>1395</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu844</pub-id>
<pub-id pub-id-type="pmid">25536966</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0133198.ref033"><label>33</label>
<mixed-citation publication-type="other">Bingmann T, Fischer J, Osipov V. Inducing Suffix and Lcp Arrays in External Memory. In: Proceedings of the 15th Meeting on Algorithm Engineering and Experiments (ALENEX); 2013. p. 88–102.</mixed-citation>
</ref>
<ref id="pone.0133198.ref034"><label>34</label>
<mixed-citation publication-type="other">Grabowski S, Raniszewski M. Sampling the suffix array with minimizers; 2014. Publicly available preprint arXiv:1406.2348v2.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001009 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001009 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4504488
   |texte=   Indexing Arbitrary-Length k-Mers in Sequencing Reads
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26182400" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Indexing Arbitrary-Length k-Mers in Sequencing Reads

Indexing Arbitrary-Length k-Mers in Sequencing Reads

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki