Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Querying large read collections in main memory: a versatile data structure

Identifieur interne : 000A91 ( Pmc/Corpus ); précédent : 000A90; suivant : 000A92

Querying large read collections in main memory: a versatile data structure

Auteurs : Nicolas Philippe ; Mikaël Salson ; Thierry Lecroq ; Martine Léonard ; Thérèse Commes ; Eric Rivals

Source :

RBID : PMC:3163563

Abstract

Background

High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.

Results

Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).

Conclusions

Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.


Url:
DOI: 10.1186/1471-2105-12-242
PubMed: 21682852
PubMed Central: 3163563

Links to Exploration step

PMC:3163563

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Querying large read collections in main memory: a versatile data structure</title>
<author>
<name sortKey="Philippe, Nicolas" sort="Philippe, Nicolas" uniqKey="Philippe N" first="Nicolas" last="Philippe">Nicolas Philippe</name>
<affiliation>
<nlm:aff id="I1">LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">CRBM, UMR 5237 CNRS, 1919 Route de Mende, 34293 Montpellier cedex 5, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salson, Mikael" sort="Salson, Mikael" uniqKey="Salson M" first="Mikaël" last="Salson">Mikaël Salson</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I4">LIFL, UMR 8022 CNRS and Université Lille 1 and INRIA Lille-Nord-Europe, Bât. M3 - UFR IEEA, 59655 Villeneuve d'Ascq Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lecroq, Thierry" sort="Lecroq, Thierry" uniqKey="Lecroq T" first="Thierry" last="Lecroq">Thierry Lecroq</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leonard, Martine" sort="Leonard, Martine" uniqKey="Leonard M" first="Martine" last="Léonard">Martine Léonard</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Commes, Therese" sort="Commes, Therese" uniqKey="Commes T" first="Thérèse" last="Commes">Thérèse Commes</name>
<affiliation>
<nlm:aff id="I2">CRBM, UMR 5237 CNRS, 1919 Route de Mende, 34293 Montpellier cedex 5, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rivals, Eric" sort="Rivals, Eric" uniqKey="Rivals E" first="Eric" last="Rivals">Eric Rivals</name>
<affiliation>
<nlm:aff id="I1">LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21682852</idno>
<idno type="pmc">3163563</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163563</idno>
<idno type="RBID">PMC:3163563</idno>
<idno type="doi">10.1186/1471-2105-12-242</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000A91</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000A91</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Querying large read collections in main memory: a versatile data structure</title>
<author>
<name sortKey="Philippe, Nicolas" sort="Philippe, Nicolas" uniqKey="Philippe N" first="Nicolas" last="Philippe">Nicolas Philippe</name>
<affiliation>
<nlm:aff id="I1">LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">CRBM, UMR 5237 CNRS, 1919 Route de Mende, 34293 Montpellier cedex 5, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salson, Mikael" sort="Salson, Mikael" uniqKey="Salson M" first="Mikaël" last="Salson">Mikaël Salson</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I4">LIFL, UMR 8022 CNRS and Université Lille 1 and INRIA Lille-Nord-Europe, Bât. M3 - UFR IEEA, 59655 Villeneuve d'Ascq Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lecroq, Thierry" sort="Lecroq, Thierry" uniqKey="Lecroq T" first="Thierry" last="Lecroq">Thierry Lecroq</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leonard, Martine" sort="Leonard, Martine" uniqKey="Leonard M" first="Martine" last="Léonard">Martine Léonard</name>
<affiliation>
<nlm:aff id="I3">LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Commes, Therese" sort="Commes, Therese" uniqKey="Commes T" first="Thérèse" last="Commes">Thérèse Commes</name>
<affiliation>
<nlm:aff id="I2">CRBM, UMR 5237 CNRS, 1919 Route de Mende, 34293 Montpellier cedex 5, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rivals, Eric" sort="Rivals, Eric" uniqKey="Rivals E" first="Eric" last="Rivals">Eric Rivals</name>
<affiliation>
<nlm:aff id="I1">LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the
<italic>k</italic>
-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some
<italic>k</italic>
-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.</p>
</sec>
<sec>
<title>Results</title>
<p>Here, we present a solution, named
<italic>Gk </italic>
arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a
<italic>k</italic>
-mer, get the reads containing this
<italic>k</italic>
-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).</p>
</sec>
<sec>
<title>Conclusions</title>
<p>
<italic>Gk </italic>
arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The
<italic>Gk </italic>
arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The
<italic>Gk </italic>
arrays library is available under Cecill (GPL compliant) license from
<ext-link ext-link-type="uri" xlink:href="http://www.atgc-montpellier.fr/ngs/">http://www.atgc-montpellier.fr/ngs/</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Maher, C" uniqKey="Maher C">C Maher</name>
</author>
<author>
<name sortKey="Kumar Sinha, C" uniqKey="Kumar Sinha C">C Kumar-Sinha</name>
</author>
<author>
<name sortKey="Cao, X" uniqKey="Cao X">X Cao</name>
</author>
<author>
<name sortKey="Kalyana Sundaram, S" uniqKey="Kalyana Sundaram S">S Kalyana-Sundaram</name>
</author>
<author>
<name sortKey="Han, B" uniqKey="Han B">B Han</name>
</author>
<author>
<name sortKey="Jing, X" uniqKey="Jing X">X Jing</name>
</author>
<author>
<name sortKey="Sam, L" uniqKey="Sam L">L Sam</name>
</author>
<author>
<name sortKey="Barrette, T" uniqKey="Barrette T">T Barrette</name>
</author>
<author>
<name sortKey="Palanisamy, N" uniqKey="Palanisamy N">N Palanisamy</name>
</author>
<author>
<name sortKey="Chinnaiyan, A" uniqKey="Chinnaiyan A">A Chinnaiyan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blow, N" uniqKey="Blow N">N Blow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S" uniqKey="Altschul S">S Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
<author>
<name sortKey="Lipman, D" uniqKey="Lipman D">D Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burkhardt, S" uniqKey="Burkhardt S">S Burkhardt</name>
</author>
<author>
<name sortKey="Crauser, A" uniqKey="Crauser A">A Crauser</name>
</author>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Lenhof, Hp" uniqKey="Lenhof H">HP Lenhof</name>
</author>
<author>
<name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Eichler, E" uniqKey="Eichler E">E Eichler</name>
</author>
<author>
<name sortKey="Sahinalp, S" uniqKey="Sahinalp S">S Sahinalp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weiner, P" uniqKey="Weiner P">P Weiner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manber, U" uniqKey="Manber U">U Manber</name>
</author>
<author>
<name sortKey="Myers, Gw" uniqKey="Myers G">GW Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, F" uniqKey="Shi F">F Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Gonzalez, R" uniqKey="Gonzalez R">R González</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Venturini, R" uniqKey="Venturini R">R Venturini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Homann, R" uniqKey="Homann R">R Homann</name>
</author>
<author>
<name sortKey="Fleer, D" uniqKey="Fleer D">D Fleer</name>
</author>
<author>
<name sortKey="Giegerich, R" uniqKey="Giegerich R">R Giegerich</name>
</author>
<author>
<name sortKey="Rehmsmeier, M" uniqKey="Rehmsmeier M">M Rehmsmeier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Philippe, N" uniqKey="Philippe N">N Philippe</name>
</author>
<author>
<name sortKey="Boureux, A" uniqKey="Boureux A">A Boureux</name>
</author>
<author>
<name sortKey="Tarhio, J" uniqKey="Tarhio J">J Tarhio</name>
</author>
<author>
<name sortKey="Brehelin, L" uniqKey="Brehelin L">L Bréhélin</name>
</author>
<author>
<name sortKey="Commes, T" uniqKey="Commes T">T Commes</name>
</author>
<author>
<name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Denoeud, F" uniqKey="Denoeud F">F Denoeud</name>
</author>
<author>
<name sortKey="Aury, Jm" uniqKey="Aury J">JM Aury</name>
</author>
<author>
<name sortKey="Da Silva, C" uniqKey="Da Silva C">C Da Silva</name>
</author>
<author>
<name sortKey="Noel, B" uniqKey="Noel B">B Noel</name>
</author>
<author>
<name sortKey="Rogier, O" uniqKey="Rogier O">O Rogier</name>
</author>
<author>
<name sortKey="Delledonne, M" uniqKey="Delledonne M">M Delledonne</name>
</author>
<author>
<name sortKey="Morgante, M" uniqKey="Morgante M">M Morgante</name>
</author>
<author>
<name sortKey="Valle, G" uniqKey="Valle G">G Valle</name>
</author>
<author>
<name sortKey="Wincker, P" uniqKey="Wincker P">P Wincker</name>
</author>
<author>
<name sortKey="Scarpelli, C" uniqKey="Scarpelli C">C Scarpelli</name>
</author>
<author>
<name sortKey="Jaillon, O" uniqKey="Jaillon O">O Jaillon</name>
</author>
<author>
<name sortKey="Artiguenave, F" uniqKey="Artiguenave F">F Artiguenave</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Williams, Ba" uniqKey="Williams B">BA Williams</name>
</author>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Mortazavi, A" uniqKey="Mortazavi A">A Mortazavi</name>
</author>
<author>
<name sortKey="Kwan, G" uniqKey="Kwan G">G Kwan</name>
</author>
<author>
<name sortKey="Van Baren, Mj" uniqKey="Van Baren M">MJ van Baren</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Wold, Bj" uniqKey="Wold B">BJ Wold</name>
</author>
<author>
<name sortKey="Pachter, L" uniqKey="Pachter L">L Pachter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conway, Tc" uniqKey="Conway T">TC Conway</name>
</author>
<author>
<name sortKey="Bromage, Aj" uniqKey="Bromage A">AJ Bromage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormen, Th" uniqKey="Cormen T">TH Cormen</name>
</author>
<author>
<name sortKey="Leiserson, Ce" uniqKey="Leiserson C">CE Leiserson</name>
</author>
<author>
<name sortKey="Rivest, Rl" uniqKey="Rivest R">RL Rivest</name>
</author>
<author>
<name sortKey="Stein, C" uniqKey="Stein C">C Stein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Munro, I" uniqKey="Munro I">I Munro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raman, R" uniqKey="Raman R">R Raman</name>
</author>
<author>
<name sortKey="Raman, V" uniqKey="Raman V">V Raman</name>
</author>
<author>
<name sortKey="Rao, S" uniqKey="Rao S">S Rao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author>
<name sortKey="Delcher, A" uniqKey="Delcher A">A Delcher</name>
</author>
<author>
<name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author>
<name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author>
<name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kasai, T" uniqKey="Kasai T">T Kasai</name>
</author>
<author>
<name sortKey="Lee, G" uniqKey="Lee G">G Lee</name>
</author>
<author>
<name sortKey="Arimura, H" uniqKey="Arimura H">H Arimura</name>
</author>
<author>
<name sortKey="Arikawa, S" uniqKey="Arikawa S">S Arikawa</name>
</author>
<author>
<name sortKey="Park, K" uniqKey="Park K">K Park</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Smyth, Wf" uniqKey="Smyth W">WF Smyth</name>
</author>
<author>
<name sortKey="Turpin, A" uniqKey="Turpin A">A Turpin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Ratan, A" uniqKey="Ratan A">A Ratan</name>
</author>
<author>
<name sortKey="Tomsho, Lp" uniqKey="Tomsho L">LP Tomsho</name>
</author>
<author>
<name sortKey="Giardine, B" uniqKey="Giardine B">B Giardine</name>
</author>
<author>
<name sortKey="Kasson, Lr" uniqKey="Kasson L">LR Kasson</name>
</author>
<author>
<name sortKey="Harris, Rs" uniqKey="Harris R">RS Harris</name>
</author>
<author>
<name sortKey="Petersen, Dc" uniqKey="Petersen D">DC Petersen</name>
</author>
<author>
<name sortKey="Zhao, F" uniqKey="Zhao F">F Zhao</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Kidd, Jm" uniqKey="Kidd J">JM Kidd</name>
</author>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
<author>
<name sortKey="Drautz, Di" uniqKey="Drautz D">DI Drautz</name>
</author>
<author>
<name sortKey="Bouffard, P" uniqKey="Bouffard P">P Bouffard</name>
</author>
<author>
<name sortKey="Muzny, Dm" uniqKey="Muzny D">DM Muzny</name>
</author>
<author>
<name sortKey="Reid, Jg" uniqKey="Reid J">JG Reid</name>
</author>
<author>
<name sortKey="Nazareth, Lv" uniqKey="Nazareth L">LV Nazareth</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Burhans, R" uniqKey="Burhans R">R Burhans</name>
</author>
<author>
<name sortKey="Riemer, C" uniqKey="Riemer C">C Riemer</name>
</author>
<author>
<name sortKey="Wittekindt, Ne" uniqKey="Wittekindt N">NE Wittekindt</name>
</author>
<author>
<name sortKey="Moorjani, P" uniqKey="Moorjani P">P Moorjani</name>
</author>
<author>
<name sortKey="Tindall, Ea" uniqKey="Tindall E">EA Tindall</name>
</author>
<author>
<name sortKey="Danko, Cg" uniqKey="Danko C">CG Danko</name>
</author>
<author>
<name sortKey="Teo, Ws" uniqKey="Teo W">WS Teo</name>
</author>
<author>
<name sortKey="Buboltz, Am" uniqKey="Buboltz A">AM Buboltz</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Ma, Q" uniqKey="Ma Q">Q Ma</name>
</author>
<author>
<name sortKey="Oosthuysen, A" uniqKey="Oosthuysen A">A Oosthuysen</name>
</author>
<author>
<name sortKey="Steenkamp, Aw" uniqKey="Steenkamp A">AW Steenkamp</name>
</author>
<author>
<name sortKey="Oostuisen, H" uniqKey="Oostuisen H">H Oostuisen</name>
</author>
<author>
<name sortKey="Venter, P" uniqKey="Venter P">P Venter</name>
</author>
<author>
<name sortKey="Gajewski, J" uniqKey="Gajewski J">J Gajewski</name>
</author>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Pugh, Bf" uniqKey="Pugh B">BF Pugh</name>
</author>
<author>
<name sortKey="Makova, Kd" uniqKey="Makova K">KD Makova</name>
</author>
<author>
<name sortKey="Nekrutenko, A" uniqKey="Nekrutenko A">A Nekrutenko</name>
</author>
<author>
<name sortKey="Mardis, Er" uniqKey="Mardis E">ER Mardis</name>
</author>
<author>
<name sortKey="Patterson, N" uniqKey="Patterson N">N Patterson</name>
</author>
<author>
<name sortKey="Pringle, Th" uniqKey="Pringle T">TH Pringle</name>
</author>
<author>
<name sortKey="Chiaromonte, F" uniqKey="Chiaromonte F">F Chiaromonte</name>
</author>
<author>
<name sortKey="Mullikin, Jc" uniqKey="Mullikin J">JC Mullikin</name>
</author>
<author>
<name sortKey="Eichler, Ee" uniqKey="Eichler E">EE Eichler</name>
</author>
<author>
<name sortKey="Hardison, Rc" uniqKey="Hardison R">RC Hardison</name>
</author>
<author>
<name sortKey="Gibbs, Ra" uniqKey="Gibbs R">RA Gibbs</name>
</author>
<author>
<name sortKey="Harkins, Tt" uniqKey="Harkins T">TT Harkins</name>
</author>
<author>
<name sortKey="Hayes, Vm" uniqKey="Hayes V">VM Hayes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N Homer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, Sj" uniqKey="Puglisi S">SJ Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ilie, L" uniqKey="Ilie L">L Ilie</name>
</author>
<author>
<name sortKey="Fazayeli, F" uniqKey="Fazayeli F">F Fazayeli</name>
</author>
<author>
<name sortKey="Ilie, S" uniqKey="Ilie S">S Ilie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salmela, L" uniqKey="Salmela L">L Salmela</name>
</author>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21682852</article-id>
<article-id pub-id-type="pmc">3163563</article-id>
<article-id pub-id-type="publisher-id">1471-2105-12-242</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-12-242</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Querying large read collections in main memory: a versatile data structure</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Philippe</surname>
<given-names>Nicolas</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>nphilippe@lirmm.fr</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Salson</surname>
<given-names>Mikaël</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<xref ref-type="aff" rid="I4">4</xref>
<email>Mikael.Salson@lifl.fr</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Lecroq</surname>
<given-names>Thierry</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>Thierry.Lecroq@univ-rouen.fr</email>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Léonard</surname>
<given-names>Martine</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>Martine.Leonard@univ-rouen.fr</email>
</contrib>
<contrib contrib-type="author" id="A5">
<name>
<surname>Commes</surname>
<given-names>Thérèse</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>commes@univ-montp2.fr</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A6">
<name>
<surname>Rivals</surname>
<given-names>Eric</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>rivals@lirmm.fr</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France</aff>
<aff id="I2">
<label>2</label>
CRBM, UMR 5237 CNRS, 1919 Route de Mende, 34293 Montpellier cedex 5, France</aff>
<aff id="I3">
<label>3</label>
LITIS EA 4108, Université de Rouen, 1 rue Thomas Becket, 76821 Mont-Saint-Aignan Cedex, France</aff>
<aff id="I4">
<label>4</label>
LIFL, UMR 8022 CNRS and Université Lille 1 and INRIA Lille-Nord-Europe, Bât. M3 - UFR IEEA, 59655 Villeneuve d'Ascq Cedex, France</aff>
<pub-date pub-type="collection">
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>17</day>
<month>6</month>
<year>2011</year>
</pub-date>
<volume>12</volume>
<fpage>242</fpage>
<lpage>242</lpage>
<history>
<date date-type="received">
<day>26</day>
<month>11</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>6</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2011 Philippe et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2011</copyright-year>
<copyright-holder>Philippe et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/12/242"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the
<italic>k</italic>
-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some
<italic>k</italic>
-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.</p>
</sec>
<sec>
<title>Results</title>
<p>Here, we present a solution, named
<italic>Gk </italic>
arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a
<italic>k</italic>
-mer, get the reads containing this
<italic>k</italic>
-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).</p>
</sec>
<sec>
<title>Conclusions</title>
<p>
<italic>Gk </italic>
arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The
<italic>Gk </italic>
arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The
<italic>Gk </italic>
arrays library is available under Cecill (GPL compliant) license from
<ext-link ext-link-type="uri" xlink:href="http://www.atgc-montpellier.fr/ngs/">http://www.atgc-montpellier.fr/ngs/</ext-link>
.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Next-generation sequencing technologies are presently being used to answer key biological questions at the scale of the entire genome and with unprecedented depth. Whether determining genetic or genomic variations, cataloging transcripts and assessing their expression levels, identifying DNA-protein interactions or chromatin modifications, surveying the species diversity in an environmental sample, all these tasks are now tackled with High Throughput Sequencing (HTS) and require different, but computer intensive bioinformatic analyses. Typically, a recent RNA sequencing experiment (RNA-Seq) produces about 8 million reads of 75 base pairs each [
<xref ref-type="bibr" rid="B1">1</xref>
], but both the yield and read length will increase [
<xref ref-type="bibr" rid="B2">2</xref>
].</p>
<p>Mapping the reads against a reference genome provides the genomic positions of mapped reads. For instance with RNA-Seq reads, these positions allow to know whether a gene is expressed in the studied condition. The set of mapped positions represents only part of the information needed to analyze the reads, and it can be obtained only if a genome is available. Indeed, other important information are contained in the read collection itself. For instance, to determine the frequency of haplotypes at a SNP position, one needs to align the reads related to this position. These can be obtained by considering for some length
<italic>k</italic>
, the
<italic>k</italic>
-mers overlapping the SNP and searching for the reads sharing this
<italic>k</italic>
-mer. This procedure is applicable even in the absence of a reference genome, and similar ones can be designed to search for a binding motif in ChIP-Seq reads, to determine with RNA-Seq data whether different regions of a messenger RNA sequence are susceptible to be differentially expressed, etc.</p>
<p>For tasks like assembly or read clustering, one needs to determine reads overlapping each other or that align partly one to another. Numerous works on similarity search algorithms have developed seed-and-extend strategies and shown that it can be performed efficiently by searching common
<italic>k</italic>
-mers between two sequences [
<xref ref-type="bibr" rid="B3">3</xref>
,
<xref ref-type="bibr" rid="B4">4</xref>
].</p>
<p>Surely, now and even more in the near future, we will need efficient indexing data structures to store and query large collections of reads in main memory. Up to now, a lot of computational research has been devoted to read mapping, and the most efficient tools owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform [
<xref ref-type="bibr" rid="B5">5</xref>
]. On the other hand, the question of read indexing remains quite unexplored, although the improvements in sequencing throughput suggest that such structures will become a compulsory part of future read analysis programs. A sign supporting this view: even mapping programs now start to index both the genome and the
<italic>k</italic>
-mers of the reads to boost efficiency and accuracy [
<xref ref-type="bibr" rid="B6">6</xref>
].</p>
<p>Numerous works have presented data structures to index a single text, like the well known Suffix Tree (ST) or the Suffix Array (SA) [
<xref ref-type="bibr" rid="B7">7</xref>
,
<xref ref-type="bibr" rid="B8">8</xref>
]. These enable the so-called
<italic>locate </italic>
query, that is to locate all occurrences of a pattern
<italic>P </italic>
either from its sequence or from a position
<italic>j </italic>
of occurrence in the text, as well as
<italic>count </italic>
query to obtain the number of occurrences of
<italic>P</italic>
. These structures can be adapted to index a
<italic>set of texts</italic>
, where each text differ from each other; the structures are then called
<italic>generalized </italic>
Suffix Tree [
<xref ref-type="bibr" rid="B9">9</xref>
], or
<italic>generalized </italic>
Suffix Array (gSA) [
<xref ref-type="bibr" rid="B10">10</xref>
]. This is done by concatenating all texts and adding a separator symbol that does not belong to the alphabet (
<italic>e.g</italic>
., a $ for the DNA alphabet) after each text [
<xref ref-type="bibr" rid="B9">9</xref>
], or directly [
<xref ref-type="bibr" rid="B10">10</xref>
]. Then it requires to store the length of each text in an additional array to correctly answer locate queries. Such algorithms have not been adapted to
<italic>collections </italic>
of texts, where two texts may be equal in sequence but differ in their identifier. The reads obtained from sequencers form a collection, not a set.</p>
<p>When the total text is too large,
<italic>compressed indexes </italic>
reduce the memory needed by storing not all, but only a certain proportion of the text positions. Compression is obtained by sampling the positions to be stored, while non sampled positions need to be recomputed at run time. This enables the user to control the balance between amount of memory and query time. Hence, compression has an impact on the time needed to compute a query. Ferragina
<italic>et al</italic>
. report in a large practical evaluation of compressed text indexes, that the query time of all tested compressed indexes are between 100 and 1,000 times slower than with a plain SA for an index that is 5 times smaller [
<xref ref-type="bibr" rid="B11">11</xref>
]. The FM-index [
<xref ref-type="bibr" rid="B5">5</xref>
] is used to index all chromosomes in mapping applications [
<xref ref-type="bibr" rid="B12">12</xref>
]. However, the scalability of neither plain nor compressed indexes to collections of millions of texts has not been investigated so far. We thus address the question of indexing large collections of reads with an uncompressed index and compare its performance to a generalized suffix array and a hash table. Our structure aims to save space compared to those indexes while globally retaining queries as fast. Thus we avoid the pitfall of compressed indexes which are less space consuming but slower by orders of magnitude.</p>
<p>In this work, we propose a new data structure to index reads, an algorithm to build the structure, and procedures to query it. Our structure, named
<italic>Gk </italic>
arrays, is kept in main memory once built and repeatedly accessed to answer different kinds of queries like "given a
<italic>k</italic>
-mer, get the reads containing this
<italic>k</italic>
-mer (once/at least once)". One can ask both for the
<italic>k</italic>
-mer positions or simply for the reads containing it, which can prove useful in different applications. We focus on cases where millions of queries need to be computed; clearly, memory usage will be the key issue. An alternative solution is to adapt some uncompressed indexing structures designed for long texts (suffix tree or suffix array [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
]). We compare
<italic>Gk </italic>
arrays to such an alternative and show experimentally that they process queries fast, while requiring much less memory (between 2/3 and 1/3 of a suffix array solution). We also perform experimental comparisons against a method using hash table: it shows that while the hash table method can answer quickly to queries it does not scale to large collections of reads.</p>
<p>If in biology the term
<italic>k</italic>
-mer is preferred, computer scientists rather use the equivalent words of
<italic>k</italic>
-factor or
<italic>k</italic>
-substring; we will stick to the term
<italic>k</italic>
-mer. The
<italic>Gk </italic>
arrays allow to answer queries related to an input
<italic>k</italic>
-mer; let us call these
<italic>k</italic>
-mer queries. Before entering the algorithms description, we list below the applications of
<italic>k</italic>
-mer queries in the analysis of High Throughput Sequencing data. The Results section will first present our data structure, its construction algorithm and the procedures to answer
<italic>k</italic>
-mer queries, then detail the experimental comparisons.</p>
<p>Finally, we discuss the advantages of our structure and conclude with future developments.</p>
<p>Note that this study does not tackle the question of read mapping, it focuses on read indexing.</p>
<sec>
<title>Queries and Applications</title>
<p>Let us give an informal presentation of the problem. We are given a collection of
<italic>q </italic>
reads of length
<italic>m </italic>
and a length of substring
<italic>k </italic>
such that
<italic>k </italic>
<italic>m</italic>
.</p>
<p>Suppose one is given a string
<italic>f </italic>
of length
<italic>k</italic>
; one does not know whether it appears in some of the reads or not (
<italic>i</italic>
.
<italic>e</italic>
., whether
<italic>f </italic>
is a substring of some read). In the Algorithm section, we describe a data structure in which all substrings of length
<italic>k </italic>
of the reads are ordered lexicographically. Hence, one can search for
<italic>f </italic>
using a dichotomic search in
<italic>O</italic>
(
<italic>k </italic>
log((
<italic>m </italic>
-
<italic>k </italic>
+ 1)
<italic>q</italic>
))) worst case time in this structure (the dichotomic search is the standard procedure in this context [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
]), and determine whether at least one read contains
<italic>f </italic>
as a substring and at which position. If not, the answers to the queries below, which are all related to a sub-collection of reads containing
<italic>f</italic>
, are trivially the empty set or zero. Otherwise, one knows that
<italic>f </italic>
occurs in some read
<italic>r </italic>
of the collection at position
<italic>j</italic>
, and wishes to get some information on the other reads where
<italic>f </italic>
occurs. One wants to answer the following questions:</p>
<p>
<bold>Q1: </bold>
In which reads does
<italic>f </italic>
occur?</p>
<p>
<bold>Q2: </bold>
In how many reads does
<italic>f </italic>
occur?</p>
<p>
<bold>Q3: </bold>
What are the occurrence positions of
<italic>f </italic>
in the reads?</p>
<p>
<bold>Q4: </bold>
What is the number of occurrences of
<italic>f </italic>
in the reads?</p>
<p>
<bold>Q5: </bold>
In which reads does
<italic>f </italic>
occur only once?</p>
<p>
<bold>Q6: </bold>
In how many reads does
<italic>f </italic>
occur only once?</p>
<p>
<bold>Q7: </bold>
What are the occurrence positions of
<italic>f </italic>
in each read where
<italic>f </italic>
occurs only once?</p>
<p>
<bold>Q8: </bold>
What is the number of occurrences of
<italic>f </italic>
in the reads where it occurs only once?</p>
<p>We state several remarks about the queries before dwelling on applications.</p>
<p>1. The queries go by pairs: the first one computes a set of positions or read indices, while the second computes the cardinality of that set.</p>
<p>2. Note the clear semantic difference between Q1/Q2 and Q3/Q4. The answer to Q1 yields the identifiers of the reads in which
<italic>f </italic>
occurs, while that to Q3 gives
<bold>also </bold>
all its positions in the read. This clearly differs since
<italic>f </italic>
may occur several times in a read (
<italic>e</italic>
.
<italic>g</italic>
., if
<italic>f </italic>
is a poly-
<italic>A </italic>
sequence). Sometimes the positions are needed, sometimes only the reads (see below).</p>
<p>3. Queries Q5-Q8 are versions of Q1-Q4 constrained to a single occurrence of
<italic>f </italic>
in the reads. Of course other variants can also be computed,
<italic>e</italic>
.
<italic>g</italic>
. where the number of occurrences is limited by a user defined threshold. Since
<italic>f </italic>
is constrained to occur only once in each read, Q6 and Q8 are equivalent, and we will mention only Q6 in the sequel.</p>
<p>4. The data structure we propose is intended to be kept in memory and used for multiple queries.</p>
<p>Although this paper focuses on the data structure, its efficiency, and on the algorithms to solve these type of queries, it is important to list applications of these queries. In which context of read analysis, can one use such queries? Note that in such context,
<italic>k </italic>
is smaller than the read length. Theoretical and empirical investigations show that for instance, with
<italic>k </italic>
≥ 19 or 20,
<italic>k</italic>
-mers indicate in average a single genomic location in the human genome [
<xref ref-type="bibr" rid="B14">14</xref>
]. Such values of
<italic>k </italic>
can be computed depending on the genome length. Translated to reads or sequences: it is unlikely that two reads sharing a
<italic>k</italic>
-mer were not sequenced from the same part of the DNA. In other words, sharing a
<italic>k</italic>
-mer is a witness for having a common genomic origin.</p>
<sec>
<title>Mutation detection</title>
<p>Putative mutations (SNP, somatic mutations, small indels) are indicated by differences between a read and a reference genome. Once the reads have been mapped to the reference genome, one analyzes the sub-collection of reads that covers a genomic position to count how many reads support the variation observed in the read or that observed in the genome. If one considers the two substrings of length
<italic>k </italic>
centered on this mutation position, one in the read and one in the genome, answering Q2 for these substrings will give an approximate count of these two haplotypes. If one needs the corresponding reads, then Q1 is the appropriate query. If only a single, or a few reads, share this
<italic>k</italic>
-mer, then a sequence error might be suspected [
<xref ref-type="bibr" rid="B15">15</xref>
].</p>
</sec>
<sec>
<title>Local coverage</title>
<p>Suppose one is given a target sequence, which can be a read or an external sequence. For each of its
<italic>k</italic>
-mer, let us call the
<italic>local coverage</italic>
, the number of reads sharing this
<italic>k</italic>
-mer (this requires a dichotomic search). The local coverage profile (
<italic>i</italic>
.
<italic>e</italic>
. a histogram of the local coverage) along the target sequence provides useful information in various contexts. For a known mRNA and an RNA-Seq experiment, the average local coverage on all
<italic>k</italic>
-mers is a proxy for the expression level of the target, while the profile enables one to distinguish the target's sub-regions expressed at different levels [
<xref ref-type="bibr" rid="B16">16</xref>
,
<xref ref-type="bibr" rid="B17">17</xref>
]. In another context, with a genomic library, taking reads as queries and looking at their local coverage profile may help to detect those overlapping the extremity of a repeated or transposable element. This may prove useful to study the distribution and evolution of these elements in the genome.</p>
</sec>
<sec>
<title>Clustering and assembly without a reference genome</title>
<p>As for Expressed Sequence Tags, it is suitable to cluster and assemble RNA-Seq reads to compute the various transcripts expressed in the assayed library [
<xref ref-type="bibr" rid="B16">16</xref>
,
<xref ref-type="bibr" rid="B17">17</xref>
]. It is necessary to detect near exact alignment between pair of reads, and this is usually performed efficiently by filtration using seeds. In such case, very efficient and sensible seeds are exact shared
<italic>k</italic>
-mers [
<xref ref-type="bibr" rid="B4">4</xref>
]. Here, the sub-collection of reads sharing a
<italic>k</italic>
-mer with a given read, as well as the
<italic>k</italic>
-mer positions, can be obtained using query Q3. The answer to Q4 can help guiding the clustering process.</p>
<p>Similar needs of query occur in the assembly of genomic reads [
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B19">19</xref>
]. To know with which reads one can assemble a given read without ambiguity, one may perform query Q7 using
<italic>k</italic>
-mers at the 5' or 3' extremities of the read. The obtained occurrences together with their positions will indicate the matching reads and the relative positions of read pairs for assembly.</p>
<p>Our application list provides examples and is by no means exhaustive. We could also mention for instance the estimation of the target genome length in assembly context, which uses
<italic>k</italic>
-mer counting [
<xref ref-type="bibr" rid="B20">20</xref>
]. Clearly, these applications are beyond the scope of this paper. However, these paragraphs underline that the proposed data structure suits the needs of read processing in various application contexts, and will provide a unified framework for building read analysis programs.</p>
</sec>
</sec>
</sec>
<sec>
<title>Results and Discussion</title>
<p>This section contains the main contribution: a data structure to index large read collections, the
<italic>Gk </italic>
arrays. To describe it, we first introduce the notation, formalize the queries, exhibit the index data structure, give its construction algorithm, and the procedures for answering all queries. This makes the content of the Algorithms section. Then, in the Comparison section we investigate its practical usability compared to two alternatives: one based on a generalized Suffix Array (SA) and another based on a hash table. This includes theoretical and practical comparisons.</p>
<sec>
<title>Algorithms</title>
<p>Here, we detail the algorithms to build the
<italic>Gk </italic>
arrays and to answer the queries. We start by defining more formally the queries we want to answer and introduce the necessary notation.</p>
<sec>
<title>Notation and definition of the queries</title>
<p>Let Σ be an alphabet of size σ. Σ* denotes the set of
<italic>words</italic>
,
<italic>strings </italic>
or
<italic>sequences </italic>
over Σ and, for any integer
<italic>n</italic>
, Σ
<italic>
<sup>n </sup>
</italic>
denotes the set of words of length
<italic>n </italic>
over Σ. For a word
<italic>x</italic>
, |
<italic>x</italic>
| denotes the
<italic>length </italic>
of
<italic>x</italic>
. Given two words
<italic>x </italic>
and
<italic>y</italic>
, we denote by
<italic>xy </italic>
the
<italic>concatenation </italic>
of
<italic>x </italic>
and
<italic>y</italic>
. For every 0 ≤
<italic>i </italic>
<italic>j </italic>
≤ |
<italic>x</italic>
| - 1,
<italic>x</italic>
[
<italic>i</italic>
] denotes the (
<italic>i </italic>
+ 1)
<sup>th </sup>
element of
<italic>x</italic>
, and
<italic>x</italic>
[
<italic>i</italic>
..
<italic>j</italic>
] denotes the
<italic>substring x</italic>
[
<italic>i</italic>
]
<italic>x</italic>
[
<italic>i </italic>
+ 1] . . .
<italic>x</italic>
[
<italic>j</italic>
]. Let ≤
<italic>
<sub>L </sub>
</italic>
denotes the comparison operator for the lexicographic order on words. Lexicographic ranks start from zero and all arrays are indexed from zero. For any finite set
<italic>A</italic>
, we denote its cardinality by #
<italic>A</italic>
.</p>
<p>The
<bold>input </bold>
consists a list
<italic>R </italic>
= (
<italic>r</italic>
<sub>0</sub>
, . . .,
<italic>r</italic>
<sub>
<italic>q</italic>
-1</sub>
) of
<italic>q </italic>
short sequences of length
<italic>m</italic>
, called
<italic>reads</italic>
, which are not necessarily distinct. We know that
<italic>m</italic>
,
<italic>k</italic>
,
<italic>q </italic>
∈ ℕ satisfy
<italic>m </italic>
<italic>k </italic>
> 0.</p>
<p>A
<italic>k</italic>
-long substring of a word is called a
<italic>k-mer</italic>
. For any
<italic>u </italic>
∈ Σ*, we denote by
<italic>F
<sub>k</sub>
</italic>
(
<italic>u</italic>
) the set of
<italic>k</italic>
-mers in
<italic>u</italic>
:
<italic>F
<sub>k</sub>
</italic>
(
<italic>u</italic>
) = {
<italic>v </italic>
∈ Σ
<italic>
<sup>k </sup>
</italic>
| ∃
<italic>p </italic>
∈ [0, |
<italic>u</italic>
| -
<italic>k</italic>
] such that
<italic>v </italic>
=
<italic>u</italic>
[
<italic>p</italic>
. .
<italic>p </italic>
+
<italic>k </italic>
- 1]}. Let
<italic>f </italic>
∈ Σ
<italic>
<sup>k </sup>
</italic>
and let us denote the set of indexes of the reads in which
<italic>f </italic>
occurs by
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
) = {
<italic>j </italic>
∈ [0,
<italic>q</italic>
[|
<italic>f </italic>
<italic>F
<sub>k</sub>
</italic>
(
<italic>r
<sub>j</sub>
</italic>
)}, and the set of
<italic>positioned occurrences </italic>
of
<italic>f </italic>
in all reads by
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) = {(
<italic>j</italic>
,
<italic>
<roman></roman>
</italic>
) |
<italic>r
<sub>j</sub>
</italic>
[
<italic></italic>
. .
<italic></italic>
+
<italic>k </italic>
- 1] =
<italic>f</italic>
}, where a positioned occurrence is given by the pair made of the read index in
<italic>R </italic>
and the beginning position of
<italic>f </italic>
in this read. Let us denote the restriction of
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
) (resp.
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)) to subset of read indexes where
<italic>f </italic>
occurs only once by
<italic>UInd
<sub>k</sub>
</italic>
(
<italic>f</italic>
) (resp.
<italic>UPos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)). Formally,
<italic>UPos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) = {(
<italic>j</italic>
,
<italic></italic>
) |
<italic>r
<sub>j</sub>
</italic>
[
<italic></italic>
. .
<italic></italic>
+
<italic>k </italic>
- 1] =
<italic>f </italic>
and ∀
<italic>i </italic>
<italic></italic>
,
<italic>r
<sub>j</sub>
</italic>
[
<italic>i</italic>
. .
<italic>i </italic>
+
<italic>k </italic>
- 1] ≠
<italic>f</italic>
}, and
<italic>UInd
<sub>k</sub>
</italic>
(
<italic>f</italic>
) = {
<italic>j </italic>
| (
<italic>j</italic>
,
<italic></italic>
) ∈
<italic>UPos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)}. Let
<italic>i </italic>
∈ [0,
<italic>q</italic>
[,
<italic>j″ </italic>
∈ [0,
<italic>m </italic>
-
<italic>k </italic>
+ 1[, and let
<italic>f </italic>
be the
<italic>k</italic>
-mer starting at
<italic>j″ </italic>
in read
<italic>r
<sub>i</sub>
</italic>
. Note that here we require the knowledge of the pair (
<italic>i</italic>
,
<italic>j″</italic>
), which defines the
<italic>k</italic>
-mer
<italic>f</italic>
. Now, the seven
<italic>k</italic>
-mer queries can be formally defined as computing
<disp-formula>
<graphic xlink:href="1471-2105-12-242-i1.gif"></graphic>
</disp-formula>
</p>
<p>Clearly, it appears (see Additional File
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Proof and queries' algorithms) that the algorithms to compute
<italic>UInd
<sub>k</sub>
</italic>
(
<italic>f</italic>
), resp.
<italic>UPos
<sub>k</sub>
</italic>
(
<italic>f</italic>
), for answering Q5/Q7, simply filter
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
), resp.
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
), on the fly, and are thus similar to the algorithms for Q1/Q3. For place sake, we will only detail the solutions for Q1-Q4 in the sequel.</p>
</sec>
<sec>
<title>The index structure</title>
<p>Our algorithm relies on four arrays that allow to query the
<italic>k</italic>
-mers of all reads. Hence, we define a word made of the concatenation of all reads:
<italic>C
<sub>R </sub>
</italic>
=
<italic>r</italic>
<sub>0</sub>
<italic>r</italic>
<sub>1 </sub>
<italic>r</italic>
<sub>
<italic>q</italic>
-1</sub>
. Of course, a
<italic>k</italic>
-mer that overlaps two reads in
<italic>C
<sub>R </sub>
</italic>
is not necessarily a
<italic>k</italic>
-mer of some read. Hence, we introduce a system to renumber the positions of interest in
<italic>C
<sub>R</sub>
</italic>
. The rationale behind is to save place in the
<italic>Gk </italic>
arrays by discarding the positions of overlapping
<italic>k</italic>
-mers in
<italic>C
<sub>R</sub>
</italic>
. Let us denote by
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
the number of distinct
<italic>k</italic>
-mers of all reads, and for the sake of legibility we set
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i3.gif"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i4.gif"></inline-graphic>
</inline-formula>
(the number of interesting positions in a read and in
<italic>C
<sub>R</sub>
</italic>
, respectively). We call:</p>
<p>
<italic>P</italic>
-
<italic>position</italic>
, a starting position in
<italic>C
<sub>R </sub>
</italic>
of a
<italic>k</italic>
-mer that is not overlapping two reads,
<italic>i</italic>
.
<italic>e</italic>
. an element of
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i5.gif"></inline-graphic>
</inline-formula>
.</p>
<p>
<italic>g</italic>
, the function that renumbers
<italic>P</italic>
-positions
<italic>in order </italic>
such that their index are consecutive;
<italic>g </italic>
is defined by:
<disp-formula>
<graphic xlink:href="1471-2105-12-242-i6.gif"></graphic>
</disp-formula>
</p>
<p>
<italic>Q</italic>
-
<italic>position</italic>
, an image of a
<italic>P</italic>
-position by g(.),
<italic>i</italic>
.
<italic>e</italic>
. an element of
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i7.gif"></inline-graphic>
</inline-formula>
. Note that the set
<italic>Q</italic>
<sub>pos </sub>
is not a query.</p>
<p>Clearly,
<italic>P</italic>
<sub>pos </sub>
and
<italic>Q</italic>
<sub>pos </sub>
have the same cardinality
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i8.gif"></inline-graphic>
</inline-formula>
, and as (
<italic>j </italic>
<italic>j</italic>
') implies
<italic>g</italic>
(
<italic>j</italic>
) ≠
<italic>g</italic>
(
<italic>j</italic>
'),
<italic>g </italic>
is bijective. Hence,
<italic>g</italic>
<sup>-1 </sup>
exists and maps a
<italic>Q</italic>
-position back to its corresponding
<italic>P</italic>
-position in
<italic>C
<sub>R</sub>
</italic>
. Proposition 1 explicits the conversion between a positioned occurrence and a
<italic>P</italic>
-position.</p>
<p>
<bold>Proposition 1</bold>
.
<italic>Let </italic>
(
<italic>j</italic>
,
<italic></italic>
)
<italic>with j </italic>
∈ [0,
<italic>q</italic>
[,
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i9.gif"></inline-graphic>
</inline-formula>
<italic> be a positioned occurrence of a k</italic>
-
<italic>mer in a read</italic>
.
<italic>The corresponding P</italic>
-
<italic>position in C
<sub>R </sub>
is jm</italic>
+
<italic></italic>
.
<italic>Conversely</italic>
,
<italic>let j</italic>
'
<italic>be a P</italic>
-
<italic>position</italic>
,
<italic>the corresponding positioned occurrence in a read is </italic>
(⌊
<italic>j</italic>
'/
<italic>m</italic>
⌋,
<italic>j</italic>
' mod
<italic>m</italic>
).</p>
<p>This numbering system is important for it allows us to go back and forth between a positioned occurrence in a read, its corresponding
<italic>P</italic>
-position in
<italic>C
<sub>R</sub>
</italic>
, and its
<italic>Q</italic>
-position that will be stored in our arrays.</p>
<p>Let
<italic>j </italic>
be a
<italic>Q</italic>
-position. We denote by
<italic>s
<sub>Q</sub>
</italic>
(
<italic>j</italic>
), resp.
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
), the suffix, resp. the
<italic>k</italic>
-mer, of
<italic>C
<sub>R </sub>
</italic>
beginning at the
<italic>P</italic>
-position
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
),
<italic>i</italic>
.
<italic>e</italic>
.
<italic>s
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) =
<italic>C
<sub>R</sub>
</italic>
[
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
) . .
<italic>qm </italic>
- 1] and
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) =
<italic>C
<sub>R</sub>
</italic>
[
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
) . .
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
) +
<italic>k </italic>
- 1]. We call
<italic>s
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) a
<italic>P</italic>
-suffix. Note that all suffixes beginning at
<italic>P</italic>
-positions have different length and are pairwise distinct; thus, there are
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i8.gif"></inline-graphic>
</inline-formula>
such suffixes and they all have a different lexicographic rank. However, this may, and in real data applications will, not be the case for the
<italic>k</italic>
-mers,
<italic>i</italic>
.
<italic>e</italic>
. the
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
). We call the set {
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) |
<italic>j </italic>
<italic>Q</italic>
<sub>pos</sub>
} the set of
<italic>P
<sub>k</sub>
</italic>
-
<italic>factors</italic>
, whose cardinality is
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
with our notation.</p>
<p>Now, we define the
<italic>Gk </italic>
arrays:</p>
<p>
<bold>
<italic>GkSA </italic>
</bold>
(Generalized
<italic>k </italic>
Suffix Array) is a modified Suffix Array of
<italic>C
<sub>R </sub>
</italic>
that lexicographically sorts only the
<italic>P</italic>
-suffixes,</p>
<p>
<bold>
<italic>GkIFA </italic>
</bold>
(Generalized
<italic>k </italic>
Inverse Factor Array) is a modified Inverse Suffix Array (ISA) that stores for each
<italic>Q</italic>
-position, in position order in
<italic>C
<sub>R</sub>
</italic>
, the lexicographic rank of the
<italic>P
<sub>k</sub>
</italic>
-factors starting at the corresponding
<italic>P</italic>
-position,</p>
<p>
<bold>
<italic>GkCFA </italic>
</bold>
(Generalized
<italic>k </italic>
Counting Factor Array) is an array that associates to a
<italic>k</italic>
-mer (actually, to its rank) its number of occurrences at
<italic>P</italic>
-positions in
<italic>C
<sub>R</sub>
</italic>
,</p>
<p>
<bold>
<italic>GkCFPS </italic>
</bold>
(Generalized
<italic>k </italic>
Counting Factor Prefix Sum) stores the prefix sums of
<italic>GkCFA</italic>
. Since
<italic>GkCFA </italic>
and
<italic>GkCFPS </italic>
are equivalent only one of them is necessary at a time.</p>
<p>Formally, the definitions are (see Figure
<xref ref-type="fig" rid="F1">1</xref>
for an example and Figure
<xref ref-type="fig" rid="F2">2</xref>
):</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Example of the read index data structure: the
<italic>Gk</italic>
-arrays</bold>
. Example of the read index data structure: the
<italic>Gk </italic>
arrays. Example for a collection
<italic>R </italic>
= (
<monospace>aacaact, caattca, aacaagc</monospace>
) of
<italic>q </italic>
= 3 reads of length
<italic>m </italic>
= 7 and considering 3-mers (
<italic>k </italic>
= 3). The index is composed of three tables and uses a fourth one during construction (
<italic>GkSA</italic>
,
<italic>GkIFA</italic>
,
<italic>GkCFA</italic>
, and
<italic>GkCFPS</italic>
). The first table shows the starting indices of
<italic>k</italic>
-mers in the text made by the concatenation of all reads,
<italic>C
<sub>R</sub>
</italic>
, the SA built on
<italic>C
<sub>R</sub>
</italic>
, and the function
<italic>g </italic>
that renumbers
<italic>P</italic>
-positions of
<italic>C
<sub>R </sub>
</italic>
to make them consecutive.
<italic>P</italic>
-positions are {0,1,2,3,4,7,8,9,10,11,14,15,16,17,18}; all other positions, those starting positions where the
<italic>k</italic>
-mer overlaps two reads, are displayed with a gray background (lines
<italic>j </italic>
and SA[
<italic>j</italic>
]). Line SA refers to the usual Suffix Array of
<italic>C
<sub>R</sub>
</italic>
. The
<italic>k</italic>
-mer
<monospace>caa</monospace>
occurs 4 times in
<italic>C
<sub>R </sub>
</italic>
at positions 2, 7, 12 and 16. Among those, only 2, 7, and 16 are
<italic>P</italic>
-positions. The lexicographic rank of the
<italic>P
<sub>k</sub>
</italic>
-factor starting at position 16 is given by
<italic>GkIFA</italic>
[
<italic>g</italic>
(16)] =
<italic>GkIFA</italic>
12 = 7, and the number of occurrences of the
<italic>P
<sub>k</sub>
</italic>
-factor
<monospace>caa</monospace>
is given by
<italic>GkCFA</italic>
7, which equals 3. The positions of these occurrences are thus obtained by the set {
<italic>g</italic>
<sup>-1</sup>
(
<italic>GkSA</italic>
[
<italic>j</italic>
])
<italic>| GkCFPS</italic>
[7 - 1] ≤
<italic>j < GkCFPS</italic>
7} = {2,7,16}. See also Figure 2.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-1"></graphic>
</fig>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Accessing the occurrences of a
<italic>k</italic>
-mer in the index</bold>
. Accessing the index to get the occurrences of a
<italic>k</italic>
-mer starting at position
<italic>j </italic>
in the concatenation of the reads (
<italic>i.e</italic>
.,
<italic>C
<sub>R</sub>
</italic>
). Accessing
<italic>GkIFA</italic>
,
<italic>GkCFPS </italic>
and
<italic>GkSA</italic>
: (a) From
<italic>C
<sub>R </sub>
</italic>
to
<italic>GkIFA</italic>
:
<italic>g</italic>
(
<italic>j</italic>
) is the renumbered position of the
<italic>P</italic>
-position
<italic>j</italic>
. (b) From
<italic>GkIFA </italic>
to
<italic>GkCFPS</italic>
:
<italic>GkIFA</italic>
[
<italic>g</italic>
(
<italic>j</italic>
)] is the lexicographic rank of the
<italic>P
<sub>k</sub>
</italic>
-factor starting at
<italic>P</italic>
-position
<italic>j </italic>
in
<italic>C
<sub>R</sub>
</italic>
, and
<italic>GkCFPS</italic>
[
<italic>GkIFA</italic>
[
<italic>g</italic>
(
<italic>j</italic>
)]] is the number of occurrences in
<italic>C
<sub>R </sub>
</italic>
of the
<italic>P
<sub>k</sub>
</italic>
-factors of rank less than
<italic>GkIFA</italic>
[
<italic>g</italic>
(
<italic>j</italic>
)]. (c) From
<italic>GkCFPS </italic>
to
<italic>GkSA</italic>
: The positions of the occurrences of the
<italic>P
<sub>k</sub>
</italic>
-factor starting at position
<italic>j </italic>
are in
<italic>GkSA </italic>
in the range [
<italic>GkCFPS</italic>
[
<italic>GkIFA</italic>
[
<italic>g</italic>
(
<italic>j</italic>
)] -1],
<italic>GkCFPS</italic>
[
<italic>GkIFA</italic>
[
<italic>g</italic>
(
<italic>j</italic>
)]]].</p>
</caption>
<graphic xlink:href="1471-2105-12-242-2"></graphic>
</fig>
<p>• For
<italic>i </italic>
a suffix lexicographic rank and
<italic>j </italic>
a
<italic>Q</italic>
-position (
<italic>i.e. i, j </italic>
<italic>Q</italic>
<sub>pos</sub>
),</p>
<p>
<italic>   GkSA</italic>
[
<italic>i</italic>
] =
<italic>j </italic>
iff
<italic>s
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) has lexicographic rank
<italic>i </italic>
among the
<italic>P</italic>
-suffixes.</p>
<p>• For
<italic>i </italic>
a
<italic>k</italic>
-mer lexicographic rank and
<italic>j </italic>
a
<italic>Q</italic>
-position (
<italic>i.e</italic>
.
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i10.gif"></inline-graphic>
</inline-formula>
and
<italic>j </italic>
<italic>Q</italic>
<sub>pos</sub>
),</p>
<p>
<italic>   GkIFA</italic>
[
<italic>j</italic>
] =
<italic>i </italic>
iff
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) has lexicographic rank
<italic>i </italic>
among the
<italic>P
<sub>k</sub>
</italic>
-factors.</p>
<p>• For
<italic>i </italic>
a
<italic>k</italic>
-mer lexicographic rank (
<italic>i.e</italic>
.
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i10.gif"></inline-graphic>
</inline-formula>
),</p>
<p>
<italic>   GkCFA</italic>
[
<italic>i</italic>
] = #{
<italic>j </italic>
<italic>Q</italic>
<sub>pos </sub>
<italic>| f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) =
<italic>f
<sub>Q </sub>
</italic>
(
<italic>GkSA</italic>
[
<italic>i</italic>
])},</p>
<p>• For
<italic>i </italic>
a
<italic>k</italic>
-mer lexicographic rank (
<italic>i.e</italic>
.
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i10.gif"></inline-graphic>
</inline-formula>
), the definition of the prefix sum is
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i11.gif"></inline-graphic>
</inline-formula>
and </p>
<p>
<italic>GkCFPS</italic>
[-1] = 0.</p>
<p>
<bold>Remark 1</bold>
.
<italic>The array GkCFPS is not essential to the algorithm: it is solely there to avoid multiple, time consuming computations of prefix sums over GkCFA (see GkCFPS definition above). Moreover, any value of GkCFA can also be accessed in constant time using GkCFA</italic>
[
<italic>i</italic>
] =
<italic>GkCFPS</italic>
[
<italic>i</italic>
] -
<italic>GkCFPS</italic>
[
<italic>i </italic>
-1].
<italic>Thus, GkCFPS will be kept in memory to replace GkCFA</italic>
.</p>
<p>We give some useful properties of
<italic>Gk </italic>
arrays.</p>
<p>
<bold>Proposition 2</bold>
.
<italic>For </italic>
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i10.gif"></inline-graphic>
</inline-formula>
,
<italic>GkCFPS</italic>
[
<italic>i</italic>
] = #{
<italic>j </italic>
<italic>Q
<sub>pos </sub>
| f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
)
<italic>
<sub>L </sub>
f
<sub>Q</sub>
</italic>
(
<italic>GkSA</italic>
[
<italic>i</italic>
])}
<italic>(Proof by induction)</italic>
.</p>
<p>In other words,
<italic>GkCFPS</italic>
[
<italic>i</italic>
] is the number of
<italic>P
<sub>k</sub>
</italic>
-factors having lexicographic rank less than or equal to
<italic>i</italic>
. Since
<italic>GkSA </italic>
is sorted on the lexicographic order of the
<italic>P</italic>
-suffixes, it is also sorted on the lexicographic order of the
<italic>P
<sub>k</sub>
</italic>
-factors. Hence, we get:</p>
<p>
<bold>Proposition 3</bold>
.
<italic>Let f </italic>
∈ Σ
<italic>
<sup>k </sup>
such that Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
) ≠ ∅.
<italic>All occurrences of f have the same rank among the P
<sub>k</sub>
-factors, and are stored consecutively in GkSA</italic>
.</p>
</sec>
<sec>
<title>Construction algorithm</title>
<p>First, we detail the algorithm for building
<italic>GkSA</italic>
, and then the one computing
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
.</p>
<sec>
<title>Computation of
<italic>GkSA</italic>
</title>
<p>We first build the full Suffix Array (SA) of
<italic>C
<sub>R </sub>
</italic>
using a linear time and space algorithm. Since
<italic>|C
<sub>R</sub>
| </italic>
=
<italic>mq </italic>
this first step can be done in
<italic>O</italic>
(
<italic>mq</italic>
). Then
<italic>GkSA </italic>
is obtained from SA by selecting only the
<italic>P</italic>
-positions and by renumbering them to
<italic>Q</italic>
-positions using function
<italic>g</italic>
. This second step is performed in
<italic>O</italic>
(
<italic>mq</italic>
) time and space. Moreover,
<italic>GkSA </italic>
is built in place of the Suffix Array: our algorithm allocates only the memory for the SA table. When answering Q1/Q2, each read where a given
<italic>P
<sub>k</sub>
</italic>
-factor occurs should be counted only once (even if the
<italic>P
<sub>k</sub>
</italic>
-factor occurs more than once in the read). Similarly, for Q5/Q6, we count only reads where a given
<italic>P
<sub>k</sub>
</italic>
-factor occurs exactly once. To avoid using masks on the reads, we sort in increasing order the values of
<italic>GkSA </italic>
corresponding to
<italic>P
<sub>k</sub>
</italic>
-factors sharing the same lexicographic rank (see Table in Additional File
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Proof and queries' algorithms). The values that have to be sorted are
<italic>Q</italic>
-positions,
<italic>i.e</italic>
. integers, thus the sort can be performed in linear time on values of
<italic>GkSA </italic>
using
<italic>e.g</italic>
. radix sort [
<xref ref-type="bibr" rid="B21">21</xref>
]. The whole process takes
<italic>O</italic>
(
<italic>mq</italic>
) time and space.</p>
</sec>
<sec>
<title>Computation of
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
</title>
<p>Algorithm 1 shows how to compute jointly
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
. Its correctness proof is given in Additional File
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Proof and queries' algorithms.</p>
<p>
<bold>Algorithm 1: </bold>
Computation of
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
.</p>
<p>
<bold>Data</bold>
:
<italic>GkSA</italic>
,
<italic>C
<sub>R</sub>
</italic>
,
<italic>k</italic>
,
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i8.gif"></inline-graphic>
</inline-formula>
</p>
<p>
<bold>Result</bold>
:
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
</p>
<p>1 begin</p>
<p>
<bold>2 </bold>
   
<italic>GkIFA</italic>
[
<italic>GkSA</italic>
[0]] ← 0;</p>
<p>
<bold>3 </bold>
   
<italic>GkCFA</italic>
[0] ← 1;</p>
<p>
<bold>4 </bold>
   
<italic>t </italic>
← 0;</p>
<p>
<bold>5    foreach </bold>
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i12.gif"></inline-graphic>
</inline-formula>
<bold> do</bold>
</p>
<p>
<bold>6 </bold>
      
<italic>j </italic>
<italic>GkSA</italic>
[
<italic>i</italic>
];</p>
<p>
<bold>7 </bold>
      
<italic>j</italic>
' ←
<italic>GkSA</italic>
[
<italic>i </italic>
- 1];</p>
<p>
<bold>8       if </bold>
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
) ≠
<italic>f
<sub>Q</sub>
</italic>
(
<italic>j</italic>
')
<bold>then</bold>
</p>
<p>
<bold>9 </bold>
            
<italic>t </italic>
<italic>t </italic>
+ 1;</p>
<p>
<bold>10 </bold>
          
<italic>GkCFA</italic>
[
<italic>t</italic>
] ← 0;</p>
<p>
<bold>11 </bold>
    
<italic>GkIFA</italic>
[
<italic>j</italic>
] ←
<italic>t </italic>
;</p>
<p>
<bold>12 </bold>
    
<italic>GkCFA</italic>
[
<italic>t</italic>
] ←
<italic>GkCFA</italic>
[
<italic>t</italic>
] + 1;</p>
<p>
<bold>13   return </bold>
(
<italic>GkIFA </italic>
and
<italic>GkCFA</italic>
);</p>
<p>
<bold>Theorem 1</bold>
.
<italic>Algorithm 1 correctly computes the arrays GkIFA and GkCFA. (Proof in </italic>
Additional File
<xref ref-type="supplementary-material" rid="S1">1</xref>
<italic>: Proof and queries' algorithms)</italic>
.</p>
<p>The comparison between two
<italic>P
<sub>k</sub>
</italic>
-factors (line 8) is naively performed in
<italic>O</italic>
(
<italic>k</italic>
) time, and is the only instruction of the inner loop that takes more than constant time. Hence, the computation of both
<italic>GkIFA </italic>
and
<italic>GkCFA </italic>
is performed in
<italic>O</italic>
((
<italic>m - k</italic>
)
<italic>qk</italic>
) time. Let us emphasize the simplicity of the algorithm, which explains the fast construction times obtained in practice.</p>
<p>
<bold>Remark 2</bold>
.
<italic>Once the values of GkCFA have been calculated, one can compute the values of GkCFPS in-place in O</italic>
((
<italic>m </italic>
-
<italic>k</italic>
)
<italic>q</italic>
)
<italic>time (see Remark 1)</italic>
.</p>
</sec>
</sec>
<sec>
<title>Answering the queries</title>
<p>Assume the
<italic>Gk </italic>
arrays have been built in a preprocessing step (see section Construction algorithm); we show how to answer the first four queries, starting with Q4 and Q3. Let
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i13.gif"></inline-graphic>
</inline-formula>
, and let
<italic>f </italic>
be the
<italic>k</italic>
-mer starting at
<italic>j</italic>
" in read
<italic>r
<sub>i</sub>
</italic>
. This occurrence of
<italic>f </italic>
in
<italic>C
<sub>R </sub>
</italic>
is found at
<italic>P</italic>
-position
<italic>j</italic>
':=
<italic>im </italic>
+
<italic>j</italic>
" and the corresponding
<italic>Q</italic>
-position is
<italic>j </italic>
:=
<italic>g</italic>
(
<italic>j</italic>
').</p>
<sec>
<title>Q4: Computing the cardinality of
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</title>
<p>First, we need to find the lexicographic rank of
<italic>f </italic>
among the
<italic>P
<sub>k</sub>
</italic>
-factors, which we obtain directly by setting
<italic>t </italic>
:=
<italic>GkIFA</italic>
[
<italic>j</italic>
] (by definition of
<italic>GkIFA</italic>
). The cardinality of
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) is simply the number of occurrences starting at
<italic>P</italic>
-positions in
<italic>C
<sub>R</sub>
</italic>
, which is given by
<italic>GkCFA</italic>
[
<italic>t</italic>
] (by definition of
<italic>GkCFA</italic>
). By Remark 1,
<italic>GkCFA</italic>
[
<italic>t</italic>
] =
<italic>GkCFPS</italic>
[
<italic>t</italic>
] -
<italic>GkCFPS</italic>
[
<italic>t - </italic>
1].</p>
</sec>
<sec>
<title>Q3: Computing
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</title>
<p>By Proposition 3, all occurrences of
<italic>f </italic>
starting at
<italic>P</italic>
-positions are stored consecutively in
<italic>GkSA</italic>
. It suffices to find the lower and upper indices, denoted by
<italic>
<sub>f </sub>
</italic>
and
<italic>u
<sub>f </sub>
</italic>
respectively. By the ordering of
<italic>GkSA </italic>
all occurrences of factors smaller than
<italic>f </italic>
in the lexicographic order are stored before its occurrences in
<italic>GkSA</italic>
. Hence, by definition of
<italic>GkCFPS </italic>
and Proposition 2, we have
<italic>u
<sub>f </sub>
</italic>
=
<italic>GkCFPS</italic>
[
<italic>t</italic>
] and
<italic>
<sub>f </sub>
</italic>
=
<italic>GkCFPS</italic>
[
<italic>t </italic>
- 1]. Since
<italic>GkSA </italic>
is indexed from 0, the starting
<italic>Q</italic>
-positions of occurrences of
<italic>f </italic>
are comprised in the range [
<italic>
<sub>f</sub>
, u
<sub>f </sub>
</italic>
] in
<italic>GkSA</italic>
. The corresponding
<italic>P</italic>
-positions are obtained using
<italic>g</italic>
<sup>-1</sup>
(.) and are then transformed into positioned occurrences with Proposition 1. This proves Theorem 2.</p>
<p>
<bold>Theorem 2</bold>
.
<italic>Let f be a k-mer of a read occurring at Q-position j in C
<sub>R</sub>
. Then, its lexicographic rank among the P
<sub>k</sub>
-factors is t </italic>
:=
<italic>GkIFA</italic>
[
<italic>j</italic>
].
<italic>If we set u
<sub>f </sub>
</italic>
:=
<italic>GkCFPS</italic>
[
<italic>t</italic>
]
<italic>and ℓ
<sub>f </sub>
</italic>
:=
<italic>GkCFPS</italic>
[
<italic>t - </italic>
1]
<italic>then</italic>
</p>
<p>
<italic>1. the starting P-positions of f's occurrences in C
<sub>R </sub>
are </italic>
{
<italic>g</italic>
<sup>-1</sup>
(
<italic>GkSA</italic>
[
<italic>
<roman></roman>
</italic>
])
<italic>| ℓ </italic>
∈ [
<italic>
<sub>f</sub>
, u
<sub>f </sub>
</italic>
[},</p>
<p>
<italic>2</italic>
.
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) = {(
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i22.gif"></inline-graphic>
</inline-formula>
,
<italic>g</italic>
<sup>-1</sup>
(
<italic>GkSA</italic>
[
<italic>
<roman></roman>
</italic>
]) mod
<italic>m</italic>
)
<italic>| ℓ </italic>
∈ [
<italic>
<roman></roman>
<sub>f</sub>
, u
<sub>f </sub>
</italic>
[},</p>
<p>
<italic>3</italic>
. #
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) =
<italic>u
<sub>f </sub>
</italic>
-
<italic>
<sub>f</sub>
</italic>
.</p>
<p>Given Theorem 2, the queries regarding
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
) can be answered as follows:</p>
<p>
<bold>Q1: </bold>
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
): = {
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i22.gif"></inline-graphic>
</inline-formula>
| ℓ ∈ [
<italic>
<sub>f</sub>
, u
<sub>f</sub>
</italic>
[},</p>
<p>
<bold>Q2: </bold>
by counting the elements of
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
) while computing it.</p>
<p>The algorithms for Q1, Q3, and Q4 are given extensively in Algorithms 2, 3, and 4. The algorithms for all other queries are included in Additional File
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Proof and queries' algorithms.</p>
<p>To answer Q7, one computes
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
) and scans it on the fly to remove reads (or the positioned occurrences) having strictly more than one occurrence of
<italic>f</italic>
. A similar approach solves Q8, and Q5. Variants of these queries where the number of allowed occurrences is constrained by a parameter can be answered similarly.</p>
</sec>
<sec>
<title>Complexity</title>
<p>Answering Q1-Q3 or Q5-Q8 requires to scan the values in
<italic>GkSA </italic>
inside the range corresponding to the
<italic>k</italic>
-mer
<italic>f</italic>
, which can be performed in
<italic>O</italic>
(
<italic>occ_Reads(f)</italic>
) time, where
<italic>occ_Reads(f) </italic>
denotes the occurrence number of
<italic>f </italic>
in the reads. Query Q4 is computed in constant time using
<italic>GkCFPS</italic>
.</p>
<p>
<bold>Algorithm 2: </bold>
Q1 (
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
))</p>
<p>
<bold>Data</bold>
:
<italic>f </italic>
∈ ∑
<italic>
<sup>k</sup>
</italic>
,
<italic>j </italic>
<italic>P</italic>
<sub>pos </sub>
such that
<italic>C
<sub>R</sub>
</italic>
[
<italic>j .. j </italic>
+
<italic>k </italic>
- 1] =
<italic>f</italic>
</p>
<p>
<bold>Result</bold>
: The set
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</p>
<p>1 begin</p>
<p>
<bold>2 </bold>
   
<italic>Ind
<sub>k </sub>
</italic>
← empty set;</p>
<p>
<bold>3 </bold>
   
<italic>t </italic>
<italic>GkIFA</italic>
[
<italic>j</italic>
];</p>
<p>
<bold>4 </bold>
   
<italic>
<sub>f </sub>
</italic>
<italic>GkCFPS</italic>
[
<italic>t - </italic>
1];</p>
<p>
<bold>5 </bold>
   
<italic>u
<sub>f </sub>
</italic>
<italic>GkCFPS</italic>
[
<italic>t</italic>
];</p>
<p>
<bold>6 </bold>
   prev ←
<italic>- </italic>
1;</p>
<p>
<bold>7    foreach </bold>
<italic>i </italic>
∈ [
<italic>
<sub>f</sub>
, u
<sub>f </sub>
</italic>
[  
<bold>do</bold>
</p>
<p>
<bold>8 </bold>
      readIndex ←
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i23.gif"></inline-graphic>
</inline-formula>
;</p>
<p>
<bold>9       if </bold>
<italic>readIndex </italic>
<italic>prev </italic>
<bold>then</bold>
</p>
<p>
<bold>10 </bold>
         Add readIndex to
<italic>Ind
<sub>k</sub>
</italic>
; prev ← readIndex;</p>
<p>
<bold>11  return </bold>
(
<italic>Ind
<sub>k</sub>
</italic>
);</p>
<p>
<bold>Algorithm 3: </bold>
Q3 (
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
))</p>
<p>
<bold>Data</bold>
:
<italic>f </italic>
∈ ∑
<italic>
<sup>k</sup>
</italic>
,
<italic>j </italic>
<italic>P</italic>
<sub>pos </sub>
such that
<italic>C
<sub>R</sub>
</italic>
[
<italic>j . . j </italic>
+
<italic>k - </italic>
1] =
<italic>f</italic>
</p>
<p>
<bold>Result</bold>
: The set
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</p>
<p>1 begin</p>
<p>
<bold>2 </bold>
   
<italic>Pos
<sub>k </sub>
</italic>
← empty set;</p>
<p>
<bold>3 </bold>
   
<italic>t </italic>
<italic>GkIFA</italic>
[
<italic>j</italic>
];</p>
<p>
<bold>4 </bold>
   
<italic>
<sub>f </sub>
</italic>
<italic>GkCFPS</italic>
[
<italic>t - </italic>
1];</p>
<p>
<bold>5 </bold>
   
<italic>u
<sub>f </sub>
</italic>
<italic>GkCFPS</italic>
[
<italic>t</italic>
];</p>
<p>
<bold>6    foreach </bold>
<italic>i </italic>
∈ [
<italic>
<sub>f</sub>
, u
<sub>f </sub>
</italic>
[
<bold>  do</bold>
</p>
<p>
<bold>7 </bold>
      readIndex ←
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i23.gif"></inline-graphic>
</inline-formula>
;</p>
<p>
<bold>8 </bold>
      posInRead ←
<italic>g</italic>
<sup>-1</sup>
(
<italic>GkSA</italic>
[
<italic>i</italic>
]) mod
<italic>m</italic>
;</p>
<p>
<bold>9 </bold>
      Add the pair (readIndex, posInRead) to
<italic>Pos
<sub>k</sub>
</italic>
;</p>
<p>
<bold>10  return </bold>
(
<italic>Pos
<sub>k</sub>
</italic>
);</p>
<p>
<bold>Algorithm 4: </bold>
Q4 (The cardinality of
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
))</p>
<p>
<bold>Data</bold>
:
<italic>f </italic>
∈ ∑
<italic>
<sup>k</sup>
</italic>
,
<italic>j </italic>
<italic>P</italic>
<sub>pos </sub>
such that
<italic>C
<sub>R</sub>
</italic>
[
<italic>j . . j </italic>
+
<italic>k </italic>
- 1] =
<italic>f</italic>
</p>
<p>
<bold>Result</bold>
: The cardinality of
<italic>Pos
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</p>
<p>
<bold>1 begin </bold>
//
<monospace>
<italic>GkCFA</italic>
</monospace>
[
<italic>t</italic>
] =
<monospace>
<italic>GkCFPS</italic>
</monospace>
[
<italic>t</italic>
] -
<monospace>
<italic>GkCFPS</italic>
</monospace>
[
<italic>t - </italic>
1]</p>
<p>
<bold>2 </bold>
   
<italic>t </italic>
<italic>GkIFA</italic>
[
<italic>j</italic>
];</p>
<p>
<bold>3    return </bold>
(
<italic>GkCFA</italic>
[
<italic>t</italic>
]);</p>
</sec>
</sec>
<sec>
<title>Practical considerations: implementation and variable read length</title>
<p>The value of
<italic>k</italic>
, which determines the length of
<italic>k</italic>
-mers used for querying the collection of reads, is a parameter of our index. However, the
<italic>Gk </italic>
arrays remain flexible. If for the simplicity of the presentation we have assumed until now that all reads have the same length, the whole structure can be adapted to a collection of reads having variable length. Indeed, since some sequencing technologies produce variable-length reads (
<italic>e.g</italic>
. Roche 454
<sup>®</sup>
), this adaptation is an important issue of versatility.</p>
<sec>
<title>Indexing variable-length reads</title>
<p>We show how our method can be slightly adapted to tackle this problem. Remind that the
<italic>Gk </italic>
arrays consider the string
<italic>C
<sub>R</sub>
</italic>
, the concatenation of all reads, and save place by discarding positions at which a
<italic>k</italic>
-mer overlaps two reads. This was done efficiently by converting any read position, or
<italic>P</italic>
-position, into a
<italic>Q</italic>
-position, and conversely, using function
<italic>g</italic>
. Up to now, this function relies on the fact that the read length is fixed. Thus, we need to modify its definition to accommodate different read lengths. For this, we use a bit vector
<italic>F</italic>
, as long as
<italic>C
<sub>R</sub>
</italic>
, to record which positions in
<italic>C
<sub>R </sub>
</italic>
are
<italic>P</italic>
-positions:
<italic>j </italic>
is a
<italic>P</italic>
-position iff
<italic>F</italic>
[
<italic>j</italic>
] = 1. We implement it as a vector having rank and select capabilities [
<xref ref-type="bibr" rid="B22">22</xref>
,
<xref ref-type="bibr" rid="B23">23</xref>
]. We define these operations as</p>
<p>• rank
<sub>1</sub>
(
<italic>F, i</italic>
) is the number of ones in
<italic>F</italic>
[0
<italic>..i</italic>
].</p>
<p>• select
<sub>1</sub>
(
<italic>F, i</italic>
) is the position of the
<italic>i</italic>
-th one in
<italic>F </italic>
(or
<italic>|F| </italic>
if there is less than
<italic>i </italic>
ones in
<italic>F</italic>
).</p>
<p>These operations can be performed in constant time, and
<italic>F </italic>
can be stored in a compressed form needing only
<italic>|F|H</italic>
<sub>0</sub>
(
<italic>F</italic>
) +
<italic>o</italic>
(
<italic>|F|</italic>
) bits, where
<italic>H</italic>
<sub>0 </sub>
is the zero-th order empirical entropy of
<italic>F</italic>
. Then computing
<italic>g</italic>
(
<italic>j</italic>
) and
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
) can be easily performed with a single rank or select query. Indeed, we have
<italic>g</italic>
(
<italic>j</italic>
) = rank
<sub>1</sub>
(
<italic>F, j</italic>
), and
<italic>g</italic>
<sup>-1</sup>
(
<italic>j</italic>
) = select
<sub>1</sub>
(
<italic>F, j</italic>
). Finally, using little extra memory,
<italic>Gk </italic>
arrays can also handle variable-length reads.</p>
</sec>
<sec>
<title>Implementation</title>
<p>
<italic>Gk </italic>
arrays are available as a reusable C++ library under a Cecill C licence (GPL compliant). It accepts standard formats for the input read collection (FASTA, FASTQ). Depending on the number of
<italic>k</italic>
-mer positions, the user should turn on the 64 bit encoding at compilation. It allows to process data sets of more than 2
<sup>31 </sup>
positions. Default is set to 32 bit encoding. Another compilation option can be activated to handle variable-length reads (typically Roche 454
<sup>® </sup>
datasets), otherwise by default
<italic>Gk </italic>
arrays process fixed length reads.</p>
<p>The data structure construction and queries algorithms are coded in standard C and C++. To reduce memory consumption, the full SA of
<italic>C
<sub>R </sub>
</italic>
is built using libdivsufsort library
<ext-link ext-link-type="uri" xlink:href="https://code.google.com/p/libdivsufsort/">https://code.google.com/p/libdivsufsort/</ext-link>
, which was chosen for its efficiency and low memory usage (see
<ext-link ext-link-type="uri" xlink:href="https://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks">https://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks</ext-link>
for a benchmark of up-to-date SA construction algorithms). However, its worst case time complexity is not linear in the length of the input sequence. Also the sort of values in
<italic>GkSA </italic>
inside each range corresponding to one
<italic>P
<sub>k</sub>
</italic>
-factor is performed with the quicksort algorithm. A linear time construction of the array
<italic>GkIFA </italic>
is possible by using an LCP array (array storing the length of the Longest Common Prefixes between two consecutive suffixes in the lexicographic order). However, building this array would need at least 9
<italic>mq </italic>
bytes with Manzini's algorithm [
<xref ref-type="bibr" rid="B24">24</xref>
].</p>
<p>We implemented two versions of the
<italic>Gk </italic>
arrays: one which indexes only fixed-length reads, and another for variable-length reads. When not stated otherwise,
<italic>Gk </italic>
arrays refers to the implementation for fixed-length reads. For managing variable-length reads we used Sux
<ext-link ext-link-type="uri" xlink:href="http://sux.dsi.unimi.it/">http://sux.dsi.unimi.it/</ext-link>
, an implementation of bit vectors with rank and select operations.</p>
</sec>
</sec>
</sec>
<sec>
<title>Theoretical and experimental comparisons</title>
<p>The sequencing capacity of new technologies continues to improve. Managing ever increasing read collections will be a major bottleneck in the bioinformatic analysis of High Throughput Sequencing data. The
<italic>Gk </italic>
arrays implement one solution to read indexing. If plain, as well as compressed, indexing data structures have been described in the litterature (cf. Introduction), their ability to handle large read collections have not been investigated. As we seek to optimise in practice the memory consumption, the construction time, and query running time, we will compare
<italic>Gk </italic>
arrays to two other uncompressed indexes: a generalized SA (gSA) and hash tables. We choose these two alternatives for they represent different approaches to read indexing. Among the uncompressed text indexes that have been generalized to handle a set of texts, the gSA is reckoned to be one of the most memory efficient and has been preferred to hash tables or the suffix tree in other contexts [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B25">25</xref>
]. On the other side, the optimisation of web search engines have triggered recent development of highly efficient hash tables, like Google sparse hash
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/google-sparsehash">http://code.google.com/p/google-sparsehash</ext-link>
or the hash tables from SGI extension of the C++ Standard Library
<ext-link ext-link-type="uri" xlink:href="http://www.sgi.com/tech/stl">http://www.sgi.com/tech/stl</ext-link>
. It is thus instructive to also compare
<italic>Gk </italic>
arrays to state of the art hash tables. As explained in Introduction, compressed indexes save memory but induce much longer running times to answer queries compared to plain indexes, and have been excluded from this comparison. Nevertheless, designing efficient compressed read indexes is a challenging future research avenue, which could be addressed by compressing the
<italic>Gk </italic>
arrays.</p>
<sec>
<title>A generalized Suffix Array (gSA) solution</title>
<p>We detail here the solution based on a generalized Suffix Array (gSA) to index a collection of reads, all reads having the same length. We call it the
<italic>gSA solution</italic>
. In fact it indexes the string made of the concatenation of all reads,
<italic>C
<sub>R</sub>
</italic>
. The preprocessing consists in building the generalized Suffix Array (gSA), the Inverse Suffix Array (ISA), and the Longest Common Prefixes (LCP) array of
<italic>C
<sub>R</sub>
</italic>
. The gSA is built using the same algorithm than for
<italic>Gk </italic>
arrays (libdivsufsort). The
<italic>ISA </italic>
is built by scanning the gSA in
<italic>mq </italic>
time, while the
<italic>LCP </italic>
array is also constructed in linear time using an efficient algorithm [
<xref ref-type="bibr" rid="B26">26</xref>
]. The tables are built in this order and add up in term of memory footprint.</p>
<p>In Figure
<xref ref-type="fig" rid="F3">3(a)</xref>
and
<xref ref-type="fig" rid="F3">3(b)</xref>
, we compare the time and space complexities of gSA and
<italic>Gk </italic>
arrays solutions. Since both start by building
<italic>gSA</italic>
(
<italic>C
<sub>R</sub>
</italic>
) and this is the dominant term of the time complexity, we obtain
<italic>O</italic>
(
<italic>mq</italic>
) time complexity: the space occupied during the construction of that table alone is 4.02
<italic>mq</italic>
, while it amounts to 4
<italic>mq </italic>
once built [
<xref ref-type="bibr" rid="B27">27</xref>
]. The last three columns of these tables show how the cumulated memory footprint evolves after each step during construction. We also monitored the memory footprint evolution during the construction of gSA and of
<italic>Gk </italic>
arrays and illustrate these graphically in Figures
<xref ref-type="fig" rid="F4">4(a)</xref>
and
<xref ref-type="fig" rid="F4">4(b)</xref>
, respectively. For the gSA the three tables add up in memory and each takes 4
<italic>mq </italic>
space. With
<italic>Gk </italic>
arrays</p>
<p>1. the
<italic>GkSA </italic>
table replaces
<italic>gSA</italic>
(
<italic>C
<sub>R</sub>
</italic>
) in memory and takes only
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i14.gif"></inline-graphic>
</inline-formula>
,</p>
<p>2.
<italic>GkIFA </italic>
takes an additional
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i14.gif"></inline-graphic>
</inline-formula>
while
<italic>GkCFA </italic>
occupies
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i15.gif"></inline-graphic>
</inline-formula>
with
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
denoting the number of distinct
<italic>P
<sub>k</sub>
</italic>
-factors, and</p>
<p>3. finally the
<italic>GkCFPS </italic>
replaces
<italic>GkCFA </italic>
and takes exactly the same space.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Comparing the complexities of the
<italic>Gk </italic>
arrays and generalized Suffix Array based solutions</bold>
. Comparing the complexities of
<italic>Gk </italic>
arrays and of the generalized Suffix Array solutions. A complexity is an expression that evaluates the running time or memory usage in function of parameters describing the input size. The construction time and space complexities of the index for
<italic>q </italic>
reads of length
<italic>m </italic>
having
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
distinct
<italic>k</italic>
-mers are given for the generalized SA in (a), and for the
<italic>Gk </italic>
arrays in (b). We detail the cumulative space complexity during the construction of the gSA, and after the main steps of the construction algorithms. I.e.: once the gSA, the ISA, and the LCP arrays are built in (a), and once
<italic>GkSA</italic>
,
<italic>GkIFA</italic>
, and
<italic>GkCFPS </italic>
are built in (b). In (c) we give the time complexities for answering queries Q1-Q7 with a
<italic>k</italic>
-mer denoted by
<italic>f</italic>
. The procedures for the gSA depends on
<italic>occ_C
<sub>R</sub>
(f)</italic>
, the occurrence number of
<italic>f </italic>
in the text made by the concatenation of all reads (
<italic>i.e</italic>
. in
<italic>C
<sub>R</sub>
</italic>
), while those for the
<italic>Gk </italic>
arrays depends on
<italic>occ_Reads(f)</italic>
, the occurrence number of
<italic>f </italic>
in all reads, and we know that
<italic>occ_Reads(f) </italic>
<italic>occ_C
<sub>R</sub>
(f)</italic>
.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-3"></graphic>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Evolution of memory footprint during the construction of the generalized Suffix Array and of
<italic>Gk </italic>
arrays</bold>
. Evolution of memory footprint during the construction of the generalized Suffix Array (a) and of
<italic>Gk </italic>
arrays (b) when indexing 15 million 75 bp reads with
<italic>k </italic>
= 25.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-4"></graphic>
</fig>
<p>In total, gSA takes 12
<italic>mq </italic>
bytes of memory, while
<italic>Gk </italic>
arrays occupy
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i16.gif"></inline-graphic>
</inline-formula>
bytes (with 32-bit integers), and
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i17.gif"></inline-graphic>
</inline-formula>
is smaller than
<italic>m</italic>
. This explains why the memory footprint of
<italic>Gk </italic>
arrays remains smaller in practice than that of gSA (Figures
<xref ref-type="fig" rid="F4">4(a)</xref>
and
<xref ref-type="fig" rid="F4">4(b)</xref>
), even for varying
<italic>k </italic>
values (see Figures
<xref ref-type="fig" rid="F5">5(a)</xref>
,
<xref ref-type="fig" rid="F4">4(a)</xref>
and
<xref ref-type="fig" rid="F4">4(b)</xref>
). Indeed, the gain of memory provided by
<italic>Gk </italic>
arrays increases with both
<italic>k </italic>
and
<italic>q</italic>
. If
<italic>k </italic>
is small, each
<italic>k</italic>
-mer tends to occur more in average, and thus
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i18.gif"></inline-graphic>
</inline-formula>
, meaning that
<italic>GkCFPS </italic>
is much smaller than the LCP array. If
<italic>k </italic>
is large then 4(
<italic>m</italic>
-
<italic>k</italic>
+1)
<italic>q </italic>
≪ 4
<italic>mq </italic>
and thus,
<italic>GkSA </italic>
plus
<italic>GkIFA </italic>
tables occupy much less place than the gSA and ISA tables. This constitutes, in almost all cases, a saving of at least 12(
<italic>k </italic>
- 1)
<italic>q </italic>
bytes.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Memory and construction time comparison between the Suffix Array solution, the hash table and the
<italic>Gk </italic>
arrays</bold>
. Memory and construction time comparison between the generalized Suffix Array solution (gSA), the hash tables (HT) and the
<italic>Gk </italic>
arrays. K562 dataset is used for that experiment, with 5 million to 25 million reads. The length of
<italic>k</italic>
-mers ranges from 15 to 30. gSA plots have been shifted left and HT plots have been shifted right for easing the reading. (a) Maximal memory usage while constructing the index and querying it. The error bars represent the space consumption depending on the value of
<italic>k</italic>
. (b) Construction time for the three indexes on the same data as for the maximal memory consumption. The levels of gray on the plots represent the value of
<italic>k</italic>
.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-5"></graphic>
</fig>
<p>Locating a
<italic>k</italic>
-mer in the reads can be done with a binary search in
<italic>O</italic>
(
<italic>k </italic>
+ log
<italic>qm</italic>
) worst case time with gSA using SA and LCP arrays and
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i19.gif"></inline-graphic>
</inline-formula>
worst case time with
<italic>Gk </italic>
arrays using
<italic>GkSA</italic>
. (We recall that the binary search is the standard procedure in this context [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
]).</p>
<p>However, Manber and Myers [
<xref ref-type="bibr" rid="B8">8</xref>
] mentioned that a simple improvement over the classical binary search (namely remembering the minimum length between the longest common prefix of the left and middle elements and the longest common prefix of the right and middle elements at each step of the binary search) permits to run in practice as fast as a
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i20.gif"></inline-graphic>
</inline-formula>
worst case method (see also [
<xref ref-type="bibr" rid="B9">9</xref>
] Section 7.14.3 page 152).</p>
<p>Thus, starting from a
<italic>k</italic>
-mer, rather than from a position, when answering the queries will bring an overhead similar in practice for the gSA and
<italic>Gk </italic>
arrays.</p>
<p>
<bold>Algorithm 5: </bold>
Q1 (
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
)) with the generalised Suffix Array solution</p>
<p>
<bold>Data</bold>
:
<italic>f </italic>
∈ ∑
<italic>
<sup>k</sup>
</italic>
,
<italic>j </italic>
<italic>P</italic>
<sub>pos </sub>
such that
<italic>C
<sub>R</sub>
</italic>
[
<italic>j .. j </italic>
+
<italic>k - </italic>
1] =
<italic>f</italic>
</p>
<p>
<bold>Result</bold>
: The set
<italic>Ind
<sub>k</sub>
</italic>
(
<italic>f</italic>
)</p>
<p>1 begin</p>
<p>
<bold>2 </bold>
   
<italic>Ind
<sub>k </sub>
</italic>
← empty set;</p>
<p>
<bold>3 </bold>
   Initialize the whole bit vector,
<italic>D</italic>
, to zero;</p>
<p>
<bold>4 </bold>
   
<italic>i </italic>
<italic>ISA</italic>
[
<italic>j</italic>
];//
<monospace>starting position of</monospace>
<italic> f </italic>
<monospace>occurrences in SA</monospace>
</p>
<p>5    repeat</p>
<p>
<bold>6       if </bold>
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i21.gif"></inline-graphic>
</inline-formula>
<bold> then</bold>
</p>
<p>         //
<monospace>the occurrence position does not overlap two reads</monospace>
</p>
<p>
<bold>7 </bold>
         readIndex ←
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i24.gif"></inline-graphic>
</inline-formula>
;</p>
<p>
<bold>8          if </bold>
<italic>D</italic>
[
<italic>readIndex</italic>
] ≠ 1
<bold>then</bold>
</p>
<p>            //
<monospace>we have not found an occurrence in this read yet</monospace>
</p>
<p>
<bold>9 </bold>
            Add readIndex to
<italic>Ind
<sub>k</sub>
</italic>
;</p>
<p>
<bold>10 </bold>
          
<italic>D</italic>
[readIndex] ← 1;</p>
<p>
<bold>11 </bold>
    
<italic>i </italic>
<italic>i </italic>
+ 1;</p>
<p>
<bold>12   until </bold>
(
<italic>i </italic>
<italic>qm</italic>
)
<italic>or </italic>
(
<italic>LCP</italic>
[
<italic>SA</italic>
[
<italic>i</italic>
],
<italic>SA</italic>
[
<italic>i </italic>
+ 1]]
<italic>< k</italic>
);</p>
<p>
<bold>13   return </bold>
(
<italic>Ind
<sub>k</sub>
</italic>
);</p>
<p>Nevertheless although we consider the same input, a position
<italic>j </italic>
of occurrence of the
<italic>k</italic>
-mer in a read, answering queries differ between the
<italic>Gk </italic>
arrays and gSA solutions. Indeed, since the gSA stores all positions in
<italic>C
<sub>R</sub>
</italic>
, we need to filter out positions of
<italic>k</italic>
-mers that overlap two reads in
<italic>C
<sub>R </sub>
</italic>
to keep only
<italic>P</italic>
-positions. This adds instructions to the procedure compared to that for the
<italic>Gk </italic>
arrays: see line 6 in Algorithm 5, which gives the algorithm for query Q1 with the gSA. For answering queries Q1 and Q2, we must perform another slight modification: we use a binary mask for dealing with duplicate
<italic>k</italic>
-mers in a same read. This mask is stored in a binary vector
<italic>B </italic>
having
<italic>q </italic>
bits, one bit per read. The bit corresponding to a read is set to one whenever the
<italic>k</italic>
-mer has been found to occur in that read, and subsequent occurrence positions in that read will be filtered out if the corresponding bit is set (lines 8 and 10 in Algorithm 5).</p>
<p>Assume we query on a
<italic>k</italic>
-mer
<italic>f </italic>
from one of its occurrence position
<italic>j</italic>
. Let us denote by
<italic>occ_C
<sub>R</sub>
(f) </italic>
the number of occurrences of
<italic>f </italic>
in
<italic>C
<sub>R</sub>
</italic>
, including those overlapping two reads (
<italic>i.e</italic>
., starting at non
<italic>P</italic>
-positions), and by
<italic>occ_Reads(f) </italic>
the number of its occurrences that are totally included in a read (
<italic>i.e</italic>
., those starting at
<italic>P</italic>
-positions). For Q1/Q2, Q5-Q7, we obtain with gSA a complexity of
<italic>O</italic>
(
<italic>q </italic>
+
<italic>occ_C
<sub>R</sub>
(f)</italic>
) since one initializes the bit vector
<italic>B </italic>
of size
<italic>q </italic>
and scan all
<italic>occ_C
<sub>R</sub>
(f) </italic>
occurrences. While with
<italic>Gk </italic>
arrays, the complexity depends linearly on
<italic>occ_Reads(f) </italic>
and we know that
<italic>occ_Reads(f) </italic>
<italic>occ_C
<sub>R</sub>
(f)</italic>
.</p>
<p>For Q3/Q4, there is no need of a bit vector with the gSA method, hence their complexity is
<italic>O</italic>
(
<italic>occ_C
<sub>R</sub>
(f)</italic>
), for one needs to scan positions in the gSA using the ISA and the LCP arrays. However,
<italic>Gk </italic>
arrays offer a complexity of
<italic>O</italic>
(
<italic>occ_Reads(f)</italic>
) for Q3 and
<italic>O</italic>
(1) for Q4. We summarize all queries time complexities in Figure
<xref ref-type="fig" rid="F3">3(c)</xref>
.</p>
<p>
<bold>Remark 3</bold>
.
<italic>To avoid scanning occ_C
<sub>R</sub>
(f) entries, an alternative solution consists in delimiting reads inside C
<sub>R </sub>
using a separator. This solution would lead to a space overhead of q bytes for lowering the time complexity to occ_Reads(f). However we did not retain this solution since our goal is to diminish the space complexity and this solution would not improve much the time complexity</italic>
.</p>
</sec>
<sec>
<title>A solution based on a hash table</title>
<p>An alternative solution is to index all
<italic>k</italic>
-mers in a hash table and to store for each read the list of its occurrence positions in the read collection. This list will contain pairs of integers: the read index in the collection, and the starting position of the
<italic>k</italic>
-mer in that read. The read index can be stored on a 32-bit integer, while a 16-bit integer suffices for the starting position. In such a case, storing the text is not necessary. The number of entries is the number of distinct
<italic>k</italic>
-mers in the read collection,
<italic>i.e</italic>
. our parameter
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
. Generally,
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
is small compared to 4
<italic>
<sup>k </sup>
</italic>
for values of
<italic>k </italic>
in [15,60]. Hence the hash table will be sparsely populated. We tried several implementation of state of the art hash tables: the Google sparse and dense hash arrays, and that from SGI extension of the C++ Standard Library (called hash map).</p>
<p>Preliminary experiments have shown that Google sparse requires significantly much longer to build than SGI hash map, while having a lower memory footprint. With 20 million 75 bp reads, Google sparse hash occupies one third of the memory needed by the SGI hash map, but it takes thrice more time to build. On the contrary Google dense hash tables takes twice more memory, and offers only similar construction time. Hence, SGI extension hash map exhibited the best compromise in term of memory consumption and construction time compared to Google implementations. Thus, we choose SGI extension implementation for the comparison with
<italic>Gk </italic>
arrays.</p>
</sec>
<sec>
<title>Experimental settings</title>
<p>We tested index structures on three datasets.</p>
<p>1. We used a collection of 40 million Illumina
<sup>® </sup>
RNA-Seq reads of length 75 from a human K562 library taken from the RGASP data (Accession number
<monospace>GM12878</monospace>
at
<ext-link ext-link-type="uri" xlink:href="http://www.gencodegenes.org/rgasp">http://www.gencodegenes.org/rgasp</ext-link>
with permission from B. Wold). We call it the K562 dataset.</p>
<p>2. We compiled several lanes of Roche 454
<sup>® </sup>
genomic sequencing to obtain a collection of 2.8 million reads ranging from [40,3000] bp with an average read length of 523 bp. These were sequenced on a Roche 454
<sup>® </sup>
GS FLX platform with Titanium chemistry for the Khoisan genome project [
<xref ref-type="bibr" rid="B28">28</xref>
]. We call it the Khoisan dataset.</p>
<p>3. As much longer fixed length reads are not yet available, we constructed a collection of fixed length reads by slicing the Khoisan reads in non-overlapping pieces of 150 bp. We obtained 25 millions of 150 bp reads, a read length that will soon be generated on High Throughput Sequencing platforms.</p>
<p>In the first and third collections, reads have a fixed length, while in the second their length varies. The experiments were performed on an Intel Xeon 2.27 GHz equipped with 48 GB of main memory, and running Linux 2.6.18 with C++ compiled using
<monospace>gcc</monospace>
version 3.4.6 and
<monospace>-02 -funroll-loops</monospace>
options.</p>
</sec>
<sec>
<title>Experimental comparison</title>
<p>The use of read indexing raises three questions: how much computing resources does the index demand? Is it scalable? How fast can it answer large number of queries? Clearly the resources will depend on the number of reads (parameter
<italic>q</italic>
), their lengths and on the length of
<italic>k</italic>
-mers (parameter
<italic>k</italic>
). We compare three solutions: a hash table (HT), a generalized Suffix Array (gSA), and
<italic>Gk </italic>
arrays.</p>
<sec>
<title>Scalability</title>
<p>We measured the construction time and amount of memory taken by all solutions for various numbers of reads and
<italic>k</italic>
-mer lengths. Figure
<xref ref-type="fig" rid="F5">5(a)</xref>
plots the maximal memory footprint on K562 data. At this scale, the value of
<italic>k </italic>
impacts only the hash table size; its influence on the gSA and
<italic>Gk </italic>
arrays is not visible on that graph. Second, the solutions can be ordered as follows:
<italic>Gk </italic>
arrays take the less memory, followed by the gSA, and then the hash table. This order is irrespective of the read number. For
<italic>k </italic>
= 20
<italic>e.g</italic>
.,
<italic>Gk </italic>
arrays use 10 GB, the gSA uses 20, and the HT 44, and the curves clearly indicate that these differences increase with the number of reads. Whatever the value of
<italic>q</italic>
, the hash table requires twice as much memory as the gSA, which itself takes at least 70% more memory than
<italic>Gk </italic>
arrays. With 25 million reads the hash table saturates the memory, with 30 million the gSA also does, while the
<italic>Gk </italic>
arrays constitute the only solution able to index the whole collection, 40 million reads, on that computer. Note that in both cases, the 64-bit implementation of gSA and
<italic>Gk </italic>
arrays have to be used to index that amount of reads. For the whole read collection,
<italic>Gk </italic>
arrays needs at most 43 GB (
<italic>k </italic>
= 15) and at least 36 GB (
<italic>k </italic>
= 30).</p>
<p>For all solutions, construction times increase linearly with the number of reads as expected (Figure
<xref ref-type="fig" rid="F5">5(b)</xref>
). It remains very similar between the gSA and
<italic>Gk </italic>
arrays, which both takes
<italic>e.g. <</italic>
1000 s. for 25 million reads. The influence of
<italic>k </italic>
is clearly visible on the hash table for 20 million reads: its construction time decreases with
<italic>k </italic>
because the parameter
<inline-formula>
<inline-graphic xlink:href="1471-2105-12-242-i2.gif"></inline-graphic>
</inline-formula>
also does (for a given number of reads). As long as they fit in memory, all compared solutions offer practical construction times.</p>
<p>We examined the behavior of
<italic>Gk </italic>
arrays on much longer reads, 150 bp, when variable-length read option is activated and when it is not. Figure
<xref ref-type="fig" rid="F6">6(a)</xref>
plots space consumption, while Figure
<xref ref-type="fig" rid="F6">6(b)</xref>
records the construction time for both options.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Comparing
<italic>Gk </italic>
arrays with fixed and variable length reads</bold>
. Experiments on the sliced Khoisan dataset (150 bp-reads) with
<italic>Gk </italic>
arrays (fixed-length reads and variable-length reads). (a) Maximal space consumption of each index for several read numbers and values of
<italic>k </italic>
(the error bars represent the variation in space usage, depending on
<italic>k</italic>
). (b) Index construction time (the lighter gray corresponds to the smaller
<italic>k</italic>
). (c) Query computation time for 13,000,000 reads depending on different
<italic>k </italic>
values.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-6"></graphic>
</fig>
<p>We see that adding a bit vector is not space consuming since there is little difference between the two methods (Figure
<xref ref-type="fig" rid="F6">6(a)</xref>
). For 13 million reads, the difference is, at most, of 300 MB between the two methods. In Figure
<xref ref-type="fig" rid="F6">6(b)</xref>
, we plotted the construction time for both indexes. The variable length read implementation becomes slower when the number of reads grows, compared to the fixed length
<italic>Gk </italic>
arrays. This shows that despite a constant-time theoretical complexity for rank and select operations; there is a dependency on the length of the bit vector in practice. However, the construction time remains reasonable in the variable case.</p>
<p>Figures
<xref ref-type="fig" rid="F7">7(a)</xref>
and
<xref ref-type="fig" rid="F7">7(b)</xref>
plot space and time measured for the hash table and
<italic>Gk </italic>
arrays (with variable length reads option) on the Khoisan read collection. The gSA has not been implemented to handle variable length reads; note that the relative cost would have been similar to that observed with
<italic>Gk </italic>
arrays between fixed and variable read length options. Here for one million reads, variable length
<italic>Gk </italic>
arrays require 470 s. to build vs 428 s. for the hash table, but 8 times less memory (5.5 vs 46 GB). The difference increases strongly with the read number. Above one million reads, the memory footprint of the hash table exceeds the computer memory (which is 48 GB), while
<italic>Gk </italic>
arrays index the complete collection of 2.8 million reads on the same hardware with
<italic><</italic>
15.6 GB. Hash tables appear to be more space consuming on the Khoisan dataset than on the K562 dataset. This can be explained by the nature of the data. Roche 454
<sup>® </sup>
sequencers offer a coverage depth much lower than Illumina's. Hence the number of distinct
<italic>k</italic>
-factors in the reads is likely to be greater with the Khoisan dataset.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Comparing hash table with
<italic>Gk </italic>
arrays on variable-length reads</bold>
. Experiments on the Khoisan Roche 454
<sup>® </sup>
dataset (variable length reads) with a hash table and
<italic>Gk </italic>
arrays (with variable-length read option). (a) Maximal space consumption of each index for several read numbers and values of
<italic>k </italic>
(the error bars represent the variation in space usage, depending on
<italic>k</italic>
). The lower space consumption corresponds to a lower
<italic>k</italic>
. (b) Index construction time (the lighter gray corresponds to the smaller
<italic>k</italic>
). (c) Query computation time for 600,000 reads depending on the value of
<italic>k</italic>
.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-7"></graphic>
</fig>
</sec>
<sec>
<title>Answering queries</title>
<p>We measured the mean time needed to answer 100,000 random queries of Q1-Q4. Since Q5-Q7 are slight variations of Q1-Q3 we do not report on these queries.</p>
<p>Figure
<xref ref-type="fig" rid="F8">8</xref>
shows how the mean time for each solution vary with the number of indexed reads (
<italic>q</italic>
) and
<italic>k </italic>
on the K562 collection. Clearly, the influence of
<italic>q </italic>
is similar for all solutions, and small compared to the differences between solutions. Generally, gSA takes always longer than the hash table irrespective of the query type, and it also takes longer than
<italic>Gk </italic>
arrays for Q1-Q2 and Q4, and a similar time for Q3. The order between the hash table and
<italic>Gk </italic>
arrays depends on the query type. They are equally fast on Q1, the hash table does slightly better on Q2, clearly better on Q3, while
<italic>Gk </italic>
arrays is much faster on Q4. Anyway, for both the hash table and
<italic>Gk </italic>
arrays, the mean running time is in the order of or less than 10 microseconds for Q1-Q3, and around 0.1 microsecond for Q4 with the
<italic>Gk </italic>
arrays, meaning reasonable practical times.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>Queries' running time comparison between the Suffix Array solution, the hash table and the
<italic>Gk </italic>
arrays</bold>
. Queries' running time comparison between the Suffix Array solution, the hash table and the
<italic>Gk </italic>
arrays. Answering the queries Q1/Q2/Q3/Q4 on K562 dataset (75 bp reads). The plots represent the average time in
<italic>μ</italic>
s over the same 100,000 queries of the corresponding type (
<italic>i.e</italic>
., Q1, Q2, Q3, or Q4). In all cases, the running time decreases when
<italic>k </italic>
increases for there are less occurrences in the reads of a say 30-mer than of a 15-mer. The
<italic>Gk </italic>
arrays are always faster than the Suffix Array; they even compute Q4 in constant time.</p>
</caption>
<graphic xlink:href="1471-2105-12-242-8"></graphic>
</fig>
<p>For our comparison of
<italic>Gk </italic>
arrays with fixed or variable length read options, we see that the latter is becoming slower than the former (up to 7 times slower) when
<italic>k </italic>
is small,
<italic>i.e</italic>
. when the number of occurrences of
<italic>k</italic>
-mers is large. With larger
<italic>k</italic>
, the query time of the latter diminishes and becomes 2 to 3 times slower than with fixed
<italic>Gk </italic>
arrays.</p>
<p>With variable length reads (Figure
<xref ref-type="fig" rid="F7">7(c)</xref>
) the query times remain practical, but the hash table needs between 1 and 32 fold less time than
<italic>Gk </italic>
arrays depending on the query.</p>
<p>In summary, under various conditions
<italic>Gk </italic>
arrays are equivalent in construction time to a generalized Suffix Array or to a hash table. Compared to these solutions, they also offer reasonable query times under all circumstances; however,
<italic>Gk </italic>
arrays clearly outperform them in terms of memory footprint, the main bottleneck for processing High Throughput Sequencing data.</p>
</sec>
</sec>
</sec>
</sec>
<sec>
<title>Conclusions</title>
<p>As High Throughput Sequencing becomes widespread, computational biology will face the challenge of managing astronomical quantities of short sequences. Mining such amount of sequences is feasible if the sequences are indexed in a preprocessing step. An index is a data structure that, like a telephone book, enables one to find easily a piece of information. For some value
<italic>k</italic>
, it records the positions of all
<italic>k</italic>
-mers in the reads in an organized fashion to minimize the memory usage. Then finding the reads related to some
<italic>k</italic>
-mer takes as long as reading the
<italic>k</italic>
-mer and listing the corresponding reads, but not as long as scanning all the reads. In other words, read indexing factorizes the results of searches, which later speeds up the numerous queries made while the index is kept in memory. Our main contribution is to propose such an index: the
<italic>Gk </italic>
arrays. They are fast to build, require less space than alternative uncompressed solutions, and can thus handle larger read collections: 40 million vs 20 million reads for the hash tables with a memory limited to 48 GB. It is a key issue in practice.</p>
<p>While being comparable to hash tables in terms of time efficiency, only the
<italic>Gk </italic>
arrays can completely index a large read collection (like the K562 dataset) with a memory size available on nowadays computing servers. Moreover, our index remains fast for a wide range of values of parameter
<italic>k </italic>
(the length of
<italic>k</italic>
-mers). We have also shown that
<italic>Gk </italic>
arrays are both faster and smaller than an alternative generalized Suffix Array approach. Similarly, on variable-length reads like a Roche 454
<sup>® </sup>
dataset,
<italic>Gk </italic>
arrays can handle the whole read collection using less than 16 GB while hash tables are limited to a smaller sub-collection (about 1 million reads) on a 48 GB machine.</p>
<p>The
<italic>Gk </italic>
arrays answer efficiently different types of queries, but they have been optimised for queries where the searched
<italic>k</italic>
-mer is extracted from an indexed read. Sometimes one wishes to know for a given
<italic>k</italic>
-mer the reads in which it occurs and its positions inside those (
<italic>e.g</italic>
. assembly), while in other contexts one only wants the number of reads sharing this
<italic>k</italic>
-mer (
<italic>e.g</italic>
. estimation of expression level). Moreover,
<italic>Gk </italic>
arrays adapt well to variable length reads. Their scalability and versatility are key advantages, which allows to envisage multiple applications as mentioned in Introduction. However, scaling up to gigantic datasets (terabytes of data), as the ones obtained in large metagenomic projects, will require compressed read indexes. The simplicity of use of our index, and its implementation as a C++ library make it a software brick that can be easily exploited in future programs or further developed by the community.</p>
<p>For mapping reads on a reference sequence, solutions exist that index reads with hash tables [
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
]. For the error correction problem, other works have indexed reads with classical text indexing solutions: with a generalized suffix trie [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
], a suffix array [
<xref ref-type="bibr" rid="B31">31</xref>
], or hash tables [
<xref ref-type="bibr" rid="B32">32</xref>
].
<italic>Gk </italic>
arrays represent a first, attractive read indexing solution; it is specialised for this question and should suit different applications. Nevertheless, one can envisage several research perspectives. Indexing approximate
<italic>k</italic>
-mers or spaced seeds will authorize more types of queries, but will certainly increase the construction time and space requirements. Designing a dynamic construction algorithm for
<italic>Gk </italic>
arrays would futher enlarge their range of applications. Another challenge is to compress
<italic>Gk </italic>
arrays by storing sampled positions and recomputing other positions at run time, as done with the Burrows Wheeler transform [
<xref ref-type="bibr" rid="B5">5</xref>
]. This would enable the user to adapt the index to its computer memory, while sacrificing some of its performance.</p>
</sec>
<sec>
<title>List of abbreviations used</title>
<p>High Throughput Sequencing: HTS; RNA: ribonucleic acid; mRNA: messenger RNA; RNA-Seq: RNA sequencing; ChIP-Seq: Chromatin ImmunoPrecipitation and sequencing; SA: Suffix Array; gSA: generalized SA; LCP: Longest Common Prefix; SNP: Single Nucleotide Polymorphism; bp: base pairs; iff: if and only if.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>All authors have designed the algorithm and contributed to the writing of the manuscript. NP and MS have developed the code. NP, MS, TL, ER have performed the experiments. ER supervised the manuscript redaction and submission. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional File 1</title>
<p>
<bold>Proof and queries' algorithms</bold>
.</p>
</caption>
<media xlink:href="1471-2105-12-242-S1.PDF" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements and Funding</title>
<p>This work is supported by a CNRS PEPS grant "Bioinformatique, séquençage haut-débit et transcrits chimères en cancérologie", a CNRS PICS grant, a BioStic grant, the Region Languedoc Roussillon, and the ATGC bioinformatics platform. MS and NP were supported by fellowships from the French Ministry of Research, and NP benefits from a fellowship from the Ligue contre le cancer. ER thanks Dortmund University for the Dortmunder Gambrinus Fellowship. We gratefully thanks A. Mancheron for packaging the
<italic>Gk </italic>
arrays library.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Maher</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kumar-Sinha</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Kalyana-Sundaram</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Jing</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Sam</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Barrette</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Palanisamy</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Chinnaiyan</surname>
<given-names>A</given-names>
</name>
<article-title>Transcriptome sequencing to detect gene fusions in cancer</article-title>
<source>Nature</source>
<year>2009</year>
<volume>458</volume>
<issue>7234</issue>
<fpage>97</fpage>
<lpage>101</lpage>
<pub-id pub-id-type="doi">10.1038/nature07638</pub-id>
<pub-id pub-id-type="pmid">19136943</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Blow</surname>
<given-names>N</given-names>
</name>
<article-title>Transcriptomics: The digital generation</article-title>
<source>Nature</source>
<year>2009</year>
<volume>458</volume>
<fpage>239</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1038/458239a</pub-id>
<pub-id pub-id-type="pmid">19279641</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D</given-names>
</name>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="book">
<name>
<surname>Burkhardt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Crauser</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lenhof</surname>
<given-names>HP</given-names>
</name>
<name>
<surname>Rivals</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
<article-title>
<italic>q</italic>
-gram Based Database Searching Using a Suffix Array (QUASAR)</article-title>
<source>3rd Annual Int Conf on Computational Molecular Biology</source>
<year>1999</year>
<publisher-name>ACM Press</publisher-name>
<fpage>77</fpage>
<lpage>83</lpage>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="other">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
<article-title>Opportunistic data structures with applications</article-title>
<source>Proc of FOCS</source>
<year>2000</year>
<fpage>390</fpage>
<lpage>398</lpage>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Hach</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Eichler</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Sahinalp</surname>
<given-names>S</given-names>
</name>
<article-title>mrsFAST: a cache-oblivious algorithm for short-read mapping</article-title>
<source>Nat Methods</source>
<year>2010</year>
<volume>7</volume>
<issue>8</issue>
<fpage>576</fpage>
<lpage>577</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth0810-576</pub-id>
<pub-id pub-id-type="pmid">20676076</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="other">
<name>
<surname>Weiner</surname>
<given-names>P</given-names>
</name>
<article-title>Linear Pattern Matching Algorithms</article-title>
<source>Conf Record of the 14th Annual Symposium on Swithcing and Automata Theory</source>
<year>1973</year>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="book">
<name>
<surname>Manber</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>GW</given-names>
</name>
<article-title>Suffix Arrays: A New Method for On-Line String Searches</article-title>
<source>Proceedings of the first annual ACM-SIAM Symposium on Discrete Algorithms</source>
<year>1990</year>
<publisher-name>San-Francisco: SIAM</publisher-name>
<fpage>319</fpage>
<lpage>327</lpage>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="book">
<name>
<surname>Gusfield</surname>
<given-names>D</given-names>
</name>
<source>Algorithms on Strings, Trees and Sequences</source>
<year>1997</year>
<publisher-name>Cambridge University Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="book">
<name>
<surname>Shi</surname>
<given-names>F</given-names>
</name>
<person-group person-group-type="editor">Jaffar J, Yap RHC</person-group>
<article-title>Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches</article-title>
<source>ASIAN, Volume 1179 of Lecture Notes in Computer Science</source>
<year>1996</year>
<publisher-name>Springer</publisher-name>
<fpage>11</fpage>
<lpage>22</lpage>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>González</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Venturini</surname>
<given-names>R</given-names>
</name>
<article-title>Compressed text indexes: From theory to practice</article-title>
<source>J Experimental Algorithmics</source>
<year>2009</year>
<volume>13</volume>
<fpage>12:1.12</fpage>
<lpage>12:1.31</lpage>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>14</issue>
<fpage>1754</fpage>
<lpage>1760</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp324</pub-id>
<pub-id pub-id-type="pmid">19451168</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Homann</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Fleer</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Giegerich</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rehmsmeier</surname>
<given-names>M</given-names>
</name>
<article-title>mkESA: enhanced suffix array construction tool</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>8</issue>
<fpage>1084</fpage>
<lpage>1085</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp112</pub-id>
<pub-id pub-id-type="pmid">19246510</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Philippe</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Boureux</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tarhio</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bréhélin</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Commes</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rivals</surname>
<given-names>E</given-names>
</name>
<article-title>Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
<volume>37</volume>
<issue>15</issue>
<fpage>e104</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp492</pub-id>
<pub-id pub-id-type="pmid">19531739</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
<article-title>Correction of sequencing errors in a mixed set of reads</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>10</issue>
<fpage>1284</fpage>
<lpage>1290</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq151</pub-id>
<pub-id pub-id-type="pmid">20378555</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Denoeud</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Aury</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Da Silva</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Noel</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Rogier</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Delledonne</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Morgante</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Valle</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wincker</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Scarpelli</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Jaillon</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Artiguenave</surname>
<given-names>F</given-names>
</name>
<article-title>Annotating genomes with massive-scale RNA sequencing</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>12</issue>
<fpage>R175</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2008-9-12-r175</pub-id>
<pub-id pub-id-type="pmid">19087247</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mortazavi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kwan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>van Baren</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Wold</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Pachter</surname>
<given-names>L</given-names>
</name>
<article-title>Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation</article-title>
<source>Nat Biotech</source>
<year>2010</year>
<volume>28</volume>
<issue>5</issue>
<fpage>511</fpage>
<lpage>515</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1621</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
<article-title>Assembly algorithms for next-generation sequencing data</article-title>
<source>Genomics</source>
<year>2010</year>
<volume>95</volume>
<issue>6</issue>
<fpage>315</fpage>
<lpage>327</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2010.03.001</pub-id>
<pub-id pub-id-type="pmid">20211242</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Conway</surname>
<given-names>TC</given-names>
</name>
<name>
<surname>Bromage</surname>
<given-names>AJ</given-names>
</name>
<article-title>Succinct Data Structures for Assembling Large Genomes</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>4</issue>
<fpage>479</fpage>
<lpage>486</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq697</pub-id>
<pub-id pub-id-type="pmid">21245053</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="book">
<name>
<surname>Cormen</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Leiserson</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Rivest</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>C</given-names>
</name>
<source>Introduction to Algorithms</source>
<year>2001</year>
<edition>2</edition>
<publisher-name>MIT Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<name>
<surname>Munro</surname>
<given-names>I</given-names>
</name>
<article-title>Tables</article-title>
<source>Proc. of Foundations of Software Technology and Theoretical Computer Science (FSTTCS), Volume 1180 of Lecture Notes in Computer Science, Springer</source>
<year>1996</year>
<fpage>37</fpage>
<lpage>42</lpage>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="other">
<name>
<surname>Raman</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raman</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>S</given-names>
</name>
<article-title>Succinct indexable dictionaries with applications to encoding
<italic>k</italic>
-ary trees and multisets</article-title>
<source>Proc of Symposium on Discrete Algorithms (SODA)</source>
<year>2002</year>
<fpage>233</fpage>
<lpage>242</lpage>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
<article-title>Two Space Saving Tricks for Linear Time LCP Array Computation</article-title>
<source>Proc 9th Scandinavian Workshop on Algorithm Theory</source>
<year>2004</year>
<volume>3111</volume>
<fpage>372</fpage>
<lpage>383</lpage>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Smoot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shumway</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biol</source>
<year>2004</year>
<volume>5</volume>
<issue>2</issue>
<fpage>R12</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2004-5-2-r12</pub-id>
<pub-id pub-id-type="pmid">14759262</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="other">
<name>
<surname>Kasai</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Arimura</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Arikawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K</given-names>
</name>
<article-title>Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications</article-title>
<source>Proc of the 12th Symposium on Combinatorial Pattern Matching, Volume 2089 of Lecture Notes in Computer Science, Springer</source>
<year>2001</year>
<fpage>181</fpage>
<lpage>192</lpage>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Smyth</surname>
<given-names>WF</given-names>
</name>
<name>
<surname>Turpin</surname>
<given-names>A</given-names>
</name>
<article-title>A taxonomy of suffix array construction algorithms</article-title>
<source>ACM Comp Surv</source>
<year>2007</year>
<volume>39</volume>
<issue>2</issue>
<fpage>1</fpage>
<lpage>31</lpage>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Ratan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tomsho</surname>
<given-names>LP</given-names>
</name>
<name>
<surname>Giardine</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Kasson</surname>
<given-names>LR</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Petersen</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Alkan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kidd</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Drautz</surname>
<given-names>DI</given-names>
</name>
<name>
<surname>Bouffard</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Muzny</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Reid</surname>
<given-names>JG</given-names>
</name>
<name>
<surname>Nazareth</surname>
<given-names>LV</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Burhans</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Riemer</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Wittekindt</surname>
<given-names>NE</given-names>
</name>
<name>
<surname>Moorjani</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Tindall</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Danko</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Teo</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Buboltz</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Oosthuysen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Steenkamp</surname>
<given-names>AW</given-names>
</name>
<name>
<surname>Oostuisen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Venter</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gajewski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pugh</surname>
<given-names>BF</given-names>
</name>
<name>
<surname>Makova</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Nekrutenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mardis</surname>
<given-names>ER</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Pringle</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Chiaromonte</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Mullikin</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Eichler</surname>
<given-names>EE</given-names>
</name>
<name>
<surname>Hardison</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Gibbs</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Harkins</surname>
<given-names>TT</given-names>
</name>
<name>
<surname>Hayes</surname>
<given-names>VM</given-names>
</name>
<article-title>Complete Khoisan and Bantu genomes from southern Africa</article-title>
<source>Nature</source>
<year>2010</year>
<volume>463</volume>
<issue>7283</issue>
<fpage>943</fpage>
<lpage>947</lpage>
<pub-id pub-id-type="doi">10.1038/nature08795</pub-id>
<pub-id pub-id-type="pmid">20164927</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Homer</surname>
<given-names>N</given-names>
</name>
<article-title>A survey of sequence alignment algorithms for next-generation sequencing</article-title>
<source>Brief Bioinf</source>
<year>2010</year>
<volume>11</volume>
<issue>5</issue>
<fpage>473</fpage>
<lpage>483</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbq015</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
<article-title>SHREC: a short-read error correction method</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>17</issue>
<fpage>2157</fpage>
<lpage>2163</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp379</pub-id>
<pub-id pub-id-type="pmid">19542152</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Ilie</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Fazayeli</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ilie</surname>
<given-names>S</given-names>
</name>
<article-title>HiTEC: accurate error correction in high-throughput sequencing data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>3</issue>
<fpage>295</fpage>
<lpage>302</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq653</pub-id>
<pub-id pub-id-type="pmid">21115437</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Salmela</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<article-title>Correcting errors in short reads by multiple alignments</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1455</fpage>
<lpage>1461</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr170</pub-id>
<pub-id pub-id-type="pmid">21471014</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A91 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000A91 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3163563
   |texte=   Querying large read collections in main memory: a versatile data structure
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21682852" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021