Serveur d'exploration autour du libre accès en Belgique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000406 ( Pmc/Corpus ); précédent : 0004059; suivant : 0004070 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Prospects and limitations of full-text index structures in genome analysis</title>
<author>
<name sortKey="Vyverman, Michael" sort="Vyverman, Michael" uniqKey="Vyverman M" first="Michaël" last="Vyverman">Michaël Vyverman</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Baets, Bernard" sort="De Baets, Bernard" uniqKey="De Baets B" first="Bernard" last="De Baets">Bernard De Baets</name>
<affiliation>
<nlm:aff id="gks408-AFF1">Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, 653 Coupure links, Ghent, B-9000, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fack, Veerle" sort="Fack, Veerle" uniqKey="Fack V" first="Veerle" last="Fack">Veerle Fack</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dawyndt, Peter" sort="Dawyndt, Peter" uniqKey="Dawyndt P" first="Peter" last="Dawyndt">Peter Dawyndt</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22584621</idno>
<idno type="pmc">3424560</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3424560</idno>
<idno type="RBID">PMC:3424560</idno>
<idno type="doi">10.1093/nar/gks408</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000406</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Prospects and limitations of full-text index structures in genome analysis</title>
<author>
<name sortKey="Vyverman, Michael" sort="Vyverman, Michael" uniqKey="Vyverman M" first="Michaël" last="Vyverman">Michaël Vyverman</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Baets, Bernard" sort="De Baets, Bernard" uniqKey="De Baets B" first="Bernard" last="De Baets">Bernard De Baets</name>
<affiliation>
<nlm:aff id="gks408-AFF1">Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, 653 Coupure links, Ghent, B-9000, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fack, Veerle" sort="Fack, Veerle" uniqKey="Fack V" first="Veerle" last="Fack">Veerle Fack</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Dawyndt, Peter" sort="Dawyndt, Peter" uniqKey="Dawyndt P" first="Peter" last="Dawyndt">Peter Dawyndt</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gks408-AFF1">Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J Felsenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S" uniqKey="Altschul S">S Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
<author>
<name sortKey="And Lipman, D" uniqKey="And Lipman D">D and Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flicek, P" uniqKey="Flicek P">P Flicek</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hoffmann, S" uniqKey="Hoffmann S">S Hoffmann</name>
</author>
<author>
<name sortKey="Otto, C" uniqKey="Otto C">C Otto</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Sharma, C" uniqKey="Sharma C">C Sharma</name>
</author>
<author>
<name sortKey="Khaitovich, P" uniqKey="Khaitovich P">P Khaitovich</name>
</author>
<author>
<name sortKey="Vogel, J" uniqKey="Vogel J">J Vogel</name>
</author>
<author>
<name sortKey="Stadler, P" uniqKey="Stadler P">P Stadler</name>
</author>
<author>
<name sortKey="Hackermuller, J" uniqKey="Hackermuller J">J Hackermüller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lam, T" uniqKey="Lam T">T Lam</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Tam, A" uniqKey="Tam A">A Tam</name>
</author>
<author>
<name sortKey="Wong, S" uniqKey="Wong S">S Wong</name>
</author>
<author>
<name sortKey="Wu, E" uniqKey="Wu E">E Wu</name>
</author>
<author>
<name sortKey="Yiu, S" uniqKey="Yiu S">S Yiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Yu, C" uniqKey="Yu C">C Yu</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Lam, T" uniqKey="Lam T">T Lam</name>
</author>
<author>
<name sortKey="Yiu, S" uniqKey="Yiu S">S Yiu</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author>
<name sortKey="Delcher, A" uniqKey="Delcher A">A Delcher</name>
</author>
<author>
<name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author>
<name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author>
<name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schatz, M" uniqKey="Schatz M">M Schatz</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Delcher, A" uniqKey="Delcher A">A Delcher</name>
</author>
<author>
<name sortKey="Varshney, A" uniqKey="Varshney A">A Varshney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Askitis, N" uniqKey="Askitis N">N Askitis</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, J" uniqKey="Schroder J">J Schröder</name>
</author>
<author>
<name sortKey="Schroder, H" uniqKey="Schroder H">H Schröder</name>
</author>
<author>
<name sortKey="Puglisi, S" uniqKey="Puglisi S">S Puglisi</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Schmidt, B" uniqKey="Schmidt B">B Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author>
<name sortKey="Yin, J" uniqKey="Yin J">J Yin</name>
</author>
<author>
<name sortKey="Zhan, Y" uniqKey="Zhan Y">Y Zhan</name>
</author>
<author>
<name sortKey="Xiong, W" uniqKey="Xiong W">W Xiong</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conway, T" uniqKey="Conway T">T Conway</name>
</author>
<author>
<name sortKey="Bromage, A" uniqKey="Bromage A">A Bromage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hernandez, D" uniqKey="Hernandez D">D Hernandez</name>
</author>
<author>
<name sortKey="Francois, P" uniqKey="Francois P">P Francois</name>
</author>
<author>
<name sortKey="Farinelli, L" uniqKey="Farinelli L">L Farinelli</name>
</author>
<author>
<name sortKey="Osteras, M" uniqKey="Osteras M">M Osteras</name>
</author>
<author>
<name sortKey="Schrenzel, J" uniqKey="Schrenzel J">J Schrenzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J" uniqKey="Simpson J">J Simpson</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abouelhoda, M" uniqKey="Abouelhoda M">M Abouelhoda</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Backofen, R" uniqKey="Backofen R">R Backofen</name>
</author>
<author>
<name sortKey="Will, S" uniqKey="Will S">S Will</name>
</author>
<author>
<name sortKey="Beckstette, M" uniqKey="Beckstette M">M Beckstette</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Iliopoulos, C" uniqKey="Iliopoulos C">C Iliopoulos</name>
</author>
<author>
<name sortKey="Makris, C" uniqKey="Makris C">C Makris</name>
</author>
<author>
<name sortKey="Panagis, Y" uniqKey="Panagis Y">Y Panagis</name>
</author>
<author>
<name sortKey="Perdikuri, K" uniqKey="Perdikuri K">K Perdikuri</name>
</author>
<author>
<name sortKey="Theodoridis, E" uniqKey="Theodoridis E">E Theodoridis</name>
</author>
<author>
<name sortKey="Tsakalidis, A" uniqKey="Tsakalidis A">A Tsakalidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shibuya, T" uniqKey="Shibuya T">T Shibuya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hon, W" uniqKey="Hon W">W Hon</name>
</author>
<author>
<name sortKey="Patil, M" uniqKey="Patil M">M Patil</name>
</author>
<author>
<name sortKey="Shah, R" uniqKey="Shah R">R Shah</name>
</author>
<author>
<name sortKey="Thankachan, S" uniqKey="Thankachan S">S Thankachan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jacobs, A" uniqKey="Jacobs A">A Jacobs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blumer, A" uniqKey="Blumer A">A Blumer</name>
</author>
<author>
<name sortKey="Blumer, J" uniqKey="Blumer J">J Blumer</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
<author>
<name sortKey="Mcconnell, R" uniqKey="Mcconnell R">R McConnell</name>
</author>
<author>
<name sortKey="Ehrenfeucht, A" uniqKey="Ehrenfeucht A">A Ehrenfeucht</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bayer, R" uniqKey="Bayer R">R Bayer</name>
</author>
<author>
<name sortKey="Mccreight, E" uniqKey="Mccreight E">E McCreight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weiner, P" uniqKey="Weiner P">P Weiner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morrison, D" uniqKey="Morrison D">D Morrison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boyer, R" uniqKey="Boyer R">R Boyer</name>
</author>
<author>
<name sortKey="Moore, J" uniqKey="Moore J">J Moore</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Knuth, D" uniqKey="Knuth D">D Knuth</name>
</author>
<author>
<name sortKey="Morris, J" uniqKey="Morris J">J Morris</name>
</author>
<author>
<name sortKey="Pratt, V" uniqKey="Pratt V">V Pratt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Giegerich, R" uniqKey="Giegerich R">R Giegerich</name>
</author>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mccreight, E" uniqKey="Mccreight E">E McCreight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manber, U" uniqKey="Manber U">U Manber</name>
</author>
<author>
<name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kasai, T" uniqKey="Kasai T">T Kasai</name>
</author>
<author>
<name sortKey="Lee, G" uniqKey="Lee G">G Lee</name>
</author>
<author>
<name sortKey="Arimura, H" uniqKey="Arimura H">H Arimura</name>
</author>
<author>
<name sortKey="Arikawa, S" uniqKey="Arikawa S">S Arikawa</name>
</author>
<author>
<name sortKey="Park, K" uniqKey="Park K">K Park</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grimsmo, N" uniqKey="Grimsmo N">N Grimsmo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fischer, J" uniqKey="Fischer J">J Fischer</name>
</author>
<author>
<name sortKey="Heun, V" uniqKey="Heun V">V Heun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Kim, M" uniqKey="Kim M">M Kim</name>
</author>
<author>
<name sortKey="Park, H" uniqKey="Park H">H Park</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burrows, M" uniqKey="Burrows M">M Burrows</name>
</author>
<author>
<name sortKey="Wheeler, D" uniqKey="Wheeler D">D Wheeler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Przywarski, R" uniqKey="Przywarski R">R Przywarski</name>
</author>
<author>
<name sortKey="Salinger, A" uniqKey="Salinger A">A Salinger</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adjeroh, D" uniqKey="Adjeroh D">D Adjeroh</name>
</author>
<author>
<name sortKey="Bell, T" uniqKey="Bell T">T Bell</name>
</author>
<author>
<name sortKey="Mukherjee, A" uniqKey="Mukherjee A">A Mukherjee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hon, W" uniqKey="Hon W">W Hon</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
<author>
<name sortKey="Sung, W" uniqKey="Sung W">W Sung</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arlazarov, V" uniqKey="Arlazarov V">V Arlazarov</name>
</author>
<author>
<name sortKey="Dinic, E" uniqKey="Dinic E">E Dinic</name>
</author>
<author>
<name sortKey="Kronrod, M" uniqKey="Kronrod M">M Kronrod</name>
</author>
<author>
<name sortKey="Faradzev, I" uniqKey="Faradzev I">I Faradzev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Gonzalez, R" uniqKey="Gonzalez R">R González</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Venturini, R" uniqKey="Venturini R">R Venturini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russo, L" uniqKey="Russo L">L Russo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
<author>
<name sortKey="Morales, P" uniqKey="Morales P">P Morales</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arroyuelo, D" uniqKey="Arroyuelo D">D Arroyuelo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Gerlach, W" uniqKey="Gerlach W">W Gerlach</name>
</author>
<author>
<name sortKey="Dixit, K" uniqKey="Dixit K">K Dixit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
<author>
<name sortKey="Gupta, A" uniqKey="Gupta A">A Gupta</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foschini, L" uniqKey="Foschini L">L Foschini</name>
</author>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
<author>
<name sortKey="Gupta, A" uniqKey="Gupta A">A Gupta</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russo, L" uniqKey="Russo L">L Russo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Canovas, R" uniqKey="Canovas R">R Cánovas</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
<author>
<name sortKey="Shibuya, T" uniqKey="Shibuya T">T Shibuya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
<author>
<name sortKey="Ukkonen, E" uniqKey="Ukkonen E">E Ukkonen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Fischer, J" uniqKey="Fischer J">J Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khan, Z" uniqKey="Khan Z">Z Khan</name>
</author>
<author>
<name sortKey="Bloom, J" uniqKey="Bloom J">J Bloom</name>
</author>
<author>
<name sortKey="Kruglyak, L" uniqKey="Kruglyak L">L Kruglyak</name>
</author>
<author>
<name sortKey="Singh, M" uniqKey="Singh M">M Singh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kulekci, M" uniqKey="Kulekci M">M Kulekci</name>
</author>
<author>
<name sortKey="Hon, W" uniqKey="Hon W">W Hon</name>
</author>
<author>
<name sortKey="Shah, R" uniqKey="Shah R">R Shah</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
<author>
<name sortKey="Xu, B" uniqKey="Xu B">B Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barsky, M" uniqKey="Barsky M">M Barsky</name>
</author>
<author>
<name sortKey="Stege, U" uniqKey="Stege U">U Stege</name>
</author>
<author>
<name sortKey="Thomo, A" uniqKey="Thomo A">A Thomo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marin, M" uniqKey="Marin M">M Marín</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andersson, A" uniqKey="Andersson A">A Andersson</name>
</author>
<author>
<name sortKey="Larsson, N" uniqKey="Larsson N">N Larsson</name>
</author>
<author>
<name sortKey="Swanson, K" uniqKey="Swanson K">K Swanson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Inenaga, S" uniqKey="Inenaga S">S Inenaga</name>
</author>
<author>
<name sortKey="Takeda, M" uniqKey="Takeda M">M Takeda</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Transier, F" uniqKey="Transier F">F Transier</name>
</author>
<author>
<name sortKey="Sanders, P" uniqKey="Sanders P">P Sanders</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Puglisi, S" uniqKey="Puglisi S">S Puglisi</name>
</author>
<author>
<name sortKey="Smyth, W" uniqKey="Smyth W">W Smyth</name>
</author>
<author>
<name sortKey="Turpin, A" uniqKey="Turpin A">A Turpin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manzini, M" uniqKey="Manzini M">M Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raman, R" uniqKey="Raman R">R Raman</name>
</author>
<author>
<name sortKey="Raman, V" uniqKey="Raman V">V Raman</name>
</author>
<author>
<name sortKey="Rao, S" uniqKey="Rao S">S Rao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Claude, F" uniqKey="Claude F">F Claude</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Okanohara, D" uniqKey="Okanohara D">D Okanohara</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kulekci, M" uniqKey="Kulekci M">M Külekci</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
<author>
<name sortKey="Xir, B" uniqKey="Xir B">B Xir</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jacobson, G" uniqKey="Jacobson G">G Jacobson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arroyuelo, D" uniqKey="Arroyuelo D">D Arroyuelo</name>
</author>
<author>
<name sortKey="Canovas, R" uniqKey="Canovas R">R Cánovas</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Archie, J" uniqKey="Archie J">J Archie</name>
</author>
<author>
<name sortKey="Day, W" uniqKey="Day W">W Day</name>
</author>
<author>
<name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J Felsenstein</name>
</author>
<author>
<name sortKey="Maddison, W" uniqKey="Maddison W">W Maddison</name>
</author>
<author>
<name sortKey="Meacham, C" uniqKey="Meacham C">C Meacham</name>
</author>
<author>
<name sortKey="Rohlf, F" uniqKey="Rohlf F">F Rohlf</name>
</author>
<author>
<name sortKey="Swofford, D" uniqKey="Swofford D">D Swofford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
<author>
<name sortKey="Fischer, J" uniqKey="Fischer J">J Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berkman, O" uniqKey="Berkman O">O Berkman</name>
</author>
<author>
<name sortKey="Vishkin, U" uniqKey="Vishkin U">U Vishkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gonzalez, R" uniqKey="Gonzalez R">R González</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elias, P" uniqKey="Elias P">P Elias</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
<author>
<name sortKey="Ukkonen, E" uniqKey="Ukkonen E">E Ukkonen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lempel, A" uniqKey="Lempel A">A Lempel</name>
</author>
<author>
<name sortKey="Ziv, J" uniqKey="Ziv J">J Ziv</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ziv, J" uniqKey="Ziv J">J Ziv</name>
</author>
<author>
<name sortKey="Lempel, A" uniqKey="Lempel A">A Lempel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arroyuelo, D" uniqKey="Arroyuelo D">D Arroyuelo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russo, L" uniqKey="Russo L">L Russo</name>
</author>
<author>
<name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
<author>
<name sortKey="Kugel, A" uniqKey="Kugel A">A Kügel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ohlebusch, E" uniqKey="Ohlebusch E">E Ohlebusch</name>
</author>
<author>
<name sortKey="Fischer, J" uniqKey="Fischer J">J Fischer</name>
</author>
<author>
<name sortKey="Gog, S" uniqKey="Gog S">S Gog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fischer, J" uniqKey="Fischer J">J Fischer</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hon, W" uniqKey="Hon W">W Hon</name>
</author>
<author>
<name sortKey="Shah, R" uniqKey="Shah R">R Shah</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baeza Yates, R" uniqKey="Baeza Yates R">R Baeza-Yates</name>
</author>
<author>
<name sortKey="Barbosa, E" uniqKey="Barbosa E">E Barbosa</name>
</author>
<author>
<name sortKey="Ziviani, N" uniqKey="Ziviani N">N Ziviani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Puglisi, S" uniqKey="Puglisi S">S Puglisi</name>
</author>
<author>
<name sortKey="Moffat, A" uniqKey="Moffat A">A Moffat</name>
</author>
<author>
<name sortKey="Turpin, A" uniqKey="Turpin A">A Turpin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Grossi, R" uniqKey="Grossi R">R Grossi</name>
</author>
<author>
<name sortKey="Gupta, A" uniqKey="Gupta A">A Gupta</name>
</author>
<author>
<name sortKey="Shah, R" uniqKey="Shah R">R Shah</name>
</author>
<author>
<name sortKey="Vitter, Js" uniqKey="Vitter J">JS Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hunt, E" uniqKey="Hunt E">E Hunt</name>
</author>
<author>
<name sortKey="Atkinson, M" uniqKey="Atkinson M">M Atkinson</name>
</author>
<author>
<name sortKey="Irving, R" uniqKey="Irving R">R Irving</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bedathur, S" uniqKey="Bedathur S">S Bedathur</name>
</author>
<author>
<name sortKey="Haritsa, J" uniqKey="Haritsa J">J Haritsa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clark, D" uniqKey="Clark D">D Clark</name>
</author>
<author>
<name sortKey="Munro, J" uniqKey="Munro J">J Munro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tian, Y" uniqKey="Tian Y">Y Tian</name>
</author>
<author>
<name sortKey="Tata, S" uniqKey="Tata S">S Tata</name>
</author>
<author>
<name sortKey="Hankins, R" uniqKey="Hankins R">R Hankins</name>
</author>
<author>
<name sortKey="Patel, J" uniqKey="Patel J">J Patel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Phoophakdee, B" uniqKey="Phoophakdee B">B Phoophakdee</name>
</author>
<author>
<name sortKey="Zaki, M" uniqKey="Zaki M">M Zaki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ghoting, A" uniqKey="Ghoting A">A Ghoting</name>
</author>
<author>
<name sortKey="Makarychev, K" uniqKey="Makarychev K">K Makarychev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clifford, R" uniqKey="Clifford R">R Clifford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bedathur, S" uniqKey="Bedathur S">S Bedathur</name>
</author>
<author>
<name sortKey="Haritsa, J" uniqKey="Haritsa J">J Haritsa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brodal, G" uniqKey="Brodal G">G Brodal</name>
</author>
<author>
<name sortKey="Fagerberg, R" uniqKey="Fagerberg R">R Fagerberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barsky, M" uniqKey="Barsky M">M Barsky</name>
</author>
<author>
<name sortKey="Stege, U" uniqKey="Stege U">U Stege</name>
</author>
<author>
<name sortKey="Thomo, A" uniqKey="Thomo A">A Thomo</name>
</author>
<author>
<name sortKey="Upton, C" uniqKey="Upton C">C Upton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barsky, M" uniqKey="Barsky M">M Barsky</name>
</author>
<author>
<name sortKey="Stege, U" uniqKey="Stege U">U Stege</name>
</author>
<author>
<name sortKey="Thomo, A" uniqKey="Thomo A">A Thomo</name>
</author>
<author>
<name sortKey="Upton, C" uniqKey="Upton C">C Upton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Halachev, M" uniqKey="Halachev M">M Halachev</name>
</author>
<author>
<name sortKey="Shiri, N" uniqKey="Shiri N">N Shiri</name>
</author>
<author>
<name sortKey="Thamildurai, A" uniqKey="Thamildurai A">A Thamildurai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Sadakane, K" uniqKey="Sadakane K">K Sadakane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gonzalez, R" uniqKey="Gonzalez R">R González</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arroyuelo, D" uniqKey="Arroyuelo D">D Arroyuelo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russo, L" uniqKey="Russo L">L Russo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chien, Y" uniqKey="Chien Y">Y Chien</name>
</author>
<author>
<name sortKey="Hon, W" uniqKey="Hon W">W Hon</name>
</author>
<author>
<name sortKey="Shah, R" uniqKey="Shah R">R Shah</name>
</author>
<author>
<name sortKey="Vitter, J" uniqKey="Vitter J">J Vitter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ukkonen, E" uniqKey="Ukkonen E">E Ukkonen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Maa, M" uniqKey="Maa M">M Maaß</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Puglisi, S" uniqKey="Puglisi S">S Puglisi</name>
</author>
<author>
<name sortKey="Smyth, W" uniqKey="Smyth W">W Smyth</name>
</author>
<author>
<name sortKey="Turpin, A" uniqKey="Turpin A">A Turpin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
<author>
<name sortKey="Sanders, P" uniqKey="Sanders P">P Sanders</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
<author>
<name sortKey="Sanders, P" uniqKey="Sanders P">P Sanders</name>
</author>
<author>
<name sortKey="Burkhardt, S" uniqKey="Burkhardt S">S Burkhardt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kulla, F" uniqKey="Kulla F">F Kulla</name>
</author>
<author>
<name sortKey="Sanders, P" uniqKey="Sanders P">P Sanders</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
<author>
<name sortKey="Gagie, T" uniqKey="Gagie T">T Gagie</name>
</author>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manzini, G" uniqKey="Manzini G">G Manzini</name>
</author>
<author>
<name sortKey="Ferragina, P" uniqKey="Ferragina P">P Ferragina</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nong, G" uniqKey="Nong G">G Nong</name>
</author>
<author>
<name sortKey="Zhang, S" uniqKey="Zhang S">S Zhang</name>
</author>
<author>
<name sortKey="Chan, W" uniqKey="Chan W">W Chan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Siren, J" uniqKey="Siren J">J Sirén</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menon, R" uniqKey="Menon R">R Menon</name>
</author>
<author>
<name sortKey="Bhat, G" uniqKey="Bhat G">G Bhat</name>
</author>
<author>
<name sortKey="Schatz, M" uniqKey="Schatz M">M Schatz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arroyuelo, D" uniqKey="Arroyuelo D">D Arroyuelo</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doring, A" uniqKey="Doring A">A Döring</name>
</author>
<author>
<name sortKey="Weese, D" uniqKey="Weese D">D Weese</name>
</author>
<author>
<name sortKey="Rausch, T" uniqKey="Rausch T">T Rausch</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loh, W" uniqKey="Loh W">W Loh</name>
</author>
<author>
<name sortKey="Moon, Y" uniqKey="Moon Y">Y Moon</name>
</author>
<author>
<name sortKey="Lee, W" uniqKey="Lee W">W Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsirogiannis, D" uniqKey="Tsirogiannis D">D Tsirogiannis</name>
</author>
<author>
<name sortKey="Koudas, N" uniqKey="Koudas N">N Koudas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fernandez, E" uniqKey="Fernandez E">E Fernandez</name>
</author>
<author>
<name sortKey="Najjar, W" uniqKey="Najjar W">W Najjar</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Schatz, M" uniqKey="Schatz M">M Schatz</name>
</author>
<author>
<name sortKey="Lin, J" uniqKey="Lin J">J Lin</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salson, M" uniqKey="Salson M">M Salson</name>
</author>
<author>
<name sortKey="Lecroq, T" uniqKey="Lecroq T">T Lecroq</name>
</author>
<author>
<name sortKey="Leonard, M" uniqKey="Leonard M">M Léonard</name>
</author>
<author>
<name sortKey="Mouchard, L" uniqKey="Mouchard L">L Mouchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salson, M" uniqKey="Salson M">M Salson</name>
</author>
<author>
<name sortKey="Lecroq, T" uniqKey="Lecroq T">T Lecroq</name>
</author>
<author>
<name sortKey="Leonard, M" uniqKey="Leonard M">M Léonard</name>
</author>
<author>
<name sortKey="Mouchard, L" uniqKey="Mouchard L">L Mouchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
<author>
<name sortKey="Siren, J" uniqKey="Siren J">J Sirén</name>
</author>
<author>
<name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuruppu, S" uniqKey="Kuruppu S">S Kuruppu</name>
</author>
<author>
<name sortKey="Puglisi, S" uniqKey="Puglisi S">S Puglisi</name>
</author>
<author>
<name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="iso-abbrev">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22584621</article-id>
<article-id pub-id-type="pmc">3424560</article-id>
<article-id pub-id-type="doi">10.1093/nar/gks408</article-id>
<article-id pub-id-type="publisher-id">gks408</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Survey and Summary</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Prospects and limitations of full-text index structures in genome analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Vyverman</surname>
<given-names>Michaël</given-names>
</name>
<xref ref-type="aff" rid="gks408-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="gks408-COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>De Baets</surname>
<given-names>Bernard</given-names>
</name>
<xref ref-type="aff" rid="gks408-AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fack</surname>
<given-names>Veerle</given-names>
</name>
<xref ref-type="aff" rid="gks408-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Dawyndt</surname>
<given-names>Peter</given-names>
</name>
<xref ref-type="aff" rid="gks408-AFF1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="gks408-AFF1">
<sup>1</sup>
Department of Applied Mathematics and Computer Science, Ghent University, Building S9, 281 Krijgslaan and
<sup>2</sup>
Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, 653 Coupure links, Ghent, B-9000, Belgium</aff>
<author-notes>
<corresp id="gks408-COR1">*To whom correspondence should be addressed. Tel:
<phone>+32 9264 47 66</phone>
; Fax:
<fax>+32 9264 49 95</fax>
; Email:
<email>michael.vyverman@ugent.be</email>
</corresp>
</author-notes>
<pmc-comment>For NAR both ppub and collection dates generated for PMC processing 1/27/05 beck</pmc-comment>
<pub-date pub-type="collection">
<month>8</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="ppub">
<month>8</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>12</day>
<month>5</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>12</day>
<month>5</month>
<year>2012</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>40</volume>
<issue>15</issue>
<fpage>6993</fpage>
<lpage>7015</lpage>
<history>
<date date-type="received">
<day>30</day>
<month>1</month>
<year>2012</year>
</date>
<date date-type="rev-recd">
<day>16</day>
<month>4</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>4</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2012. Published by Oxford University Press.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.</p>
</abstract>
<counts>
<page-count count="23"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>INTRODUCTION</title>
<p>Developments in sequencing technology continue to produce data at higher speed and lower cost. The resulting sequence data form a large fraction of the data processed in life sciences research. For example,
<italic>de novo</italic>
genome assembly joins relatively short DNA fragments together into longer contigs based on overlapping regions, whereas in RNA-seq experiments, cDNA is mapped to a reference genome or transcriptome. Further down the analysis pipeline, DNA and protein sequences are aligned to one another and similarity between aligned sequences is estimated to infer phylogenies (
<xref ref-type="bibr" rid="gks408-B1">1</xref>
). Although the type of sequences and applications varies widely, they all require basic string operations, most notably search operations. Given the sheer number and size of the sequences under consideration and the number of search operations required, efficient search algorithms are important components of genome analysis pipelines. For this reason, specialized data structures, generally bundled under the term ‘index structures’, are required to speed up string searching.</p>
<p>The use of specialized algorithms and data structures is motivated by the fact that the data flow has already surpassed the flow of advances in computer hardware and storage capabilities. However, although index structures are already widely used to speed up bioinformatics applications, they too are challenged by the recent data flood. Index structures require an initial construction phase and impose extra storage requirements. In return, they provide a wide variety of efficient string searching algorithms. Traditionally, this has led to a dichotomy between search efficiency and reduced memory consumption. However, recent advances in index structures have shown that compression and fast string searching can be achieved simultaneously using a combination of compression and indexing, thus solving this dichotomy (
<xref ref-type="bibr" rid="gks408-B2">2</xref>
).</p>
<p>There are many types of index structures. The most commonly known index structures are inverted indexes and lookup tables. These work in a similar way to the indexes found at the back of books. However, biological sequences generally lack a clear division in words or phrases, a prerequisite for inverted indexes to function properly. Two alternative index structures are used in bioinformatics applications.
<italic>k</italic>
-mer indexes divide sequences into substrings of fixed length
<italic>k</italic>
and are used, among others, in the BLAST (
<xref ref-type="bibr" rid="gks408-B3">3</xref>
) alignment tool. ‘Full-text indexes’, on the other hand, allow fast access to substrings of any length. Full-text indexes come at a greater memory and construction cost compared with
<italic>k</italic>
-mer indexes and are also far more complex. However, they contain much more information and allow for faster and more flexible string searching algorithms (
<xref ref-type="bibr" rid="gks408-B4">4</xref>
).</p>
<p>Full-text index structures are widely used crucial black box components of many bioinformatics applications. Their success is illustrated by the number of bioinformatics tools that currently use them. Examples are tools for short read mapping (
<xref ref-type="bibr" rid="gks408-B5">5</xref>
<xref ref-type="bibr" rid="gks408-B9">9</xref>
), alignment (
<xref ref-type="bibr" rid="gks408-B10">10</xref>
,
<xref ref-type="bibr" rid="gks408-B11">11</xref>
), repeat detection (
<xref ref-type="bibr" rid="gks408-B12">12</xref>
), error correction (
<xref ref-type="bibr" rid="gks408-B13">13</xref>
,
<xref ref-type="bibr" rid="gks408-B14">14</xref>
) and genome assembly (
<xref ref-type="bibr" rid="gks408-B15">15</xref>
<xref ref-type="bibr" rid="gks408-B17">17</xref>
). The memory and time performance of many of these tools are directly affected by the type and implementation of the index structure used. The choice for a tool impacts the choice of index structures and vice versa. However, the description of these tools in scientific literature often bypasses a detailed description about the specifications of the index structures used. Concepts such as suffix trees, suffix arrays or FM-indexes are introduced in general terms in bioinformatics courses, but most of the time, these index structures are applied as black boxes having certain properties and allowing certain operations on strings at a given time. This does injustice to the vast and rich literature available on index structures and does not present their complex design, possibilities and limitations. Moreover, most tools are designed using basic implementations of these index structures, without taking full advantage of the latest advances in indexing technology.</p>
<p>The goal of this article is 2-fold. On the one hand, we offer a comprehensive review of the basic ideas behind classical index structures, such as suffix trees, suffix arrays and Burrows–Wheeler-based index structures, such as the FM-indexes. No prior knowledge about index structures is required. On the other hand, we give an overview of the limitations of these structures as well as the research done in the last decade to overcome these limitations. Furthermore, in light of recent advances made in both sequencing technology as well as computing technology, we give prospects on future developments in index structure research.</p>
<sec sec-type="intro">
<title>Overview</title>
<p>This article is structured according to the following outline. The first main section introduces basic concepts and notations which are used throughout the article. This section also clarifies the relationship between computer science string algorithms and sequence analysis applications. Furthermore, it explains some algorithmic performance measures which have to be taken into account when dealing with advanced data structures. Readers well acquainted with data structures and algorithms may easily skip this section.</p>
<p>The second section reviews some of the most popular index structures currently in use. These include suffix trees, enhanced and compressed suffix arrays and FM-indexes, which are based on the Burrows–Wheeler transform. Both representational and algorithmic aspects of basic search operations of these index structures are discussed using a running example. Furthermore, the features of these different index structures are compared on an abstract level and their interrelation is made clear.</p>
<p>The next section gives an overview of current state-of-the-art main (RAM) memory index structures, with a focus on memory-time trade-offs. Several memory saving techniques are discussed, including compression techniques utilized in ‘compressed index structures’. The aim of this section is to provide insight into the complexity of the design of these compressed index structures, rather than to give their full details. It is shown how their design is composed of auxiliary data structures that govern the performance of the main index structure. On a larger scale, practical results from the bioinformatics literature illustrate the performance gain and limitations of search algorithms. Furthermore, a comparison between index structures, together with an extensive literature list, acts as a taxonomy for the currently known main memory full-text index structures.</p>
<p>While main memory index structures are the main focus of the second section, the design, limitations and improvements of external memory index structures are also discussed. The difference between index structures for internal and external memory is most prominent in their use of compression techniques, which are (still) less important in external memory. However, because harddisk access is much slower than main memory access, data structure layout and access patterns are much more important.</p>
<p>The second biggest bottleneck of index structure usage is the initial construction phase, which is covered in the final section. Both main memory as well as secondary memory construction algorithms are reviewed. The main conceptual ideas used for construction of the index structures discussed in previous sections are provided together with examples of the best results of construction algorithms found in the literature.</p>
<p>Finally, a summary of the findings presented in this article and some prospects on future directions of the research on index structures and its impact on bioinformatics applications is given. These prospects include variants and extensions of classical index structures, designed to answer specific biological queries, such as the search for structural RNA patterns, but also the use of new computing paradigms, such as the Google MapReduce framework (
<xref ref-type="bibr" rid="gks408-B18">18</xref>
).</p>
</sec>
</sec>
<sec>
<title>IMPORTANT CONCEPTS</title>
<p>Index structures originate from the field of theoretical computer science. This section introduces some important concepts for readers not familiar with the field. Readers with a background in data structures and algorithms may skip this section, except for the notations introduced at the end of this section.</p>
<sec>
<title>Strings versus sequences</title>
<p>The term ‘sequence’ is used for different concepts in the field of computer science and biology. What is called a sequence in biology is usually a ‘string’ in standard computer science parlance. The distinction between strings and sequences becomes especially prominent in computer science when introducing the concepts of substrings and subsequences. The former refer to contiguous intervals from larger strings, whereas the latter do not necessarily need to be contiguous intervals from the original string. As index structures work with substrings and to avoid ambiguity, we will stick to the standard computer science term string throughout this article, unless we explicitly want to stress the biological origin of the sequence.</p>
</sec>
<sec>
<title>String matching</title>
<p>Key components of genome analysis include statistical methods for scoring and comparing string hypotheses and string matching algorithms for efficient string comparisons. However, the former component falls beyond the scope of this review as our main focus lies on string matching algorithms studied in the field of computer science. This again gives rise to a terminology barrier between the two research fields. For nearly all index structures discussed in this review, efficient algorithms for exact and inexact string matching exist. These algorithms allow fast queries into sequence databases, similarity searches between sequences and DNA/RNA mapping. Inexact string matching is usually implemented using a backtracking algorithm on the suffix tree or a seed-and-extend approach. The latter approach may use maximal exact matches or other types of shared substrings. Maximal exact matches are examples of identical substrings shared between multiple strings and are frequently used as seeds in sequence alignment or in tools that determine sequence similarity (
<xref ref-type="bibr" rid="gks408-B10">10</xref>
). Searching for all maximal exact matches in an efficient way requires strong index structures that are fully expressive, i.e. allow for all suffix tree operations in constant time (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
).</p>
<p>Index structures reaching full expressiveness are able to handle a multitude of string searching problems such as locating several types of repeats, finding overlapping strings and finding the longest common substring. These string matching algorithms are, among others, used in genome assembly (finding repeats and overlaps), error correction of sequencing reads (repeats), fast identification of DNA contaminants (longest common substring) and genealogical DNA testing (short tandem repeats).</p>
<p>In addition, some index structures are geared toward specific applications. ‘Affix index structures’, for example, allow bidirectional string searching. As a result, they can be used for searching RNA structure patterns (
<xref ref-type="bibr" rid="gks408-B20">20</xref>
) and for short read mapping (
<xref ref-type="bibr" rid="gks408-B6">6</xref>
). ‘Weighted suffix trees’ (
<xref ref-type="bibr" rid="gks408-B21">21</xref>
) can be used to find patterns in biological sequences that contain weights such as base probabilities, but are also applied in error correction (
<xref ref-type="bibr" rid="gks408-B13">13</xref>
). ‘Geometric suffix trees’ (
<xref ref-type="bibr" rid="gks408-B22">22</xref>
) have been used to index 3D protein structures. ‘Property suffix trees’ have additional data structures to efficiently answer property matching queries. This can be useful, for example, in retrieving all occurrences of patterns that appear in a repetitive genomic structure (
<xref ref-type="bibr" rid="gks408-B23">23</xref>
).</p>
</sec>
<sec>
<title>Theoretical complexity</title>
<p>As is the case for other data structures, the performance of algorithms working on index structures is usually expressed in terms of their theoretical complexity, indicated by the ‘big-
<inline-formula>
<inline-graphic xlink:href="gks408i6.jpg"></inline-graphic>
</inline-formula>
notation’. Although a theoretical measure of the worst-case scenario, it contains valuable practical information about the qualitative and quantitative performance of algorithms and data structures. For example, some index structures contain an alphabet-dependency, whereas others do not. Thus, alphabet-independent index structures theoretically perform string searches equally well on DNA sequences (4 different characters) as on protein sequences (20 different characters). The qualitative information of the theoretical complexity usually categorizes the dependency of input parameters in terms of logarithmic, linear, quasilinear, quadratic or exponential dependency. Intuitively, this means that even if several algorithms nearly have the same execution time or memory requirements for a given input sequence, the execution time and memory requirements of some algorithms will grow much faster than those of others when the input size increases. In practice, quasilinear algorithms [complexity
<inline-formula>
<inline-graphic xlink:href="gks408i7.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log
<italic>n</italic>
)] are sometimes much faster than linear algorithms [complexity
<inline-formula>
<inline-graphic xlink:href="gks408i8.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
)], because of the lower order terms and constants involved. These are usually omitted in the big-
<inline-formula>
<inline-graphic xlink:href="gks408i9.jpg"></inline-graphic>
</inline-formula>
notation. In general, however, the big-
<inline-formula>
<inline-graphic xlink:href="gks408i10.jpg"></inline-graphic>
</inline-formula>
notation is a good guideline for algorithm and data structure performance. Furthermore, this measure of algorithm and data structure efficiency is timeless and is not dependent on hardware, implementation and data specifications, as opposed to benchmark test results which can be misleading and may quickly become obsolete over time.</p>
</sec>
<sec>
<title>Computer memory</title>
<p>Practical performance of index structures is not only governed by their algorithmic design, but also by the hardware that holds the data structure. Computer memory in essence is a hierarchical structure of layers, ordered from small, expensive, but fast memory to large, cheap and slow memory types. The hierarchy can roughly be divided into ‘main memory’, most notably RAM memory and caches, and secondary or ‘external memory’, which usually consists of hard disks or in the near future solid-state disks. Most index structures and applications are designed to run in main memory, because this allows for fast ‘random access’ to the data, whereas hard disks are usually 10
<sup>5</sup>
–10
<sup>6</sup>
times slower for random access (
<xref ref-type="bibr" rid="gks408-B24">24</xref>
). As the price of biological data currently decreases much faster than the price of RAM memory and bioinformatics projects are becoming much larger, comparing more data than ever before, algorithms and data structures designed for cheaper external memory become more important (
<xref ref-type="bibr" rid="gks408-B25">25</xref>
). These external memory algorithms usually read data from external memory, process the information in main memory and output the result again to disk. As mentioned above, these ‘input/output’ (I/O) operations are very expensive. As a result, the algorithmic design needs to minimize these operations as much as possible, for example by keeping key information that is needed frequently into main memory. This technique, known as ‘caching’, is also used by file systems. File systems usually load more data into main memory than requested because it is physically located close to the requested data and may be predicted to become needed in the near future. The physical ‘locality’ of data organized by index structures is thus of great importance. Moreover, data that is often logically requested in sequential order, should also be physically ordered sequentially, because sequential disk access is almost as fast as random access in main memory. More information about index structure design for the different memory settings is found in ‘Popular index structures’, ‘Time-memory trade-offs’ and ‘Index structures in external memory’ sections.</p>
</sec>
<sec>
<title>Notations</title>
<p>The following notations are used throughout the rest of the text. Let the finite, totally ordered alphabet Σ be an array of size |Σ| (|·| will be used to denote the size of a string, set or array). The DNA-alphabet, for example, has size four and is given by Σ = {
<monospace>A,C,G,T</monospace>
}. Furthermore, let Σ
<sup>
<italic>k</italic>
</sup>
and Σ*, respectively, be the set of all strings composed of
<italic>k</italic>
characters from Σ and the set of all strings composed of zero or more characters from Σ. The empty string will be denoted as ε. Let
<italic>S</italic>
 ∈ Σ
<sup>
<italic>n</italic>
</sup>
. All indexes in this article are zero-based. For every 0 ≤ 
<italic>i</italic>
 ≤ 
<italic>j</italic>
 < 
<italic>n</italic>
,
<italic>S</italic>
[
<italic>i</italic>
] denotes the character at position
<italic>i</italic>
in
<italic>S</italic>
,
<italic>S</italic>
[
<italic>i</italic>
 .. 
<italic>j</italic>
] denotes a substring that starts at position
<italic>i</italic>
and ends at position
<italic>j</italic>
and
<italic>S</italic>
[
<italic>i</italic>
 .. 
<italic>j</italic>
 ] = ε for
<italic>i</italic>
 > 
<italic>j</italic>
.
<italic>S</italic>
[
<italic>i</italic>
..] is the
<italic>i</italic>
-th
<italic>suffix</italic>
of
<italic>S</italic>
and
<italic>S</italic>
[..
<italic>i</italic>
] is the
<italic>i</italic>
-th
<italic>prefix</italic>
of
<italic>S</italic>
and
<italic>S</italic>
[−1..] = 
<italic>S</italic>
[.. 
<italic>n</italic>
] = ε. Likewise,
<italic>A</italic>
[
<italic>i</italic>
 .. 
<italic>j</italic>
] denotes an interval in an array
<italic>A</italic>
and the comma separator is used in 2D arrays, e.g.
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
] denotes the matrix element of
<italic>M</italic>
at the
<italic>i</italic>
-th row and
<italic>j</italic>
-th column.</p>
<p>
<italic>S</italic>
represents the indexed string which is usually very large, i.e. a chromosome or complete genome. Another string
<italic>P</italic>
denotes a pattern, which is searched in
<italic>S</italic>
. The length of
<italic>P</italic>
is
<italic>m</italic>
and usually
<italic>m</italic>
 ≪ 
<italic>n</italic>
holds, unless stated otherwise. For example,
<italic>P</italic>
can be a certain pattern, a sequencing read or a gene. The lexicographical order relation between two elements of Σ* is represented as <. The ‘longest common prefix’ LCP(
<italic>S</italic>
,
<italic>P</italic>
) of two strings
<italic>S</italic>
and
<italic>P</italic>
is the prefix
<italic>S</italic>
[..
<italic>k</italic>
], such that
<italic>S</italic>
[..
<italic>k</italic>
] = 
<italic>P</italic>
[..
<italic>k</italic>
] and
<italic>S</italic>
[
<italic>k</italic>
 + 1] ≠ 
<italic>P</italic>
[
<italic>k</italic>
 + 1].</p>
<p>As a final remark, note that all logarithms in this article have base two, unless stated otherwise.</p>
</sec>
</sec>
<sec>
<title>POPULAR INDEX STRUCTURES</title>
<p>Index structures are data structures used to preprocess one or more strings to speed up string searches. As the examples in this section will illustrate, the types of searches can be quite diverse, yet some index structures manage to achieve an optimal performance for a broad class of search problems. The ultimate goal of index structures is to quickly capture maximal information about the string to be queried and to represent this information in a compact form. It turns out that both requirements often conflict in practice, with different types of index structures providing alternative trade-offs between speed and memory consumption. However, the speedup achieved over classical string searching algorithms often makes up for the extra construction and memory costs.</p>
<p>The type of index structures discussed here are ‘full-text index structures’. Unlike natural language, biological sequences do not show a clear structure of words and phrases, making popular ‘word-based’ index structures such as inverted files (
<xref ref-type="bibr" rid="gks408-B26">26</xref>
) and B-trees (
<xref ref-type="bibr" rid="gks408-B27">27</xref>
) less suited for indexing genomic sequences. Instead, full-text indexes that store information about all variable length substrings are better suited to analyze the complex nature of genome sequences.</p>
<p>The three most commonly used full-text index structures in bioinformatics today are suffix trees, suffix arrays and FM-indexes. The raison d'être of the latter two is the high-memory requirements of suffix trees. In this section, it is shown how those smaller indexes actually are reduced suffix trees and can be enhanced with auxiliary information to achieve complete suffix tree functionality.</p>
<sec>
<title>Suffix trees</title>
<p>Suffix trees have become the archetypical index structure used in bioinformatics. Introduced by Weiner (
<xref ref-type="bibr" rid="gks408-B28">28</xref>
), who also gave a linear time construction algorithm, they are said to efficiently solve a myriad of string processing problems (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
). Complex string problems such as finding the longest common substring can be solved in linear time using suffix trees. The suffix tree of a string
<italic>S</italic>
contains information about all suffixes of that string and gives access to all prefixes of those suffixes, thus effectively allows fast access to all substrings of the string
<italic>S</italic>
.</p>
<p>The suffix tree ST(
<italic>S</italic>
) is formally defined as the radix tree (
<xref ref-type="bibr" rid="gks408-B30">30</xref>
), i.e. a compact string search tree data structure, built from all suffixes of
<italic>S</italic>
. The edges of ST(
<italic>S</italic>
) are labeled with substrings of
<italic>S</italic>
and the leaves are numbered 0 to
<italic>n</italic>
 − 1. The one-to-one correspondence between leaf
<italic>i</italic>
of ST(
<italic>S</italic>
) and suffix
<italic>i</italic>
of
<italic>S</italic>
is found by concatenating all edge labels on the path from the root to the leaf: the concatenated string ending in leaf
<italic>i</italic>
equals suffix
<italic>S</italic>
[
<italic>i</italic>
..]. Moreover, internal nodes correspond to the LCP of suffixes of
<italic>S</italic>
, such that labels of all outgoing edges from an internal node start with a different character and every internal node has at least two children. This last property allows to distinguish suffix trees and non-compact suffix ‘tries’ whose nodes can have single children because edge label lengths are all equal to one. In order for the above properties to hold for a string
<italic>S</italic>
, the last character of
<italic>S</italic>
has to uniquely appear in
<italic>S</italic>
. In practice, this problem is solved by appending a special end-character $ to the end of string
<italic>S</italic>
, with $∉Σ and $ < 
<italic>c</italic>
, ∀
<italic>c</italic>
 ∈ Σ. This special end-character plays the same role as the virtual end-of-string symbol used in regular expressions (also represented as $ in that context). Hereafter, for every indexed string
<italic>S</italic>
it is assumed
<italic>S</italic>
[
<italic>n</italic>
 − 1] = $ or, equivalently,
<italic>S</italic>
 ∈ Σ*$ holds. As a running example, the suffix tree ST(
<italic>S</italic>
) for the string
<italic>S</italic>
 = 
<monospace>ACATACAGATG</monospace>
$ is given in
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
.
<fig id="gks408-F1" position="float">
<label>Figure 1.</label>
<caption>
<p>Suffix tree for string
<italic>S</italic>
 = 
<monospace>ACATACAGATG</monospace>
, where $ is the special end-character. Each number
<italic>i</italic>
inside a leaf represents suffix
<italic>S</italic>
[
<italic>i</italic>
..] of the string
<italic>S</italic>
. Dashed arrows correspond to suffix links. Edges are arranged in lexicographical order. For the sake of brevity, only the first characters followed by two dots and the special end-character $ are shown for edge labels that spell out the rest of the suffix corresponding to the leaf the edge is connected with.</p>
</caption>
<graphic xlink:href="gks408f1"></graphic>
</fig>
</p>
<p>The ‘label’ ℓ(
<italic>v</italic>
) of a node
<italic>v</italic>
of ST(
<italic>S</italic>
) is defined as the concatenation of edge labels on the path from the root to the node. From this definition it follows that ℓ(root) = ϵ. The ‘string depth’ of
<italic>v</italic>
is defined as |ℓ(
<italic>v</italic>
)|. The ‘suffix link’ sl(
<italic>v</italic>
) of an internal node
<italic>v</italic>
with label
<italic>cw</italic>
(
<italic>c</italic>
 ∈ Σ and
<italic>w</italic>
 ∈ Σ*) is the unique internal node with label
<italic>w</italic>
. Suffix links are represented as dashed lines in
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
.</p>
<p>Most suffix tree algorithms boil down to (partial or full) top-down or bottom-up traversals of the tree, or the following of suffix links (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
). These different types of traversals are further illustrated using some classical string algorithms.</p>
<p>In the exact string matching problem, all positions of a substring
<italic>P</italic>
have to be found in string
<italic>S</italic>
. Exact string matching is an important problem on its own and is also used as a basis for more complex string matching problems. Since
<italic>P</italic>
is a substring of
<italic>S</italic>
if and only if
<italic>P</italic>
is a prefix of some suffix of
<italic>S</italic>
, it follows that matching every character of
<italic>P</italic>
along a path in ST(
<italic>S</italic>
) (starting at the root) gives the answer to the existential question. This algorithm thus requires a partial top-down traversal of ST(
<italic>S</italic>
) and has a time complexity of
<inline-formula>
<inline-graphic xlink:href="gks408i11.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
). Since suffixes of
<italic>S</italic>
are grouped by common prefixes in ST(
<italic>S</italic>
), the set of leaves in the subtree below the path that spells out
<italic>P</italic>
represents all locations where
<italic>P</italic>
occurs in
<italic>S</italic>
. This set is denoted as occ(
<italic>P</italic>
,
<italic>S</italic>
) and can be obtained in
<inline-formula>
<inline-graphic xlink:href="gks408i12.jpg"></inline-graphic>
</inline-formula>
(|occ(
<italic>P</italic>
,
<italic>S</italic>
)|) time. As an example, consider matching pattern
<italic>P</italic>
 = 
<monospace>AC</monospace>
to the running example in
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
. The algorithm first finds the edge with label
<monospace>A</monospace>
going down from the root and then continues down the tree along the edge labeled
<monospace>CA</monospace>
. After matching the character
<monospace>C</monospace>
, the algorithm decides that
<italic>P</italic>
is a substring of
<italic>S</italic>
. Furthermore, occ(
<italic>P</italic>
,
<italic>S</italic>
) = {0, 4} and thus
<italic>P</italic>
 = 
<italic>S</italic>
[0..1] = 
<italic>S</italic>
[4..5]. This classical example already demonstrates the true power of suffix trees: the time complexity for matching
<italic>k</italic>
patterns of length
<italic>m</italic>
to a string of length
<italic>n</italic>
is
<inline-formula>
<inline-graphic xlink:href="gks408i13.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
 + 
<italic>km</italic>
). String matching algorithms that preprocess pattern
<italic>P</italic>
instead of string
<italic>S</italic>
[Boyer–Moore (
<xref ref-type="bibr" rid="gks408-B31">31</xref>
) and Knuth–Morris–Pratt (
<xref ref-type="bibr" rid="gks408-B32">32</xref>
), among others] require
<inline-formula>
<inline-graphic xlink:href="gks408i14.jpg"></inline-graphic>
</inline-formula>
(
<italic>k</italic>
(
<italic>n</italic>
 + 
<italic>m</italic>
)) time to solve the same problem. Since
<italic>k</italic>
and
<italic>n</italic>
are usually very large in most bioinformatics applications, for example in mapping millions (=
<italic>k</italic>
) short (=
<italic>m</italic>
) reads to the human genome (=
<italic>n</italic>
), this speedup is significant.</p>
<p>Bottom-up traversals through suffix trees are mainly required for the detection of highly similar patterns, such as common substrings or (approximate) repeats. This follows from the fact that internal nodes of ST(
<italic>S</italic>
) represent the LCP of suffixes in their subtree. Internal nodes with maximal string depth correspond to suffixes with the largest LCP, which makes it easy to find maximal repeats and LCPs using a full bottom-up search of ST(
<italic>S</italic>
). In detail, the longest common substring of two strings
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
of lengths
<italic>n</italic>
<sub>1</sub>
and
<italic>n</italic>
<sub>2</sub>
is found by first building a suffix tree for the concatenated string
<italic>S</italic>
<sub>1</sub>
<italic>S</italic>
<sub>2</sub>
, called a ‘generalized suffix tree’ (GST), and then traversing the GST twice. During an initial top-down traversal, string depths are stored at the internal nodes [if this information is gathered during construction of ST(
<italic>S</italic>
<sub>1</sub>
<italic>S</italic>
<sub>2</sub>
), the top-down traversal can be skipped]. A consecutive bottom-up traversal determines whether leaves in the subtree of an internal node all originate from
<italic>S</italic>
<sub>1</sub>
,
<italic>S</italic>
<sub>2</sub>
or both. This information can percolate up to parent nodes. In case leaves from both
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
have the current node as their ancestor, the corresponding suffixes have a common prefix. Since every internal node is visited at most once during each traversal, and calculations at every internal node can be done in constant time, this algorithm requires
<inline-formula>
<inline-graphic xlink:href="gks408i15.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
<sub>1</sub>
 + 
<italic>n</italic>
<sub>2</sub>
) time. The details of the algorithm can be found in (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
). Maximal repeats, such as calculated in Vmatch (
<ext-link ext-link-type="uri" xlink:href="http://www.vmatch.de/">http://www.vmatch.de/</ext-link>
), are found in a similar fashion. A maximal repeat is a substring of length
<italic>l</italic>
 > 0 that occurs at least at two positions
<italic>i</italic>
<sub>1</sub>
 < 
<italic>i</italic>
<sub>2</sub>
in
<italic>S</italic>
and that is both left-maximal (
<italic>S</italic>
[
<italic>i</italic>
<sub>1</sub>
 − 1] ≠ 
<italic>S</italic>
[
<italic>i</italic>
<sub>2</sub>
 − 1]) and right-maximal (
<italic>S</italic>
[
<italic>i</italic>
<sub>1</sub>
 + 
<italic>l</italic>
] ≠ 
<italic>S</italic>
[
<italic>i</italic>
<sub>2</sub>
 + 
<italic>l</italic>
]). Labels of the internal nodes of ST(
<italic>S</italic>
) represent all repeated substrings that are right-maximal. There are, however, node labels that correspond to repeats that are not left-maximal. Similar to finding the longest common substring, a bottom-up traversal of ST(
<italic>S</italic>
) uses information in the leaves to check left-maximality and forwards this information to parent nodes. As an example, the maximal repeats in the running example (
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
) are
<monospace>ACA</monospace>
,
<monospace>AT</monospace>
,
<monospace>A</monospace>
and
<monospace>T</monospace>
. The first internal node
<italic>v</italic>
visited by a bottom-up traversal has ℓ(
<italic>v</italic>
) = 
<monospace>ACA</monospace>
and
<italic>v</italic>
has two leaves: 0 and 4. Since leaf 0 is a child of
<italic>v</italic>
, left-maximality is guaranteed for
<italic>v</italic>
and every parent of
<italic>v</italic>
. The internal node
<italic>w</italic>
with label ℓ(
<italic>w</italic>
) = 
<monospace>CA</monospace>
has leaves 5 and 1 as children, but because
<italic>S</italic>
[5 − 1] = 
<italic>S</italic>
[1 − 1] = 
<monospace>A</monospace>
, ℓ(
<italic>w</italic>
) = 
<monospace>CA</monospace>
is not a maximal repeat.</p>
<p>A final way of traversing suffix trees is by following suffix links. Suffix links can both be used in suffix tree construction and algorithms for searching maximal exact matches or matching statistics. Intuitively, suffix links maintain a sliding window when matching a pattern to the suffix tree. Furthermore, suffix links act as a memory-efficient alternative to GSTs. As constructing, storing and updating suffix trees is a costly operation, the utilization of suffix links offers an important trade-off. The following algorithm demonstrates how suffix links enable a quick comparison between all suffixes of string
<italic>S</italic>
<sub>1</sub>
and the suffix tree ST(
<italic>S</italic>
<sub>2</sub>
) of another string
<italic>S</italic>
<sub>2</sub>
. Suppose the first suffix
<italic>S</italic>
<sub>1</sub>
[0..] has been compared up to a node
<italic>v</italic>
with ℓ(
<italic>v</italic>
) = 
<italic>S</italic>
<sub>2</sub>
[0..
<italic>i</italic>
]. After following sl(
<italic>v</italic>
) = 
<italic>w</italic>
, the second suffix
<italic>S</italic>
<sub>1</sub>
[1..] is already matched to ST(
<italic>S</italic>
<sub>2</sub>
) up to
<italic>w</italic>
, with ℓ(
<italic>w</italic>
) = 
<italic>S</italic>
<sub>2</sub>
[1..
<italic>i</italic>
]. In this way,
<italic>i</italic>
 = |ℓ(
<italic>w</italic>
)| characters do not have to be matched again for this suffix. This process can be repeated until all suffixes of
<italic>S</italic>
<sub>1</sub>
are matched to ST(
<italic>S</italic>
<sub>2</sub>
). Hence, the maximal exact matches between
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
can be found again in
<inline-formula>
<inline-graphic xlink:href="gks408i16.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
<sub>1</sub>
 + 
<italic>n</italic>
<sub>2</sub>
) time, but using less memory to store only the suffix tree of
<italic>S</italic>
<sub>2</sub>
plus its suffix links.</p>
<p>Given enough fast memory, suffix trees are probably the best data structure ever invented to support string algorithms. For large-scale bioinformatics applications, however, memory consumption really becomes a bottleneck. Although the memory requirements of suffix trees are asymptotically linear, the constant factor involved is quite high, i.e. up to 10 (
<xref ref-type="bibr" rid="gks408-B33">33</xref>
) to 20 times (
<xref ref-type="bibr" rid="gks408-B34">34</xref>
) higher than the amount of memory required to store the input string. However, state-of-the-art suffix tree implementations are able to handle sequences of human chromosome size (
<xref ref-type="bibr" rid="gks408-B10">10</xref>
). During the last decade, a lot of research focused on tackling this memory bottleneck, resulting in many suffix tree variants that show interesting memory versus time trade-offs.</p>
</sec>
<sec>
<title>Suffix arrays</title>
<p>The most successful and well-known variants of suffix trees are the so-called suffix arrays (
<xref ref-type="bibr" rid="gks408-B35">35</xref>
). They are made up of a single array containing a permutation of the indexes of string
<italic>S</italic>
, making them extremely simple and elegant. In terms of performance, expressiveness is traded for lower memory footprint and improved locality. Suffix arrays in general only require four times the amount of storage needed for the input string, can be constructed in linear time and can exactly match all occurrences of pattern
<italic>P</italic>
in string
<italic>S</italic>
in
<inline-formula>
<inline-graphic xlink:href="gks408i17.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
log
<italic>n</italic>
+|occ(
<italic>P</italic>
,
<italic>S</italic>
)|) time using a binary search.</p>
<p>Suffix array SA(
<italic>S</italic>
) stores the lexicographical ordering of all suffixes of string
<italic>S</italic>
as a permutation of its index positions:
<italic>S</italic>
[SA[
<italic>i</italic>
 − 1]..] < 
<italic>S</italic>
[SA[
<italic>i</italic>
]..], 0 < 
<italic>i</italic>
 < 
<italic>n</italic>
. The last column of
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
shows the lexicographical ordering for the running example. SA(
<italic>S</italic>
) itself can be found in the second column. The uniqueness of the lexicographical order is determined by the fact that all suffixes have different lengths, and the use of the special end-character $ < 
<italic>c</italic>
,
<italic>c</italic>
 ∈ Σ. By definition,
<italic>S</italic>
[SA[0]] always equals the string $. The relationship between suffix trees and suffix arrays becomes clear when traversing suffix trees depth-first and giving priority to edges with lexicographically smaller labels. Leaf numbers encountered in this order spell out the suffix array. All edges were lexicographically ordered on purpose in
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
, so that leaf numbers, read from left to right, form SA(
<italic>S</italic>
) as found in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. Exact matching of substring
<italic>P</italic>
is done using two binary searches on SA(
<italic>S</italic>
). These binary searches locate
<italic>P</italic>
<sub>
<italic>L</italic>
</sub>
 = min{
<italic>k</italic>
|
<italic>P</italic>
 ≤ 
<italic>S</italic>
[SA[
<italic>k</italic>
]]} and
<italic>P</italic>
<sub>
<italic>R</italic>
</sub>
 = max{
<italic>k</italic>
|
<italic>P</italic>
 ≥ 
<italic>S</italic>
[SA[
<italic>k</italic>
]]}, which form the boundaries of the interval in SA(
<italic>S</italic>
) where occ(
<italic>P</italic>
,
<italic>S</italic>
) is found. Note that counting the occurrences requires
<inline-formula>
<inline-graphic xlink:href="gks408i18.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
log
<italic>n</italic>
) time, but finding occ(
<italic>P</italic>
,
<italic>S</italic>
) only requires an additional
<inline-formula>
<inline-graphic xlink:href="gks408i19.jpg"></inline-graphic>
</inline-formula>
(|occ(
<italic>P</italic>
,
<italic>S</italic>
)|) time.
<table-wrap id="gks408-T1" position="float">
<label>Table 1.</label>
<caption>
<p>Arrays used by enhanced suffix arrays (columns 2–5), compressed suffix arrays (columns 2, 6 and 7) and FM-indexes (columns 8 – 14) for string
<italic>S</italic>
 = 
<monospace>ACATACAGATG$</monospace>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th colspan="3" align="center" rowspan="1">ESA
<hr></hr>
</th>
<th colspan="2" align="center" rowspan="1">CSA
<hr></hr>
</th>
<th colspan="7" align="center" rowspan="1">FM-index ‘rank’
<hr></hr>
</th>
<th rowspan="1" colspan="1"></th>
</tr>
<tr>
<th rowspan="1" colspan="1">
<italic>i</italic>
</th>
<th rowspan="1" colspan="1">SA</th>
<th rowspan="1" colspan="1">LCP</th>
<th rowspan="1" colspan="1">
<italic>child</italic>
</th>
<th rowspan="1" colspan="1">
<italic>sl</italic>
</th>
<th rowspan="1" colspan="1">SA
<sup>−1</sup>
</th>
<th rowspan="1" colspan="1">Ψ</th>
<th rowspan="1" colspan="1">BWT</th>
<th rowspan="1" colspan="1">
<monospace>$</monospace>
</th>
<th rowspan="1" colspan="1">
<monospace>A</monospace>
</th>
<th rowspan="1" colspan="1">
<monospace>C</monospace>
</th>
<th rowspan="1" colspan="1">
<monospace>G</monospace>
</th>
<th rowspan="1" colspan="1">
<monospace>T</monospace>
</th>
<th rowspan="1" colspan="1">LF</th>
<th align="left" rowspan="1" colspan="1">
<italic>S</italic>
[SA[
<italic>i</italic>
]..]</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">−1</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">
<monospace>$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">[0..11]</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">
<monospace>ACAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">[6..7]</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<monospace>$</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<monospace>ACATACAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">[0..11]</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">
<monospace>AGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<monospace>ATACAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">[10..11]</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">
<monospace>ATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">
<monospace>CAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">[1..5]</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<monospace>CATACAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">
<monospace>G$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">[0..11]</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">
<monospace>GATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">
<monospace>TACAGATG$</monospace>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">[0..11]</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">
<monospace>TG$</monospace>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>From left to right: index position, suffix array, LCP array, child array, suffix link array, inverse suffix array, Ψ-array, BWT text, ‘rank’ array, LF-mapping array and suffixes of string
<italic>S</italic>
. FM-indexes also require an array
<italic>C</italic>
(
<italic>S</italic>
).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Although conceptually simple, suffix arrays are not just reduced versions of suffix trees (
<xref ref-type="bibr" rid="gks408-B36">36</xref>
,
<xref ref-type="bibr" rid="gks408-B37">37</xref>
). Optimal solutions for complex string processing problems can be achieved by algorithms on suffix arrays without simulating suffix tree traversals. An example is the all pairs suffix–prefix problem in which the maximal suffix–prefix overlap between all ordered pairs of
<italic>k</italic>
strings of total length
<italic>n</italic>
can be determined by both suffix trees (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
) and suffix arrays (
<xref ref-type="bibr" rid="gks408-B37">37</xref>
) in
<inline-formula>
<inline-graphic xlink:href="gks408i20.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
 + 
<italic>k</italic>
<sup>2</sup>
) time.</p>
</sec>
<sec>
<title>Enhanced suffix arrays</title>
<p>Suffix arrays are not that information-rich compared with suffix trees, but require far less memory. They lack LCP information, constant time access to children and suffix links, which makes them less fit to tackle more complex string matching problems. Abouelhoda
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B19">19</xref>
) demonstrated how suffix arrays can be embellished with additional arrays to recover the full expressivity of suffix trees. These so-called ‘enhanced suffix arrays’ consist of three extra arrays that, together with a suffix array, form a more compact representation of suffix trees that can also be constructed in
<inline-formula>
<inline-graphic xlink:href="gks408i21.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) time. Furthermore, the next paragraphs demonstrate how the extra arrays of enhanced suffix arrays enable efficient simulation of all traversal types of suffix trees (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
).</p>
<p>A first array LCP(
<italic>S</italic>
) supports bottom-up traversals on suffix array SA(
<italic>S</italic>
). It stores LCP lengths of consecutive suffixes from the suffix array, i.e. LCP[
<italic>i</italic>
] = | LCP(
<italic>S</italic>
[SA[
<italic>i</italic>
 − 1]..],
<italic>S</italic>
[SA[
<italic>i</italic>
]..])|, 0 < 
<italic>i</italic>
 < 
<italic>n</italic>
. By definition, LCP[0] = −1. An example LCP array for the running example is shown in the third column of
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. Originally, Manber and Myers (
<xref ref-type="bibr" rid="gks408-B35">35</xref>
) utilized LCP arrays to speed up exact substring matching on suffix arrays to achieve an
<inline-formula>
<inline-graphic xlink:href="gks408i22.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
+log
<italic>n</italic>
+|occ(
<italic>P</italic>
,
<italic>S</italic>
)|) time bound. Recently, Grossi (
<xref ref-type="bibr" rid="gks408-B36">36</xref>
) proved that the
<inline-formula>
<inline-graphic xlink:href="gks408i23.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
 + log
<italic>n</italic>
 + |occ(
<italic>P</italic>
,
<italic>S</italic>
)|) time bound for exact substring matching can be reached by using only
<italic>S</italic>
, SA(
<italic>S</italic>
) and
<inline-formula>
<inline-graphic xlink:href="gks408i1.jpg"></inline-graphic>
</inline-formula>
sampled LCP array entries. Furthermore, it is possible to encode those sampled LCP array entries inside a modified version of SA(
<italic>S</italic>
) itself. However, the details of this technique are rather technical and fall beyond the scope of this review. Later, Kasai
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B38">38</xref>
) showed how all bottom-up traversals of suffix trees can be mimicked on suffix arrays in linear time by traversing LCP arrays. In fact, LCP(
<italic>S</italic>
) represents the tree topology of ST(
<italic>S</italic>
). Recall that internal nodes of suffix trees group suffixes by their LCPs. In enhanced suffix arrays, internal nodes are represented by ‘LCP intervals’ ℓ-[
<italic>i</italic>
 .. 
<italic>j</italic>
]. Formally, an interval ℓ -[
<italic>i</italic>
 .. 
<italic>j</italic>
], 0 ≤ 
<italic>i</italic>
 < 
<italic>j</italic>
 < 
<italic>n</italic>
is an LCP interval with ‘LCP value’ ℓ if for every
<italic>i</italic>
 < 
<italic>k</italic>
 ≤ 
<italic>j</italic>
: LCP[
<italic>k</italic>
] ≥ ℓ and there exists
<italic>i</italic>
 < 
<italic>k</italic>
 ≤ 
<italic>j</italic>
: LCP[
<italic>k</italic>
] = ℓ and LCP[
<italic>i</italic>
] < ℓ and LCP[
<italic>j</italic>
 + 1] < ℓ. The LCP interval 0-[0..
<italic>n</italic>
 − 1] is defined to correspond to the root of ST(
<italic>S</italic>
). Intuitively, an LCP interval is a maximal interval of minimal LCP length that corresponds to an internal node of ST(
<italic>S</italic>
). As an illustration, LCP interval 1-[1..5] with LCP value 1 of the example string
<italic>S</italic>
in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
corresponds to internal node
<italic>v</italic>
with label ℓ(
<italic>v</italic>
) = 
<monospace>A</monospace>
in
<xref ref-type="fig" rid="gks408-F1">Figure 1</xref>
. Similarly, subinterval relations among LCP intervals relate to parent–child relationships in suffix trees. Abouelhoda
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B19">19</xref>
) have shown that the boundaries between LCP subintervals of LCP interval ℓ-[
<italic>i</italic>
 .. 
<italic>j</italic>
] are given by the ‘ℓ-indexes’ for which it holds that LCP[
<italic>k</italic>
] = ℓ,
<italic>i</italic>
 < 
<italic>k</italic>
 ≤ 
<italic>j</italic>
. Singleton intervals correspond to leaves in the suffix tree and non-singleton intervals correspond to internal nodes. Consider, for example, the LCP interval 1-[1..5] in the running example. Its ℓ-indexes are 3 and 4. The resulting subintervals are LCP intervals 3-[1..2] and 2-[4..5] and singleton interval [3..3]. The above definitions thus generate a virtual suffix tree called the ‘LCP interval tree’. Note that the topology of this tree is not stored in memory, but is traversed using the arrays SA(
<italic>S</italic>
) and LCP(
<italic>S</italic>
).</p>
<p>Fast top-down searches of suffix trees not only require their tree topology, but also constant time access to child nodes. For an LCP interval ℓ-[
<italic>i</italic>
 .. 
<italic>j</italic>
], this means constant time access to its ℓ-indexes. This information can be precomputed in linear time for the entire LCP interval tree and stored in another array of enhanced suffix arrays, the ‘child array’. The first ℓ-index is either stored in
<italic>i</italic>
or
<italic>j</italic>
[the exact location can be determined in constant time (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
)] and the next ℓ-index is stored at the location of the previous ℓ-index. The child array for the running example is given in the fourth column of
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. As an example, again consider LCP interval 1-[1..5]. The first ℓ-index (3) is stored at position 5 and the second ℓ-index (4) is stored at position 3. Since child[4] = 5 is equal to the right boundary of the interval (which cannot equal ℓ by definition), 4 is the last ℓ-index. The child array allows enhanced suffix arrays to simulate top-down suffix tree traversals.</p>
<p>As a final step towards complete suffix tree expressiveness, suffix arrays can be enhanced with ‘suffix link arrays’ that store suffix links as pointers to other LCP intervals. These pointers are stored at the position of the first ℓ-index of an LCP interval because no two LCP intervals share the same position as their first ℓ-index (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
). This property and the suffix link array for the running example can be checked in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
.</p>
<p>With three extra arrays added, enhanced suffix arrays support all operations and traversals on suffix trees using the same time complexity. However, the simple modular structure allows memory savings if not all traversals are required for an application. Furthermore, array representations generally show better locality than most standard suffix tree representations, which is important when converting the index to disk, but also improves cache usage in memory (
<xref ref-type="bibr" rid="gks408-B39">39</xref>
). Practical implementation improvements have further reduced memory consumption (
<xref ref-type="bibr" rid="gks408-B40">40</xref>
) of enhanced suffix arrays and have speeded up substring matching for larger alphabets (
<xref ref-type="bibr" rid="gks408-B41">41</xref>
). In practice, several state-of-the-art bioinformatics tools make use of enhanced suffix arrays for finding repeated structures in genomes (Vmatch), short read mapping (
<xref ref-type="bibr" rid="gks408-B5">5</xref>
) and genome assembly (
<xref ref-type="bibr" rid="gks408-B16">16</xref>
). If memory is a concern, enhanced suffix arrays occupy about the same amount of memory as regular suffix trees and are thus equally inapplicable for large strings. Suffix arrays (without enhancement) are preferred for exact substring matching in very large strings.</p>
</sec>
<sec>
<title>Compressed suffix arrays</title>
<p>Although suffix arrays are much more compact than suffix trees, their memory footprint is still too high for extremely large strings. The main reason stems from the fact that suffix arrays (and suffix trees) store pointers to string positions. The largest pointer takes
<inline-formula>
<inline-graphic xlink:href="gks408i24.jpg"></inline-graphic>
</inline-formula>
(log
<italic>n</italic>
) bits, which means that suffix arrays require
<inline-formula>
<inline-graphic xlink:href="gks408i25.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log
<italic>n</italic>
) bits of storage. This is large compared with
<inline-formula>
<inline-graphic xlink:href="gks408i26.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log|Σ|) bits needed for storing uncompressed strings. A demand for smaller indexes that remain efficient gave rise to the development of ‘succinct indexes’ and ‘compressed indexes’. Succinct indexes require
<inline-formula>
<inline-graphic xlink:href="gks408i27.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) bits of space, whereas the memory requirements of compressed indexes is in the order of magnitude of the compressed string (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
).</p>
<p>Many types of compressed suffix arrays (
<xref ref-type="bibr" rid="gks408-B43">43</xref>
) have already been proposed [see Navarro and Mäkinen for a recent review (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
)]. They are usually centered around the idea of storing ‘suffix array samples’, complemented with a good compressible ‘neighbor array’ Ψ(
<italic>S</italic>
). To understand the role of the array Ψ(
<italic>S</italic>
), the concept of ‘inverse suffix arrays’ SA
<sup>−1</sup>
(
<italic>S</italic>
) is introduced for which holds that
<italic>SA</italic>
<sup>−1</sup>
[SA[
<italic>i</italic>
]] ≡ SA[SA
<sup>−1</sup>
[
<italic>i</italic>
]] = 
<italic>i</italic>
. Ψ(
<italic>S</italic>
) can then be defined as Ψ[
<italic>i</italic>
] ≡ SA
<sup>−1</sup>
[SA[
<italic>i</italic>
] + 1 mod(
<italic>n</italic>
 − 1)] for 0 ≤ 
<italic>i</italic>
 < 
<italic>n</italic>
. This definition closely resembles that of suffix links and it will thus come as no surprise that in practice Ψ can be used to recover suffix links (
<xref ref-type="bibr" rid="gks408-B44">44</xref>
). Consequently, the array Ψ can be used to recover suffix array samples from a sparse representation of SA(
<italic>S</italic>
). This is illustrated using the running example string from
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. Assume that only SA[0], SA[6] and SA[11] are stored and that the value of SA[10] is unknown. Note that Ψ[10] = 1 and SA[1] = 4 = 3 + 1, i.e. the requested value plus one. A sampled value of SA(
<italic>S</italic>
) is reached by repeatedly calculating Ψ[Ψ[..Ψ[10]]] = Ψ
<sup>
<italic>k</italic>
</sup>
[10]. In the example
<italic>k</italic>
 = 2, because Ψ[Ψ[10]] = 6. Consequently, SA[10] = SA[6] − 
<italic>k</italic>
 = 5 − 2 = 3. A more detailed discussion about compressed suffix arrays is given in the next section.</p>
</sec>
<sec>
<title>The Burrows–Wheeler transform</title>
<p>Several compressed index structures, most notably the FM-index (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
), are based on the Burrows–Wheeler transform (
<xref ref-type="bibr" rid="gks408-B46">46</xref>
) BWT(
<italic>S</italic>
). This reversible permutation of the string
<italic>S</italic>
is also known to lie at the core of compression tools such as the fast ‘bzip2’ compression tool.</p>
<p>The Burrows–Wheeler transform does not compress a string itself, rather it enables an easier and stronger compression of the original string by exploiting regularities found in the string. Unlike SA(
<italic>S</italic>
) that is a permutation of the index positions of
<italic>S</italic>
, BWT(
<italic>S</italic>
) is a permutation of the characters of
<italic>S</italic>
. As a result, BWT(
<italic>S</italic>
) only occupies
<inline-formula>
<inline-graphic xlink:href="gks408i28.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log|Σ|) bits of memory in contrast to
<inline-formula>
<inline-graphic xlink:href="gks408i29.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log
<italic>n</italic>
) bits needed for storing SA(
<italic>S</italic>
). As it contains the original string itself, the Burrows–Wheeler transform does not require an additional copy of
<italic>S</italic>
for string searching algorithms. Index structures having this property are called ‘self-indexes’.</p>
<p>Intuitively, the Burrows–Wheeler transformation orders the characters of
<italic>S</italic>
by the context following the characters. Thus, characters followed by similar substrings will be close together. A simple way to formally define BWT(
<italic>S</italic>
) uses a conceptual
<italic>n</italic>
 × 
<italic>n</italic>
matrix
<italic>M</italic>
whose rows are formed by the characters of the lexicographically sorted
<italic>n</italic>
cyclic shifts of
<italic>S</italic>
. BWT(
<italic>S</italic>
) is the string represented by the last column of
<italic>M</italic>
, or BWT[
<italic>i</italic>
] ≡ 
<italic>M</italic>
[
<italic>i</italic>
,
<italic>n</italic>
 − 1], 0 ≤ 
<italic>i</italic>
 < 
<italic>n</italic>
. Note that the rows of
<italic>M</italic>
up to the character $ also represent the suffixes in lexicographical order, or, equivalently, in suffix array order. Thus, the first column of
<italic>M</italic>
equals the first characters of the suffixes in suffix array order, from which follows that BWT(
<italic>S</italic>
) can also be defined as BWT[
<italic>i</italic>
] ≡ 
<italic>S</italic>
[SA[
<italic>i</italic>
] − 1 mod
<italic>n</italic>
], 0 ≤ 
<italic>i</italic>
 < 
<italic>n</italic>
, where the modulo operator is used for the case SA[
<italic>i</italic>
] = 0. From this definition it immediately follows that BWT(
<italic>S</italic>
) can be constructed in linear time using SA(
<italic>S</italic>
). BWT(
<italic>S</italic>
) for the running example can be found in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
, column 8, whereas the complete matrix
<italic>M</italic>
is given in
<xref ref-type="table" rid="gks408-T2">Table 2</xref>
.
<table-wrap id="gks408-T2" position="float">
<label>Table 2.</label>
<caption>
<p>Conceptual matrix
<italic>M</italic>
containing the lexicographically ordered
<italic>n</italic>
cyclic shifts of
<italic>S</italic>
 = 
<monospace>ACATACAGATG$</monospace>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">
<italic>i</italic>
</th>
<th rowspan="1" colspan="1">
<italic>S</italic>
[SA[
<italic>i</italic>
]]</th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">BWT[
<italic>i</italic>
]</th>
<th rowspan="1" colspan="1">offset[
<italic>i</italic>
]</th>
<th rowspan="1" colspan="1">LF[
<italic>i</italic>
]</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<monospace>$</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>ACATACAGAT</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">8</td>
</tr>
<tr>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>CAGATG$ACA</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">10</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>CATACAGATG</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>$</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>GATG$ACATA</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>TACAGATG$A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">7</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>ATG$ACATAC</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">9</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>AGATG$ACAT</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<monospace>C</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>ATACAGATG$</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>$ACATACAGA</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">11</td>
</tr>
<tr>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">
<monospace>G</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>ATG$ACATAC</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>ACAGATG$AC</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">4</td>
</tr>
<tr>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">
<monospace>T</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>G$ACATACAG</monospace>
</td>
<td rowspan="1" colspan="1">
<monospace>A</monospace>
</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">5</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>M</italic>
[0..11,0] contains the lexicographically ordered characters of
<italic>S</italic>
and
<italic>M</italic>
[0..11,11] equals BWT(
<italic>S</italic>
). The last two columns are required for the inverse transformation. offset[
<italic>i</italic>
] stores the number of times BWT[
<italic>i</italic>
] has appeared earlier in BWT(
<italic>S</italic>
). The last column LF[
<italic>i</italic>
] contains pointers used during the inverse transformation algorithm: if
<italic>S</italic>
[
<italic>i</italic>
] = BWT[
<italic>j</italic>
], then BWT[LF[
<italic>j</italic>
]] = 
<italic>S</italic>
[
<italic>i</italic>
 − 1].</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The inverse transformation that reconstructs
<italic>S</italic>
from BWT(
<italic>S</italic>
) is key to uncompression algorithms and the string matching algorithm utilized in compressed index structures. It recovers
<italic>S</italic>
back-to-front and is based on a few simple observations. First, although BWT(
<italic>S</italic>
) only stores the last column of
<italic>M</italic>
, the first column of
<italic>M</italic>
is easily retrieved from BWT(
<italic>S</italic>
) because it is the lexicographical ordering of the characters of
<italic>S</italic>
[and thus also BWT(
<italic>S</italic>
)]. Moreover, the first column of
<italic>M</italic>
can be represented in compact form as an array
<italic>C</italic>
(
<italic>S</italic>
) that stores the number of characters in
<italic>S</italic>
that are lexicographically smaller than character
<italic>c</italic>
 ∈ Σ. More precisely:
<inline-formula>
<inline-graphic xlink:href="gks408i2.jpg"></inline-graphic>
</inline-formula>
,
<italic>c</italic>
<sub>
<italic>i</italic>
</sub>
 ∈ Σ. For the running example,
<italic>C</italic>
(
<italic>S</italic>
) = [0,1,6,8,10] can be retrieved from
<xref ref-type="table" rid="gks408-T2">Table 2</xref>
. A second observation is that BWT(
<italic>S</italic>
) stores the order of characters preceding the suffixes in suffix array order. As a result, if the character at position
<italic>i</italic>
(
<italic>S</italic>
[
<italic>i</italic>
]) has been decoded and the lexicographical order of suffix
<italic>S</italic>
[
<italic>i</italic>
..] is known to be
<italic>j</italic>
, character
<italic>S</italic>
[
<italic>i</italic>
 − 1] is found in BWT[
<italic>j</italic>
]. Finally, the most important observation that allows for the retrieval of
<italic>S</italic>
from BWT(
<italic>S</italic>
) is that identical characters preserve their relative order in the first and last columns of
<italic>M</italic>
. To see the correctness of this observation, let BWT[
<italic>i</italic>
] = BWT[
<italic>j</italic>
] = 
<italic>c</italic>
for
<italic>i</italic>
 < 
<italic>j</italic>
. The lexicographical ordering of the cyclic permutations means that the suffix in row
<italic>i</italic>
of
<italic>M</italic>
corresponding to SA[
<italic>i</italic>
] is lexicographically smaller than the suffix in row
<italic>j</italic>
corresponding to SA[
<italic>j</italic>
]. From
<italic>cS</italic>
[SA[
<italic>i</italic>
]..] < 
<italic>cS</italic>
[SA[
<italic>j</italic>
]..] it then follows that the location of character
<italic>c</italic>
corresponding to BWT[
<italic>i</italic>
] precedes the location of character
<italic>c</italic>
corresponding to BWT[
<italic>j</italic>
] in the first column of
<italic>M</italic>
. The relative order of identical characters in BWT(
<italic>S</italic>
) is captured in the array offset(
<italic>S</italic>
): offset[
<italic>i</italic>
] stores the number of times that character BWT[
<italic>i</italic>
] occurs in BWT(
<italic>S</italic>
) before position
<italic>i</italic>
, i.e. offset[
<italic>i</italic>
] ≡ |occ(BWT[
<italic>i</italic>
],BWT[..
<italic>i</italic>
 − 1])|, 0 < 
<italic>i</italic>
 < 
<italic>n</italic>
. Given a position
<italic>i</italic>
in BWT(
<italic>S</italic>
), the corresponding character in the first column of
<italic>M</italic>
can then be found at position LF[
<italic>i</italic>
] = 
<italic>C</italic>
[BWT[
<italic>i</italic>
]] + offset[
<italic>i</italic>
]. The array LF(
<italic>S</italic>
) is called the ‘last-to-first column mapping’.</p>
<p>The above observations allow the back-to-front recovery of
<italic>S</italic>
from BWT(
<italic>S</italic>
) utilizing a zig-zag algorithm. Starting in row
<italic>i</italic>
<sub>0</sub>
of BWT(
<italic>S</italic>
) containing character $, the position of the previous character of
<italic>S</italic>
is found in row LF[
<italic>i</italic>
<sub>0</sub>
] = 
<italic>i</italic>
<sub>1</sub>
. The next preceding character is found on row
<italic>i</italic>
<sub>2</sub>
 = LF[
<italic>i</italic>
<sub>1</sub>
] in BWT(
<italic>S</italic>
), and so on. Thus, to find the row of the next preceding character, the algorithm looks horizontally in
<xref ref-type="table" rid="gks408-T2">Table 2</xref>
and the actual character is retrieved from the BWT column on that row in
<xref ref-type="table" rid="gks408-T2">Table 2</xref>
. Note that neither
<italic>M</italic>
nor its first column are ever used explicitly during the algorithm. They only serve to understand the procedure for the inverse transformation. In practice,
<italic>C</italic>
(
<italic>S</italic>
) and offset(
<italic>S</italic>
) are first constructed from BWT(
<italic>S</italic>
). During each step, LF[
<italic>i</italic>
<sub>
<italic>k</italic>
</sub>
] is calculated using
<italic>C</italic>
(
<italic>S</italic>
) and offset(
<italic>S</italic>
) and BWT[LF[
<italic>i</italic>
<sub>
<italic>k</italic>
</sub>
]] is returned as the preceding character. As an example,
<italic>M</italic>
, offset(
<italic>S</italic>
) and LF(
<italic>S</italic>
) for the running example can be found in
<xref ref-type="table" rid="gks408-T2">Table 2</xref>
and
<italic>C</italic>
(
<italic>S</italic>
) is given above.
<italic>S</italic>
[SA[0]] = $ is preceded by the character BWT[
<italic>i</italic>
<sub>0</sub>
] = 
<monospace>G</monospace>
in the running example. Consequently,
<monospace>G$</monospace>
is the lexicographical first suffix that starts with
<monospace>G</monospace>
, which translates into offset[
<italic>i</italic>
<sub>0</sub>
] = 0. The first row of
<italic>M</italic>
whose corresponding suffix starts with
<monospace>G</monospace>
has row number
<italic>C</italic>
[
<monospace>G</monospace>
] = 8. Adding the number of suffixes that also start with
<monospace>G</monospace>
, but are lexicographically smaller than
<monospace>G$</monospace>
(=0), returns the position in BWT(
<italic>S</italic>
) of the next character that will be decoded. BWT[8 + 0] = BWT[LF[0]] = 
<monospace>
<italic>T</italic>
</monospace>
 = 
<italic>S</italic>
[9]. In the next step,
<italic>S</italic>
[8] is retrieved by computing LF[8] = 11 and BWT[11] = 
<monospace>A</monospace>
. Eventually,
<italic>S</italic>
is retrieved in
<inline-formula>
<inline-graphic xlink:href="gks408i30.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) time using the LF-mapping.</p>
<p>The Burrows–Wheeler transform by itself only permutes strings without compressing them. It is, however, easier to compress BWT(
<italic>S</italic>
) than the original string
<italic>S</italic>
, as the order of the characters in BWT(
<italic>S</italic>
) is determined by similar contexts following the characters, analogous to the way suffixes are grouped by LCPs in suffix trees. An immediate consequence is that run-length encoding, which encodes runs of identical characters by their length, shows good compression results for BWT(
<italic>S</italic>
). Apart from run-length encoding (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
,
<xref ref-type="bibr" rid="gks408-B47">47</xref>
), move-to-front lists (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
), wavelet trees (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
,
<xref ref-type="bibr" rid="gks408-B47">47</xref>
,
<xref ref-type="bibr" rid="gks408-B48">48</xref>
) and several entropy encoders, such as Huffman codes (
<xref ref-type="bibr" rid="gks408-B49">49</xref>
,
<xref ref-type="bibr" rid="gks408-B50">50</xref>
), have also been used successfully to compress BWT(
<italic>S</italic>
). For a complete overview on compression techniques based on the Burrows–Wheeler transform, we refer to the book of Adjeroh
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B51">51</xref>
).</p>
<p>Analogous to suffix arrays, BWT(
<italic>S</italic>
) can be used to find exact matches of substrings by applying binary search. Similar to compressed suffix arrays, binary searching BWT(
<italic>S</italic>
) requires auxiliary data structures, including Ψ(
<italic>S</italic>
) and (sampled) SA(
<italic>S</italic>
) (
<xref ref-type="bibr" rid="gks408-B51">51</xref>
), resulting in compressed suffix arrays. Given the relation between BWT(
<italic>S</italic>
) and SA(
<italic>S</italic>
), BWT(
<italic>S</italic>
) can also be utilized for constructing other compressed suffix arrays (
<xref ref-type="bibr" rid="gks408-B52">52</xref>
). Moreover, suffix trees, suffix arrays and other non-self-indexes require a copy of the indexed string
<italic>S</italic>
, which can be replaced by a compressed form of BWT(
<italic>S</italic>
) to reduce space.</p>
</sec>
<sec>
<title>FM-indexes</title>
<p>Another search method for exact string matching can be applied to Burrows–Wheeler transformed strings, using ideas from the inverse transformation algorithm. This method is referred to as ‘backward searching’ and forms the basic search mechanism of ‘FM-indexes’ (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
). FM-index is the short name given by Ferragina and Manzini to their full-text self-indexes that require ‘minute amount of space’. The space requirement is proportional to and sometimes even smaller than that of the indexed string. FM-indexes can be constructed in
<inline-formula>
<inline-graphic xlink:href="gks408i31.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) time and all occurrences of pattern
<italic>P</italic>
can be located in
<inline-formula>
<inline-graphic xlink:href="gks408i32.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
 + |occ(
<italic>P</italic>
,
<italic>S</italic>
)|log
<italic>n</italic>
) time. Note that finding |occ(
<italic>P</italic>
,
<italic>S</italic>
)| only requires
<inline-formula>
<inline-graphic xlink:href="gks408i33.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
) time, which makes that FM-indexes have theoretical optimal time and space requirements for counting the number of occurrences of a pattern in a string.</p>
<p>The backward search algorithm employed by FM-indexes requires BWT(
<italic>S</italic>
),
<italic>C</italic>
(
<italic>S</italic>
) and a 2D
<italic>n</italic>
 × |Σ| array rank(
<italic>S</italic>
) [In many papers, rank(
<italic>S</italic>
) is referred to as Occ(
<italic>S</italic>
), but to avoid confusion with occ(
<italic>P</italic>
,
<italic>S</italic>
), the name ‘rank’ is used]. This array is defined as rank[
<italic>i</italic>
,
<italic>c</italic>
] ≡ |occ(
<italic>c</italic>
, BWT[..
<italic>i</italic>
])|, 0 ≤ 
<italic>i</italic>
 < 
<italic>n</italic>
,
<italic>c</italic>
 ∈ Σ. For the running example, rank(
<italic>S</italic>
) is shown as columns 9 – 13 in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. The role of rank(
<italic>S</italic>
) is similar to the role offset(
<italic>S</italic>
) plays in the inverse transformation of BWT(
<italic>S</italic>
). However, while offset(
<italic>S</italic>
) only stores information on the number of occurrences of one character for each index position, rank(
<italic>S</italic>
) contains this information for all the characters in the alphabet in all index positions. The extra information contained in rank(
<italic>S</italic>
) compared with offset(
<italic>S</italic>
) gives it the advantage of granting random access to LF(
<italic>S</italic>
). Furthermore, rank(
<italic>S</italic>
) is easier to compress than offset(
<italic>S</italic>
) or LF(
<italic>S</italic>
) (
<xref ref-type="bibr" rid="gks408-B51">51</xref>
).</p>
<p>During the course of the search algorithm,
<italic>P</italic>
is matched from right to left. For every step
<italic>i</italic>
, 0 ≤ 
<italic>i</italic>
 < 
<italic>m</italic>
, an interval BWT[
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
 .. 
<italic>e</italic>
<sub>
<italic>i</italic>
</sub>
] is maintained that contains all occurrences of
<italic>P</italic>
[
<italic>m</italic>
 − 
<italic>i</italic>
..]. Initially, [
<italic>s</italic>
<sub>0</sub>
 .. 
<italic>e</italic>
<sub>0</sub>
] ≡ [0..
<italic>n</italic>
 − 1], and after
<italic>m</italic>
steps [
<italic>s</italic>
<sub>
<italic>m</italic>
</sub>
 .. 
<italic>e</italic>
<sub>
<italic>m</italic>
</sub>
] contains the suffix array interval corresponding to occ(
<italic>P</italic>
,
<italic>S</italic>
). Given [
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
 .. 
<italic>e</italic>
<sub>
<italic>i</italic>
</sub>
] and
<italic>c</italic>
 = 
<italic>P</italic>
[
<italic>m</italic>
 − 
<italic>i</italic>
 − 1], the next interval is found using the formulas
<italic>s</italic>
<sub>
<italic>i</italic>
+1</sub>
 = 
<italic>C</italic>
[
<italic>c</italic>
] + rank[
<italic>c</italic>
,
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
 − 1] and
<italic>e</italic>
<sub>
<italic>i</italic>
+1</sub>
 = 
<italic>C</italic>
[
<italic>c</italic>
] + rank[
<italic>c</italic>
,
<italic>e</italic>
<sub>
<italic>i</italic>
</sub>
 + 1] − 1. Here, array
<italic>C</italic>
(
<italic>S</italic>
) is used to locate the interval of suffixes starting with
<italic>c</italic>
in SA(
<italic>S</italic>
) and array rank(
<italic>S</italic>
) is used to find the number of suffixes starting with
<italic>c</italic>
that are lexicographically smaller and larger than the ones prefixed by
<italic>cP</italic>
[
<italic>m</italic>
 − 
<italic>i</italic>
..]. As an example of backward searching, again consider matching
<italic>P</italic>
 = 
<monospace>CA</monospace>
to the running example in
<xref ref-type="table" rid="gks408-T1">Table 1</xref>
. Initially, the backward search interval is [0..11]. Since
<italic>C</italic>
[
<monospace>A</monospace>
] = 1 and
<italic>C</italic>
[
<monospace>C</monospace>
] = 6, the backward search interval narrows down to [
<italic>s</italic>
<sub>1</sub>
..
<italic>e</italic>
<sub>1</sub>
] = [1..5] in the next step, which corresponds to the suffix array interval containing suffixes starting with
<monospace>A</monospace>
. Note that BWT[3] = BWT[4] = 
<monospace>C</monospace>
, so there are two suffixes starting with
<monospace>A</monospace>
that are preceded by
<monospace>C</monospace>
. Consequently,
<italic>s</italic>
<sub>2</sub>
 = 
<italic>C</italic>
[
<monospace>C</monospace>
] + rank[0,
<monospace>C</monospace>
] = 6 + 0 = 6 and
<italic>e</italic>
<sub>2</sub>
 = 
<italic>C</italic>
[
<monospace>C</monospace>
] + rank[5,
<monospace>C</monospace>
] − 1 = 6 + 2 − 1 = 7. The answer |occ(
<italic>P</italic>
,
<italic>S</italic>
)| = 7 − 6 + 1 = 2 is found in
<inline-formula>
<inline-graphic xlink:href="gks408i34.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
) time. rank[0,
<monospace>C</monospace>
] = 0 means that there are no suffixes starting with
<monospace>C</monospace>
located in SA[0..0] and rank[5,
<monospace>C</monospace>
] = 2 means that there are 2 suffixes starting with
<monospace>C</monospace>
located in SA[0..5]. Also note the resemblance between LF-mapping and backward search:
<italic>s</italic>
<sub>2</sub>
also could have been found as the first occurrence of
<monospace>C</monospace>
in BWT[1..5], which is 3: LF[3] = 6 = 
<italic>s</italic>
<sub>2</sub>
. Likewise,
<italic>e</italic>
<sub>2</sub>
could have been found as the last occurrence of
<monospace>C</monospace>
in BWT[1..5]. However, instead of locating these occurrences, note that offset[3] = rank[3,
<monospace>C</monospace>
] − 1 = rank[1,
<monospace>C</monospace>
] − 1. Thus, the offset(
<italic>S</italic>
) values are stored in rank(
<italic>S</italic>
) at the boundaries of every interval, allowing search intervals to be narrowed down in constant time. As a result, the reverse search algorithm of the FM-index simulates a top-down search in a suffix ‘trie’, i.e. a suffix tree where every edge label contains only a single character.</p>
<p>After backward searching has terminated, occ(
<italic>P</italic>
,
<italic>S</italic>
) is still unknown. Using LF-mapping, this set can be retrieved from the interval BWT[
<italic>s</italic>
<sub>
<italic>m</italic>
</sub>
 .. 
<italic>e</italic>
<sub>
<italic>m</italic>
</sub>
]. One possibility is to count the number of backward searches it takes to reach character $ for every
<italic>s</italic>
<sub>
<italic>m</italic>
</sub>
 ≤ 
<italic>i</italic>
 ≤ 
<italic>e</italic>
<sub>
<italic>m</italic>
</sub>
. However, this would require too much time. To achieve better performance, FM-indexes mark additional positions with suffix array values in BWT(
<italic>S</italic>
). The number of suffix array values stored constitutes a time-space trade-off. Recall that LF[
<italic>i</italic>
] returns the position in SA(
<italic>S</italic>
) of suffix
<italic>S</italic>
[SA[
<italic>i</italic>
] − 1..]. Thus SA[LF[
<italic>i</italic>
]] = SA[
<italic>i</italic>
] − 1, such that LF(
<italic>S</italic>
) can be used to find the next smaller suffix array value. The ability of LF(
<italic>S</italic>
) to find smaller suffix array values is used as an argument to classify FM-indexes as compressed suffix arrays (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
). Moreover, LF(
<italic>S</italic>
) and Ψ(
<italic>S</italic>
) are each others' inverse: SA[LF[
<italic>i</italic>
]] = SA[
<italic>i</italic>
] − 1 and SA[Ψ[
<italic>i</italic>
]] = SA[
<italic>i</italic>
] + 1, hence LF[Ψ[
<italic>i</italic>
]] = Ψ[LF[
<italic>i</italic>
]] = 
<italic>i</italic>
.</p>
<p>FM-indexes combine fast string matching with low memory requirements. Their original design (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
) compresses BWT(
<italic>S</italic>
) using move-to-front lists, run-length encoding and a variable-length prefix code. In the original paper, rank(
<italic>S</italic>
) was compressed using the ‘Four-Russians’ technique (
<xref ref-type="bibr" rid="gks408-B53">53</xref>
). Roughly speaking, this technique comes down to subdividing the problem into small enough subproblems and indexing all solutions to these small problems in a global table. The subdivision into smaller subproblems is done by recursively splitting arrays into equally sized blocks and storing answers to queries relative to the larger parent block. Other compression methods have been proposed that show better performance in practice (
<xref ref-type="bibr" rid="gks408-B49">49</xref>
) or that give different space-time trade-offs (
<xref ref-type="bibr" rid="gks408-B47">47</xref>
,
<xref ref-type="bibr" rid="gks408-B48">48</xref>
,
<xref ref-type="bibr" rid="gks408-B50">50</xref>
,
<xref ref-type="bibr" rid="gks408-B54">54</xref>
,
<xref ref-type="bibr" rid="gks408-B55">55</xref>
).</p>
<p>Since they allow fast pattern matching while having small memory requirements, FM-indexes have become a very popular tool for different types of genome analyses. Compressed full-text index structures are mainly used for exact string matching, but algorithms for inexact string matching exist (
<xref ref-type="bibr" rid="gks408-B51">51</xref>
,
<xref ref-type="bibr" rid="gks408-B56">56</xref>
). FM-indexes have started to become used as part of
<italic>de novo</italic>
genome assembly algorithms (
<xref ref-type="bibr" rid="gks408-B17">17</xref>
) and are supporting popular tools for mapping reads to reference sequences such as Bowtie (
<xref ref-type="bibr" rid="gks408-B7">7</xref>
), BWA (
<xref ref-type="bibr" rid="gks408-B8">8</xref>
) and SOAP2 (
<xref ref-type="bibr" rid="gks408-B9">9</xref>
).</p>
</sec>
</sec>
<sec>
<title>TIME-MEMORY TRADE-OFFS</title>
<p>The increase in sequencing data requires efficient algorithms and data structures to form the backbone of computational tools for storing, processing and analyzing these sequences. Without the use of index structures, many algorithms that rely on string searching would become unfeasible due to a long execution time. However, index structures also incur a memory overhead to sequence analysis.</p>
<p>Over the last decade, much energy has been put into decreasing the memory consumption of index structures. The proposals differ in the performance overhead incurred by lowering the memory footprint. Some index structures suffer from a logarithmic slowdown, while others allow for the tuning of the space-time trade-off. There are indexes that have been especially designed for certain types of data, whereas others are tweaked for particular hardware architectures. An example of a data-specific property influencing index structure performance is the alphabet size of the sequences. Another major factor that allows classifying index structures is their expressiveness. Suffix trees are considered to have full expressiveness (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
), supporting a large variety of string algorithms. Conversely, the bulk of recent compressed self-index structures are limited to performing mainly (in)exact string matching. These string matching self-indexes are often compared on the basis of four criteria: their performance of extracting a random substring of
<italic>S</italic>
, calculating |occ(
<italic>P</italic>
,
<italic>S</italic>
)| and occ(
<italic>P</italic>
,
<italic>S</italic>
) and their size. An overview of the memory taken by several index structures discussed in this section can be found in
<xref ref-type="table" rid="gks408-T3">Table 3</xref>
. This table represents memory requirements both in general terms of number of bits required per indexed character, as well as in terms of its size for indexing full genomes. Note, however, that the list of index structures in
<xref ref-type="table" rid="gks408-T3">Table 3</xref>
is not complete nor gives a full overview of the memory-time trade-offs. For example, external memory index structures were omitted, but can be found in ‘Index structures in external memory’ section. Additionally, peak memory requirements during construction can be much higher than the figures described here (see ‘Construction’ section). Furthermore, index structures contain parameters that allow manual tuning of the memory-time trade-off. Finally, because the expressiveness differs greatly between index structures,
<xref ref-type="table" rid="gks408-T3">Table 3</xref>
does not include any time-related results. Partial results for some algorithms can be found elsewhere (
<xref ref-type="bibr" rid="gks408-B39">39</xref>
,
<xref ref-type="bibr" rid="gks408-B54">54</xref>
,
<xref ref-type="bibr" rid="gks408-B57">57</xref>
,
<xref ref-type="bibr" rid="gks408-B58">58</xref>
).
<table-wrap id="gks408-T3" position="float">
<label>Table 3.</label>
<caption>
<p>Representative memory requirements for different index structure implementations, expressed both as bits per indexed character (column 2) and estimated size in megabytes for several known genomes (columns 3–5)</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th align="left" rowspan="1" colspan="1">Name index structure</th>
<th rowspan="1" colspan="1">Bits/char</th>
<th colspan="3" align="center" rowspan="1">Size for genome in MB
<hr></hr>
</th>
<th rowspan="1" colspan="1">Reference</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Yeast</th>
<th rowspan="1" colspan="1">Fruit fly</th>
<th rowspan="1" colspan="1">Human</th>
<th rowspan="1" colspan="1"></th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">
<bold>2-bit encoded string</bold>
</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">35</td>
<td rowspan="1" colspan="1">775</td>
<td rowspan="1" colspan="1">NCBI
<xref ref-type="table-fn" rid="TF1">
<sup>a</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    CSA Grossi
<italic>et al.</italic>
</td>
<td rowspan="1" colspan="1">2.4</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">42</td>
<td rowspan="1" colspan="1">931</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B59">59</xref>
,
<xref ref-type="bibr" rid="gks408-B60">60</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    FM-index</td>
<td rowspan="1" colspan="1">3.36</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">59</td>
<td rowspan="1" colspan="1">1302</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B45">45</xref>
,
<xref ref-type="bibr" rid="gks408-B39">39</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    SSA (best)</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">70</td>
<td rowspan="1" colspan="1">1551</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B47">47</xref>
,
<xref ref-type="bibr" rid="gks408-B57">57</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    CST Russo
<italic>et al.</italic>
<xref ref-type="table-fn" rid="TF2">
<sup>b</sup>
</xref>
</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">87</td>
<td rowspan="1" colspan="1">1939</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B61">61</xref>
,
<xref ref-type="bibr" rid="gks408-B62">62</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    CSA Sadakane (best)</td>
<td rowspan="1" colspan="1">5.6</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">98</td>
<td rowspan="1" colspan="1">2171</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B63">63</xref>
,
<xref ref-type="bibr" rid="gks408-B64">64</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    LZ-index (best)</td>
<td rowspan="1" colspan="1">6.64</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">116</td>
<td rowspan="1" colspan="1">2574</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B57">57</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">
<bold>byte encoded string</bold>
</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">12</td>
<td rowspan="1" colspan="1">139</td>
<td rowspan="1" colspan="1">3102</td>
<td rowspan="1" colspan="1">NCBI
<xref ref-type="table-fn" rid="TF1">
<sup>a</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    CST Navarro
<xref ref-type="table-fn" rid="TF2">
<sup>b</sup>
</xref>
</td>
<td rowspan="1" colspan="1">12</td>
<td rowspan="1" colspan="1">18</td>
<td rowspan="1" colspan="1">209</td>
<td rowspan="1" colspan="1">4653</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B62">62</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    SSA (worst)</td>
<td rowspan="1" colspan="1">20</td>
<td rowspan="1" colspan="1">30</td>
<td rowspan="1" colspan="1">349</td>
<td rowspan="1" colspan="1">7754</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B47">47</xref>
,
<xref ref-type="bibr" rid="gks408-B57">57</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    CST Sadakane
<xref ref-type="table-fn" rid="TF2">
<sup>b</sup>
</xref>
</td>
<td rowspan="1" colspan="1">30</td>
<td rowspan="1" colspan="1">45</td>
<td rowspan="1" colspan="1">523</td>
<td rowspan="1" colspan="1">11 632</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B44">44</xref>
,
<xref ref-type="bibr" rid="gks408-B62">62</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    LZ-index (worst)</td>
<td rowspan="1" colspan="1">35.2</td>
<td rowspan="1" colspan="1">53</td>
<td rowspan="1" colspan="1">614</td>
<td rowspan="1" colspan="1">13 648</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B65">65</xref>
,
<xref ref-type="bibr" rid="gks408-B39">39</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Suffix array</td>
<td rowspan="1" colspan="1">40</td>
<td rowspan="1" colspan="1">60</td>
<td rowspan="1" colspan="1">697</td>
<td rowspan="1" colspan="1">15 509</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B35">35</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Enhanced SA</td>
<td rowspan="1" colspan="1">72</td>
<td rowspan="1" colspan="1">109</td>
<td rowspan="1" colspan="1">1255</td>
<td rowspan="1" colspan="1">27 916</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B19">19</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    WOTD suffix tree</td>
<td rowspan="1" colspan="1">76</td>
<td rowspan="1" colspan="1">115</td>
<td rowspan="1" colspan="1">1325</td>
<td rowspan="1" colspan="1">29 467</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B33">33</xref>
)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    ST McCreight</td>
<td rowspan="1" colspan="1">232</td>
<td rowspan="1" colspan="1">350</td>
<td rowspan="1" colspan="1">4045</td>
<td rowspan="1" colspan="1">89 952</td>
<td rowspan="1" colspan="1">(
<xref ref-type="bibr" rid="gks408-B34">34</xref>
,
<xref ref-type="bibr" rid="gks408-B33">33</xref>
)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Column 6 contains references to the original theoretical proposals and an additional reference to the articles from which these practical estimates originate. For ease of comparison purposes, the index structures are sorted by increasing memory requirements. As a reference, the original (non-indexed) sequence is also included (bold), both stored using 2-bit encoding and byte encoding.</p>
</fn>
<fn id="TF1" fn-type="other">
<p>
<sup>a</sup>
Genome sizes were taken from the NCBI genome information pages
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/genome">http://www.ncbi.nlm.nih.gov/genome</ext-link>
of
<italic>Saccharomyces cerevisiae</italic>
(yeast),
<italic>Drosophila melanogaster</italic>
(fruit fly) and
<italic>Homo Sapiens</italic>
(human).</p>
</fn>
<fn id="TF2" fn-type="other">
<p>
<sup>b</sup>
Mean of the interval of possible memory requirements given in (
<xref ref-type="bibr" rid="gks408-B62">62</xref>
).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The remainder of this section focuses on the basic principles behind these index structures and the memory-time trade-offs induced by design choices and confounding factors such as application and data types.</p>
<sec>
<title>Uncompressed index structures</title>
<p>Choosing appropriate data structures for implementing the different components of suffix trees forms a basic step in lowering their memory requirements. These components include nodes, edges, edge labels, leaf numbers and suffix links. The topological information of ST(
<italic>S</italic>
) and the edge labels are traditionally stored as pointers, resulting in suffix trees that require
<inline-formula>
<inline-graphic xlink:href="gks408i35.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) words of usually 32 bits. Note that for very large strings (
<italic>n</italic>
 > 2
<sup>32</sup>
 ≈ 4·10
<sup>9</sup>
) 32 bits is insufficient for storing the pointers, thus larger representations are required. This factor is often overlooked when presenting theoretical results.</p>
<p>There is only one major
<inline-formula>
<inline-graphic xlink:href="gks408i36.jpg"></inline-graphic>
</inline-formula>
(|Σ|)-sized memory-time trade-off in this traditional representation. This trade-off comes from the data structure that handles access to child vertices. Most implementations make use of—roughly ordered from high-memory requirements to low access time—static arrays, dynamic arrays (
<xref ref-type="bibr" rid="gks408-B39">39</xref>
), hash tables, linked lists and layouts with only pointers toward the first child and next sibling. Furthermore, mixed data structures that represent vertices with different numbers of children have also been proposed (
<xref ref-type="bibr" rid="gks408-B66">66</xref>
). Note that for DNA sequences, |Σ| is very small, turning array implementations into a workable solution. Also note that algorithms that perform full suffix tree traversals, such as repeat finding and many other string problems (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
), do not suffer from a performance loss when implemented with more memory-efficient data structures.</p>
<p>In practice, suffix trees and suffix arrays require between 34
<italic>n</italic>
and 152
<italic>n</italic>
bits of memory. The suffix tree implementations described by Kurtz (
<xref ref-type="bibr" rid="gks408-B66">66</xref>
) perform very well and are implemented in the latest release of MUMmer (
<xref ref-type="bibr" rid="gks408-B10">10</xref>
), an open-source sequence analysis tool. The implementation in MUMmer allows indexing DNA sequences up to 250 Mbp on a computer with 4 GB of memory. Single human chromosomes are thus well within reach of standard suffix trees. Another implementation by Giegerich
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B33">33</xref>
) is even smaller, but lacks suffix links. Enhanced suffix arrays (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
) also reach full expressiveness of suffix trees, as described in the previous section. When carefully implemented, they require anything between 40
<italic>n</italic>
and 72
<italic>n</italic>
bits. Enhanced suffix arrays use a linked list to represent the vertices of the tree. However, the
<inline-formula>
<inline-graphic xlink:href="gks408i37.jpg"></inline-graphic>
</inline-formula>
(|Σ|) performance penalty for string matching can be reduced to
<inline-formula>
<inline-graphic xlink:href="gks408i38.jpg"></inline-graphic>
</inline-formula>
(|logΣ|) (
<xref ref-type="bibr" rid="gks408-B41">41</xref>
). Furthermore, enhanced suffix arrays form the basis of the Vmatch program that finds different types of exact and approximate repeats in sequences of several hundreds of Mbp in a few seconds. Moreover, according to a comparison between several implementations of suffix trees and enhanced suffix arrays (
<xref ref-type="bibr" rid="gks408-B39">39</xref>
), enhanced suffix arrays show the best overall performance for both the memory footprint and the traversal times. Finally, their modular design allows replacing some arrays by a compressed counterpart to further reduce space.</p>
</sec>
<sec>
<title>Sparse indexes</title>
<p>An intuitive solution for decreasing index structure memory requirements is sparsification or sampling of suffixes or array indexes. ‘Sparse suffix trees’ (
<xref ref-type="bibr" rid="gks408-B67">67</xref>
) and ‘sparse suffix arrays’ (
<xref ref-type="bibr" rid="gks408-B68">68</xref>
) adopt the idea of utilizing a sparse set of suffixes, whereas compressed suffix arrays and trees sample values in Ψ(
<italic>S</italic>
),
<italic>C</italic>
(
<italic>S</italic>
), rank(
<italic>S</italic>
) and other arrays involved in their design. As a consequence of sparsification, more string comparisons and sequential string searches are required. This, however, gives the opportunity to optionally tweak the size of the index structure based on the available memory. Although compressed index structures have received more attention in bioinformatics applications, sparse suffix arrays have been successfully used for exact pattern matching, retrieval of maximal exact matches (
<xref ref-type="bibr" rid="gks408-B69">69</xref>
) and read alignment (
<xref ref-type="bibr" rid="gks408-B70">70</xref>
). Furthermore, splitting indexes over multiple sparse index structures has been used for index structures that reside on disk (
<xref ref-type="bibr" rid="gks408-B71">71</xref>
) and for distributed query processing (
<xref ref-type="bibr" rid="gks408-B72">72</xref>
).</p>
<p>Word-based index structures are special cases of sparse index structures which only sample one suffix per word. Although word-based index structures are most popular in the form of inverted files, word-based suffix trees (
<xref ref-type="bibr" rid="gks408-B73">73</xref>
,
<xref ref-type="bibr" rid="gks408-B74">74</xref>
) and suffix arrays (
<xref ref-type="bibr" rid="gks408-B68">68</xref>
) also exist. Although it is possible to divide biological sequences into ‘words’, word-based index structures are generally designed to answer pattern matching queries on natural language data. On natural language data, Transier and Sanders (
<xref ref-type="bibr" rid="gks408-B75">75</xref>
) found that inverted files outperformed full-text indexes by a wide margin. Unfortunately, the inverted files were not compared against word-based implementations of suffix trees and suffix arrays. A somewhat dual approach was taken by Puglisi
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B76">76</xref>
), who adapted inverted files to become full-text indexes able to perform substring queries. They found compressed suffix arrays to generally outperform inverted files for DNA sequences, but the opposite conclusion was drawn for protein sequences. It turns out that compressed suffix arrays perform relatively better compared with inverted files when searching for patterns having fewer occurrences. Note that both comparative studies were performed in primary memory.</p>
</sec>
<sec>
<title>Compressed index structures</title>
<p>Compressed and succinct index structures are currently the most popular forms of index structures used in bioinformatics. Index structures such as compressed suffix arrays and FM-indexes are gradually built into state-of-the-art read mapping tools and other bioinformatics applications. Where traditional index structures require
<inline-formula>
<inline-graphic xlink:href="gks408i39.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log
<italic>n</italic>
) bits of storage, succinct index structures require
<inline-formula>
<inline-graphic xlink:href="gks408i40.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) bits and the memory footprint of compressed index structures is defined relative to the ‘empirical entropy’ (
<xref ref-type="bibr" rid="gks408-B77">77</xref>
) of a string. Furthermore, these self-indexes contain
<italic>S</italic>
itself, thus saving again
<inline-formula>
<inline-graphic xlink:href="gks408i41.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) bits. Theoretically, this means that the size of compressed index structures can become a fraction of
<italic>S</italic>
itself. In practice, however, DNA and protein sequences do not compress very well (
<xref ref-type="bibr" rid="gks408-B2">2</xref>
,
<xref ref-type="bibr" rid="gks408-B70">70</xref>
). For this reason, the size of compressed index structures is roughly similar to the size of storing
<italic>S</italic>
using a compact bit representation. The major disadvantage of compressed index structures is the logarithmic increase in computation time for many string algorithms. This is, however, not the case for all string algorithms. For example, calculating |occ(
<italic>P</italic>
,
<italic>S</italic>
)| can still be done in
<inline-formula>
<inline-graphic xlink:href="gks408i42.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
) time for some compressed indexes. These internal differences between compressed index structures result from their complex nature, as they combine ideas from classical index structures, compression algorithms, coding strategies and other research fields. In the following paragraphs, the conceptual differences of state-of-the-art compressed index structures are surveyed, illustrated with theoretical and practical comparisons wherever possible. A more technical review is found in (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
).</p>
</sec>
<sec>
<title>Auxiliary data structures</title>
<p>Understanding the organization details and properties of compressed index structures requires prior knowledge of the auxiliary data structures involved in their design. Compressed indexes consist of many auxiliary structures that influence their memory-time trade-off, and have properties that dictate their expressiveness and performance for certain types of data. Representation of these auxiliary structures forms an active field of research. What follows is a brief summary of several commonly used auxiliary structures, not including the rather technical implementation details.</p>
<p>Almost all compressed index structures make use of bit vectors
<italic>B</italic>
to support random access and rank(
<italic>B</italic>
) and select(
<italic>B</italic>
) queries. Intuitively, rank(
<italic>B</italic>
) queries count the number of zeroes or ones before a certain index in the vector. Dual to this, select(
<italic>B</italic>
) queries return the position in
<italic>B</italic>
of the
<italic>i</italic>
-th zero or one. They often play a role in granting random access to a compressed or permutated string. Their usefulness, however, goes further than being mere building blocks of compressed index structures. For example, they can also be used to succinctly represent de Bruijn graphs (
<xref ref-type="bibr" rid="gks408-B15">15</xref>
), a typical data structure used in
<italic>de novo</italic>
genome assembly. Formally, rank(
<italic>B</italic>
) is represented as a 2D array defined by rank[
<italic>i</italic>
,
<italic>c</italic>
] ≡ |occ(
<italic>c</italic>
,
<italic>B</italic>
[..
<italic>i</italic>
])|, 0 ≤ 
<italic>i</italic>
 < |
<italic>B</italic>
|,
<italic>c</italic>
 ∈ {0, 1}, similar to rank(
<italic>S</italic>
) for FM-indexes. select(
<italic>B</italic>
) is defined as select[
<italic>i</italic>
,
<italic>c</italic>
] ≡ 
<italic>j</italic>
iff
<italic>i</italic>
 = rank[
<italic>j</italic>
,
<italic>c</italic>
], 0 ≤ 
<italic>i</italic>
 < |occ(
<italic>c</italic>
,
<italic>B</italic>
)|,
<italic>c</italic>
 ∈ {0, 1}. These data structures and their generalizations to non-binary strings strongly influence the memory-time trade-off of compressed index structures (
<xref ref-type="bibr" rid="gks408-B48">48</xref>
). As an example, the array rank(
<italic>S</italic>
) used in FM-indexes takes up to half of its size. As is the case for other data structures, there is no single optimal implementation for every application, but many proposals exist (
<xref ref-type="bibr" rid="gks408-B78">78</xref>
<xref ref-type="bibr" rid="gks408-B80">80</xref>
). The performance also depends on the restrictions imposed by the compressed index structure or the properties of the data, such as the sparsity of the original bit vector. From extremely sparse to more balanced, the best implementations require 0.2
<italic>n</italic>
bits (
<xref ref-type="bibr" rid="gks408-B80">80</xref>
) (1% ones), 0.8
<italic>n</italic>
bits (
<xref ref-type="bibr" rid="gks408-B79">79</xref>
) (20% ones) and 1.4
<italic>n</italic>
bits (
<xref ref-type="bibr" rid="gks408-B80">80</xref>
) (50% ones).</p>
<p>The above results for bit vectors have been generalized to non-binary strings (
<xref ref-type="bibr" rid="gks408-B48">48</xref>
,
<xref ref-type="bibr" rid="gks408-B79">79</xref>
), as worked with in many applications, including FM-indexes. A simple idea toward such a generalization is to create |Σ| bit vectors
<italic>B</italic>
<sub>
<italic>c</italic>
</sub>
, with
<italic>B</italic>
<sub>
<italic>c</italic>
</sub>
[
<italic>j</italic>
] = 1 iff
<italic>S</italic>
[
<italic>j</italic>
] = 
<italic>c</italic>
. However, this entails an overhead both in time (random access to
<italic>S</italic>
) and memory. A careful implementation allows eliminating this overhead (
<xref ref-type="bibr" rid="gks408-B48">48</xref>
), but ‘wavelet trees’ (
<xref ref-type="bibr" rid="gks408-B59">59</xref>
) form an even more elegant solution.</p>
<p>Wavelet trees are balanced binary trees with |Σ| leaves. Every node
<italic>v</italic>
in the tree represents a subsequence
<italic>S</italic>
′ of
<italic>S</italic>
formed by the concatenation of all characters that belong to some interval Σ[
<italic>i</italic>
 .. 
<italic>j</italic>
]. The two children of
<italic>v</italic>
are the subsequences formed by the concatenation of all characters of
<italic>S</italic>
′ that belong to
<inline-formula>
<inline-graphic xlink:href="gks408i3.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="gks408i4.jpg"></inline-graphic>
</inline-formula>
respectively. Vertex
<italic>v</italic>
itself is represented by a bit vector
<italic>B</italic>
of size |
<italic>S</italic>
′| that is defined as
<italic>B</italic>
[
<italic>i</italic>
] = 0 iff
<inline-formula>
<inline-graphic xlink:href="gks408i5.jpg"></inline-graphic>
</inline-formula>
. Furthermore,
<italic>B</italic>
is preprocessed as to resolve rank(
<italic>B</italic>
) and select(
<italic>B</italic>
) in constant time. The wavelet tree for BWT(
<italic>S</italic>
) of the running example is shown in
<xref ref-type="fig" rid="gks408-F2">Figure 2</xref>
, and has the same functionality as BWT(
<italic>S</italic>
) and rank(
<italic>S</italic>
). From this figure, BWT[9] can be found as follows. The root bit vector learns that
<italic>B</italic>
<sub>Σ</sub>
[9] = 0, meaning that BWT[9] is a character from the first half of the alphabet. Since
<italic>B</italic>
<sub>Σ</sub>
[9] is the sixth occurrence of 0 in
<italic>B</italic>
<sub>Σ</sub>
(rank[9,0] = 5), it corresponds to
<italic>B</italic>
<sub>
<monospace>$AC</monospace>
</sub>
[5] (zero-based index). Repetition of this procedure for the vertices corresponding to
<italic>S</italic>
<sub>
<monospace>$AC</monospace>
</sub>
 = 
<monospace>$CCAAAAA</monospace>
and
<italic>S</italic>
<sub>
<monospace>$A</monospace>
</sub>
 = 
<monospace>$AAAAA</monospace>
yields BWT[9] = 
<monospace>A</monospace>
. rank(
<italic>S</italic>
) queries can be resolved in a similar way. Further research on wavelet trees gave rise to Huffman-shaped wavelet trees (
<xref ref-type="bibr" rid="gks408-B60">60</xref>
) and non-binary wavelet trees (
<xref ref-type="bibr" rid="gks408-B48">48</xref>
). This elegant, but somewhat complex data structure, has become very popular in index structure design. As an example result, all maximal repeats occurring in the complete human genome could be found in <17 h on a desktop PC (
<xref ref-type="bibr" rid="gks408-B81">81</xref>
) with 8 GB internal memory using an index structure based on the Burrows–Wheeler transform combined with a sparse wavelet tree implementation of the LCP array. Similar tests using suffix trees or enhanced suffix arrays failed due to the memory bottleneck.
<fig id="gks408-F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Wavelet tree for indexing string
<italic>S</italic>
 = 
<monospace>GT$CCGAATAAA</monospace>
. Only the binary strings are stored in practice. Subsequences of
<italic>S</italic>
are shown only to ease the interpretation. This figure does not include data structures for resolving rank and select queries for every bit vector. For this small example, however, the answer to these queries is straightforward.</p>
</caption>
<graphic xlink:href="gks408f2"></graphic>
</fig>
</p>
<p>Other important index structure building blocks are auxiliary tree representations. Index structures use various types of trees, but a common design problem is the representation of their topology. As an example, suffix tree topology is traditionally implemented using pointers, requiring
<inline-formula>
<inline-graphic xlink:href="gks408i43.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
log
<italic>n</italic>
) bits of memory. In contrast, a popular way to succinctly represent tree topology by a sequence of balanced parentheses (
<xref ref-type="bibr" rid="gks408-B82">82</xref>
) only requires 2
<italic>n</italic>
 + 
<italic>o</italic>
(
<italic>n</italic>
) bits of memory. This implementation represents nodes in the tree as a pair of parentheses ‘()’. The nested structure of the parentheses then represents the tree (
<xref ref-type="bibr" rid="gks408-B83">83</xref>
), similar to a reduced form of the known Newick Tree Format (
<xref ref-type="bibr" rid="gks408-B84">84</xref>
). More tree operations are generally supported in constant or near-constant time by succinct tree topology representations compared with classical pointer-based representations, which only supports top-down traversals in constant time. Node depth, subtree size and the lowest common ancestor of two nodes (
<xref ref-type="bibr" rid="gks408-B85">85</xref>
) are examples of properties that can be retrieved in constant time from succinct representations where pointer-based representations require additional data structures to achieve the same performance. In theory, this means that a highly expressive suffix tree topology can be stored using 4
<italic>n</italic>
 + 
<italic>o</italic>
(
<italic>n</italic>
) bits instead of 64
<italic>n</italic>
bits using 32-bit pointers. Note, however, that the
<italic>o</italic>
(
<italic>n</italic>
)-term may become large in practice and even surpass the higher order term. For biological sequences, tests (
<xref ref-type="bibr" rid="gks408-B83">83</xref>
) show that these representations require something in between 2.1 and 4.84 bits per node, which has to be multiplied by 2
<italic>n</italic>
nodes in the worst case.</p>
<p>Retrieval of the lowest common ancestor of two nodes
<italic>v</italic>
<sub>1</sub>
and
<italic>v</italic>
<sub>2</sub>
, mentioned in the previous paragraph, is a fundamental operation for inexact string matching algorithms (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
). Denoted by LCA(
<italic>v</italic>
<sub>1</sub>
,
<italic>v</italic>
<sub>2</sub>
), it is defined as the unique node
<italic>v</italic>
<sub>3</sub>
for which holds that ℓ(
<italic>v</italic>
<sub>3</sub>
) = ℓ(LCA(
<italic>v</italic>
<sub>1</sub>
,
<italic>v</italic>
<sub>2</sub>
)) ≡ LCP(ℓ(
<italic>v</italic>
<sub>1</sub>
), ℓ(
<italic>v</italic>
<sub>2</sub>
)). This operation is supported by a combination of LCP arrays and data structures for resolving range minimum queries (
<xref ref-type="bibr" rid="gks408-B86">86</xref>
). This directly follows from the definition of LCA(
<italic>v</italic>
<sub>1</sub>
,
<italic>v</italic>
<sub>2</sub>
). Range minimum query data structures return the positions of the smallest values in any interval of an array. In LCP arrays, they return the length of the LCP of any two suffixes. Furthermore, range minimum query data structures can replace the child array in enhanced suffix arrays (
<xref ref-type="bibr" rid="gks408-B40">40</xref>
), because ℓ-indexes are the positions of minimal values in LCP intervals.</p>
</sec>
<sec>
<title>Compressed suffix arrays</title>
<p>Compression of suffix arrays is based on storing a sparse representation of SA(
<italic>S</italic>
) and storing Ψ(
<italic>S</italic>
) in compressed form. Ψ(
<italic>S</italic>
) has the property that it is increasing in areas of SA(
<italic>S</italic>
) that point to suffixes starting with the same character (
<xref ref-type="bibr" rid="gks408-B87">87</xref>
), which makes it compressible. The first real compressed suffix array was designed by Grossi and Vitter (
<xref ref-type="bibr" rid="gks408-B43">43</xref>
,
<xref ref-type="bibr" rid="gks408-B87">87</xref>
). They built on a hierarchical decomposition of SA(
<italic>S</italic>
) that halves the size of SA(
<italic>S</italic>
) in every level by removing values pointing to odd suffixes and dividing even suffix array values by two. Ψ(
<italic>S</italic>
) is stored in every level for odd suffix array values. rank(
<italic>S</italic>
) and select(
<italic>S</italic>
) data structures are used to retrieve the parity of suffixes on every level of the hierarchy and in an encoding of Ψ(
<italic>S</italic>
) (
<xref ref-type="bibr" rid="gks408-B43">43</xref>
,
<xref ref-type="bibr" rid="gks408-B88">88</xref>
). The number of levels stored in this representation is a parameter that tunes the memory-time trade-off.</p>
<p>Sadakane (
<xref ref-type="bibr" rid="gks408-B63">63</xref>
) further improved the above implementation by incorporating the compressed string into the index structure. A basic version of this self-index does not allow direct access to SA[
<italic>i</italic>
], but instead allows access to
<italic>S</italic>
[SA[
<italic>i</italic>
]], which is sufficient for pattern matching and finding |occ(
<italic>P</italic>
,
<italic>S</italic>
)|. Direct access to SA[
<italic>i</italic>
] and SA
<sup>−1</sup>
[
<italic>i</italic>
] and random access to
<italic>S</italic>
is achieved by incorporating the hierarchical structure by Grossi and Vitter. Sadakane's compressed suffix array was implemented (
<xref ref-type="bibr" rid="gks408-B64">64</xref>
) and constructed for the human genome. The index required ∼5.6
<italic>n</italic>
bits of memory, resulting in an overall memory footprint of <2 GB. Additionally, Sadakane designed a backward search algorithm, similar to that used by FM-indexes, for counting patterns (
<xref ref-type="bibr" rid="gks408-B89">89</xref>
). This strategy is much faster than the traditional binary search used by suffix arrays.</p>
<p>Other compressed suffix array designs incorporated wavelet trees (
<xref ref-type="bibr" rid="gks408-B59">59</xref>
). In practice, an example implementation (
<xref ref-type="bibr" rid="gks408-B60">60</xref>
) required 2.4
<italic>n</italic>
bits of memory for real DNA sequences.</p>
<p>A different solution to lower the memory requirements of suffix arrays was used for ‘compact suffix arrays’ (
<xref ref-type="bibr" rid="gks408-B90">90</xref>
). Here, the compression is based on self-repetitions, so-called runs, in SA(
<italic>S</italic>
). These are suffix array intervals [
<italic>i</italic>
 .. 
<italic>i</italic>
 + ℓ] for which another interval [
<italic>j</italic>
 .. 
<italic>j</italic>
 + ℓ] exists such that SA[
<italic>i</italic>
 + 
<italic>k</italic>
] = SA[
<italic>j</italic>
 + 
<italic>k</italic>
] + 1 for 0 ≤ 
<italic>k</italic>
 ≤ ℓ. In practice, compact suffix arrays take up more memory than existing compressed suffix arrays, but are also faster. It was shown that the number of self-repetitions in SA(
<italic>S</italic>
) is related to the number of equal-characters runs in BWT(
<italic>S</italic>
) (
<xref ref-type="bibr" rid="gks408-B91">91</xref>
), which can be compressed by run-length encoding. In terms of compression, however, this technique was superseded by other FM-indexes (
<xref ref-type="bibr" rid="gks408-B47">47</xref>
).</p>
<p>The above compressed suffix arrays are especially geared toward pattern matching. Some compressed index structures (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
) are able to find |occ(
<italic>P</italic>
,
<italic>S</italic>
)| in
<inline-formula>
<inline-graphic xlink:href="gks408i44.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
) time, but in practice they all require at least
<inline-formula>
<inline-graphic xlink:href="gks408i45.jpg"></inline-graphic>
</inline-formula>
(|occ(
<italic>P</italic>
,
<italic>S</italic>
)|log
<italic>n</italic>
) time for retrieving the actual occurrences of pattern
<italic>P</italic>
. Furthermore, locating the patterns requires a lot of random accesses to the index structures, resulting in degrading performance due to cache misses. This becomes even more severe when ported to secondary memory (
<xref ref-type="bibr" rid="gks408-B92">92</xref>
). This also holds for FM-indexes, as discussed further. González and Navarro (
<xref ref-type="bibr" rid="gks408-B92">92</xref>
) designed locally compressed suffix arrays to cope with this problem. Their index structures are based on sampling exact suffix array values, differentially encoding SA(
<italic>S</italic>
) and encoding this array using dictionaries. However, these index structures are not self-indexes and have to be incorporated into existing compressed suffix arrays or FM-indexes. In practice, the speed for locating patterns is indeed much faster, even compared with the Lempel–Ziv index structures described further. However, their compression rate is not that high, as it requires up to 85% of the size of regular suffix arrays for DNA sequences and 70% for protein sequences.</p>
<p>A practical performance comparison between compressed suffix arrays and plain suffix arrays was made by Sadakane and Shibuya (
<xref ref-type="bibr" rid="gks408-B64">64</xref>
). They both tested for the application of approximate string matching. Compressed suffix arrays required one sixth of the memory typically needed by plain suffix arrays, but were 2 – 20 times slower.</p>
</sec>
<sec>
<title>FM-indexes</title>
<p>As previously stated, FM-indexes are compressed full-text indexes based on the Burrows–Wheeler transform. Different memory-time trade-offs are reached for FM-indexes by using different techniques for compressing BWT(
<italic>S</italic>
) and rank(
<italic>S</italic>
). As a reminder, in the original proposal (
<xref ref-type="bibr" rid="gks408-B45">45</xref>
,
<xref ref-type="bibr" rid="gks408-B55">55</xref>
), BWT(
<italic>S</italic>
) is compressed by applying move-to-front transformation, run-length compression and a version of Elias-γ prefix codes (
<xref ref-type="bibr" rid="gks408-B93">93</xref>
). rank(
<italic>S</italic>
) is encoded by cutting the array in blocks and using the Four-Russians technique. In the original practical implementation (
<xref ref-type="bibr" rid="gks408-B49">49</xref>
), the dictionary used for the Four-Russians technique is replaced by a linear scan of a bit vector.</p>
<p>The above representation of FM-indexes is heavily dependent on the alphabet size. A simple way to reduce this dependence is to use a wavelet tree over BWT(
<italic>S</italic>
) and use any representation of rank(
<italic>S</italic>
) for bit vectors in every internal node (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
). Huffman-shaped wavelet trees are used by ‘succinct suffix arrays’ (
<xref ref-type="bibr" rid="gks408-B47">47</xref>
). In a recent practical survey (
<xref ref-type="bibr" rid="gks408-B54">54</xref>
), this implementation shows the best known practical time-memory trade-offs for the most used basic operations on compressed index structures when applied to DNA and protein sequences. Although its memory footprint is somewhat higher (4
<italic>n</italic>
–20
<italic>n</italic>
bits) than that of the standard FM-index, it is 20 times faster than its classical counterpart (
<xref ref-type="bibr" rid="gks408-B39">39</xref>
). Compared to suffix trees, however, it is 20 times slower. There exist even smaller FM-indexes, such as ‘run-length FM-indexes’ (
<xref ref-type="bibr" rid="gks408-B47">47</xref>
) that apply run-length compression to BWT(
<italic>S</italic>
) prior to building a wavelet tree. A more recent proposal by Ferragina
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B48">48</xref>
), the ‘alphabet-friendly FM-indexes’, theoretically supersedes all previous FM-index implementations. In practice (
<xref ref-type="bibr" rid="gks408-B54">54</xref>
), however, the alphabet-friendly FM-index is superseded by the succinct suffix array for biological sequences. Only for strings with a large alphabet and small high-order entropy (making them highly compressible), such as natural language strings or XML files, alphabet-friendly FM-indexes outperform other FM-indexes.</p>
<p>Another possibility for lowering the memory dependence of FM-indexes was explored by Grabowski
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B50">50</xref>
). They first Huffman-encoded
<italic>S</italic>
and then applied the Burrows–Wheeler transform. They require sampling some characters from
<italic>S</italic>
additionally to the sampling of SA(
<italic>S</italic>
). Their best implementation slightly outperforms succinct suffix arrays on biological sequences and requires 3.28
<italic>n</italic>
bits of memory on average.</p>
<p>Note that locating patterns using FM-indexes is done by sampling suffix array values, which turns out to be rather slow in practice. A memory-time trade-off is imposed by the sampling rate. Improvements on the pattern locating performance can be made by using more complex sampling strategies, different from basic evenly spaced sampling (
<xref ref-type="bibr" rid="gks408-B49">49</xref>
,
<xref ref-type="bibr" rid="gks408-B54">54</xref>
). An alternative is to incorporate another index structure that supports fast locating of patterns (
<xref ref-type="bibr" rid="gks408-B55">55</xref>
,
<xref ref-type="bibr" rid="gks408-B92">92</xref>
).</p>
</sec>
<sec>
<title>Lempel-Ziv index structures</title>
<p>Similar to the above compressed full-text index structures, Lempel–Ziv indexes (
<xref ref-type="bibr" rid="gks408-B94">94</xref>
) are mainly designed for pattern matching. Unlike the above compressed index structures, however, Lempel–Ziv indexes are not based on suffix arrays or the Burrows–Wheeler transform. Instead, they build on the dictionary-based Lempel–Ziv (
<xref ref-type="bibr" rid="gks408-B95">95</xref>
) compression technique. Briefly, the LZ78 (
<xref ref-type="bibr" rid="gks408-B96">96</xref>
) compression is achieved by traversing
<italic>S</italic>
and replacing substrings of
<italic>S</italic>
with tuples (
<italic>w</italic>
,
<italic>c</italic>
), where
<italic>w</italic>
is a word from the dictionary and
<italic>c</italic>
 ∈ Σ. Assume that at some point,
<italic>S</italic>
[..
<italic>i</italic>
 − 1] has been compressed and the next tuple in the compressed string is (
<italic>w</italic>
,
<italic>c</italic>
).
<italic>w</italic>
equals the code word for the longest prefix of
<italic>S</italic>
[
<italic>i</italic>
..], say
<italic>S</italic>
[
<italic>i</italic>
 .. 
<italic>j</italic>
], that is already part of the dictionary and
<italic>c</italic>
 = 
<italic>S</italic>
[
<italic>j</italic>
 + 1]. Furthermore,
<italic>S</italic>
[
<italic>i</italic>
 .. 
<italic>j</italic>
 + 1] is added to the dictionary. Note that there are other variants of Lempel–Ziv compression, similar to the technique described here, which are omitted for the sake of brevity.</p>
<p>Due to space limitations, details on the structure and search algorithms of Lempel–Ziv indexes are omitted, but can be found elsewhere (
<xref ref-type="bibr" rid="gks408-B42">42</xref>
). What is important to note about their structure, however, is that Lempel–Ziv indexes contain many building blocks: compressed or sparse (suffix) tree data structures to compactly represent the dictionaries of forward and reverse code words, data structures for linking those trees and several other auxiliary data structures that answer rank(
<italic>S</italic>
) queries and data structures to answer orthogonal range queries. As a direct consequence, further improvements in these building blocks will improve the performance of Lempel–Ziv indexes. Compared with other compressed index structures, Lempel–Ziv index structures require more memory than other self-indexes on average and they are not competitive for counting occurrences of patterns [
<inline-formula>
<inline-graphic xlink:href="gks408i46.jpg"></inline-graphic>
</inline-formula>
(
<italic>m</italic>
<sup>2</sup>
) time]. They, however, excel at retrieving the exact set of all occurrences occ(
<italic>P</italic>
,
<italic>S</italic>
).</p>
<p>Lempel–Ziv indexes have been turned into self-indexes by Navarro (
<xref ref-type="bibr" rid="gks408-B65">65</xref>
), who also designed an efficient implementation (
<xref ref-type="bibr" rid="gks408-B97">97</xref>
). Further improvements in counting occurrences were made by Ferragina and Manzini (
<xref ref-type="bibr" rid="gks408-B55">55</xref>
), who attached FM-indexes to Lempel–Ziv indexes. Other approaches (
<xref ref-type="bibr" rid="gks408-B98">98</xref>
,
<xref ref-type="bibr" rid="gks408-B99">99</xref>
) have minimized the redundancy caused by an overload of building blocks and have experimented with new auxiliary data structures. Recent tests (
<xref ref-type="bibr" rid="gks408-B54">54</xref>
,
<xref ref-type="bibr" rid="gks408-B57">57</xref>
,
<xref ref-type="bibr" rid="gks408-B99">99</xref>
) show that those new implementations have made Lempel–Ziv indexes more competitive compared with compressed suffix arrays and FM-indexes, but succinct suffix arrays are still reported to have better memory-time trade-offs. In the near future, however, Lempel–Ziv indexes could outperform other indexes for highly compressible strings because all building blocks of Lempel–Ziv index structures can be compressed, while other compressed indexes contain sampled suffix array values, which are incompressible (
<xref ref-type="bibr" rid="gks408-B98">98</xref>
).</p>
</sec>
<sec>
<title>Compressed suffix trees</title>
<p>The above compressed index structures were mainly designed for exact string matching. As such, they do not reach the full expressiveness of suffix trees. Examples of this expressiveness have been previously given as illustration of the different traversal types of suffix trees. In recent years, efforts have been made to increase the flexibility of compressed index structures either by designing index-specific algorithms or by implementing additional auxiliary data structures. Analogous to enhanced suffix arrays, the main auxiliary data structures used for augmenting compressed suffix arrays are succinct representations of LCP arrays (
<xref ref-type="bibr" rid="gks408-B89">89</xref>
), data structures for top-down tree traversals and suffix link support. As an example, the combination of Burrows Wheeler index structures and wavelet trees for succinct LCP arrays was used for locating all maximal repeats in the whole human genome (
<xref ref-type="bibr" rid="gks408-B81">81</xref>
). Ohlebush
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B100">100</xref>
), among others, noted that the backward search mechanism mimics top-down suffix ‘trie’ traversal. Using additional data structures to simulate suffix links, they calculated maximal exact matches between DNA sequences, using less memory than, for example, MUMmer (
<xref ref-type="bibr" rid="gks408-B10">10</xref>
).</p>
<p>Instead of developing application-specific compressed index structures, several ‘compressed suffix trees’ (
<xref ref-type="bibr" rid="gks408-B44">44</xref>
) or ‘compressed enhanced suffix arrays’ (
<xref ref-type="bibr" rid="gks408-B101">101</xref>
) have been designed that even surpass the expressiveness of classical suffix trees. Furthermore, because compressed suffix trees extend compressed self-indexes, they are self-indexes themselves. The difference between these structures and the compressed suffix arrays and FM-indexes on which they are built, is their ability to directly implement suffix tree algorithms using these structures. Although the extra data structures increase their memory footprint, compressed suffix trees are still smaller than classical suffix arrays. Furthermore, space-time trade-offs can be tuned to a certain extent, similar to the sparsification parameter in compressed suffix arrays and FM-indexes.</p>
<p>Over the last years, several compressed suffix tree designs have been proposed. These can be classified by their choice of auxiliary data structures, especially the representation of the suffix tree topology (
<xref ref-type="bibr" rid="gks408-B102">102</xref>
). They either use sequences of balanced parentheses or implicit representation by LCP intervals. Additional building blocks are succinct representations of LCP arrays and data structures for performing lowest common ancestor queries, which in turn support suffix links. As an example, the first compressed suffix tree reaching full expressiveness was given by Sadakane (
<xref ref-type="bibr" rid="gks408-B44">44</xref>
). It consists of a compressed suffix array, succinct LCP array, balanced parentheses representation for suffix tree topology and additional data structures for solving range minimum queries. In practice, an engineered version (
<xref ref-type="bibr" rid="gks408-B58">58</xref>
) of this compressed suffix tree required 25
<italic>n</italic>
–35
<italic>n</italic>
bits of memory and was able to index the complete human genome using only 8.5 GB. Compared with classical suffix trees, this compressed variant is two orders of magnitude slower on average. Nevertheless, compressed suffix trees are still much faster than brute force algorithms. Furthermore, many auxiliary data structures used in the design offer a memory-time trade-off which can be optimized for the available memory. Advancements made in representing auxiliary data structures have led to index structures with even smaller memory requirements (
<xref ref-type="bibr" rid="gks408-B85">85</xref>
). The smallest compressed suffix tree we know of (
<xref ref-type="bibr" rid="gks408-B61">61</xref>
) requires only 4
<italic>n</italic>
–6
<italic>n</italic>
bits of memory and is based on sampling the suffix tree. This low memory footprint, however, is paid for by giving up performance, and it is several orders of magnitude slower than Sadakane's compressed suffix tree (
<xref ref-type="bibr" rid="gks408-B62">62</xref>
). Another compressed suffix tree proposed by Fischer
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B103">103</xref>
) has a memory-time trade-off which lies between the two previously mentioned compressed suffix trees. Cánovas and Navarro (
<xref ref-type="bibr" rid="gks408-B62">62</xref>
) engineered an implementation of this compressed suffix tree and compared the impact of different LCP array implementations on the compressed suffix tree. Depending on the implementation of the LCP arrays used, the compressed suffix tree requires between 8
<italic>n</italic>
and 16
<italic>n</italic>
bits of memory. A compressed enhanced suffix array reaching full expressiveness is given by Ohlebusch and Gog (
<xref ref-type="bibr" rid="gks408-B101">101</xref>
). However, it does not support lowest common ancestor queries. Prospects are that space-time trade-offs of compressed index structures will keep improving due to improvements in auxiliary data structures, especially improvements in compressed suffix arrays and compressed LCP arrays.</p>
</sec>
</sec>
<sec>
<title>Index structures in external memory</title>
<p>The solution for the memory bottleneck suffered by (main memory) index structures are index structures in external or secondary memory, such as hard disks. This paradigm shift is necessary when even the smallest compressed index structures cannot be stored in main memory. This limit is usually reached when even a compressed form of
<italic>S</italic>
cannot be stored in main memory. Secondary or external memory has the advantages of low cost, abundance and the persistence given to index structures. However, random access to secondary memory (disk) is much slower than random access to primary memory (RAM). In practice, this difference can be up to five orders of magnitude (
<xref ref-type="bibr" rid="gks408-B24">24</xref>
). Since index structures, such as suffix trees, intrinsically access data structures and input strings in a random manner, this leads to the so-called ‘I/O bottleneck’. Several techniques are used to minimize the effect of this bottleneck, both in hardware and in algorithm and data structure design. Solid-State disks, for example, are one order of magnitude faster than classical hard disks. Also, sequential disk access is almost as fast as random access on RAM. Another solution is to limit the number of I/O operations altogether by, for example, decreasing the size of the index structure. Buffering is another strategy commonly employed, as well as improving locality of information that is closely connected. To achieve this locality, redundancy is often introduced in the data structure, which is opposite to the space-saving techniques seen in main memory indexes. These techniques are not only applied for designing the spatial layout of index structures, but also for their traversal algorithms. In this section, existing index structures for external memory are reviewed with an emphasis on the high-level strategies employed. Other, more technical, reviews on this topic can be found elsewhere (
<xref ref-type="bibr" rid="gks408-B25">25</xref>
,
<xref ref-type="bibr" rid="gks408-B71">71</xref>
,
<xref ref-type="bibr" rid="gks408-B104">104</xref>
).</p>
<sec>
<title>Suffix arrays</title>
<p>Both suffix trees and suffix arrays perform poorly when naively implemented in secondary memory. Since of their simple design, however, suffix arrays are easier to implement on disk. The basic idea is to use levels of sparse suffix arrays in faster memory to guide searches in the full suffix array stored on disk. Baeza-Yates
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B105">105</xref>
) proposed a two-level index structure. They also augmented the sparse suffix array, stored in RAM, with exact prefixes of the suffixes represented in the sparse suffix array. This has the advantage that no random access to
<italic>S</italic>
is needed for matching in the sparse suffix array. Tests revealed that this implementation is five times faster than a naive implementation (
<xref ref-type="bibr" rid="gks408-B106">106</xref>
) of a single-level suffix array on disk. Later, Sinha
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B106">106</xref>
) replaced sparse suffix arrays by pruned suffix ‘tries’ for the first level of the hierarchy. Again, labels on the pruned suffix ‘trie’ are explicitly stored instead of pointers to
<italic>S</italic>
. Sinha
<italic>et al.</italic>
also improved the second level of the hierarchy by storing SA(
<italic>S</italic>
), LCP(
<italic>S</italic>
) and substrings of
<italic>S</italic>
, to minimize random access to
<italic>S</italic>
. Note that in primary memory, redundancy is eliminated, whereas in secondary memory it is introduced to increase performance. Tests showed that this method is five times faster than the two-level method of Baeza-Yates
<italic>et al.</italic>
and requires ∼10 times less non-sequential I/O operations for pattern matching.</p>
<p>A larger number of levels is used in the design of ‘string B-trees’ (
<xref ref-type="bibr" rid="gks408-B107">107</xref>
). These index structures act as conceptual B-trees (
<xref ref-type="bibr" rid="gks408-B27">27</xref>
) over suffix arrays. Similar to B-trees, internal nodes are B-ary and the final suffix array values are found in the leaves. To speed up the search through the B-tree, each internal node
<italic>v</italic>
contains a ‘Patricia tree or blind tree’ for the suffixes in
<italic>v</italic>
. Blind trees are suffix tree variants for which edge labels are stored as the first character of the label and its length. Pattern matching in blind trees consists of two phases. A first phase, similar to pattern matching in suffix trees, finds candidate positions according to the matched characters on the edges of the tree. A second phase explicitly compares the pattern to the candidate substrings in
<italic>S</italic>
. This type of edge labeling followed by a blind search can also be applied to all external memory suffix tree implementations to minimize random access to
<italic>S</italic>
. This data structure has the advantage that pattern matching is theoretically I/O optimal and updates are supported due to its B-tree nature. Furthermore, succinct cache-oblivious string B-trees have been developed (
<xref ref-type="bibr" rid="gks408-B108">108</xref>
). Note that string B-trees are not suffix trees and thus do not reach full expressiveness. Another disadvantage is that the blind search method used is impractical for inexact string matching (
<xref ref-type="bibr" rid="gks408-B109">109</xref>
).</p>
<p>Distribution of suffix arrays has also been proposed (
<xref ref-type="bibr" rid="gks408-B72">72</xref>
). This allows processing batches of queries in parallel by dividing SA(
<italic>S</italic>
) in intervals or by interleaving suffix array values. This interleaving can be done by grouping every
<italic>k</italic>
-th suffix to a single computing unit or by grouping the suffixes of a substring of
<italic>S</italic>
together in one node, thus minimizing access to
<italic>S</italic>
. Although these designs look promising, we have no knowledge of any recent performance results for string matching algorithms on biological data using any of the above external memory suffix arrays.</p>
</sec>
<sec>
<title>Suffix trees</title>
<p>Because of the underlying tree data structure, efficient implementation on disk is more difficult for suffix trees than for suffix arrays. Although many papers about external memory suffix trees exist, most of them focus on construction in external memory. Less attention has been given to optimizing suffix tree layout for traversals and even fewer performance tests are available for algorithms that make use of external memory suffix trees. The most important factor in designing external memory representations of suffix trees is the grouping of nodes into blocks and the layout of these blocks onto disk. Other important aspects are node and edge label representations. For locality reasons, array-based representations are superior to other implementations (
<xref ref-type="bibr" rid="gks408-B110">110</xref>
) and nodes contain more information than their primary memory counterparts, while edge labels can be compactly represented by their first character and length as in blind trees. An example of this strategy is one of the earliest external suffix trees, the ‘compact Patricia tree’ (
<xref ref-type="bibr" rid="gks408-B111">111</xref>
), which uses a topology representation similar to the balanced parentheses representation.</p>
<p>A very intuitive external memory suffix tree layout is that of partitioning by prefixes. The suffix tree is split into an upper root-block and blocks containing the subtrees of a given prefix. This layout is similar to the two-level hierarchical layout for suffix arrays. For top-down traversals of the suffix tree, it works well in practice. Furthermore, this layout is created naturally during construction (
<xref ref-type="bibr" rid="gks408-B109">109</xref>
,
<xref ref-type="bibr" rid="gks408-B110">110</xref>
). A disadvantage, however, is its scalability. Although these indexes can be constructed for the human genome (
<xref ref-type="bibr" rid="gks408-B112">112</xref>
), larger sequences or data sets suffer from either a large growth in the size of the partitions or an exponential growth in the number of partitions. Moreover, data skewness results in decreasing performance, as some partitions are much larger than others. In theory, a multi-level hierarchical structure could alleviate the scalability problem and data skewness has already been tackled by using variable length prefixes (
<xref ref-type="bibr" rid="gks408-B113">113</xref>
,
<xref ref-type="bibr" rid="gks408-B114">114</xref>
). Another weakness of external memory suffix trees are suffix links. These links imply a lot of random access and are thus optional (
<xref ref-type="bibr" rid="gks408-B113">113</xref>
,
<xref ref-type="bibr" rid="gks408-B114">114</xref>
) or completely omitted (
<xref ref-type="bibr" rid="gks408-B109">109</xref>
,
<xref ref-type="bibr" rid="gks408-B112">112</xref>
). On the other hand, some authors (
<xref ref-type="bibr" rid="gks408-B113">113</xref>
) claim that the use of suffix links in external memory improves performance of some search algorithms, such as finding maximal exact matches. Clifford (
<xref ref-type="bibr" rid="gks408-B115">115</xref>
) designed ‘distributed suffix trees’, which contain a local version of suffix links, called ‘sparse suffix links’. These links point to the local root if the normal suffix link would point to a node in a different partition. Clifford points out that prefix partitioning allows traversals on the suffix tree to be run in parallel on the distributed subtrees. Furthermore, he claims that most bioinformatics applications do not require traversals that require communications between the different prefix-partitioned parts. Thus prefix-partitioning enables the parallelization of most search algorithms on suffix trees.</p>
<p>For exact pattern matching, prefix partitioned suffix trees work well. For other queries, however, transforming the tree layout to an already constructed prefix-partitioned suffix tree has been proposed. The goal of changing layouts is to increase scalability and improve the locality of the nodes. For pattern matching, however, the new layout could increase the number of I/O operations. Different techniques have been proposed to achieve this goal. Clark and Munro (
<xref ref-type="bibr" rid="gks408-B111">111</xref>
) focused on minimizing the number of blocks required to store suffix trees using a greedy bottom-up algorithm. ‘STELLAR’ (
<xref ref-type="bibr" rid="gks408-B116">116</xref>
), on the other hand, focused on improving locality of nodes for both parent–child links as well as suffix links. Other layouts introduce redundancy of data by having the subtrees stored in blocks on disk overlap (
<xref ref-type="bibr" rid="gks408-B117">117</xref>
,
<xref ref-type="bibr" rid="gks408-B118">118</xref>
). Although the redundancy introduced increases the memory footprint of the index structures, it improves locality of the nodes and improves the scalability of the index structures. Care has to be taken, however, not to destroy some of the expressiveness of suffix trees, including LCP values and suffix links.</p>
<p>In practice, the largest indexed single DNA sequence found in the literature contains 12 billion base pairs (
<xref ref-type="bibr" rid="gks408-B119">119</xref>
). Although no extensive performance results for string algorithms on this index were given, disk-based index structures are known to be several times faster than non-indexed methods for string matching on the scale of the human genome. Compared with string B-trees, disk-based suffix trees require a similar number of I/O operations (
<xref ref-type="bibr" rid="gks408-B104">104</xref>
) for pattern matching. Furthermore, Halachev
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B120">120</xref>
) showed that for protein data, pattern matching on disk-based suffix trees can be almost as fast as pattern matching on enhanced suffix arrays. As an example of other applications, a disk-based enhanced suffix array has been used to locate repeats in human chromosomes (
<xref ref-type="bibr" rid="gks408-B12">12</xref>
).</p>
</sec>
<sec>
<title>Compressed index structures</title>
<p>Data compression and indexing are very important in computational biology, although they seem to be opposites at first sight. With the rise of compressed index structures, this dichotomy can be considered solved (
<xref ref-type="bibr" rid="gks408-B2">2</xref>
) for the RAM model. However, designing a disk-based version of these indexes is non-trivial, because compressed suffix arrays and FM-indexes perform many random accesses and show a poor locality (
<xref ref-type="bibr" rid="gks408-B92">92</xref>
). Nevertheless, some compressed index structures for external memory do exist.</p>
<p>Mäkinen
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B121">121</xref>
) designed a secondary memory version of the compressed suffix array by Sadakane (
<xref ref-type="bibr" rid="gks408-B88">88</xref>
) using a multi-level hierarchical structure. They also designed a distributed compressed suffix array. External memory variants of FM-indexes have been developed by González and Navarro (
<xref ref-type="bibr" rid="gks408-B122">122</xref>
). They proposed external memory versions for auxiliary data structures for calculating rank(
<italic>B</italic>
) and select(
<italic>B</italic>
) and proposed a two-level hierarchy for storing rank(
<italic>S</italic>
). Different structures were designed for representing BWT(
<italic>S</italic>
) on disk, all having different trade-offs depending on the size of the available main memory. For fast locating, they adopted the locally compressed suffix array designed for fast locating (
<xref ref-type="bibr" rid="gks408-B92">92</xref>
). Arroyuelo and Navarro (
<xref ref-type="bibr" rid="gks408-B123">123</xref>
) designed an external memory Lempel–Ziv index based on the Lempel–Ziv index structure proposed by Navarro (
<xref ref-type="bibr" rid="gks408-B65">65</xref>
). A recent article by Russo
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B124">124</xref>
) shows how parallel and distributed compressed suffix arrays can efficiently answer more advanced queries such as longest common substrings. Furthermore, they designed parallel and distributed compressed suffix trees.</p>
<p>Although the idea of reducing space in external memory to reduce the number of I/O-operations is interesting, it is not known how this affects performance in practice. Some tests on natural language data suggest that compressed index structures are competitive in practice, although they are somewhat slower than string B-trees (
<xref ref-type="bibr" rid="gks408-B122">122</xref>
).</p>
<p>Recently, Chien
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B125">125</xref>
) proposed a new transformation, called the ‘geometric Burrows–Wheeler transform’, which connects index structures with range searching. It translates characters of a string into 2D points and vice versa and uses the vast research on 2D range queries to answer pattern matching queries. To achieve a succinct representation, sparsification is used by grouping substrings in meta characters. For external memory purposes it uses a string B-tree to find ranges in the sparse suffix array, while 2D search can be done using a wavelet tree. Tests (
<xref ref-type="bibr" rid="gks408-B104">104</xref>
) show that these compressed index structures are smaller compared with other external memory index structures, but they require more I/O operations. Another application opened by these index structures is the possibility to answer relevance queries (
<xref ref-type="bibr" rid="gks408-B104">104</xref>
). As an example, it would be possible to retrieve only the top
<italic>k</italic>
most similar sequences in a database.</p>
</sec>
</sec>
<sec>
<title>CONSTRUCTION</title>
<p>Before index structures can be used, they first have to be constructed. Although construction is fast in theory, it is not always the case in practice. The current bottlenecks in constructing disk-based index structures for very large strings are memory limitations in the working space, cache misses and a high number of random accesses to secondary memory. The working space is the amount of memory required by the construction algorithm, which is usually higher than the memory required by the final index. Apart from dealing with these issues, some research has focused on parallelizing construction algorithms. In this section, an overview of existing construction algorithms for various index structures is given, illustrated with practical results found in the literature. Note that the figures in this section represent some of the historical breakthroughs in index structure construction, and are not meant as a comparison between the cited implementations. As a general reference, reported index structure construction times for the human genome, or for sequences in the same order of magnitude, were in the range of a few hours on desktop computers and in the range of minutes on clusters and specialized hardware.</p>
<sec>
<title>Suffix trees</title>
<p>Historically, suffix tree construction goes back to Weiner (
<xref ref-type="bibr" rid="gks408-B28">28</xref>
), who gave a first
<inline-formula>
<inline-graphic xlink:href="gks408i47.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) algorithm. Later, Ukkonen (
<xref ref-type="bibr" rid="gks408-B126">126</xref>
) gave a simpler
<inline-formula>
<inline-graphic xlink:href="gks408i48.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) algorithm, which has the nice property of being online, i.e. a new string can be added to the suffix tree by appending it to the back of the previous strings. The WOTD suffix tree by Giegerich
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B33">33</xref>
) comes with a lazy construction algorithm, in the sense that suffix tree nodes are added the first time that a traversal algorithm requires these nodes. Thus, suffix trees can also be efficiently used for smaller applications that do not require information about the whole tree. The suffix links that are a by-product of Ukkonen's algorithm have very nice features, as discussed in ‘Popular index structures’ Section, but they are omitted in other construction algorithms. To retrieve these suffix links, some post-processing algorithms exist (
<xref ref-type="bibr" rid="gks408-B127">127</xref>
). Although the above mentioned suffix tree construction algorithms only scale up to chromosome level, they form the basis for many external memory construction algorithms. Although a main memory suffix tree for the whole human genome was constructed by Kurtz (
<xref ref-type="bibr" rid="gks408-B66">66</xref>
), most main memory index structure construction algorithms focus on suffix arrays and compressed index structures.</p>
</sec>
<sec>
<title>Suffix arrays</title>
<p>Originally, linear time suffix array construction required the construction of the suffix tree (
<xref ref-type="bibr" rid="gks408-B29">29</xref>
). During the last decade, however, many direct suffix array construction algorithms have been proposed. A taxonomy of existing suffix array construction algorithms is given by Puglisi
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B128">128</xref>
). Since suffix array construction consists of sorting all suffixes of
<italic>S</italic>
, many algorithms are based on known sorting algorithms. One of the most popular algorithms is the recursive
<inline-formula>
<inline-graphic xlink:href="gks408i49.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) KS3 algorithm of Kärkkäinen and Sanders (
<xref ref-type="bibr" rid="gks408-B129">129</xref>
). It can be modified to a parallel and external memory version, called DC3 (
<xref ref-type="bibr" rid="gks408-B130">130</xref>
), which can construct SA(
<italic>S</italic>
) for the whole human genome using only 1 GB RAM and for which a Message Passing Interface (MPI) version exists that has indexed the human genome in only a few minutes (on specialized hardware) (
<xref ref-type="bibr" rid="gks408-B131">131</xref>
). However, it was noted elsewhere that DC3 is unable to index strings longer than 4 Gb (
<xref ref-type="bibr" rid="gks408-B132">132</xref>
). Other algorithms try to minimize the working space in internal memory. So-called ‘lightweight’ (
<xref ref-type="bibr" rid="gks408-B133">133</xref>
,
<xref ref-type="bibr" rid="gks408-B134">134</xref>
) construction algorithms have a working space that approaches the theoretical minimum. Furthermore, according to extensive tests on biological sequences made by Mori (among others,
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/libdivsufsort/">http://code.google.com/p/libdivsufsort/</ext-link>
), they are the fastest construction algorithms in practice. Another trick utilized is to only sort suffixes up to a certain LCP value, leading to ‘partial suffix arrays’. Although the expressiveness of partial suffix arrays is unclear, they have already been applied for error correction of sequencing reads (
<xref ref-type="bibr" rid="gks408-B14">14</xref>
). For the construction of enhanced suffix arrays, efficient LCP array construction algorithms have been developed (
<xref ref-type="bibr" rid="gks408-B38">38</xref>
) and
<inline-formula>
<inline-graphic xlink:href="gks408i50.jpg"></inline-graphic>
</inline-formula>
(
<italic>n</italic>
) algorithms exist for the construction of the other tables (
<xref ref-type="bibr" rid="gks408-B19">19</xref>
,
<xref ref-type="bibr" rid="gks408-B127">127</xref>
).</p>
</sec>
<sec>
<title>Compressed index structures</title>
<p>Working space is even more important for compressed full-text index structures. Compressed suffix arrays, FM-indexes and regular suffix arrays can easily be obtained from one another. However, suffix array construction requires 40
<italic>n</italic>
– 48
<italic>n</italic>
bits of memory, whereas FM-indexes can be stored in only 2
<italic>n</italic>
bits. Despite this, lightweight suffix array construction algorithms (
<xref ref-type="bibr" rid="gks408-B134">134</xref>
) are used by Burrows–Wheeler-based read mapping tools, such as BWA (
<xref ref-type="bibr" rid="gks408-B8">8</xref>
). Direct and lightweight construction of compressed index structures is therefore an important issue. A gap between theory and practice existed for several years, but several practical results have been reported recently. For example, a lightweight Burrows–Wheeler construction algorithm by Kärkkäinen (
<xref ref-type="bibr" rid="gks408-B135">135</xref>
) requires only 8
<italic>n</italic>
bits of working space for DNA sequences (which is equal to the size of a normal text string) and was implemented in the short read mapping tool Bowtie (
<xref ref-type="bibr" rid="gks408-B7">7</xref>
). Other direct construction algorithms include the parallel algorithm of Sirén (
<xref ref-type="bibr" rid="gks408-B136">136</xref>
) and the lightweight construction algorithms in both internal and external memory settings of Ferragina
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B132">132</xref>
). The former has the added value of being able to merge existing compressed suffix arrays, and the latter have very low working spaces. Moreover, a parallel BWT(
<italic>S</italic>
) construction algorithm (
<xref ref-type="bibr" rid="gks408-B137">137</xref>
) based on the Google MapReduce (
<xref ref-type="bibr" rid="gks408-B18">18</xref>
) framework has recently indexed the human genome in ∼10 min on the Amazon Elastic Compute Cloud. Finally, a lightweight construction algorithm for Lempel–Ziv indexes (
<xref ref-type="bibr" rid="gks408-B138">138</xref>
) has been reported that is competitive with construction algorithms for other compressed full-text indexes.</p>
</sec>
<sec>
<title>External memory suffix tree construction</title>
<p>Most work on external memory index structures has been done on construction algorithms, which have been extensively reviewed by Barsky
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="gks408-B71">71</xref>
). To summarize their results, external memory allows for larger sequences to be indexed, but the scalability of the algorithms is limited by the number of random accesses to
<italic>S</italic>
and the suffix tree under construction. This means that the practical performance of many construction algorithms is limited to sequences which are smaller than the size of the available main memory. As an exception, the B2ST algorithm (
<xref ref-type="bibr" rid="gks408-B119">119</xref>
) was able to index DNA sequences of 12 Gb in <8 h, making this algorithm the first to partially overcome the above-mentioned bottlenecks. Furthermore, the authors believe the algorithm will scale up to sequences of 60 Gb.</p>
</sec>
</sec>
<sec sec-type="conclusions">
<title>CONCLUSION</title>
<p>In this review, we have shown the importance of data structures for processing and searching in strings, known as index structures. Many current sequence analysis tools heavily rely upon index structures for handling large amounts of data, which is currently a major concern to bioinformaticians. In the first main section, details concerning the most commonly used index structures were presented. The details given in this review are often omitted in articles describing tools and applications. However, we believe that these details are important to fully grasp the possibilities and limitations of these sequence analysis tools.</p>
<p>We have made a basic classification of existing index structures and explained the memory-time trade-offs related to these data structures. Since the number of available index structures is vast, we were only able to skim over the technical details involved in the design of these data structures. However, the interested reader was guided to more in-depth work in the literature. Note that the index structures discussed in this review mainly are all-purpose full-text index structures, although some focused on exact pattern matching. There are, however, other index structures specially designed for specific applications, as discussed in the first section of this review.</p>
<p>Furthermore, both main purpose full-text index structures and specialized index structures will always be hampered with space-time trade-offs. Several index structures allow tuning this trade-off by setting a sparsification parameter. This optimization of the available main memory is required because of the large difference in speed between internal and external memory. In some cases, the available main memory does not suffice and external memory index structures have to be used. Moreover, we saw that the performance of external memory index structures highly depends on the application for which the index structure is used. There is still a lot of work to be done on increasing the performance of disk-based index structures.</p>
<p>Construction of index structures in external memory has seen more investigation and clearly shows that the use of current index structures is limited to sequences that fit in main memory. Main memory construction algorithms are limited by the available work space for which the demand is several times higher than the memory required for the final index structure.</p>
<p>In the future, algorithms and data structures will have to be improved further to keep up with the rapidly evolving sequencing technology and the growing amount of data in general. To tackle the bottlenecks related to index structures mentioned here, new directions for their design have to be investigated (
<xref ref-type="bibr" rid="gks408-B2">2</xref>
). As a final note, we give some prospects for research on index structures for bioinformatics applications. Currently, the biggest issue in index structure research is closing the gap between theory and practice, which is illustrated by the fact that many theoretically superior index structures do not outperform simpler designs in practice. More engineering work has to be done to improve the practical performance of these index structures. These implementations should be grouped under a common interface in libraries and benchmarked using different types of (biological) sequences. One such library-project is the ‘Pizza&Chili website’ [two mirrors at
<ext-link ext-link-type="uri" xlink:href="http://pizzachili.di.unipi.it">http://pizzachili.di.unipi.it</ext-link>
and
<ext-link ext-link-type="uri" xlink:href="http://pizzachili.dcc.uchile.cl">http://pizzachili.dcc.uchile.cl]</ext-link>
, which bundles full-text compressed index structures for use in exact pattern matching. Another library containing several index structures, but also focusing on biological applications, is the SeqAn library (
<xref ref-type="bibr" rid="gks408-B139">139</xref>
).</p>
<p>Another significant topic for further research is the adaptation of index structures to modern hardware, such as multi-core CPUs (
<xref ref-type="bibr" rid="gks408-B140">140</xref>
,
<xref ref-type="bibr" rid="gks408-B141">141</xref>
) and solid-state disks. Recently, even more specialized hardware has been considered, including Graphical Processing Units (GPUs) (
<xref ref-type="bibr" rid="gks408-B11">11</xref>
) and GPFAs (
<xref ref-type="bibr" rid="gks408-B142">142</xref>
). Alternatively, large computer clusters, local or on the cloud, could allow for massive parallelization of index structures. Some applications have already been ported to these new platforms, including read mapping and SNP finding (
<xref ref-type="bibr" rid="gks408-B143">143</xref>
) using cloud computing, sequence alignment (
<xref ref-type="bibr" rid="gks408-B11">11</xref>
) on GPUs and suffix array construction (
<xref ref-type="bibr" rid="gks408-B137">137</xref>
) using Google's MapReduce (
<xref ref-type="bibr" rid="gks408-B18">18</xref>
). However, these techniques and implementations are very novel and further research will have to indicate their scope and potential.</p>
<p>For applications which require maintenance of the index structure, such as sequence databases or updating an existing index of the human genome, dynamic index structures are required. Historically, this is challenging due to the intrinsic interrelationship of suffixes, where insertion of a single character in a string can change the lexicographical order of many suffixes. However, some index structures that allow addition and removal of whole strings (
<xref ref-type="bibr" rid="gks408-B61">61</xref>
,
<xref ref-type="bibr" rid="gks408-B107">107</xref>
) and single characters (
<xref ref-type="bibr" rid="gks408-B144">144</xref>
,
<xref ref-type="bibr" rid="gks408-B145">145</xref>
) can be found in the literature. Moreover, several index structures were recently proposed for processing a set of very similar strings (
<xref ref-type="bibr" rid="gks408-B146">146</xref>
,
<xref ref-type="bibr" rid="gks408-B147">147</xref>
), where the size of the index structure only depends on a single reference sequence in the collection, rather than the combined size of all sequences in it.</p>
<p>Given these developments, index structures will continue to increase the performance of bioinformatics applications while coping with the continuous growth in sequence sizes.</p>
</sec>
<sec>
<title>FUNDING</title>
<p>Work of M.V. supported by
<funding-source>the Agency for Innovation by Science and Technology</funding-source>
,
<funding-source>Flemish Government, Belgium</funding-source>
[
<award-id>SB-101609</award-id>
]. All authors acknowledge the support of
<funding-source>Ghent University</funding-source>
(Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”). Funding for open access charge:
<funding-source>Agency for Innovation by Science and Technology, Flemish Government, Belgium</funding-source>
.</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>The authors wish to thank Martijn Devisscher, Ken Heyndrickx and Joachim De Schrijver whose feedback has been invaluable in opening up this article for the largest possible audience. The authors also like to acknowledge the members of the Nucleotides to Networks next-generation sequencing discussion group, in particular Yao-Cheng Lin and Lieven Sterck, for their helpful comments in improving the readability of the article. In addition, the authors would also like to thank the editor and the anonymous reviewers for their valuable comments and suggestions to improve the quality of the manuscript. Due to space constraints, it was impossible to cite all relevant publications in this review; our sincere apologies and appreciation to all colleagues whose important work is not cited.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="gks408-B1">
<label>1</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>Inferring Phylogenies</source>
<year>2004</year>
<publisher-loc>Sunderland, Mass</publisher-loc>
<publisher-name>Sinauer Associates</publisher-name>
</element-citation>
</ref>
<ref id="gks408-B2">
<label>2</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Data structures: time, I/Os, entropy, joules!</article-title>
<source>Proceedings of the 18th Annual European Symposium on Algorithms</source>
<year>2010</year>
<publisher-loc>Liverpool, UK</publisher-loc>
<fpage>1</fpage>
<lpage>16</lpage>
</element-citation>
</ref>
<ref id="gks408-B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E</given-names>
</name>
<name>
<surname>and Lipman</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J. Mol. Biol.</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="gks408-B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Flicek</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Sense from sequence reads: methods for alignment and assembly</article-title>
<source>Nat. Meth.</source>
<year>2009</year>
<volume>6</volume>
<fpage>S6</fpage>
<lpage>S12</lpage>
</element-citation>
</ref>
<ref id="gks408-B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoffmann</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Otto</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Khaitovich</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vogel</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Stadler</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hackermüller</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Fast mapping of short sequences with mismatches, insertions and deletions using index structures</article-title>
<source>PLoS Comput. Biol.</source>
<year>2009</year>
<volume>5</volume>
<fpage>e1000502</fpage>
<pub-id pub-id-type="pmid">19750212</pub-id>
</element-citation>
</ref>
<ref id="gks408-B6">
<label>6</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Lam</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Tam</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>High thoughput short read alignment via bi-directional BWT</article-title>
<source>2009 IEEE International Conference on Bioinformatics and Biomedicine</source>
<year>2009</year>
<publisher-loc>Washington, DC, USA</publisher-loc>
<fpage>31</fpage>
<lpage>36</lpage>
</element-citation>
</ref>
<ref id="gks408-B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome</article-title>
<source>Genome Biology</source>
<year>2009</year>
<volume>10</volume>
<fpage>R25</fpage>
<pub-id pub-id-type="pmid">19261174</pub-id>
</element-citation>
</ref>
<ref id="gks408-B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1754</fpage>
<lpage>1760</lpage>
<pub-id pub-id-type="pmid">19451168</pub-id>
</element-citation>
</ref>
<ref id="gks408-B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>SOAP2: an improved ultrafast tool for short read alignment</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1966</fpage>
<lpage>1967</lpage>
<pub-id pub-id-type="pmid">19497933</pub-id>
</element-citation>
</ref>
<ref id="gks408-B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Smoot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shumway</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonescu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biol.</source>
<year>2004</year>
<volume>5</volume>
<fpage>R12</fpage>
<pub-id pub-id-type="pmid">14759262</pub-id>
</element-citation>
</ref>
<ref id="gks408-B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schatz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Delcher</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Varshney</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>High-throughput sequence alignment using graphics processing units</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>474</fpage>
<pub-id pub-id-type="pmid">18070356</pub-id>
</element-citation>
</ref>
<ref id="gks408-B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Askitis</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>RepMaestro: scalable repeat detection on disk-based genome sequences</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>2368</fpage>
<lpage>2374</lpage>
<pub-id pub-id-type="pmid">20663848</pub-id>
</element-citation>
</ref>
<ref id="gks408-B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schröder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schröder</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>SHREC: a short-read error correction method</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>2157</fpage>
<lpage>2163</lpage>
<pub-id pub-id-type="pmid">19542152</pub-id>
</element-citation>
</ref>
<ref id="gks408-B14">
<label>14</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>PSAEC: an improved algorithm for short read error correction using partial suffix arrays</article-title>
<source>Proceedings of the Joint International Conference Frontiers in Algorithmics and Algorithmic Aspects in Information and Management</source>
<year>2011</year>
<publisher-loc>Jinhua, China</publisher-loc>
<fpage>220</fpage>
<lpage>232</lpage>
</element-citation>
</ref>
<ref id="gks408-B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Conway</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bromage</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Succinct data structures for assembling large genomes</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>479</fpage>
<lpage>486</lpage>
<pub-id pub-id-type="pmid">21245053</pub-id>
</element-citation>
</ref>
<ref id="gks408-B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hernandez</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Francois</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Farinelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Osteras</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schrenzel</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>802</fpage>
<lpage>809</lpage>
<pub-id pub-id-type="pmid">18332092</pub-id>
</element-citation>
</ref>
<ref id="gks408-B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simpson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Efficient construction of an assembly string graph using the FM-index</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>i367</fpage>
<lpage>i373</lpage>
<pub-id pub-id-type="pmid">20529929</pub-id>
</element-citation>
</ref>
<ref id="gks408-B18">
<label>18</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Dean</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>MapReduce: simplified data processing on large clusters</article-title>
<source>Proceedings of the 6th Symposium on Operating System Design and Implementation</source>
<year>2004</year>
<publisher-loc>San Francisco, California, USA</publisher-loc>
<fpage>137</fpage>
<lpage>150</lpage>
</element-citation>
</ref>
<ref id="gks408-B19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abouelhoda</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Replacing suffix trees with enhanced suffix arrays</article-title>
<source>Discrete Algor.</source>
<year>2004</year>
<volume>2</volume>
<fpage>53</fpage>
<lpage>86</lpage>
</element-citation>
</ref>
<ref id="gks408-B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Backofen</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Will</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Beckstette</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Structator: fast index-based search for RNA sequence-structure patterns</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>214</fpage>
<pub-id pub-id-type="pmid">21619640</pub-id>
</element-citation>
</ref>
<ref id="gks408-B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iliopoulos</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Makris</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Panagis</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Perdikuri</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Theodoridis</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tsakalidis</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>The weighted suffix tree: an efficient data structure for handling molecular weighted sequences and its applications</article-title>
<source>Fund. Infor.</source>
<year>2006</year>
<volume>71</volume>
<fpage>259</fpage>
<lpage>277</lpage>
</element-citation>
</ref>
<ref id="gks408-B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shibuya</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Geometric suffix tree: Indexing protein 3-D structures</article-title>
<source>J. ACM</source>
<year>2010</year>
<volume>57</volume>
<fpage>15</fpage>
</element-citation>
</ref>
<ref id="gks408-B23">
<label>23</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hon</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Patil</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Thankachan</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Compressed property suffix trees</article-title>
<source>Proceedings of the 2011 Data Compression Conference</source>
<year>2011</year>
<publisher-loc>Snowbird, Utah, USA</publisher-loc>
<fpage>123</fpage>
<lpage>132</lpage>
</element-citation>
</ref>
<ref id="gks408-B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jacobs</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>The pathologies of big data</article-title>
<source>Commun. ACM</source>
<year>2009</year>
<volume>52</volume>
<fpage>36</fpage>
<lpage>44</lpage>
</element-citation>
</ref>
<ref id="gks408-B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>External memory algorithms and data structures: dealing with massive data</article-title>
<source>ACM Comput. Surv.</source>
<year>2001</year>
<volume>33</volume>
<fpage>209</fpage>
<lpage>271</lpage>
</element-citation>
</ref>
<ref id="gks408-B26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blumer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Blumer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>McConnell</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Ehrenfeucht</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Complete inverted files for efficient text retrieval and analysis</article-title>
<source>J. ACM</source>
<year>1987</year>
<volume>34</volume>
<fpage>578</fpage>
<lpage>595</lpage>
</element-citation>
</ref>
<ref id="gks408-B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bayer</surname>
<given-names>R</given-names>
</name>
<name>
<surname>McCreight</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Organization and Maintenance of large ordered indexes</article-title>
<source>Acta Infor.</source>
<year>1972</year>
<volume>1</volume>
<fpage>173</fpage>
<lpage>189</lpage>
</element-citation>
</ref>
<ref id="gks408-B28">
<label>28</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Weiner</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Linear pattern matching algorithm</article-title>
<source>Proceedings of the 14th IEEE Symposium on Switching and Automata Theory</source>
<year>1973</year>
<publisher-loc>Iowa City, Iowa, USA</publisher-loc>
<fpage>1</fpage>
<lpage>11</lpage>
</element-citation>
</ref>
<ref id="gks408-B29">
<label>29</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Gusfield</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>Algorithms on Strings, Trees, and Sequences</source>
<year>1997</year>
<edition>11th edn</edition>
<publisher-loc>Cambridge, UK and New York, USA</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
</element-citation>
</ref>
<ref id="gks408-B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morrison</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>PATRICIA practical algorithm to retrieve information coded in alphanumeric</article-title>
<source>J. ACM</source>
<year>1968</year>
<volume>15</volume>
<fpage>514</fpage>
<lpage>534</lpage>
</element-citation>
</ref>
<ref id="gks408-B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boyer</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>A fast string searching algorithm</article-title>
<source>Commun. ACM</source>
<year>1977</year>
<volume>20</volume>
<fpage>762</fpage>
<lpage>772</lpage>
</element-citation>
</ref>
<ref id="gks408-B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Knuth</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Morris</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Pratt</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Fast pattern matching in strings</article-title>
<source>SIAM J. Comput.</source>
<year>1977</year>
<volume>6</volume>
<fpage>323</fpage>
<lpage>350</lpage>
</element-citation>
</ref>
<ref id="gks408-B33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Giegerich</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Efficient implementation of lazy suffix trees</article-title>
<source>Softw. Pract. Exp.</source>
<year>2003</year>
<volume>33</volume>
<fpage>1035</fpage>
<lpage>1049</lpage>
</element-citation>
</ref>
<ref id="gks408-B34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McCreight</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>A space-economical suffix tree construction algorithm</article-title>
<source>J. ACM</source>
<year>1976</year>
<volume>23</volume>
<fpage>262</fpage>
<lpage>272</lpage>
</element-citation>
</ref>
<ref id="gks408-B35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manber</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Suffix arrays: a new method for on-line string searches</article-title>
<source>SIAM J. Comput.</source>
<year>1993</year>
<volume>22</volume>
<fpage>935</fpage>
<lpage>948</lpage>
</element-citation>
</ref>
<ref id="gks408-B36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>A quick tour on suffix arrays and compressed suffix arrays</article-title>
<source>Theor. Comput. Sci.</source>
<year>2011</year>
<volume>412</volume>
<fpage>2964</fpage>
<lpage>2973</lpage>
</element-citation>
</ref>
<ref id="gks408-B37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem</article-title>
<source>Inf. Process. Lett.</source>
<year>2010</year>
<volume>110</volume>
<fpage>123</fpage>
<lpage>128</lpage>
</element-citation>
</ref>
<ref id="gks408-B38">
<label>38</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kasai</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Arimura</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Arikawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Linear-time longest-common-prefix computation in suffix arrays and its applications</article-title>
<source>Proceedings of the 12th Symposium on Combinatorial Pattern Matching</source>
<year>2001</year>
<publisher-loc>Jerusalem, Israel</publisher-loc>
<fpage>181</fpage>
<lpage>192</lpage>
</element-citation>
</ref>
<ref id="gks408-B39">
<label>39</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Grimsmo</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>On performance and cache effects in substring indexes</article-title>
<year>2007</year>
<comment>
<italic>Report IDI-TR-2007-04</italic>
, Norwegian University of Science and Technology, Norway</comment>
</element-citation>
</ref>
<ref id="gks408-B40">
<label>40</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Fischer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Heun</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>A new succinct representation of RMQ-information and improvements in the enhanced suffix array</article-title>
<source>Proceedings of the 1st Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies</source>
<year>2007</year>
<publisher-loc>Hangzhou, China</publisher-loc>
<fpage>459</fpage>
<lpage>470</lpage>
</element-citation>
</ref>
<ref id="gks408-B41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Linearized suffix tree: an efficient index data structure with the capabilities of suffix trees and suffix arrays</article-title>
<source>Algorithmica</source>
<year>2008</year>
<volume>52</volume>
<fpage>350</fpage>
<lpage>377</lpage>
</element-citation>
</ref>
<ref id="gks408-B42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Compressed full-text indexes</article-title>
<source>ACM Comput. Surv.</source>
<year>2007</year>
<volume>39</volume>
<fpage>2:1</fpage>
<lpage>2:61</lpage>
</element-citation>
</ref>
<ref id="gks408-B43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Compressed suffix arrays and suffix trees with applications to text indexing and string matching</article-title>
<source>SIAM J. Comput.</source>
<year>2005</year>
<volume>35</volume>
<fpage>378</fpage>
<lpage>407</lpage>
</element-citation>
</ref>
<ref id="gks408-B44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Compressed suffix trees with full functionality</article-title>
<source>Theor. Comput. Syst.</source>
<year>2007</year>
<volume>41</volume>
<fpage>589</fpage>
<lpage>607</lpage>
</element-citation>
</ref>
<ref id="gks408-B45">
<label>45</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Opportunistic data structures with application</article-title>
<source>Proceedings of the 41st IEEE Symposium on Foundations of Computer Science</source>
<year>2000</year>
<publisher-loc>Redondo Beach, California, USA</publisher-loc>
<fpage>390</fpage>
<lpage>398</lpage>
</element-citation>
</ref>
<ref id="gks408-B46">
<label>46</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Burrows</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wheeler</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>A block-sorting lossless data compression algorithm</source>
<year>1994</year>
<publisher-name>Technical Report 124. DEC SRC</publisher-name>
</element-citation>
</ref>
<ref id="gks408-B47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Succinct suffix arrays based on run-length encoding</article-title>
<source>Nordic J. Comput.</source>
<year>2005</year>
<volume>12</volume>
<fpage>40</fpage>
<lpage>66</lpage>
</element-citation>
</ref>
<ref id="gks408-B48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Compressed representations of sequences and full-text indexes</article-title>
<source>ACM Trans. Algor.</source>
<year>2007</year>
<volume>3</volume>
<fpage>20</fpage>
</element-citation>
</ref>
<ref id="gks408-B49">
<label>49</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>An experimental study of an opportunistic index</article-title>
<source>Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms</source>
<year>2000</year>
<publisher-loc>Washington, DC, USA</publisher-loc>
<fpage>269</fpage>
<lpage>278</lpage>
</element-citation>
</ref>
<ref id="gks408-B50">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Przywarski</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Salinger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>A simple alphabet-independent FM-index</article-title>
<source>Int. J. Founda. of Comput. Sci.</source>
<year>2006</year>
<volume>17</volume>
<fpage>1365</fpage>
<lpage>1384</lpage>
</element-citation>
</ref>
<ref id="gks408-B51">
<label>51</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Adjeroh</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bell</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Mukherjee</surname>
<given-names>A</given-names>
</name>
</person-group>
<source>The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching</source>
<year>2008</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="gks408-B52">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hon</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Breaking a time-and-space barrier in constructing full-text indices</article-title>
<source>SIAM J. Comput.</source>
<year>2009</year>
<volume>38</volume>
<fpage>2162</fpage>
<lpage>2178</lpage>
</element-citation>
</ref>
<ref id="gks408-B53">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arlazarov</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Dinic</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Kronrod</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Faradzev</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>On economic construction of the transitive closure of a directed graph</article-title>
<source>Dokl. Akad. Nauk SSSR</source>
<year>1970</year>
<volume>194</volume>
<fpage>487</fpage>
<lpage>488</lpage>
</element-citation>
</ref>
<ref id="gks408-B54">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>González</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Venturini</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Compressed text indexes: From theory to practice</article-title>
<source>ACM J. Exp. Algor.</source>
<year>2009</year>
<volume>13</volume>
<fpage>1.12</fpage>
<lpage>1.31</lpage>
</element-citation>
</ref>
<ref id="gks408-B55">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Indexing compressed text</article-title>
<source>J. ACM</source>
<year>2005</year>
<volume>52</volume>
<fpage>552</fpage>
<lpage>581</lpage>
</element-citation>
</ref>
<ref id="gks408-B56">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Oliveira</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Morales</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Approximate string matching with compressed indexes</article-title>
<source>Algorithms</source>
<year>2009</year>
<volume>2</volume>
<fpage>1105</fpage>
<lpage>1136</lpage>
</element-citation>
</ref>
<ref id="gks408-B57">
<label>57</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arroyuelo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Practical approaches to reduce the space requirement of Lempel-Ziv based compressed text indices</article-title>
<source>ACM J. Exp. Algor.</source>
<year>2010</year>
<volume>15</volume>
<fpage>1.5:1.1</fpage>
<lpage>1.5:1.54</lpage>
</element-citation>
</ref>
<ref id="gks408-B58">
<label>58</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Välimäki</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Gerlach</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Dixit</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Engineering a compressed suffix tree implementation</article-title>
<source>ACM J. Exp. Algor.</source>
<year>2009</year>
<volume>14</volume>
<fpage>4.2</fpage>
<lpage>4.23</lpage>
</element-citation>
</ref>
<ref id="gks408-B59">
<label>59</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>High-order entropy-compressed text indexes</article-title>
<source>Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms</source>
<year>2003</year>
<publisher-loc>Baltimore, Maryland, USA</publisher-loc>
<fpage>841</fpage>
<lpage>850</lpage>
</element-citation>
</ref>
<ref id="gks408-B60">
<label>60</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Foschini</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>When indexing equals compression: experiments with compressing suffix arrays and applications</article-title>
<source>ACM Trans. Algor.</source>
<year>2006</year>
<volume>2</volume>
<fpage>611</fpage>
<lpage>639</lpage>
</element-citation>
</ref>
<ref id="gks408-B61">
<label>61</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Oliveira</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Fully-compressed suffix trees</article-title>
<source>ACM Trans. Algor.</source>
<year>2011</year>
<volume>7</volume>
<fpage>53</fpage>
</element-citation>
</ref>
<ref id="gks408-B62">
<label>62</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Cánovas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Practical compressed suffix trees</article-title>
<source>Proceedings of the 9th Symposium on Experimental Algorithms</source>
<year>2010</year>
<publisher-loc>Ischia Island, Naples, Italy</publisher-loc>
<fpage>94</fpage>
<lpage>105</lpage>
</element-citation>
</ref>
<ref id="gks408-B63">
<label>63</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>New text indexing functionalities of the compressed suffix arrays</article-title>
<source>J. Algor.</source>
<year>2003</year>
<volume>48</volume>
<fpage>294</fpage>
<lpage>313</lpage>
</element-citation>
</ref>
<ref id="gks408-B64">
<label>64</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Shibuya</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Indexing huge genome sequences for solving various problems</article-title>
<source>Genome Inform.</source>
<year>2001</year>
<volume>12</volume>
<fpage>175</fpage>
<lpage>183</lpage>
<pub-id pub-id-type="pmid">11791236</pub-id>
</element-citation>
</ref>
<ref id="gks408-B65">
<label>65</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Indexing text using the Ziv-Lempel trie</article-title>
<source>J. Discrete Algor.</source>
<year>2004</year>
<volume>2</volume>
<fpage>87</fpage>
<lpage>114</lpage>
</element-citation>
</ref>
<ref id="gks408-B66">
<label>66</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Reducing the space requirement of suffix trees</article-title>
<source>Softw. Pract. Exp.</source>
<year>1999</year>
<volume>29</volume>
<fpage>1149</fpage>
<lpage>1171</lpage>
</element-citation>
</ref>
<ref id="gks408-B67">
<label>67</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ukkonen</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Sparse suffix trees</article-title>
<source>Proceedings of the 2nd Conference on Computing and Combinatorics</source>
<year>1996</year>
<publisher-loc>Hong Kong, China</publisher-loc>
<fpage>219</fpage>
<lpage>230</lpage>
</element-citation>
</ref>
<ref id="gks408-B68">
<label>68</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Suffix arrays on words</article-title>
<source>Proceedings of the 18th conference on Combinatorial Pattern Matching</source>
<year>2007</year>
<publisher-loc>London, Ontario, Canada</publisher-loc>
<fpage>328</fpage>
<lpage>339</lpage>
</element-citation>
</ref>
<ref id="gks408-B69">
<label>69</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Khan</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Bloom</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kruglyak</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1609</fpage>
<lpage>1616</lpage>
<pub-id pub-id-type="pmid">19389736</pub-id>
</element-citation>
</ref>
<ref id="gks408-B70">
<label>70</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kulekci</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>PSI-RA: a parallel sparse index for read alignment on genomes</article-title>
<source>Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine</source>
<year>2010</year>
<publisher-loc>Hong Kong, China</publisher-loc>
<fpage>663</fpage>
<lpage>668</lpage>
</element-citation>
</ref>
<ref id="gks408-B71">
<label>71</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barsky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stege</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Thomo</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>A survey of practical algorithms for suffix tree construction in external memory</article-title>
<source>Softw. Pract. Exp.</source>
<year>2010</year>
<volume>40</volume>
<fpage>965</fpage>
<lpage>988</lpage>
</element-citation>
</ref>
<ref id="gks408-B72">
<label>72</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Marín</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Distributed query processing using suffix arrays</article-title>
<source>Proceedings of the 10th conference on String Processing and Information Retrieval</source>
<year>2003</year>
<publisher-loc>Manaus, Brazil</publisher-loc>
<fpage>311</fpage>
<lpage>325</lpage>
</element-citation>
</ref>
<ref id="gks408-B73">
<label>73</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andersson</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Larsson</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Swanson</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Suffix trees on words</article-title>
<source>Algorithmica</source>
<year>1999</year>
<volume>23</volume>
<fpage>246</fpage>
<lpage>260</lpage>
</element-citation>
</ref>
<ref id="gks408-B74">
<label>74</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Inenaga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Takeda</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>On-line linear-time construction of word suffix trees</article-title>
<source>Proceedings of the 17th conference on Combinatorial Pattern Matching</source>
<year>2006</year>
<publisher-loc>Barcelona, Spain</publisher-loc>
<fpage>60</fpage>
<lpage>71</lpage>
</element-citation>
</ref>
<ref id="gks408-B75">
<label>75</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Transier</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Sanders</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Compressed inverted indexes for in-memory search engines</article-title>
<source>Proceedings of the 10th Workshop on Algorithm Engineering and Experimentation</source>
<year>2008</year>
<publisher-loc>San Francisco, California, USA</publisher-loc>
<fpage>3</fpage>
<lpage>12</lpage>
</element-citation>
</ref>
<ref id="gks408-B76">
<label>76</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Puglisi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Smyth</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Turpin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Inverted files versus suffix arrays for locating patterns in primary memory</article-title>
<source>Proceedings of the 13th conference on String Processing and Information Retrieval</source>
<year>2006</year>
<publisher-loc>Glasgow, UK</publisher-loc>
<fpage>122</fpage>
<lpage>133</lpage>
</element-citation>
</ref>
<ref id="gks408-B77">
<label>77</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manzini</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>An analysis of the Burrows-Wheeler transform</article-title>
<source>J. ACM</source>
<year>2001</year>
<volume>48</volume>
<fpage>407</fpage>
<lpage>430</lpage>
</element-citation>
</ref>
<ref id="gks408-B78">
<label>78</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Raman</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raman</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Succinct indexable dictionaries with applications to encoding k-ary trees and multisets</article-title>
<source>Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms</source>
<year>2002</year>
<publisher-loc>San Francisco, California, USA</publisher-loc>
<fpage>233</fpage>
<lpage>242</lpage>
</element-citation>
</ref>
<ref id="gks408-B79">
<label>79</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Claude</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Practical Rank/Select queries over arbitrary sequences</article-title>
<source>Proceedings of the 15th annual Symposium on String Processing and Information Retrieval</source>
<year>2008</year>
<publisher-loc>Melbourne, Australia</publisher-loc>
<fpage>176</fpage>
<lpage>187</lpage>
</element-citation>
</ref>
<ref id="gks408-B80">
<label>80</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Okanohara</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Practical entropy-compressed rank/select dictionary</article-title>
<source>Proceedings of the 9th Workshop on Algorithm Engineering and Experiments</source>
<year>2007</year>
<publisher-loc>New Orleans, Louisiana, USA</publisher-loc>
<fpage>60</fpage>
<lpage>70</lpage>
</element-citation>
</ref>
<ref id="gks408-B81">
<label>81</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Külekci</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Xir</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Time-and space-efficient maximal repeat finding using the Burrows-Wheeler transform and wavelet trees</article-title>
<source>Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine</source>
<year>2010</year>
<publisher-loc>Hong Kong, China</publisher-loc>
<fpage>622</fpage>
<lpage>625</lpage>
</element-citation>
</ref>
<ref id="gks408-B82">
<label>82</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Jacobson</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Space-efficient static trees and graphs</article-title>
<source>Proceedings of the 30th IEEE Symposium on Foundations of Computer Science</source>
<year>1989</year>
<publisher-loc>Research Triangle Park, North Carolina, USA</publisher-loc>
<fpage>549</fpage>
<lpage>554</lpage>
</element-citation>
</ref>
<ref id="gks408-B83">
<label>83</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Arroyuelo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cánovas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Succient trees in practice</article-title>
<source>Proceedings of the 11th Workshop on Algorithm Engineering and Experiments</source>
<year>2010</year>
<publisher-loc>Austin, Texas, USA</publisher-loc>
<fpage>84</fpage>
<lpage>97</lpage>
</element-citation>
</ref>
<ref id="gks408-B84">
<label>84</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Archie</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Day</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Maddison</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Meacham</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rohlf</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Swofford</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>The Newick Tree Format</source>
<year>1986</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://evolution.genetics.washington.edu/phylip/newicktree.html">http://evolution.genetics.washington.edu/phylip/newicktree.html</ext-link>
(May 2012, date last accessed)</comment>
</element-citation>
</ref>
<ref id="gks408-B85">
<label>85</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Advantages of shared data structures for sequences of balanced parentheses</article-title>
<source>Proceedings of the 2010 Data Compression Conference</source>
<year>2010</year>
<publisher-loc>Snowbird, Utah, USA</publisher-loc>
<fpage>406</fpage>
<lpage>415</lpage>
</element-citation>
</ref>
<ref id="gks408-B86">
<label>86</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Berkman</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Vishkin</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Recursive star-tree parallel data structure</article-title>
<source>SIAM J. Comput.</source>
<year>1993</year>
<volume>22</volume>
<fpage>221</fpage>
<lpage>242</lpage>
</element-citation>
</ref>
<ref id="gks408-B87">
<label>87</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Compressed suffix arrays and suffix trees with applications to text indexing and string matching</article-title>
<source>Proceedings of the 32nd Annual ACM Symposium on Theory of Computing,
<italic>
<bold>32</bold>
</italic>
</source>
<year>2000</year>
<publisher-loc>Portland, Oregon, USA</publisher-loc>
<fpage>397</fpage>
<lpage>406</lpage>
</element-citation>
</ref>
<ref id="gks408-B88">
<label>88</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Compressed text databases with efficient query algorithms based on the compressed suffix array</article-title>
<source>Proceedings of the 11th International Symposium on Algorithms and Computation</source>
<year>2000</year>
<publisher-loc>Taipei, Taiwan</publisher-loc>
<fpage>410</fpage>
<lpage>421</lpage>
</element-citation>
</ref>
<ref id="gks408-B89">
<label>89</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Succinct representations of LCP information and improvements in the compressed suffix array</article-title>
<source>Proceedings of the 13th ACM-SIAM symposium on Discrete algorithms</source>
<year>2002</year>
<publisher-loc>San Francisco, California, USA</publisher-loc>
<fpage>225</fpage>
<lpage>232</lpage>
</element-citation>
</ref>
<ref id="gks408-B90">
<label>90</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Compact suffix array - a space-efficient full-text index</article-title>
<source>Fund. Inform.</source>
<year>2003</year>
<volume>56</volume>
<fpage>191</fpage>
<lpage>210</lpage>
</element-citation>
</ref>
<ref id="gks408-B91">
<label>91</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Compressed compact suffix arrays</article-title>
<source>Proceedings of the 15th Symposium on Combinatorial Pattern Matching</source>
<year>2004</year>
<publisher-loc>Istanbul, Turkey</publisher-loc>
<fpage>420</fpage>
<lpage>433</lpage>
</element-citation>
</ref>
<ref id="gks408-B92">
<label>92</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>González</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Compressed text indexes with fast locate</article-title>
<source>Proceedings of the 18th Symposium on Combinatorial Pattern Matching</source>
<year>2007</year>
<publisher-loc>London, Ontario, Canada</publisher-loc>
<fpage>216</fpage>
<lpage>227</lpage>
</element-citation>
</ref>
<ref id="gks408-B93">
<label>93</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elias</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Universal codeword sets and representations of integers</article-title>
<source>IEEE Trans. Inf. Theory</source>
<year>1975</year>
<volume>21</volume>
<fpage>194</fpage>
<lpage>203</lpage>
</element-citation>
</ref>
<ref id="gks408-B94">
<label>94</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ukkonen</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Lempel-Ziv parsing and sublinear-size index structures for string matching</article-title>
<source>Proceedings of the 3rd South American Workshop on String Processing</source>
<year>1996</year>
<publisher-loc>Recife, Brazil</publisher-loc>
<fpage>141</fpage>
<lpage>155</lpage>
</element-citation>
</ref>
<ref id="gks408-B95">
<label>95</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lempel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ziv</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>On the complexity of finite sequences</article-title>
<source>IEEE Trans. Inf. Theory</source>
<year>1976</year>
<volume>22</volume>
<fpage>75</fpage>
<lpage>81</lpage>
</element-citation>
</ref>
<ref id="gks408-B96">
<label>96</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ziv</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lempel</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Compression of individual sequences via variable length coding</article-title>
<source>IEEE Trans. Inf. Theory</source>
<year>1978</year>
<volume>24</volume>
<fpage>530</fpage>
<lpage>536</lpage>
</element-citation>
</ref>
<ref id="gks408-B97">
<label>97</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Implementing the LZ-index: theory versus practice</article-title>
<source>ACM J. Exp. Algor.</source>
<year>2009</year>
<volume>13</volume>
<fpage>1.2</fpage>
</element-citation>
</ref>
<ref id="gks408-B98">
<label>98</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arroyuelo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Stronger Lempel-Ziv Based compressed text indexing</article-title>
<source>Algorithmica</source>
<year>2012</year>
<volume>62</volume>
<fpage>54</fpage>
<lpage>101</lpage>
</element-citation>
</ref>
<ref id="gks408-B99">
<label>99</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Oliveira</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>A compressed self-index using a Ziv-Lempel dictionnary</article-title>
<source>Inform. Retrieval</source>
<year>2008</year>
<volume>11</volume>
<fpage>359</fpage>
<lpage>388</lpage>
</element-citation>
</ref>
<ref id="gks408-B100">
<label>100</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kügel</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Computing matching statistics and maximal exact matches on compressed full-text indexes</article-title>
<source>Proceedings of the 17th Annual Symposium on String Processing and Information Retrieval</source>
<year>2010</year>
<publisher-loc>Los Cabos, Mexico</publisher-loc>
<fpage>347</fpage>
<lpage>358</lpage>
</element-citation>
</ref>
<ref id="gks408-B101">
<label>101</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A compressed enhanced suffix array supporting fast string matching</article-title>
<source>Proceedings of the 16th Symposium on String Processing and Information Retrieval</source>
<year>2009</year>
<publisher-loc>Saariselkä, Finland</publisher-loc>
<fpage>51</fpage>
<lpage>62</lpage>
</element-citation>
</ref>
<ref id="gks408-B102">
<label>102</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ohlebusch</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gog</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>CST++</article-title>
<source>Proceedings of the 17th Symposium on String Processing and Information Retrieval</source>
<year>2010</year>
<publisher-loc>Los Cabos, Mexico</publisher-loc>
<fpage>322</fpage>
<lpage>333</lpage>
</element-citation>
</ref>
<ref id="gks408-B103">
<label>103</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fischer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Faster entropy-bounded compressed suffix trees</article-title>
<source>Theor. Comput. Sci.</source>
<year>2009</year>
<volume>410</volume>
<fpage>5354</fpage>
<lpage>5364</lpage>
</element-citation>
</ref>
<ref id="gks408-B104">
<label>104</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hon</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Compression, indexing, and retrieval for massive string data</article-title>
<source>Proceedings of the 21st Symposium on Combinatorial Pattern Matching</source>
<year>2010</year>
<publisher-loc>New York, USA</publisher-loc>
<fpage>260</fpage>
<lpage>274</lpage>
</element-citation>
</ref>
<ref id="gks408-B105">
<label>105</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baeza-Yates</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Barbosa</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Ziviani</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Hierarchies of indices for text retrieval</article-title>
<source>J. Inf. Syst.</source>
<year>1996</year>
<volume>21</volume>
<fpage>497</fpage>
<lpage>514</lpage>
</element-citation>
</ref>
<ref id="gks408-B106">
<label>106</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Moffat</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Turpin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Improving suffix array locality for fast pattern matching on disk</article-title>
<source>Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data</source>
<year>2008</year>
<publisher-loc>Vancouver, British Columbia, Canada</publisher-loc>
<fpage>661</fpage>
<lpage>672</lpage>
</element-citation>
</ref>
<ref id="gks408-B107">
<label>107</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>The string B-tree: a new data structure for string search in external memory and its applications</article-title>
<source>J. ACM</source>
<year>1999</year>
<volume>46</volume>
<fpage>236</fpage>
<lpage>280</lpage>
</element-citation>
</ref>
<ref id="gks408-B108">
<label>108</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Grossi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>JS</given-names>
</name>
</person-group>
<article-title>On searching compressed string collections cache-obliviously</article-title>
<source>Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems</source>
<year>2008</year>
<publisher-loc>Vancouver, British Columbia, Canada</publisher-loc>
<fpage>181</fpage>
<lpage>190</lpage>
</element-citation>
</ref>
<ref id="gks408-B109">
<label>109</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hunt</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Atkinson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Irving</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Database indexing for large DNA and protein sequence collections</article-title>
<source>The VLDB J.</source>
<year>2002</year>
<volume>11</volume>
<fpage>256</fpage>
<lpage>271</lpage>
</element-citation>
</ref>
<ref id="gks408-B110">
<label>110</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Bedathur</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Haritsa</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Engineering a fast online persistent suffix tree construction</article-title>
<source>Proceedings of the 20th International Conference on Data Engineering</source>
<year>2004</year>
<publisher-loc>Boston, Massachusetts, USA</publisher-loc>
<fpage>720</fpage>
<lpage>731</lpage>
</element-citation>
</ref>
<ref id="gks408-B111">
<label>111</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Munro</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Efficient suffix trees on secondary storage (extended Abstract)</article-title>
<source>Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms</source>
<year>1996</year>
<publisher-loc>Atlanta, Georgia, USA</publisher-loc>
<fpage>383</fpage>
<lpage>391</lpage>
</element-citation>
</ref>
<ref id="gks408-B112">
<label>112</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tian</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tata</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hankins</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Practical methods for constructing suffix trees</article-title>
<source>VLDB J.</source>
<year>2005</year>
<volume>14</volume>
<fpage>281</fpage>
<lpage>299</lpage>
</element-citation>
</ref>
<ref id="gks408-B113">
<label>113</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Phoophakdee</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zaki</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Genome-scale disk-based suffix tree indexing</article-title>
<source>Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data</source>
<year>2007</year>
<publisher-loc>Beijing, China</publisher-loc>
<fpage>833</fpage>
<lpage>844</lpage>
</element-citation>
</ref>
<ref id="gks408-B114">
<label>114</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ghoting</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Makarychev</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>I/O efficient algorithms for serial and parallel suffix tree construction</article-title>
<source>ACM Trans. Database Sys.</source>
<year>2010</year>
<volume>35</volume>
<fpage>25</fpage>
</element-citation>
</ref>
<ref id="gks408-B115">
<label>115</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clifford</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Distributed suffix trees</article-title>
<source>J. Discrete Algor.</source>
<year>2005</year>
<volume>3</volume>
<fpage>176</fpage>
<lpage>197</lpage>
</element-citation>
</ref>
<ref id="gks408-B116">
<label>116</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Bedathur</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Haritsa</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Search-Optimized suffix-tree storage for biological applications</article-title>
<source>Proceedings of the 12th International Conference on High Performance Computing</source>
<year>2005</year>
<publisher-loc>Goa, India</publisher-loc>
<fpage>29</fpage>
<lpage>39</lpage>
</element-citation>
</ref>
<ref id="gks408-B117">
<label>117</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Brodal</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Fagerberg</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Cache-oblivious string dictionaries</article-title>
<source>Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms</source>
<year>2006</year>
<publisher-loc>Miami, Florida, USA</publisher-loc>
<fpage>581</fpage>
<lpage>590</lpage>
</element-citation>
</ref>
<ref id="gks408-B118">
<label>118</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Barsky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stege</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Thomo</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Upton</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A new method for indexing genomes using on-disk suffix trees</article-title>
<source>Proceeding of the 17th ACM Conference on Information and Knowledge Management</source>
<year>2008</year>
<publisher-loc>Napa Valley, California, USA</publisher-loc>
<fpage>649</fpage>
<lpage>658</lpage>
</element-citation>
</ref>
<ref id="gks408-B119">
<label>119</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barsky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stege</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Thomo</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Upton</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Suffix trees for inputs larger than main memory</article-title>
<source>Inf. Syst.</source>
<year>2011</year>
<volume>36</volume>
<fpage>644</fpage>
<lpage>654</lpage>
</element-citation>
</ref>
<ref id="gks408-B120">
<label>120</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Halachev</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shiri</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Thamildurai</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Efficient and scalable indexing techniques for biological sequence data</article-title>
<source>Proceedings of the 1st International Conference on Bioinformatics Research and Development</source>
<year>2007</year>
<publisher-loc>Berlin, Germany</publisher-loc>
<fpage>464</fpage>
<lpage>479</lpage>
</element-citation>
</ref>
<ref id="gks408-B121">
<label>121</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sadakane</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Advantages of backward searching - efficient secondary memory and distributed implementation of compressed suffix arrays</article-title>
<source>Proceedings of the 15th International Symposium on Algorithms and Computation</source>
<year>2004</year>
<publisher-loc>Hong Kong, China</publisher-loc>
<fpage>681</fpage>
<lpage>692</lpage>
</element-citation>
</ref>
<ref id="gks408-B122">
<label>122</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>González</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>A compressed text index on secondary memory</article-title>
<source>J. Comb. Math. Comb. Comput.</source>
<year>2009</year>
<volume>71</volume>
<fpage>127</fpage>
<lpage>154</lpage>
</element-citation>
</ref>
<ref id="gks408-B123">
<label>123</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Arroyuelo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>A Lempel-Ziv text index on secondary storage</article-title>
<source>Proceedings of the 18th Symposium on Combinatorial Pattern Matching</source>
<year>2007</year>
<publisher-loc>London, Ontario, Canada</publisher-loc>
<fpage>83</fpage>
<lpage>94</lpage>
</element-citation>
</ref>
<ref id="gks408-B124">
<label>124</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Russo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Oliveira</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Parallel and distributed compressed indexes</article-title>
<source>Proceedings of the 21st Conference on Combinatorial Pattern Matching</source>
<year>2010</year>
<publisher-loc>New York, USA</publisher-loc>
<fpage>348</fpage>
<lpage>360</lpage>
</element-citation>
</ref>
<ref id="gks408-B125">
<label>125</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Chien</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Geometric Burrows-Wheeler transform: linking range searching and text indexing</article-title>
<source>Proceedings of the 2008 Data Compression Conference</source>
<year>2008</year>
<publisher-loc>Snowbird, Utah, USA</publisher-loc>
<fpage>252</fpage>
<lpage>261</lpage>
</element-citation>
</ref>
<ref id="gks408-B126">
<label>126</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ukkonen</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>On-line construction of suffix trees</article-title>
<source>Algorithmica</source>
<year>1995</year>
<volume>14</volume>
<fpage>249</fpage>
<lpage>260</lpage>
</element-citation>
</ref>
<ref id="gks408-B127">
<label>127</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maaß</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Computing suffix links for suffix trees and arrays</article-title>
<source>Inf. Proces. Lett.</source>
<year>2007</year>
<volume>101</volume>
<fpage>250</fpage>
<lpage>254</lpage>
</element-citation>
</ref>
<ref id="gks408-B128">
<label>128</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Puglisi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Smyth</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Turpin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>A taxonomy of suffix array construction algorithms</article-title>
<source>ACM Comput. Surv.</source>
<year>2007</year>
<volume>39</volume>
<fpage>1</fpage>
<lpage>31</lpage>
</element-citation>
</ref>
<ref id="gks408-B129">
<label>129</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sanders</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Simple linear work suffix array construction</article-title>
<source>Proceedings of the 30th International Conference on Automata Languages and Programming</source>
<year>2003</year>
<publisher-loc>Eindhoven, The Netherlands</publisher-loc>
<fpage>943</fpage>
<lpage>955</lpage>
</element-citation>
</ref>
<ref id="gks408-B130">
<label>130</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sanders</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Burkhardt</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Linear work suffix array construction</article-title>
<source>J. ACM</source>
<year>2006</year>
<volume>53</volume>
<fpage>918</fpage>
<lpage>936</lpage>
</element-citation>
</ref>
<ref id="gks408-B131">
<label>131</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kulla</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Sanders</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Scalable parallel suffix array construction</article-title>
<source>Parallel Comput.</source>
<year>2007</year>
<volume>33</volume>
<fpage>605</fpage>
<lpage>612</lpage>
</element-citation>
</ref>
<ref id="gks408-B132">
<label>132</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gagie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Lightweight data indexing and compression in external memory</article-title>
<source>Proceedings of the 9th Latin American Symposium on Theoretical Informatics</source>
<year>2010</year>
<publisher-loc>Oaxaca, Mexico</publisher-loc>
<fpage>697</fpage>
<lpage>710</lpage>
</element-citation>
</ref>
<ref id="gks408-B133">
<label>133</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manzini</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Ferragina</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Engineering a lightweight suffix array construction algorithm</article-title>
<source>Algorithmica</source>
<year>2004</year>
<volume>40</volume>
<fpage>33</fpage>
<lpage>50</lpage>
</element-citation>
</ref>
<ref id="gks408-B134">
<label>134</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nong</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Two efficient algorithms for linear time suffix array construction</article-title>
<source>IEEE Trans. Comput.</source>
<year>2011</year>
<volume>60</volume>
<fpage>1471</fpage>
<lpage>1484</lpage>
</element-citation>
</ref>
<ref id="gks408-B135">
<label>135</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Fast BWT in small space by blockwise suffix sorting</article-title>
<source>Theor. Comput. Sci.</source>
<year>2007</year>
<volume>387</volume>
<fpage>249</fpage>
<lpage>257</lpage>
</element-citation>
</ref>
<ref id="gks408-B136">
<label>136</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sirén</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Compressed suffix arrays for massive data</article-title>
<source>Proceedings of the 16th International Symposium on String Processing and Information Retrieval</source>
<year>2009</year>
<publisher-loc>Saariselkä, Finland</publisher-loc>
<fpage>63</fpage>
<lpage>74</lpage>
</element-citation>
</ref>
<ref id="gks408-B137">
<label>137</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Menon</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bhat</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Rapid parallel genome indexing with MapReduce</article-title>
<source>Proceedings of the Second International Workshop on MapReduce and its Applications</source>
<year>2011</year>
<publisher-loc>San Jose, California, USA</publisher-loc>
<fpage>51</fpage>
<lpage>58</lpage>
</element-citation>
</ref>
<ref id="gks408-B138">
<label>138</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arroyuelo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Space-Efficient Construction of Lempel-Ziv compressed text indexes</article-title>
<source>Inf. Comput.</source>
<year>2011</year>
<volume>209</volume>
<fpage>1070</fpage>
<lpage>1102</lpage>
</element-citation>
</ref>
<ref id="gks408-B139">
<label>139</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Döring</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weese</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Rausch</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>SeqAn an efficient, generic C++ library for sequence analysis</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>11</fpage>
<pub-id pub-id-type="pmid">18184432</pub-id>
</element-citation>
</ref>
<ref id="gks408-B140">
<label>140</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Loh</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Moon</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>A fast divide-and-conquer algorithm for indexing human genome sequences</article-title>
<source>IEICE Trans. Inf. Systems</source>
<year>2011</year>
<volume>94</volume>
<fpage>1369</fpage>
<lpage>1377</lpage>
</element-citation>
</ref>
<ref id="gks408-B141">
<label>141</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Tsirogiannis</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Koudas</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Suffix tree construction algorithms on modern hardware</article-title>
<source>Proceedings of the 13th International Conference on Extending Database Technology</source>
<year>2010</year>
<publisher-loc>Lausanne, Switzerland</publisher-loc>
<fpage>263</fpage>
<lpage>274</lpage>
</element-citation>
</ref>
<ref id="gks408-B142">
<label>142</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Fernandez</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Najjar</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>String matching in hardware using the FM-Index</article-title>
<source>Proceedings of the 19th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines</source>
<year>2011</year>
<publisher-loc>Salt Lake City, Utah, USA</publisher-loc>
<fpage>218</fpage>
<lpage>225</lpage>
</element-citation>
</ref>
<ref id="gks408-B143">
<label>143</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Searching for SNPs with cloud computing</article-title>
<source>Genome Biol.</source>
<year>2009</year>
<volume>10</volume>
<fpage>R134</fpage>
<pub-id pub-id-type="pmid">19930550</pub-id>
</element-citation>
</ref>
<ref id="gks408-B144">
<label>144</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lecroq</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Léonard</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mouchard</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>A four-stage algorithm for updating a Burrows-Wheeler transform</article-title>
<source>Theor. Comput. Sci.</source>
<year>2009</year>
<volume>410</volume>
<fpage>4350</fpage>
<lpage>4359</lpage>
</element-citation>
</ref>
<ref id="gks408-B145">
<label>145</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salson</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lecroq</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Léonard</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mouchard</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Dynamic extended suffix arrays</article-title>
<source>J. Discrete Algor.</source>
<year>2010</year>
<volume>8</volume>
<fpage>241</fpage>
<lpage>257</lpage>
</element-citation>
</ref>
<ref id="gks408-B146">
<label>146</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Navarro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sirén</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Välimäki</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Storage and retrieval of highly repetitive sequence collections</article-title>
<source>J. Comput. Biol.</source>
<year>2010</year>
<volume>17</volume>
<fpage>281</fpage>
<lpage>308</lpage>
<pub-id pub-id-type="pmid">20377446</pub-id>
</element-citation>
</ref>
<ref id="gks408-B147">
<label>147</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kuruppu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Puglisi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zobel</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Optimized relative Lempel-Ziv compression of genomes</article-title>
<source>Proceedings of Australasian Computer Science Conference</source>
<year>2011</year>
<publisher-loc>Perth, Australia</publisher-loc>
<fpage>91</fpage>
<lpage>98</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Belgique/explor/OpenAccessBelV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000406  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000406  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Belgique
   |area=    OpenAccessBelV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Dec 1 00:43:49 2016. Site generation: Wed Mar 6 14:51:30 2024