Serveur d'exploration sur la télématique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

TRStalker: an efficient heuristic for finding fuzzy tandem repeats

Identifieur interne : 000433 ( Pmc/Corpus ); précédent : 000432; suivant : 000434

TRStalker: an efficient heuristic for finding fuzzy tandem repeats

Auteurs : Marco Pellegrini ; M. Elena Renda ; Alessio Vecchio

Source :

RBID : PMC:2881393

Abstract

Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.

Results: We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.

Availability: TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it.

Contact: marco.pellegrini@iit.cnr.it

Supplementary information: Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btq209
PubMed: 20529928
PubMed Central: 2881393

Links to Exploration step

PMC:2881393

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">TRStalker: an efficient heuristic for finding fuzzy tandem repeats</title>
<author>
<name sortKey="Pellegrini, Marco" sort="Pellegrini, Marco" uniqKey="Pellegrini M" first="Marco" last="Pellegrini">Marco Pellegrini</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1"> CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Renda, M Elena" sort="Renda, M Elena" uniqKey="Renda M" first="M. Elena" last="Renda">M. Elena Renda</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1"> CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vecchio, Alessio" sort="Vecchio, Alessio" uniqKey="Vecchio A" first="Alessio" last="Vecchio">Alessio Vecchio</name>
<affiliation>
<nlm:aff id="AFF1"> Univ. di Pisa, Dip. Ingegneria dell'Informazione, Via Diotisalvi 2, 56122 Pisa (Italy)</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">20529928</idno>
<idno type="pmc">2881393</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881393</idno>
<idno type="RBID">PMC:2881393</idno>
<idno type="doi">10.1093/bioinformatics/btq209</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">000433</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000433</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">TRStalker: an efficient heuristic for finding fuzzy tandem repeats</title>
<author>
<name sortKey="Pellegrini, Marco" sort="Pellegrini, Marco" uniqKey="Pellegrini M" first="Marco" last="Pellegrini">Marco Pellegrini</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1"> CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Renda, M Elena" sort="Renda, M Elena" uniqKey="Renda M" first="M. Elena" last="Renda">M. Elena Renda</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1"> CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vecchio, Alessio" sort="Vecchio, Alessio" uniqKey="Vecchio A" first="Alessio" last="Vecchio">Alessio Vecchio</name>
<affiliation>
<nlm:aff id="AFF1"> Univ. di Pisa, Dip. Ingegneria dell'Informazione, Via Diotisalvi 2, 56122 Pisa (Italy)</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of
<italic>trinucleotide repeat disorders</italic>
). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.</p>
<p>
<bold>Results:</bold>
We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.</p>
<p>
<bold>Availability:</bold>
TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at
<ext-link ext-link-type="uri" xlink:href="bioalgo.iit.cnr.it">bioalgo.iit.cnr.it</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>marco.pellegrini@iit.cnr.it</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btq209/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ames, D" uniqKey="Ames D">D Ames</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, G" uniqKey="Benson G">G Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, G" uniqKey="Benson G">G Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boeva, V" uniqKey="Boeva V">V Boeva</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brodzik, Ak" uniqKey="Brodzik A">AK Brodzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buchner, M" uniqKey="Buchner M">M Buchner</name>
</author>
<author>
<name sortKey="Janjarasjitt, S" uniqKey="Janjarasjitt S">S Janjarasjitt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burkhardt, S" uniqKey="Burkhardt S">S Burkhardt</name>
</author>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burkhardt, S" uniqKey="Burkhardt S">S Burkhardt</name>
</author>
<author>
<name sortKey="K Rkk Inen, J" uniqKey="K Rkk Inen J">J Kärkkäinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bussey, H" uniqKey="Bussey H">H Bussey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Campuzano, V" uniqKey="Campuzano V">V Campuzano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De La Higuera, C" uniqKey="De La Higuera C">C de la Higuera</name>
</author>
<author>
<name sortKey="Casacuberta, F" uniqKey="Casacuberta F">F Casacuberta</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dujon, B" uniqKey="Dujon B">B Dujon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elemento, O" uniqKey="Elemento O">O Elemento</name>
</author>
<author>
<name sortKey="Gascuel, O" uniqKey="Gascuel O">O Gascuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fischetti, Va" uniqKey="Fischetti V">VA Fischetti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gelfand, Y" uniqKey="Gelfand Y">Y Gelfand</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Glusman, G" uniqKey="Glusman G">G Glusman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grissa, I" uniqKey="Grissa I">I Grissa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gupta, R" uniqKey="Gupta R">R Gupta</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hauth, Am" uniqKey="Hauth A">AM Hauth</name>
</author>
<author>
<name sortKey="Joseph, D" uniqKey="Joseph D">D Joseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, X" uniqKey="Jiang X">X Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jurka, J" uniqKey="Jurka J">J Jurka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelkar, Y Dd" uniqKey="Kelkar Y">Y.DD Kelkar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kolpakov, R" uniqKey="Kolpakov R">R Kolpakov</name>
</author>
<author>
<name sortKey="Kucherov, G" uniqKey="Kucherov G">G Kucherov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kolpakov, R" uniqKey="Kolpakov R">R Kolpakov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kolpakov, Rm" uniqKey="Kolpakov R">RM Kolpakov</name>
</author>
<author>
<name sortKey="Kucherov, G" uniqKey="Kucherov G">G Kucherov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krishnan, A" uniqKey="Krishnan A">A Krishnan</name>
</author>
<author>
<name sortKey="Tang, F" uniqKey="Tang F">F Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Schleiermacher, C" uniqKey="Schleiermacher C">C Schleiermacher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Landau, Gm" uniqKey="Landau G">GM Landau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leclercq, S" uniqKey="Leclercq S">S Leclercq</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Legendre, M" uniqKey="Legendre M">M Legendre</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Motwani, R" uniqKey="Motwani R">R Motwani</name>
</author>
<author>
<name sortKey="Raghavan, P" uniqKey="Raghavan P">P Raghavan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mudunuri, Sb" uniqKey="Mudunuri S">SB Mudunuri</name>
</author>
<author>
<name sortKey="Nagarajaram, Ha" uniqKey="Nagarajaram H">HA Nagarajaram</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mulmuley, K" uniqKey="Mulmuley K">K Mulmuley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="O Dushlaine, C" uniqKey="O Dushlaine C">C O'Dushlaine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parisi, V" uniqKey="Parisi V">V Parisi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peterlongo, P" uniqKey="Peterlongo P">P Peterlongo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rivals, E" uniqKey="Rivals E">E Rivals</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rowen, L" uniqKey="Rowen L">L Rowen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saha, S" uniqKey="Saha S">S Saha</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sammeth, M" uniqKey="Sammeth M">M Sammeth</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sharma, D" uniqKey="Sharma D">D Sharma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sim, Js" uniqKey="Sim J">JS Sim</name>
</author>
<author>
<name sortKey="Park, K" uniqKey="Park K">K Park</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smit, A F A" uniqKey="Smit A">A.F.A Smit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sokol, D" uniqKey="Sokol D">D Sokol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stolovitzky, G" uniqKey="Stolovitzky G">G Stolovitzky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vissers, Le" uniqKey="Vissers L">LE Vissers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vogler, A" uniqKey="Vogler A">A Vogler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warburton, P" uniqKey="Warburton P">P Warburton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wells, Rd" uniqKey="Wells R">RD Wells</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wexler, Y" uniqKey="Wexler Y">Y Wexler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wexler, Y" uniqKey="Wexler Y">Y Wexler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wooster, R" uniqKey="Wooster R">R Wooster</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">20529928</article-id>
<article-id pub-id-type="pmc">2881393</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btq209</article-id>
<article-id pub-id-type="publisher-id">btq209</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa</subject>
<subj-group>
<subject>Original Papers</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>TRStalker: an efficient heuristic for finding fuzzy tandem repeats</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Pellegrini</surname>
<given-names>Marco</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Renda</surname>
<given-names>M. Elena</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vecchio</surname>
<given-names>Alessio</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="AFF1">
<sup>1</sup>
CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa and
<sup>2</sup>
Univ. di Pisa, Dip. Ingegneria dell'Informazione, Via Diotisalvi 2, 56122 Pisa (Italy)</aff>
<author-notes>
<corresp id="COR1">* To whom correspondence should be addressed.</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>6</month>
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>1</day>
<month>6</month>
<year>2010</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>1</day>
<month>6</month>
<year>2010</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>26</volume>
<issue>12</issue>
<fpage>i358</fpage>
<lpage>i366</lpage>
<permissions>
<copyright-statement>© The Author(s) 2010. Published by Oxford University Press.</copyright-statement>
<copyright-year>2010</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/2.0/uk/">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/2.5">http://creativecommons.org/licenses/by-nc/2.5</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of
<italic>trinucleotide repeat disorders</italic>
). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.</p>
<p>
<bold>Results:</bold>
We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.</p>
<p>
<bold>Availability:</bold>
TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at
<ext-link ext-link-type="uri" xlink:href="bioalgo.iit.cnr.it">bioalgo.iit.cnr.it</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>marco.pellegrini@iit.cnr.it</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btq209/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</abstract>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="SEC1">
<title>1 INTRODUCTION</title>
<p>Tandem Repeats (TRs) are multiple (two or more) duplications of substrings in the DNA that occur contiguously, and may involve some base mutations (such as substitutions, insertions and deletions). TRs of several forms (satellites, microsatellites, minisatellites and others) have been studied extensively because of their role in several biological processes. In fact, TRs are privileged targets in activities such as fingerprinting or tracing the evolution of populations (Kelkar
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B24">2008</xref>
; Vogler
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B51">2006</xref>
). Several diseases, disorders and addictive behaviors are linked to specific TR loci (Wooster
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B56">1994</xref>
). The role of TRs has been studied also within coding regions (O'Dushlaine
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B37">2005</xref>
) and in relation to gene functions (Legendre
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B33">2007</xref>
). Large scale comparative studies on TRs of the human genome are described in Ames
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B1">2008</xref>
) and Warburton
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B52">2008</xref>
). Data Bases of repetitive elements such as RepBase (Jurka
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B23">2005</xref>
) and Tandem Repeats Database (TRDB) (Gelfand
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B15">2007</xref>
) are now available; and the detection of repetitive elements via library-based similarity matching, for example by using the tool Repeatmasker (Smit
<italic>et al.</italic>
, 2004), is a popular practice. However, tools for
<italic>ab initio</italic>
detection of repetitive elements that are not based on prior knowledge accumulated in data bases are still important in order to extend our comprehension of the role of TRs in biological mechanisms. Existing
<italic>ab initio</italic>
tools are successful when the TR exhibits a moderate amount of divergence and when the TR is easily validated. However, there is an emerging need for new tools that are able to cope with higher levels of sequence divergence and/or TR computationally more difficult to validate. For example, Boeva
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B4">2006</xref>
) study so called
<italic>Fuzzy TRs</italic>
and their role in gene expression. The technique in Boeva
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B4">2006</xref>
) works well for the Hamming metric (only substitutions and no insertions/deletions allowed) and for short repeat units (from 3 to 24 bp) that are common in micro- and mini-satellite families.</p>
<p>Some of the most successful
<italic>ab initio</italic>
tools, such as TRF (Benson,
<xref ref-type="bibr" rid="B3">1999</xref>
) and ATRHunter (Wexler
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B55">2005</xref>
), are based on a multi-stage filtering approach [see also (Peterlongo
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B39">2009</xref>
)]. In the first stage the input sequence is analyzed to detect, via statistical criteria, likely position and length of candidate subsequences. The final stage is the validation one in which a more expensive test is applied to candidate substrings passing the first stages, so to determine an output that matches the implicit definition of TR and the user-defined filtering parameters.</p>
<sec id="SEC1.1">
<title>1.1 Our contribution</title>
<p>Our contribution is a novel multi-stage filtering algorithm, called
<italic>TRStalker</italic>
, for finding long fuzzy TRs under the edit distance, that introduces new techniques (w.r.t. previous TR finding algorithms) in all stages. For the first stage, where over-represented distances between probes are sought, we employ
<italic>gapped q-grams</italic>
(Burkhardt and Kärkkäinen,
<xref ref-type="bibr" rid="B8">2003</xref>
) in place of the standard
<italic>ungapped q-grams</italic>
in order to collect evidence on the candidate substrings. Gapped q-grams have been used before in the context of textual and biological database searching, but less so in the area of TRs detection [with the exception of the system TEIRESIAS (Stolovitzky
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B49">1999</xref>
)]. Because of errors due to insertion/deletions, the
<italic>period</italic>
of a TR is subject to fluctuations, thus we employ a weighting scheme with exponential decay so to reinforce the signal even in presence of this smearing effect. Finally, we use
<italic>ranking</italic>
instead of
<italic>thresholds</italic>
when deciding the substrings to pass to the next phases, in order to concentrate the computational effort on the zones with candidates with higher weight. For the final validation stage we employ an NP-complete definition of TR involving the concept of
<italic>generalized median string</italic>
under edit distance (de la Higuera and Casacuberta,
<xref ref-type="bibr" rid="B11">2000</xref>
; Sim and Park,
<xref ref-type="bibr" rid="B46">2003</xref>
), together with an efficient heuristic for computing an approximation of such median string (Jiang
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B22">2003</xref>
) previously not used in a biological context.</p>
<p>By extensive experimental comparisons of
<italic>TRStalker</italic>
with two state-of-the-art tools, namely TRF and ATRHunter, we did find out that TRStalker has consistently better performance for a large range of error and length parameters for the class of fuzzy TRs under edit distance, with a recall ranging from 100 to 60%. Thus TRStalker improves the capability of TR detection for classes of TRs for which existing methods do not perform well. Tests performed on standard evolutionary TRs definitions (verifiable in polynomial time) also show recall performance close to 100%. Incidentally, this result confirms of the power of the new techniques developed for the initial filtering phase.</p>
</sec>
<sec id="SEC1.2">
<title>1.2 State of the art</title>
<p>We will briefly survey the state of the art in finding tandem repeats. First we will describe methods that for a given definition of TR are able to find all maximal substrings in the input that match the definition (
<italic>exhaustive algorithms</italic>
). Often exhaustive algorithms may not be available, or when available they may be too slow in practice. Thus, several
<italic>heuristic algorithms</italic>
have been developed which are shown experimentally to be able to detect a large fraction of TRs efficiently. Note that the time/precision trade-off is severely influenced by the allowed error thresholds. Performance often degrades quickly with increasing error levels.</p>
<sec id="SEC1.2.1">
<title>1.2.1 Exhaustive algorithms</title>
<p>When we allow no error, it is possible to find all maximal exact TRs in a string of length
<italic>n</italic>
in time
<italic>O</italic>
(
<italic>n</italic>
) (Gusfield and Stoye,
<xref ref-type="bibr" rid="B20">2004</xref>
; Kolpakov and Kucherov,
<xref ref-type="bibr" rid="B27">1999</xref>
). When we allow two consecutive repeats to differ by an amount at most
<italic>k</italic>
(either in Hamming or in edit distance) Landau
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B31">2001</xref>
) give exhaustive algorithms running in time
<italic>O</italic>
(
<italic>nk</italic>
log(
<italic>n</italic>
/
<italic>k</italic>
)) for Hamming distance, and
<italic>O</italic>
(
<italic>nk</italic>
log
<italic>k</italic>
log(
<italic>n</italic>
/
<italic>k</italic>
)) for edit distance. A simpler algorithm with the same asymptotic complexity for the edit distance is proposed by Sokol
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B48">2007</xref>
). Kolpakov and Kucherov (
<xref ref-type="bibr" rid="B25">2003</xref>
) improved the bound for the Hamming distance to
<italic>O</italic>
(
<italic>nk</italic>
log
<italic>k</italic>
+
<italic>s</italic>
) where
<italic>s</italic>
is the number of TRs found. For the Hamming distance, Krishnan and Tang (
<xref ref-type="bibr" rid="B28">2004</xref>
) give an exhaustive method running sequentially in time
<italic>O</italic>
(
<italic>n</italic>
<sup>3</sup>
), that can be easily implemented onto a parallel architecture, since every possible pattern length is searched independently.</p>
</sec>
<sec id="SEC1.2.2">
<title>1.2.2 Heuristic algorithms</title>
<p>The algorithmic techniques in Kolpakov and Kucherov (
<xref ref-type="bibr" rid="B27">1999</xref>
,
<xref ref-type="bibr" rid="B25">2003</xref>
) have been extended in the tool
<monospace>mreps</monospace>
(Kolpakov
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B26">2003</xref>
) so to be able to handle approximate TRs (ATRs) under edit distance, with some additional heuristic filtering steps.</p>
<p>The tool TRF (Tandem Repeat Finder) developed by Benson (
<xref ref-type="bibr" rid="B2">1998</xref>
,
<xref ref-type="bibr" rid="B3">1999</xref>
), based on statistical filtering of zones of DNA likely to contain TRs, is currently one of the standard heuristic methods. ATRHunter by Wexler
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B54">2004</xref>
) is also based on a statistical filtering approach, placing greater emphasis in techniques for designing thresholds for the quantities of interest. Other proposed heuristics for finding TRs are REPuter (Kurtz and Schleiermacher,
<xref ref-type="bibr" rid="B29">1999</xref>
; Kurtz
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B30">2001</xref>
), STRING (Parisi
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B38">2003</xref>
), TEIRESIAS (Stolovitzky
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B49">1999</xref>
) and TandemSWAN (Boeva
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B4">2006</xref>
). A class of papers (see e.g. Brodzik,
<xref ref-type="bibr" rid="B5">2007</xref>
; Buchner and Janjarasjitt,
<xref ref-type="bibr" rid="B6">2003</xref>
; Gupta
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B18">2007</xref>
; Sharma
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B45">2004</xref>
) tackle the problem of finding TRs as a problem in signal processing theory and usually map the input string into a time-signal in a suitable numerical domain for which several spectral techniques can be used, such as the
<italic>Periodicity Transform</italic>
or the
<italic>Fourier Transform</italic>
. Other methods use data compression techniques to detect repetitive elements (Rivals
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B41">1997</xref>
).</p>
<p>The methods cited above are rather general since they aim at treating efficiently TRs in a wide range of length values. There is also a large class of methods that are aimed at handling particular or special classes of TRs such as: microsatellites [e.g. IMEx (Mudunuri and Nagarajaram,
<xref ref-type="bibr" rid="B35">2007</xref>
)], palindromic repeats [e.g. CRISPFinder (Grissa
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B17">2007</xref>
)], Variable Length TRs (VLTR) and Multi-period TRs (MPTR) (Hauth and Joseph,
<xref ref-type="bibr" rid="B21">2002</xref>
) and Variable Number TRs (VNTR) (Sammeth and Stoye,
<xref ref-type="bibr" rid="B44">2006</xref>
). Since the focus of our research on TRs at present is on the more classical forms of TRs, we do not dwell longer on them. However, we just note that often methods for MPTR, VNTR, VLTR use standard TR finding as a subroutine, thus our proposed algorithm can increase also the ability to detect such higher order structures.</p>
<p>Systematic comparison among TR finding tools and algorithms operating
<italic>ab initio</italic>
, that is without support of specific biological data bases has been tackled in recent years (Leclercq
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B32">2007</xref>
; Saha
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B43">2008</xref>
). A survey of problems on TRs in the context of evolutionary mechanisms, such as the construction of TR Evolutionary Trees, is proposed in Rivals (
<xref ref-type="bibr" rid="B40">2004</xref>
); see also Elemento and Gascuel (
<xref ref-type="bibr" rid="B13">2002</xref>
).</p>
</sec>
</sec>
<sec id="SEC1.3">
<title>1.3 Organization of the article</title>
<p>The article is organized as follows: in
<xref ref-type="sec" rid="SEC2">Section 2</xref>
we describe at a high level the principles guiding the different phases of the TRStalker algorithm.
<xref ref-type="sec" rid="SEC3">Section 3</xref>
gives a more technical description of key ingredients of TRStalker and discusses the formal definition of fuzzy TR employed.
<xref ref-type="sec" rid="SEC4">Section 4</xref>
describes the experiments devised to demonstrate the capacity of TRStalker in detecting fuzzy TRs, and a few interesting fuzzy TRs found in sequences of biological significance.</p>
</sec>
</sec>
<sec id="SEC2">
<title>2 APPROACH</title>
<p>An example: To focus on the main ideas, let us consider the very simple case of Exact TR. Consider an alphabet Σ = {
<italic>A</italic>
,
<italic>C</italic>
,
<italic>G</italic>
,
<italic>T</italic>
} of four symbols, and a string
<italic>X</italic>
=
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>2</sub>
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
formed by the concatenation of
<italic>t</italic>
strings
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
, embedded in a random string
<italic>Y</italic>
, where
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
=
<italic>x</italic>
<sub>1</sub>
for all
<italic>i</italic>
and |
<italic>x</italic>
<sub>1</sub>
| =
<italic>k</italic>
, thus all replicas of
<italic>x</italic>
<sub>1</sub>
are of the same length. An
<italic>ungapped q-gram</italic>
is a string of
<italic>q</italic>
symbols from Σ that appears as a consecutive sequence of
<italic>q</italic>
symbols in
<italic>Y</italic>
. We aim at discovering
<italic>k</italic>
just by looking at the distances between occurrences of homologous (i.e. identical)
<italic>q</italic>
-grams in
<italic>Y</italic>
. For q-grams in
<italic>X</italic>
, the period
<italic>k</italic>
will appear at least (
<italic>k</italic>
<italic>q</italic>
+ 1)(
<italic>t</italic>
− 1) times as the distance between homologous probes. More generally the distance
<italic>hk</italic>
, an integer multiple of
<italic>k</italic>
, will appear at least (
<italic>k</italic>
<italic>q</italic>
+ 1)(
<italic>t</italic>
<italic>h</italic>
) times for each value
<italic>h</italic>
= 1 …
<italic>t</italic>
− 1. A
<italic>gapped q-gram</italic>
is a sequence of
<italic>q</italic>
characters from Σ with additional ‘don't care’ symbols, also called ‘gaps’, that appears as a consecutive sequence in
<italic>Y</italic>
. For gapped
<italic>q</italic>
-grams similar formulae hold. For values of
<italic>k</italic>
and
<italic>t</italic>
large enough, the period
<italic>k</italic>
and its integer multiples will occur more frequently than the expected number of occurrences of any distance of homologous
<italic>q</italic>
-grams in a random string, thus the empirical number of occurrences of the value
<italic>k</italic>
and its multiples will tend to be in the higher part of a ranking by frequency. This observation holds true as long as the length of the super-string
<italic>Y</italic>
is sufficiently limited so that the frequencies generated by the random portion of
<italic>Y</italic>
do not overrun the frequencies generated by
<italic>X</italic>
. An exact characterization of such a distribution in terms of the parameters
<italic>k</italic>
,
<italic>t</italic>
,
<italic>q</italic>
and |
<italic>Y</italic>
| is complex since it can be characterized as the sum of
<italic>non-independent</italic>
random variables each with a negative binomial distribution. However we avoid the issue of characterizing exactly such a distribution by: (i) splitting the input string into blocks of predefined length and limiting the analysis to each block separately, providing mechanisms to deal with TRs stranded across the block boundaries; (ii) ranking the periods by
<italic>weighted frequency</italic>
and exploring only the top
<italic>L</italic>
positions (for
<italic>L</italic>
= 50 in our experiments). Note that in most cases the top ranking periods not corresponding to TRs will be discarded quickly when the positional density is considered, thus we can be very slack in choosing
<italic>L</italic>
without incurring in a computational burden. The choice of block length could be critical too, but experimental results showed that blocks of length within a factor of up to 40 of the length of the TR do work well. For long input, string occurrences of the same
<italic>q</italic>
-gram that are too distant are unlikely to be related to a TR, thus we limit the number of pairs of homologous q-grams considered. While scanning each block of the input
<italic>Y</italic>
we record for each occurrence of a gapped q-gram in
<italic>Y</italic>
its distance to the five preceding and the five following homologous occurrences (10 in total). The high-level pseudocode of TRStalker is shown in the
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btq209/DC1">Supplementary Materials</ext-link>
while we expose next the key algorithmic choices.</p>
<p>Gapped q-grams: The presence of substitutions/insertions/deletions in
<italic>X</italic>
has the effect that many instances of q-grams will be affected by error and a match will be missed, thus reducing the frequency counts for the period
<italic>k</italic>
. To cope with this effect, we use
<italic>gapped q-grams</italic>
(Burkhardt and Kärkkäinen,
<xref ref-type="bibr" rid="B7">2002</xref>
,
<xref ref-type="bibr" rid="B8">2003</xref>
) that are more resilient to the presence of substitutions/insertions/deletions. As suggested by experiments in Burkhardt and Kärkkäinen (
<xref ref-type="bibr" rid="B7">2002</xref>
), just few gaps are sufficient to be effective, thus we will use the family of all gapped q-grams with three alphabet symbols and at most two gaps.</p>
<p>Anti-smear weighting: If
<italic>q</italic>
<sub>1</sub>
and
<italic>q</italic>
<sub>2</sub>
are occurrences of homologous
<italic>q</italic>
-grams in
<italic>X</italic>
at distance
<italic>k</italic>
, before the implant of mutations, the effect of insertion and deletions on the positions of the string
<italic>X</italic>
between
<italic>q</italic>
<sub>1</sub>
and
<italic>q</italic>
<sub>2</sub>
is to alter their distance so that a different period
<italic>k</italic>
′ is detected. The difference
<italic>k</italic>
<italic>k</italic>
′ is equal to the algebraic sum of number of insertions and deletions in the positions between
<italic>q</italic>
<sub>1</sub>
and
<italic>q</italic>
<sub>2</sub>
. Assuming that any such position can be an insertion or a deletion independently with the same probability, the random variable
<italic>k</italic>
<italic>k</italic>
′ is distributed as a sum of independent random variables with values in {+1, −1, 0} with mean value 0, thus, by a Chernoff bound argument, its tail distribution decays exponentially (Motwani and Raghavan,
<xref ref-type="bibr" rid="B34">1995</xref>
; Mulmuley,
<xref ref-type="bibr" rid="B36">1993</xref>
). Also near-by probes in
<italic>X</italic>
have small variations in the value of the shift
<italic>k</italic>
<italic>k</italic>
′. Inspired by the above observation, we devise a weighting scheme that increments the total weight of period
<italic>k</italic>
if another period of value
<inline-formula>
<inline-graphic xlink:href="btq209i1.jpg"></inline-graphic>
</inline-formula>
is discovered in a near-by position, with weights that decay exponentially with
<inline-formula>
<inline-graphic xlink:href="btq209i2.jpg"></inline-graphic>
</inline-formula>
. The final weight
<italic>w</italic>
<sub>0</sub>
(
<italic>k</italic>
) for a given period
<italic>k</italic>
is the sum of the individual anti-smear weights computed above for probes at distance
<italic>k</italic>
.</p>
<p>Multiplicity weighting: Let
<italic>w</italic>
<sub>0</sub>
(
<italic>k</italic>
) be the weight of the period
<italic>k</italic>
as assigned by the anti-smear weighting procedure. As observed before, for a TR with a large number of copies we will find also integer multiples of
<italic>k</italic>
with a relatively high frequency. We take advantage of this fact and compute new weights:
<disp-formula>
<graphic xlink:href="btq209um1"></graphic>
</disp-formula>
The candidate periods are then sorted by the weight
<italic>w</italic>
<sub>1</sub>
(.), and processed in decreasing order.</p>
<p>Positional density: We further exploit the property of TRs that the same period is detected by probes in near-by positions. We define a notion of positional
<italic>k</italic>
-density, that is the density of probes that contribute to the counter for the candidate period
<italic>k</italic>
. We search for position in
<italic>Y</italic>
of high
<italic>k</italic>
-density as candidates for the starting point of a TR.</p>
<p>Validation: In the third phase we take each candidate pair (
<italic>p</italic>
,
<italic>i</italic>
) and we test explicitly whether there is a TR of period
<italic>p</italic>
starting in position
<italic>i</italic>
according to the definition (
<xref ref-type="sec" rid="SEC3">Section 3</xref>
). In particular when using the definition of a
<italic>Steiner-STR</italic>
(
<xref ref-type="sec" rid="SEC3.2">Section 3.2</xref>
) we use a double filtering. The fist filter uses a
<italic>wraparound dynamic programming</italic>
technique (WDP; Fischetti
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B14">1993</xref>
). The second filter computes an approximation to the
<italic>generalized media string</italic>
[inspired by an algorithm proposed in Jiang
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B22">2003</xref>
)]. In this phase, besides validating the TRs, we discover the (fractional) repetition number of the TRs eventually extracted.</p>
<p>Post-processing: As a post-processing, we check for inclusion the TRs found and we filter out those TRs completely enclosed in another one. For TRs in the same position and length but different period we report the TR with shorter period. Finally we align the approximate generalized median string with the TR units so to give a graphical compact output of the TR.</p>
</sec>
<sec sec-type="methods" id="SEC3">
<title>3 METHODS</title>
<sec id="SEC3.1">
<title>3.1 Basic definitions</title>
<p>A TR in a DNA sequence is the repetition of two or more contiguous exact or approximate copies of a substring (called the
<italic>motif</italic>
) of the TR.</p>
<sec id="SEC3.1.1">
<title>3.1.1 Exact TR</title>
<p>Formally, given an alphabet Σ, and a set of strings
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
∈ Σ
<sup>*</sup>
, consider the concatenation
<italic>X</italic>
=
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>2</sub>
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
. The string
<italic>X</italic>
is an
<italic>exact TR</italic>
(ETR) of
<italic>period k</italic>
and
<italic>repeat number t</italic>
, when |
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
| =
<italic>k</italic>
and
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
=
<italic>x</italic>
<sub>1</sub>
, for each
<italic>i</italic>
∈ [1 …
<italic>t</italic>
]. In general we may suppose there is a longer string
<italic>Y</italic>
of which
<italic>X</italic>
is a substring. The string
<italic>x</italic>
<sub>1</sub>
that is repeated exactly is called the
<italic>motif</italic>
of the ETR. A TR
<italic>X</italic>
is called maximal if it cannot be extended in
<italic>Y</italic>
while still being a TR.</p>
</sec>
<sec id="SEC3.1.2">
<title>3.1.2 ATR</title>
<p>ETRs are sometimes found in biological sequences, but they tell us only part of the story, thus several notions of an
<italic>ATR</italic>
have been developed. Denote with
<italic>D</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
) the hamming distance of two strings with equal length. If the length of
<italic>a</italic>
and
<italic>b</italic>
is different, we consider the smallest possible mismatch in an alignment of the two strings without gaps. Denote with
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
) the edit distance of the two strings
<italic>a</italic>
and
<italic>b</italic>
.</p>
</sec>
</sec>
<sec id="SEC3.2">
<title>3.2 Our definitions of TR</title>
<p>We used two different definitions of TRs:
<list list-type="bullet">
<list-item>
<p>
<bold>Neighboring TR (NTR)</bold>
: a string
<italic>X</italic>
, so that for each
<italic>i</italic>
∈ [1 …
<italic>t</italic>
− 1],
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+1</sub>
)≤μ|
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
|, for a user defined parameter 0≤μ≤1</p>
</list-item>
<list-item>
<p>
<bold>Steiner-STR with sum</bold>
: a string
<italic>X</italic>
=
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>2</sub>
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
for which two conditions hold for a user defined error parameter 0≤μ≤1, and constant
<italic>c</italic>
with 1≤
<italic>c</italic>
≤2:
<list list-type="alpha-lower">
<list-item>
<p>for each
<italic>i</italic>
∈ [1 …
<italic>t</italic>
− 1],
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+1</sub>
)≤
<italic>c</italic>
μ|
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
|.</p>
</list-item>
<list-item>
<p>there exists a Steiner string
<inline-formula>
<inline-graphic xlink:href="btq209i3.jpg"></inline-graphic>
</inline-formula>
so that
<inline-formula>
<inline-graphic xlink:href="btq209i4.jpg"></inline-graphic>
</inline-formula>
</p>
</list-item>
</list>
</p>
</list-item>
</list>
</p>
<p>Intuitively, in a Steiner-STR the TR consists of
<italic>t</italic>
duplications of a single Steiner consensus string
<inline-formula>
<inline-graphic xlink:href="btq209i5.jpg"></inline-graphic>
</inline-formula>
with
<inline-formula>
<inline-graphic xlink:href="btq209i6.jpg"></inline-graphic>
</inline-formula>
mutations on average in each copy, such that consecutive copies do not diverge too much w.r.t. the average. Note that condition (a) is vacuous for μ ≥ 1/
<italic>c</italic>
. The choice for the constant
<italic>c</italic>
depends also on the level of divergence. For low divergence
<italic>c</italic>
= 2 is a sensible choice since two copies at distance
<inline-formula>
<inline-graphic xlink:href="btq209i7.jpg"></inline-graphic>
</inline-formula>
from
<inline-formula>
<inline-graphic xlink:href="btq209i8.jpg"></inline-graphic>
</inline-formula>
are also at distance at most
<inline-formula>
<inline-graphic xlink:href="btq209i9.jpg"></inline-graphic>
</inline-formula>
from each other by the triangular inequality. Thus (a) is a necessary condition for (b). For higher level of divergence above 30%, the value
<italic>c</italic>
= 2 is too loose and we use a lower value
<italic>c</italic>
= 1.5, so as to maintain a good filtering ability of condition (a) and to avoid having as a possible solution a TR where the consecutive pairs may have a very irregular divergence.</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Output of TRStalker</title>
<p>The aim of TRStalker is to produce a ranked list of
<italic>all maximal tandem repeat sub-sequences</italic>
present in the input string that satisfy the definition above (NTR or Steiner-STR), where maximality means that it is not possible to extend the TR (as a substring of the input) to the left or to the right without violating the definition within the given user-defined parameters. We avoid producing meaningless TRs by imposing also a lower bound on the TRs length.</p>
</sec>
<sec id="SEC3.4">
<title>3.4 Other definitions</title>
<p>In Sokol
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B48">2007</xref>
) it is used the following definition:
<italic>X</italic>
is called a
<italic>k-edit ATR</italic>
when ∑
<sub>
<italic>i</italic>
=1</sub>
<sup>
<italic>t</italic>
−1</sup>
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+1</sub>
)≤
<italic>k</italic>
, where the last repeat
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
might be incomplete so
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>x</italic>
<sub>
<italic>t</italic>
−1</sub>
,
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
) is computed as the minimum edit distance of
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
and the prefixes of
<italic>x</italic>
<sub>
<italic>t</italic>
−1</sub>
. This definition is inspired by the evolutionary model of TRs in which it is assumed that TRs are generated by duplicating the last copy of a previous TR, possibly with duplication errors that truncate it. A k-edit repeat is
<italic>maximal</italic>
if it cannot be extended either to the left or to the right without violating its definition.</p>
<p>In Wexler
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B54">2004</xref>
), for a similarity function ϕ that measures the alignment score of two sequences, it is defined a η
<italic>-Simple ATR</italic>
(η-SATR) a string
<italic>X</italic>
=
<italic>x</italic>
<sub>1</sub>
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
such that: there exists a motif
<inline-formula>
<inline-graphic xlink:href="btq209i10.jpg"></inline-graphic>
</inline-formula>
so that for every
<italic>i</italic>
∈ [1, …,
<italic>t</italic>
],
<inline-formula>
<inline-graphic xlink:href="btq209i11.jpg"></inline-graphic>
</inline-formula>
. In other words, the TR consists of
<italic>t</italic>
duplications of a single consensus string
<inline-formula>
<inline-graphic xlink:href="btq209i12.jpg"></inline-graphic>
</inline-formula>
with mutations. Such string
<inline-formula>
<inline-graphic xlink:href="btq209i13.jpg"></inline-graphic>
</inline-formula>
is also called a
<italic>Steiner motif</italic>
if
<inline-formula>
<inline-graphic xlink:href="btq209i14.jpg"></inline-graphic>
</inline-formula>
is not constrained to be equal to some repeat
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
. Often in practice
<inline-formula>
<inline-graphic xlink:href="btq209i15.jpg"></inline-graphic>
</inline-formula>
is chosen as the repeat
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
that minimizes the error function, and is called a
<italic>pivot motif</italic>
. The distinction is critical since, as mentioned before, Steiner motifs lead to NP-complete recognition problems, while pivot motifs do not.</p>
<p>The η
<italic>-Neighboring ATR</italic>
(η-NATR) is a string
<italic>X</italic>
, so that for each
<italic>i</italic>
∈ [1,..,
<italic>t</italic>
− 1], ϕ(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+</sub>
)≥η (Wexler
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B54">2004</xref>
). The
<italic>Pairwise ATR</italic>
(PATR) is a string
<italic>X</italic>
, such that for every pair of indices
<italic>i</italic>
,
<italic>j</italic>
∈[1,..
<italic>t</italic>
]
<sup>2</sup>
with
<italic>i</italic>
<italic>j</italic>
we have ϕ(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
)≥η
<sub>
<italic>ij</italic>
</sub>
, where η
<sub>
<italic>ij</italic>
</sub>
is set to be a monotonically decreasing function of |
<italic>i</italic>
<italic>j</italic>
|, thus allowing more slackness when comparing distant copies of the basic motif.</p>
<p>In Krishnan and Tang (
<xref ref-type="bibr" rid="B28">2004</xref>
) it is used a definition similar to that of the NATR, except that the Hamming distance is used and that the threshold is not absolute but relative to the length. A γ-Hamming ATR (γ-HATR) is a string
<italic>X</italic>
such that: for each
<italic>i</italic>
∈ [1, …
<italic>t</italic>
− 1],
<italic>D</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+1</sub>
)≤|
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
|γ.</p>
<p>In Stolovitzky
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B49">1999</xref>
), a more complex definition is given that takes into account the substring alignment score density function for pairs of random substrings of a given length. Here, the definition of a TR
<italic>X</italic>
depends on the properties of the longer string
<italic>Y</italic>
into which
<italic>X</italic>
is embedded. In particular, a (μ,
<italic>p</italic>
)-TR must comply to two conditions: (i) ∑
<sub>
<italic>i</italic>
=0</sub>
<sup>
<italic>t</italic>
−1</sup>
ϕ(
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
+1</sub>
)≥(
<italic>t</italic>
− 1)μ, that imposes an average high similarity score for adjacent repeats, and (ii) define α(
<italic>p</italic>
,
<italic>k</italic>
) as the value of similarity such that there is probability
<italic>p</italic>
that two random substrings of length
<italic>k</italic>
in
<italic>Y</italic>
have similarity above α(
<italic>p</italic>
,
<italic>k</italic>
). There must be an index
<italic>q</italic>
∈ [1, …
<italic>t</italic>
] such that ϕ(
<italic>x</italic>
<sub>
<italic>q</italic>
</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
)≥α(
<italic>p</italic>
,
<italic>k</italic>
) for all
<italic>j</italic>
∈ [1, …
<italic>t</italic>
]. Note that this condition limits the dispersion of the similarity with respect to one of the copies (called the
<italic>pivot</italic>
).</p>
<p>TRF (Benson,
<xref ref-type="bibr" rid="B3">1999</xref>
) uses as final validation algorithm the WDP that tests efficiently the alignments of a given candidate motif with the surrounding portions of the input sequence, so as to determine the maximum number of adjacent repetitions within a user-defined score bound. This implies a notion of TR akin to that of SATRs with pivot motif. Classical results on string alignments (Gusfield,
<xref ref-type="bibr" rid="B19">1997</xref>
, p. 351) ensures that, for the metric score given by the sum of motif-repeats distances, the solution found using the optimal pivot motif has a score within a factor (2 − 1/
<italic>t</italic>
) of the score induced by the optimal Steiner motif. For low levels of errors one could use a pivot-SATR definition doubling the error threshold to capture a Steiner-SATR, however for higher error levels (say, above 25%), doubling the error threshold forces the existing systems to work in a range of values (say, above 50%) where most methods do not perform well.</p>
</sec>
<sec id="SEC3.5">
<title>3.5 Gapped q-grams</title>
<p>Let
<italic>I</italic>
be a finite subset of non-negative integers. We call
<italic>I</italic>
an
<italic>index set</italic>
. The
<italic>span</italic>
of
<italic>I</italic>
is
<italic>span</italic>
(
<italic>I</italic>
) =
<italic>max</italic>
{
<italic>i</italic>
<italic>j</italic>
|
<italic>i</italic>
,
<italic>j</italic>
<italic>I</italic>
}, the position of
<italic>I</italic>
is
<italic>pos</italic>
(
<italic>I</italic>
) = min
<italic>i</italic>
<italic>I</italic>
and the
<italic>shape</italic>
of
<italic>I</italic>
is
<italic>shape</italic>
(
<italic>I</italic>
) = {
<italic>i</italic>
<italic>pos</italic>
(
<italic>I</italic>
)|
<italic>i</italic>
<italic>I</italic>
}. When set
<italic>I</italic>
has |
<italic>I</italic>
| =
<italic>q</italic>
and
<italic>span</italic>
(
<italic>I</italic>
) =
<italic>s</italic>
, its shape belongs to the class of (
<italic>q</italic>
,
<italic>s</italic>
)-shapes. Any set of non-negative integers
<italic>Q</italic>
containing 0 is a shape. For an alphabet Σ={
<italic>A</italic>
,
<italic>C</italic>
,
<italic>G</italic>
,
<italic>T</italic>
}, a string
<italic>S</italic>
∈ Σ
<sup>*</sup>
of length
<italic>n</italic>
can be seen as a function defined over [0, …,
<italic>n</italic>
− 1] with values in Σ, and for any subset
<italic>I</italic>
⊂ [0, …,
<italic>n</italic>
− 1] the restriction of
<italic>S</italic>
to
<italic>I</italic>
, denoted by
<italic>S</italic>
[
<italic>I</italic>
] a substring of
<italic>S</italic>
.</p>
<p>Given any shape
<italic>Q</italic>
in the class of (
<italic>q</italic>
,
<italic>s</italic>
)-shapes, all sets
<italic>I</italic>
⊂ [0, …
<italic>n</italic>
− 1] such that
<italic>shape</italic>
(
<italic>I</italic>
) =
<italic>Q</italic>
, form the set of
<italic>Indexes</italic>
(
<italic>Q</italic>
,
<italic>n</italic>
). We can use elements from the
<italic>Indexes</italic>
(
<italic>Q</italic>
,
<italic>n</italic>
) to generate restrictions for the string
<italic>S</italic>
. Given two index sets
<italic>I</italic>
<sub>1</sub>
,
<italic>I</italic>
<sub>2</sub>
<italic>Indexes</italic>
(
<italic>Q</italic>
,
<italic>n</italic>
), we call them
<italic>matching</italic>
(or
<italic>homologous</italic>
in
<italic>S</italic>
, if
<italic>S</italic>
[
<italic>I</italic>
<sub>1</sub>
] =
<italic>S</italic>
[
<italic>I</italic>
<sub>2</sub>
]. The value |
<italic>pos</italic>
(
<italic>I</italic>
<sub>1</sub>
) −
<italic>pos</italic>
(
<italic>I</italic>
<sub>2</sub>
)| is called the
<italic>period</italic>
of the match.</p>
<p>An index set
<italic>I</italic>
with |
<italic>I</italic>
| =
<italic>q</italic>
and
<italic>span</italic>
(
<italic>I</italic>
) =
<italic>q</italic>
− 1 is called an
<italic>ungapped q-gram</italic>
since its shape is
<italic>shape</italic>
(
<italic>I</italic>
) = [0, …
<italic>q</italic>
− 1]. If we have an index set
<italic>J</italic>
with |
<italic>J</italic>
| =
<italic>q</italic>
and
<italic>span</italic>
(
<italic>J</italic>
) =
<italic>s</italic>
<italic>q</italic>
we have a
<italic>gapped q-gram</italic>
since its shape is formed of non-consecutive integers. In order to generate a population of candidate periods we consider now all possible (
<italic>q</italic>
,
<italic>s</italic>
)-shapes with
<italic>q</italic>
= 3 and
<italic>s</italic>
= 4, 3 ,2. Denoting with – the gaps and with # symbols from Σ, (the first and last positions must be always #), we have the (3, 4)-shapes ##− −#, #−#−# and #−−##; the (3, 3)-shapes #−##, ##−#; and the (3, 2)-shape ###.</p>
<p>As noted in Burkhardt and Kärkkäinen (
<xref ref-type="bibr" rid="B8">2003</xref>
) and Burkhardt and Kärkkäinen (
<xref ref-type="bibr" rid="B7">2002</xref>
), if we fix an ungapped shape and an error level in Hamming distance, there are error patterns for which every corresponding ungapped q-gram is affected by error. In contrast with the same Hamming error level, for some gapped shapes, there are always some gapped q-grams unaffected by the injected error. Thus using a small complete family of gapped q-grams we can detect the correct period in situations where ungapped q-grams cannot.
<xref ref-type="fn" rid="FN1">
<sup>1</sup>
</xref>
</p>
</sec>
<sec id="SEC3.6">
<title>3.6 Anti-smear weighting</title>
<p>Let
<italic>P</italic>
be a
<italic>q</italic>
-gram in the input string
<italic>Y</italic>
at position
<italic>i</italic>
. Let
<italic>j</italic>
<sub>1</sub>
,…,
<italic>j</italic>
<sub>
<italic>h</italic>
</sub>
be the next
<italic>h</italic>
occurrences of
<italic>P</italic>
in
<italic>Y</italic>
following the occurrence at position
<italic>i</italic>
. The
<italic>h</italic>
corresponding detected distances are
<italic>x</italic>
<sub>
<italic>g</italic>
</sub>
=
<italic>j</italic>
<sub>
<italic>g</italic>
</sub>
<italic>i</italic>
, for
<italic>g</italic>
∈ [1, …
<italic>h</italic>
]. For the period
<italic>x</italic>
<sub>
<italic>g</italic>
</sub>
, we increment its weight:
<disp-formula>
<graphic xlink:href="btq209um2"></graphic>
</disp-formula>
where
<italic>Q</italic>
is a queue holding the last
<italic>H</italic>
detected distances in the sequential scan of the input string
<italic>Y</italic>
. After the weight update, we enqueue all
<italic>h</italic>
values
<italic>x</italic>
<sub>g</sub>
in the queue
<italic>Q</italic>
, and we dequeue an equal number
<italic>h</italic>
of items. In line with other constants fixed in TRStalker, we have chosen
<italic>h</italic>
= 5 and
<italic>H</italic>
= 20 since they do work well in our synthetic experiments for a large range of TR error and length values. A fine tuning of these parameters as a function of the characteristics of the TR sought is possible, but beyond the focus of this article.</p>
</sec>
<sec id="SEC3.7">
<title>3.7 Positional density</title>
<p>Let
<italic>k</italic>
be the period under investigation. Consider the set
<italic>K</italic>
<sub>
<italic>k</italic>
</sub>
of the positions of those
<italic>q</italic>
-grams (i.e. substrings of
<italic>Y</italic>
) that contribute to the weighting of
<italic>k</italic>
through the multiplicity weighting. In order to avoid double counting, we always take the position of the first of the two matching probes. Note that, if a position is shared by several pairs of probes it will be counted only once. Let
<italic>f</italic>
:[1, …, |
<italic>Y</italic>
|]→{0, 1} the characteristic function that for each position in
<italic>Y</italic>
denote the membership of that position to
<italic>K</italic>
<sub>
<italic>k</italic>
</sub>
. Consider the
<italic>k</italic>
-window smoothing of
<italic>f</italic>
:
<italic>F</italic>
(
<italic>i</italic>
) = ∑
<sub>
<italic>j</italic>
=
<italic>i</italic>
</sub>
<sup>
<italic>i</italic>
+
<italic>k</italic>
</sup>
<italic>f</italic>
(
<italic>j</italic>
) that computes the
<italic>k</italic>
-smoothed density of the function
<italic>f</italic>
, for
<italic>i</italic>
∈ [1, …, |
<italic>Y</italic>
| −
<italic>k</italic>
]. Finally, we define a threshold
<italic>t</italic>
(
<italic>k</italic>
) proportional to the average
<italic>k</italic>
-density by a user-defined constant, and we consider as a candidate position set
<italic>CP</italic>
(
<italic>Y</italic>
,
<italic>k</italic>
) = {
<italic>i</italic>
∈ [1, …, |
<italic>Y</italic>
| −
<italic>k</italic>
]|
<italic>F</italic>
(
<italic>i</italic>
)≥
<italic>t</italic>
(
<italic>k</italic>
)}. The output of this positional density computation is a sequence of pairs (
<italic>k</italic>
,
<italic>i</italic>
) where
<italic>k</italic>
is a candidate period and
<italic>i</italic>
a candidate position.</p>
</sec>
<sec id="SEC3.8">
<title>3.8 Validation</title>
<p>The definition of Steiner-STR is composed of two conditions that will be tested in cascade starting from the one less computationally demanding.</p>
<sec id="SEC3.8.1">
<title>3.8.1 Testing condition (a)</title>
<p>The WDP technique in Fischetti
<italic>et al.</italic>
(
<xref ref-type="bibr" rid="B14">1993</xref>
) solves the following problem. Given a string
<italic>P</italic>
of length
<italic>m</italic>
and a text
<italic>T</italic>
of length
<italic>n</italic>
, with
<italic>m</italic>
<italic>n</italic>
, find the best alignment of
<italic>P</italic>
<sup>
<italic>n</italic>
</sup>
(concatenation of
<italic>n</italic>
copies of
<italic>P</italic>
in
<italic>T</italic>
), in time and storage
<italic>O</italic>
(
<italic>nm</italic>
). Note that a naive application of the standard dynamic programming based optimal alignment of two strings would require
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
<italic>m</italic>
) time/storage. We modify the WDP approach in order to (i) work with edit distance instead of similarity matrices, (ii) take as pattern the candidate initial tandem copy in positions [
<italic>i</italic>
,
<italic>i</italic>
+
<italic>k</italic>
− 1] and as text an adjacent portion of the input string of size
<italic>O</italic>
(
<italic>m</italic>
), (iii) we iteratively expand the the text length till the termination condition is met and (iv) we stop the matching as soon as the next adjacent copy of the TR differ from the previous one by more than
<italic>c</italic>
μ
<italic>m</italic>
in edit distance.</p>
</sec>
<sec id="SEC3.8.2">
<title>3.8.2 Testing condition (b)</title>
<p>Let
<italic>x</italic>
<sub>1</sub>
, …
<italic>x</italic>
<sub>
<italic>t</italic>
</sub>
be the candidate TR to test for property (b) that passed the test for property (a). We incrementally compute an approximate generalized median
<inline-formula>
<inline-graphic xlink:href="btq209i16.jpg"></inline-graphic>
</inline-formula>
, using
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
and the previously computed approximate generalized median string
<inline-formula>
<inline-graphic xlink:href="btq209i17.jpg"></inline-graphic>
</inline-formula>
. Initially
<inline-formula>
<inline-graphic xlink:href="btq209i18.jpg"></inline-graphic>
</inline-formula>
. Let
<italic>k</italic>
and
<italic>h</italic>
be two positive integers and
<italic>K</italic>
= {
<italic>j</italic>
/
<italic>k</italic>
|
<italic>j</italic>
∈ [0,
<italic>k</italic>
]} be the set formed by
<italic>k</italic>
+ 1 equally spaced real values between 0 and 1. For each value α ∈
<italic>K</italic>
, we determine up to
<italic>h median strings</italic>
between
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
and
<inline-formula>
<inline-graphic xlink:href="btq209i19.jpg"></inline-graphic>
</inline-formula>
with weight α. This set of at most
<italic>hk</italic>
candidates is then searched for the string
<italic>a</italic>
that minimizes the function ∑
<sub>
<italic>j</italic>
=1</sub>
<sup>
<italic>i</italic>
</sup>
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
). So we set
<inline-formula>
<inline-graphic xlink:href="btq209i20.jpg"></inline-graphic>
</inline-formula>
and start the next iteration.</p>
<p>A
<italic>median string</italic>
of weight α ∈ [0, …, 1] of two strings
<italic>a</italic>
and
<italic>b</italic>
is obtained as follows. Compute the edit distance
<italic>e</italic>
=
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
) and record the set
<italic>A</italic>
(
<italic>a</italic>
,
<italic>b</italic>
) of edit operations that transform
<italic>a</italic>
into
<italic>b</italic>
. Pick any subset of size ⌊α
<italic>e</italic>
⌋ in
<italic>A</italic>
(
<italic>a</italic>
,
<italic>b</italic>
). The median weighted string
<italic>c</italic>
is obtained by applying those operations to the string
<italic>a</italic>
. It is not difficult to show that it holds that
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>c</italic>
)=α
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
) and
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>b</italic>
,
<italic>c</italic>
) = (1 − α)
<italic>D</italic>
<sub>
<italic>E</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
). Note that depending on the value of
<italic>e</italic>
we have
<inline-formula>
<inline-graphic xlink:href="btq209i21.jpg"></inline-graphic>
</inline-formula>
different subsets of
<italic>A</italic>
(
<italic>a</italic>
,
<italic>b</italic>
) we can choose. In our algorithm we randomly select
<inline-formula>
<inline-graphic xlink:href="btq209i22.jpg"></inline-graphic>
</inline-formula>
of them.</p>
</sec>
</sec>
<sec id="SEC3.9">
<title>3.9 Evaluation of recall in synthetic sequences</title>
<p>In order to measure the quality of the TRs reported by TRSTalker and by other benchmark algorithms in our synthetic experiments, we need to give a score to a pair of TRs. The higher the similarity of the two TRs, the higher should be the score. Since perfect equality is rare we need a more flexible score function. A TR can be characterized by the triple: (
<italic>b</italic>
,
<italic>p</italic>
,
<italic>r</italic>
), where
<italic>b</italic>
is the initial position,
<italic>p</italic>
the period,
<italic>r</italic>
the repetition number. Also, the same TR covers the positions in
<italic>Y</italic>
from index
<italic>b</italic>
to
<italic>b</italic>
+
<italic>rp</italic>
− 1. We identify the TR with the set of positions
<italic>Seg</italic>
(
<italic>TR</italic>
)=[
<italic>b</italic>
,
<italic>b</italic>
+
<italic>rp</italic>
− 1]. Given two TRs
<italic>TR</italic>
<sub>1</sub>
and
<italic>TR</italic>
<sub>2</sub>
represented as sets of positions, the classical Jaccard coefficient measure of set similarity
<italic>JC</italic>
is:
<disp-formula>
<graphic xlink:href="btq209um3"></graphic>
</disp-formula>
</p>
<sec id="SEC3.9.1">
<title>3.9.1 Modified Jaccard coefficient</title>
<p>Let
<italic>t</italic>
<sub>0</sub>
be a TR embedded in
<italic>Y</italic>
. Even if
<italic>t</italic>
<sub>0</sub>
is a TR according to the definition, when we embed
<italic>t</italic>
<sub>0</sub>
in a string
<italic>Y</italic>
, it is well possible that
<italic>t</italic>
<sub>0</sub>
is not maximal in
<italic>Y</italic>
, thus if an algorithm reports correctly
<italic>t</italic>
′ ⊃
<italic>t</italic>
<sub>0</sub>
there will be a slight penalization in the JC measure. This phenomenon arose a number of times, thus we decided to use a modified version of the Jaccard coefficient, called JC2, where the denominator is changed. The resulting measure is thus more robust w.r.t. this penalization:
<disp-formula>
<graphic xlink:href="btq209um4"></graphic>
</disp-formula>
Given a TR
<italic>t</italic>
<sub>0</sub>
and a set of TRs:
<italic>T</italic>
= {
<italic>t</italic>
<sub>1</sub>
<italic>t</italic>
<sub>
<italic>s</italic>
</sub>
} we define the best-match
<italic>BM</italic>
(
<italic>t</italic>
<sub>0</sub>
,
<italic>T</italic>
):
<disp-formula>
<graphic xlink:href="btq209um5"></graphic>
</disp-formula>
and the best-match-score (BMS):
<disp-formula>
<graphic xlink:href="btq209um6"></graphic>
</disp-formula>
In our controlled experiments, the evaluation module knows the embedded TR
<italic>t</italic>
<sub>0</sub>
and receives the output of an algorithm
<italic>T</italic>
, giving back the BMS. For a series of experiments, we will report the average of the BMS. Note that BMS has values in the range [0, …, 1], and higher values correspond to better quality. At first sight one might consider this metric as overly generous. However, since we cannot rule out the existence of other TRs in
<italic>Y</italic>
besides the embedded ones, we do not want to penalize the presence in
<italic>T</italic>
of valid TRs different from
<italic>t</italic>
<sub>0</sub>
. Also, the set
<italic>T</italic>
will not contain nested TRs.</p>
</sec>
</sec>
<sec id="SEC3.10">
<title>3.10 Evaluation of recall on biological sequences</title>
<p>The evaluation has been carried out according to the following procedure. Let
<italic>T</italic>
<sub>
<italic>TRS</italic>
</sub>
,
<italic>T</italic>
<sub>
<italic>TRF</italic>
</sub>
,
<italic>T</italic>
<sub>
<italic>ATR</italic>
</sub>
be the set of TRs found by TRStalker, TRF and ATRHunter respectively. First, we removed from every set all the TRs that have a Jaccard coefficient greater than a threshold
<italic>J</italic>
when compared with another TR in the same set. In other words, we removed TR duplicates from every set of results, where two TRs are considered as duplicates when they cover the same region with an approximation
<italic>J</italic>
. Since TRF and ATRHunter have been executed with options that discard all TRs having a score lower than a given threshold, we filtered
<italic>T</italic>
<sub>
<italic>TRS</italic>
</sub>
by removing all the TRs with a score under such value (this has been done to not penalize TRF and ATRHunter with respect to TRStalker). More in detail, TRF has been executed with match, mismatch and indel score equal to 2, 3 and 3, respectively, maximum motif length equal to 2000bp
<xref ref-type="fn" rid="FN2">
<sup>2</sup>
</xref>
, and threshold equal to 30. ATRHunter has been executed with match, mismatch, gap and terminal gap score equal to 1 0 -1 0, maximum motif length equal to 500bp
<xref ref-type="fn" rid="FN3">
<sup>3</sup>
</xref>
and threshold equal to 30. For the TRs found by TRStalker, the score is computed by using the same weights used by TRF and ATRHunter then we filtered the results using the same threshold. After the filtering phase, we computed the union of the TRs found by all algorithms,
<inline-formula>
<inline-graphic xlink:href="btq209i23.jpg"></inline-graphic>
</inline-formula>
. The removal of duplicates with threshold
<italic>J</italic>
is also applied to
<italic>U</italic>
. Naturally the higher the value of
<italic>J</italic>
less filtering will be performed.</p>
</sec>
</sec>
<sec sec-type="discussion" id="SEC4">
<title>4 DISCUSSION</title>
<p>We have performed comparative experiments both with synthetic and with biological sequences. Here we describe the experimental set up, how the synthetic sequences are generated and the outcome of the comparison. For biological data, we briefly indicate the reason why that sequence has been selected, and the new TRs found by the application of TRStalker.</p>
<sec id="SEC4.1">
<title>4.1 Synthetic data</title>
<sec id="SEC4.1.1">
<title>4.1.1 Generation of synthetic data</title>
<p>We carried out a first set of experiments by using synthetic data. This allows a fine grained control on the amount of mutations introduced within the regions covered by the TRs. The sequences we gave as input to the programs have been built according to the following steps:
<list list-type="order">
<list-item>
<p>the background sequence is generated by selecting the four bases A,C,G and T with equal probability;</p>
</list-item>
<list-item>
<p>a perfect TR is embedded within the previous sequence, the TR is generated as
<italic>r</italic>
repetitions of a motif with length
<italic>l</italic>
;</p>
</list-item>
<list-item>
<p>the region covered by the TR is mutated according to substitution, insertion and deletion probabilities (
<italic>p</italic>
<sub>
<italic>s</italic>
</sub>
,
<italic>p</italic>
<sub>
<italic>i</italic>
</sub>
and
<italic>p</italic>
<sub>
<italic>d</italic>
</sub>
); the number of substitutions, insertions and deletion for every repetition of the motif is exactly equal to
<italic>lp</italic>
<sub>
<italic>s</italic>
</sub>
,
<italic>lp</italic>
<sub>
<italic>i</italic>
</sub>
and
<italic>lp</italic>
<sub>
<italic>d</italic>
</sub>
; and</p>
</list-item>
<list-item>
<p>if the TR is a Steiner-STR, mutations are introduced in every repeat with respect to the consensus motif; if the TR is a NTR, mutations are introduced with respect to the previous repeat.</p>
</list-item>
</list>
</p>
<p>The experiments have been carried out running ATRHunter with these parameters: match, mismatch, gap and terminal gap score equal to 1 0 -1 0 (the most permissive setting on the website); maximum motif length equal to 500 bp (the maximum allowed by the tool). In order to select the definition of TRs among those allowed by ATRHunter, we performed a preliminary set of experiments: the definition that gave the best results was the third one (minimum alignment score). In this case, ATRHunter reports only the TRs that have a score higher than a given threshold. The value of the threshold has been set to 30.</p>
<p>For the web-based version of TRF, all the experiments have been carried out with these parameters: match, mismatch and indel score equal to 2, 3 and 5, respectively; maximum period equal to 500; minimum score equal to 30. For the binary version, we used the following ones: match, mismatch and indel score equal to 2, 3 and 3, respectively; match and indel probability equal to 0.75 and 0.20; maximum period equal to 500; minimum score equal to 30. The parameters of the experiments have been set so as to make sure that the minimum allowed score for all the tools tested is attained on the input data. TRStalker is run with the error parameter μ = 0.3 and the constant
<italic>c</italic>
= 1.5.</p>
</sec>
<sec id="SEC4.1.2">
<title>4.1.2 Discussion of the comparative experiments</title>
<p>For the experiments on NTR (
<xref ref-type="fig" rid="F1">Fig. 1</xref>
), we tested TRs with motifs of length from 60 to 300, and a number of repeats from 2 to 8. TRStalker has recall always above 95%. TRF (binary) has always a recall above 80% except for TR with repeat number 2 for which the recall drops to 60%. ATRHunter has recall of a about 60%. These experiments confirm the effectiveness of the new techniques for the initial filtering steps.
<fig id="F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>BMS as a function of copy number for NTR. Motif lengths 60 (
<bold>a</bold>
), 100 (
<bold>b</bold>
), 100 (
<bold>c</bold>
) and 300 (
<bold>d</bold>
). The total length of the input sequence is 10 000 bp; the amount of substitutions, insertions and deletions are equal to 10% of the motif length each (thus with total error allowed of 30%). Every point is the average of 30 measurements and the 95% confidence intervals are shown.</p>
</caption>
<graphic xlink:href="btq209f1"></graphic>
</fig>
</p>
<p>Results on Steiner-STR with motifs of length from 60 to 300, and a number of repeats from 2 to 8 are shown in
<xref ref-type="fig" rid="F2">Figure 2</xref>
. Here we notice that all methods have degraded performance for longer motifs (>200 bases) while TRStalker still manages to have recall above 60%. For shorter motifs (of <100 bases) TRF (binary) is able to match TRStalker only when the repeat number is above 6. Thus for a large range of values, TRStalker attains the best performance in recall, or a matching one, always above 80%.
<fig id="F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>BMS as a function of copy number for Steiner-STR. Motif lengths 60 (
<bold>a</bold>
), 100 (
<bold>b</bold>
), 200 (
<bold>c</bold>
) and 300 (
<bold>d</bold>
). The total length of the input sequence is 10 000 bp; the amount of substitutions, insertions and deletions are equal to 10% of the motif length each (thus with total error allowed of 30%). Every point is the average of 30 measurements and the 95% confidence intervals are shown.</p>
</caption>
<graphic xlink:href="btq209f2"></graphic>
</fig>
</p>
<p>The time performance of TRStalker has not been yet optimized. At the moment it is within an order of magnitude of TRF and ATRHunter. More details on the running time are in the
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btq209/DC1">Supplementary Materials</ext-link>
.</p>
</sec>
</sec>
<sec id="SEC4.2">
<title>4.2 Biological sequences</title>
<p>Testing of TRStalker on biological sequences has confirmed the potential of our method for finding very fuzzy TRs not detected by TRF and ATRHunter, and, to the best of our knowledge, not reported in literature. We tested the following sequences:
<list list-type="order">
<list-item>
<p>U43748
<italic>Homo sapiens</italic>
frataxin gene, promoter region and exon—2465 bp long (FRDA).</p>
</list-item>
<list-item>
<p>L3609
<italic>Homo sapiens</italic>
germline T-cell receptor βchain, complete gene—684 973 bp long (HSBT).</p>
</list-item>
<list-item>
<p>NC_001133.8
<italic>Saccharomyces cerevisiae</italic>
Chromosome I—230 208 bp long (YCh1).</p>
</list-item>
</list>
</p>
<sec id="SEC4.2.1">
<title>4.2.1 Experimental settings</title>
<p>The three algorithms have been run with the setting used in the synthetic experiments
<xref ref-type="fn" rid="FN4">
<sup>4</sup>
</xref>
(thus with a very permissive acceptance policy). In general, none of the three algorithms generates all TRs found by the two others, and in
<xref ref-type="table" rid="T1">Table 1</xref>
we show the percentage of the TRs found by each algorithm with respect to the union of the TRs found. In
<xref ref-type="table" rid="T2">Table 2</xref>
, we report some very long TRs that were detected by TRStalker but missed by the other two methods. We check the motif/repeat alignments using the tool
<monospace>jaligner</monospace>
(
<ext-link ext-link-type="uri" xlink:href="http://jaligner.sourceforge.net/">http://jaligner.sourceforge.net/</ext-link>
) using the BLOSUM62 score matrix, that confirms the good quality of the motifs found (
<xref ref-type="table" rid="T3">Table 3</xref>
).
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>Evaluation of recall for the three methods under evaluation</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th align="left" rowspan="1" colspan="1">Algorithm</th>
<th align="left" rowspan="1" colspan="1">Filter 90%</th>
<th align="left" rowspan="1" colspan="1">Filter 70%</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td align="left" rowspan="1" colspan="1">Frataxin</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (TRF filter)</td>
<td align="left" rowspan="1" colspan="1">59 (56.2)</td>
<td align="left" rowspan="1" colspan="1">43 (56.5)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (ATR filter)</td>
<td align="left" rowspan="1" colspan="1">43 (41.0)</td>
<td align="left" rowspan="1" colspan="1">30 (39.4)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRF</td>
<td align="left" rowspan="1" colspan="1">24 (22.9)</td>
<td align="left" rowspan="1" colspan="1">18 (23.6)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ATRHunter</td>
<td align="left" rowspan="1" colspan="1">24 (22.9)</td>
<td align="left" rowspan="1" colspan="1">23 (30.2)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Union</td>
<td align="left" rowspan="1" colspan="1">105 (100.0)</td>
<td align="left" rowspan="1" colspan="1">76 (100.0)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Homo sapiens</italic>
T-cell receptor β chain</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (TRF Filter)</td>
<td align="left" rowspan="1" colspan="1">22 557 (59.1)</td>
<td align="left" rowspan="1" colspan="1">14 137 (60.2)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (ATR Filter)</td>
<td align="left" rowspan="1" colspan="1">18 124 (47.5)</td>
<td align="left" rowspan="1" colspan="1">11 427 (48.7)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRF</td>
<td align="left" rowspan="1" colspan="1">9977 (26.1)</td>
<td align="left" rowspan="1" colspan="1">8521 (36.0)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ATRHunter</td>
<td align="left" rowspan="1" colspan="1">7392 (19.3)</td>
<td align="left" rowspan="1" colspan="1">7034 (29.6)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Union</td>
<td align="left" rowspan="1" colspan="1">38218 (100.0)</td>
<td align="left" rowspan="1" colspan="1">23743 (100.0)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Saccharomyces cerevisiae</italic>
chromosome I</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (TRF Filter)</td>
<td align="left" rowspan="1" colspan="1">7168 (61.8)</td>
<td align="left" rowspan="1" colspan="1">4656 (63.5)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRStalker (ATR Filter)</td>
<td align="left" rowspan="1" colspan="1">5621 (48.4)</td>
<td align="left" rowspan="1" colspan="1">3655 (49.9)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">TRF</td>
<td align="left" rowspan="1" colspan="1">2892 (24.9)</td>
<td align="left" rowspan="1" colspan="1">2518 (34.1)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ATRHunter</td>
<td align="left" rowspan="1" colspan="1">2037 (17.6)</td>
<td align="left" rowspan="1" colspan="1">1958 (26.4)</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Union</td>
<td align="left" rowspan="1" colspan="1">11 616 (100.0)</td>
<td align="left" rowspan="1" colspan="1">7407 (100.0)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Each entry in the table gives the absolute number of unique TR found, and in the percentage of unique TR w.r.t the union of the three methods. For TRSTalker, we used both a TRF-like and an ATRHunter-like filtering (more restrictive) on the TRs found.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2.</label>
<caption>
<p>Examples of TRs found by TRStalker and missed by TRF and ATRHunter</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th align="left" rowspan="1" colspan="1">No.</th>
<th align="left" rowspan="1" colspan="1">Sequence</th>
<th align="left" rowspan="1" colspan="1">Seq. length</th>
<th align="left" rowspan="1" colspan="1">TR start</th>
<th align="left" rowspan="1" colspan="1">TR end</th>
<th align="left" rowspan="1" colspan="1">TR length</th>
<th align="left" rowspan="1" colspan="1">Consensus</th>
<th align="left" rowspan="1" colspan="1">Repetitions</th>
<th align="left" rowspan="1" colspan="1">Score</th>
<th align="left" rowspan="1" colspan="1">Norm. score</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">684 973</td>
<td align="left" rowspan="1" colspan="1">411 000</td>
<td align="left" rowspan="1" colspan="1">413127</td>
<td align="left" rowspan="1" colspan="1">2127</td>
<td align="left" rowspan="1" colspan="1">1061</td>
<td align="left" rowspan="1" colspan="1">2.00</td>
<td align="left" rowspan="1" colspan="1">2868</td>
<td align="left" rowspan="1" colspan="1">1.384</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">684 973</td>
<td align="left" rowspan="1" colspan="1">448 001</td>
<td align="left" rowspan="1" colspan="1">449687</td>
<td align="left" rowspan="1" colspan="1">1686</td>
<td align="left" rowspan="1" colspan="1">842</td>
<td align="left" rowspan="1" colspan="1">2.00</td>
<td align="left" rowspan="1" colspan="1">2310</td>
<td align="left" rowspan="1" colspan="1">1.370</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">684 973</td>
<td align="left" rowspan="1" colspan="1">636 116</td>
<td align="left" rowspan="1" colspan="1">638622</td>
<td align="left" rowspan="1" colspan="1">2506</td>
<td align="left" rowspan="1" colspan="1">1253</td>
<td align="left" rowspan="1" colspan="1">2.00</td>
<td align="left" rowspan="1" colspan="1">3323</td>
<td align="left" rowspan="1" colspan="1">1.326</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">YCh1</td>
<td align="left" rowspan="1" colspan="1">230 208</td>
<td align="left" rowspan="1" colspan="1">186 168</td>
<td align="left" rowspan="1" colspan="1">188347</td>
<td align="left" rowspan="1" colspan="1">2179</td>
<td align="left" rowspan="1" colspan="1">1089</td>
<td align="left" rowspan="1" colspan="1">2.00</td>
<td align="left" rowspan="1" colspan="1">3053</td>
<td align="left" rowspan="1" colspan="1">1.401</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">FRDA</td>
<td align="left" rowspan="1" colspan="1">2465</td>
<td align="left" rowspan="1" colspan="1">2029</td>
<td align="left" rowspan="1" colspan="1">2407</td>
<td align="left" rowspan="1" colspan="1">378</td>
<td align="left" rowspan="1" colspan="1">188</td>
<td align="left" rowspan="1" colspan="1">2.011</td>
<td align="left" rowspan="1" colspan="1">501</td>
<td align="left" rowspan="1" colspan="1">1.325</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>We report the original sequence name and length, the TR starting and ending positions, the TR length and the TR repeating unit length and copy number. The score is computed by assigning +2 to matches and −1 to mismatches and gaps w.r.t the consensus string. The normalized score is the score divided the TR length.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T3" position="float">
<label>Table 3.</label>
<caption>
<p>Motif/repeats alignment scores computed by
<monospace>jaligner</monospace>
using the BLOSUM62 score matrix with gap open penalty set to 10.0 and gap extend penalty set to 0.5 for the TRs reported in
<xref ref-type="table" rid="T2">Table 2</xref>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th align="left" rowspan="1" colspan="1">Seq.</th>
<th align="left" rowspan="1" colspan="1">No.</th>
<th align="left" rowspan="1" colspan="1">Repeat</th>
<th align="left" rowspan="1" colspan="1">Length,
<italic>N</italic>
</th>
<th align="left" rowspan="1" colspan="1">Identity, n(%)</th>
<th align="left" rowspan="1" colspan="1">Gaps, n(%)</th>
<th align="left" rowspan="1" colspan="1">Score</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1107</td>
<td align="left" rowspan="1" colspan="1">805(72.72)</td>
<td align="left" rowspan="1" colspan="1">91 (8.22)</td>
<td align="left" rowspan="1" colspan="1">3657.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">1093</td>
<td align="left" rowspan="1" colspan="1">895(81.88)</td>
<td align="left" rowspan="1" colspan="1">70 (6.40)</td>
<td align="left" rowspan="1" colspan="1">4291.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">878</td>
<td align="left" rowspan="1" colspan="1">638(72.67)</td>
<td align="left" rowspan="1" colspan="1">85 (9.68)</td>
<td align="left" rowspan="1" colspan="1">3045.50</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">866</td>
<td align="left" rowspan="1" colspan="1">716(82.68)</td>
<td align="left" rowspan="1" colspan="1">52 (6.00)</td>
<td align="left" rowspan="1" colspan="1">3568.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HSBT</td>
<td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1300</td>
<td align="left" rowspan="1" colspan="1">1000(76.92)</td>
<td align="left" rowspan="1" colspan="1">94 (7.23)</td>
<td align="left" rowspan="1" colspan="1">5206.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">1313</td>
<td align="left" rowspan="1" colspan="1">1004(76.47)</td>
<td align="left" rowspan="1" colspan="1">120 (9.14)</td>
<td align="left" rowspan="1" colspan="1">5176.50</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">YCh1</td>
<td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1130</td>
<td align="left" rowspan="1" colspan="1">895(79.20)</td>
<td align="left" rowspan="1" colspan="1">83 (7.35)</td>
<td align="left" rowspan="1" colspan="1">4280.50</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">1123</td>
<td align="left" rowspan="1" colspan="1">901(80.23)</td>
<td align="left" rowspan="1" colspan="1">77 (6.86)</td>
<td align="left" rowspan="1" colspan="1">4345.50</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">FRDA</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">193</td>
<td align="left" rowspan="1" colspan="1">149(77.20)</td>
<td align="left" rowspan="1" colspan="1">10 (5.18)</td>
<td align="left" rowspan="1" colspan="1">723.50</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">191</td>
<td align="left" rowspan="1" colspan="1">146(76.44)</td>
<td align="left" rowspan="1" colspan="1">5 (2.62)</td>
<td align="left" rowspan="1" colspan="1">765.00</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="SEC4.2.2">
<title>4.2.2 Frederich's ataxia</title>
<p>Frederich's ataxia is an autosomal recessive degenerative disease involving the central and peripheral nervous system and the heart, that roughly affects one person in 50 000 (Wells,
<xref ref-type="bibr" rid="B53">2008</xref>
). In 1996, it was shown (Campuzano
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B10">1996</xref>
) that in 98% of the cases this disease was caused by an abnormal expansion in the copy number of a triplet TR in the first intron of the Frataxin coding sequence. It belongs to the family of
<italic>trinucleotide repeat disorders</italic>
. Very recently (Vissers
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B50">2009</xref>
) it has been shown that the local repetitive structure of DNA may play a role in
<italic>variable copy number</italic>
genomic disorders. Applying TRStalker to the frataxin sequence we detected a divergent TR in positions [2036–2414], of period 188 and copy number 2, to the best of our knowledge not previously reported, that includes the breakpoint region of the repeat disorder (
<xref ref-type="table" rid="T2">Table 2</xref>
). Experimental data reported in Brodzik (
<xref ref-type="bibr" rid="B5">2007</xref>
) on the Frataxin sequence did find a number of short TRs (of period up to 10/13) that are completely covered by the longer fuzzy TR reported by TRStalker.</p>
</sec>
<sec id="SEC4.2.3">
<title>4.2.3 Human
<italic>beta</italic>
T-cell receptor locus</title>
<p>The cellular immune system detects the presence of pathogens largely through the activation of
<italic>T-cell receptor</italic>
proteins (TCR) (Glusman
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B16">2001</xref>
), which come in four different families α, β, γ and δ. The complete DNA sequence of the human β T-cell receptor locus has been determined (Rowen
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B42">1996</xref>
) and it has been found that a large fraction of the locus sequence (about 47%) is formed by locus-specific repeats (Rowen
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B42">1996</xref>
). This sequence was selected as a test case for TRStalker because of its richness in repeating elements with the aim of highlighting the ability of TRStalker in finding repeats with high divergence among adjacent copies. Here, (
<xref ref-type="table" rid="T2">Table 2</xref>
) we could find a few such repeats apparently not recorded in the GenBank: L36092.2 record, nor found by TRF and ATRHunter (still set with very loose parameters).</p>
</sec>
<sec id="SEC4.2.4">
<title>4.2.4 Yeast chromosome I</title>
<p>
<italic>Saccharomyces cerevisiae</italic>
(baker's yeast) has been the focus of intensive study as the first eukaryotic organism whose genome was completely sequenced (Dujon,
<xref ref-type="bibr" rid="B12">1996</xref>
), and serves as a model organism in basic genomic investigations. Chromosome I (Bussey
<italic>et al.</italic>
,
<xref ref-type="bibr" rid="B9">1995</xref>
) is the smallest of the 16 chromosomes present in yeast. It has been noticed that the yeast genome is remarkably poor in repeated elements (Dujon,
<xref ref-type="bibr" rid="B12">1996</xref>
), thus finding new TRs in such organism is a challenging task for any algorithm. In
<xref ref-type="table" rid="T2">Table 2</xref>
we report a TR in position [186168, 188347] of copy number 2 and motif length 1089. This TR is not reported in the TRDB database, while ATRHunter in the same region finds 15 shorter TR of length ranging from 50 to 180. This region, according to the NCBI record, is rich in genes of the DUP240 gene family (encoding membrane proteins). The presence of a fuzzy repeat in this region thus suggests a possible remote gene duplication event.</p>
</sec>
<sec id="SEC4.2.5">
<title>4.2.5 Performance on biological sequences</title>
<p>Reporting interesting single new TRs, as in
<xref ref-type="table" rid="T2">Table 2</xref>
, is useful to demonstrate that biological relevant TRs are still unknown. We give also an evaluation of the overall behavior of the three different methods on biological sequences. Thus, we compared TRStalker, TRF and ATRHunter by estimating their recall on the three biological sequences with the methodology described in
<xref ref-type="sec" rid="SEC3.10">Section 3.10</xref>
.
<xref ref-type="table" rid="T1">Table 1</xref>
reports (i) the number of unique TRs found by the different algorithms and (ii) the percentage of the union reported by a given algorithm, with two filtering thresholds at
<italic>J</italic>
= 90% and
<italic>J</italic>
= 70%. For all the three sequences, TRStalker is able to find a large number of TRs that are not discovered by using the other methods. In practice, a better overall coverage can be attained by using all three methods and merging their results. Although lower
<italic>J</italic>
values imply a more aggressive filtering, the percentage of the union attained by TRStalker is almost constant.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="SEC5">
<title>5 CONCLUSION</title>
<p>TRStalker is a novel efficient heuristic algorithm for finding Fuzzy TRs in biological sequences. TRStalker aims at improving the capability of TR detection for a class of fuzzy TRs for which existing methods do not perform well. Initial testing on biological data show that fuzzy TRs not previously reported are present in biologically relevant sequences. In the case of the Frataxin sequence, the fuzzy TR reported is associated with the known variable copy number breakpoint of Frederich's ataxia. Future work will involve testing TRStalker on relevant families of repetitive elements such as centromeric α-satellites. An extension of TRStalker to handle amino acid sequences is under development.</p>
<p>
<italic>Funding</italic>
: European Community's Seventh Framework Programme FP7/2007-2013 under grant agreement N 223920 (Virtual Physiological Human Network of Excellence) partially.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>[Supplementary Data]</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="btq209_index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="btq209_1.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<fn-group>
<fn id="FN1">
<p>
<sup>1</sup>
A precise characterization of the relative gain under different error models would be theoretically interesting but is now beyond the focus of this article. Selecting larger values of
<italic>q</italic>
and
<italic>s</italic>
, as a function of the period to be detected and the error level, may increase the filtering ability of the method at the cost of slower computations. Exploring these connections is left for future research.</p>
</fn>
<fn id="FN2">
<p>
<sup>2</sup>
Maximum possible value for TRF.</p>
</fn>
<fn id="FN3">
<p>
<sup>3</sup>
Maximum possible value for ATRHunter.</p>
</fn>
<fn id="FN4">
<p>
<sup>4</sup>
For TRF the maximum motif length has been raised to 2000 bp.</p>
</fn>
</fn-group>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ames</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Comparative analyses of human single- and multilocus tandem repeats</article-title>
<source>Genetics</source>
<year>2008</year>
<volume>179</volume>
<fpage>1693</fpage>
<lpage>1704</lpage>
<pub-id pub-id-type="pmid">18562644</pub-id>
</element-citation>
</ref>
<ref id="B2">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>G</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Istrail</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>An algorithm for finding tandem repeats of unspecified pattern size</article-title>
<source>Proceedings of the Second Annual international Conference on Computational Molecular Biology (New York, New York, United States, March 22–25, 1998)</source>
<year>1998</year>
<publisher-loc>New York, NY</publisher-loc>
<publisher-name>RECOMB '98. ACM</publisher-name>
<fpage>20</fpage>
<lpage>29</lpage>
</element-citation>
</ref>
<ref id="B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Tandem repeats finder: a program to analyze DNA sequences</article-title>
<source>Nucleic Acids Res.</source>
<year>1999</year>
<volume>27</volume>
<fpage>573</fpage>
<lpage>580</lpage>
<comment>Available at
<ext-link ext-link-type="uri" xlink:href="http://tandem.bu.edu/trf/trf.html">http://tandem.bu.edu/trf/trf.html</ext-link>
(last accessed date April 27, 2010)</comment>
<pub-id pub-id-type="pmid">9862982</pub-id>
</element-citation>
</ref>
<ref id="B4">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boeva</surname>
<given-names>V</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>676</fpage>
<lpage>684</lpage>
<pub-id pub-id-type="pmid">16403795</pub-id>
</element-citation>
</ref>
<ref id="B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brodzik</surname>
<given-names>AK</given-names>
</name>
</person-group>
<article-title>Quaternionic periodicity transform: an algebraic solution to the tandem repeat detection problem</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>694</fpage>
<lpage>700</lpage>
<pub-id pub-id-type="pmid">17237057</pub-id>
</element-citation>
</ref>
<ref id="B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buchner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Janjarasjitt</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Detection and visualization of tandem repeats in DNA sequences</article-title>
<source>IEEE Trans. Signal Process</source>
<year>2003</year>
<volume>51</volume>
<fpage>2280</fpage>
<lpage>2287</lpage>
</element-citation>
</ref>
<ref id="B7">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Burkhardt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Apostolico</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Takeda</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>One-gapped q-grams filters for levenshtein distance</article-title>
<source>Combinatorial Pattern Matching, 13th Annual Symposium, CPM 2002, Fukuoka, Japan, July 3–5, 2002, Proceedings</source>
<year>2002</year>
<volume>2373</volume>
<publisher-name>Lecture Notes in Computer Science, Springer</publisher-name>
<fpage>225</fpage>
<lpage>234</lpage>
</element-citation>
</ref>
<ref id="B8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burkhardt</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kärkkäinen</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Better filtering with gapped q-grams</article-title>
<source>Fundam. Inform.</source>
<year>2003</year>
<volume>56</volume>
<fpage>51</fpage>
<lpage>70</lpage>
</element-citation>
</ref>
<ref id="B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bussey</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The nucleotide sequence of chromosome I from Saccharomyces cerevisiae</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1995</year>
<volume>92</volume>
<fpage>3809</fpage>
<lpage>3813</lpage>
<pub-id pub-id-type="pmid">7731988</pub-id>
</element-citation>
</ref>
<ref id="B10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Campuzano</surname>
<given-names>V</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion</article-title>
<source>Science</source>
<year>1996</year>
<volume>271</volume>
<fpage>1423</fpage>
<lpage>1427</lpage>
<pub-id pub-id-type="pmid">8596916</pub-id>
</element-citation>
</ref>
<ref id="B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>de la Higuera</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Casacuberta</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Topology of strings: median string is np-complete</article-title>
<source>Theor. Comput. Sci.</source>
<year>2000</year>
<volume>230</volume>
<fpage>39</fpage>
<lpage>48</lpage>
</element-citation>
</ref>
<ref id="B12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dujon</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>The yeast genome project: what did we learn?</article-title>
<source>Trends Genet.</source>
<year>1996</year>
<volume>12</volume>
<fpage>263</fpage>
<lpage>270</lpage>
<pub-id pub-id-type="pmid">8763498</pub-id>
</element-citation>
</ref>
<ref id="B13">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Elemento</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Gascuel</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>An efficient and accurate distance based algorithm to reconstruct tandem duplication trees</article-title>
<source>Proceedings of the European Conference on Computational Biology (ECCB 2002)</source>
<year>2002</year>
<publisher-loc>Germany</publisher-loc>
<publisher-name>Saarbrücken</publisher-name>
<fpage>92</fpage>
<lpage>99</lpage>
<comment>October 6-9, 2002 (Supplement of Bioinformatics)</comment>
</element-citation>
</ref>
<ref id="B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fischetti</surname>
<given-names>VA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Identifying periodic occurrences of a template with applications to protein structure</article-title>
<source>Inf. Process. Lett.</source>
<year>1993</year>
<volume>45</volume>
<fpage>11</fpage>
<lpage>18</lpage>
</element-citation>
</ref>
<ref id="B15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gelfand</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>TRDB - the tandem repeats database</article-title>
<source>Nucleic Acids Res.</source>
<year>2007</year>
<volume>35</volume>
<issue>Database Issue</issue>
<fpage>80</fpage>
<lpage>87</lpage>
</element-citation>
</ref>
<ref id="B16">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glusman</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Comparative genomics of the human and mouse T cell receptor loci</article-title>
<source>Immunity</source>
<year>2001</year>
<volume>15</volume>
<fpage>337</fpage>
<lpage>349</lpage>
<pub-id pub-id-type="pmid">11567625</pub-id>
</element-citation>
</ref>
<ref id="B17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grissa</surname>
<given-names>I</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>172</fpage>
<pub-id pub-id-type="pmid">17521438</pub-id>
</element-citation>
</ref>
<ref id="B18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A novel signal processing measure to identify exact and inexact tandem repeat patterns in DNA sequences</article-title>
<source>EURASIP. J. Bioinform. Syst. Biol.</source>
<year>2007</year>
<comment>[Epub ahead of print, doi: 10.1155/2007/43596, March 13, 2007]</comment>
</element-citation>
</ref>
<ref id="B19">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Gusfield</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology.</source>
<year>1997</year>
<publisher-loc>Cambridge, U.K.</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
</element-citation>
</ref>
<ref id="B20">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gusfield</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Linear time algorithms for finding and representing all the tandem repeats in a string</article-title>
<source>J. Comput. Syst. Sci.</source>
<year>2004</year>
<volume>69</volume>
<fpage>525</fpage>
<lpage>546</lpage>
</element-citation>
</ref>
<ref id="B21">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hauth</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Joseph</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Beyond tandem repeats: complex pattern structures and distant regions of similarity</article-title>
<source>Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology</source>
<year>2002</year>
<publisher-loc>Alberta, Canada</publisher-loc>
<publisher-name>Edmondon</publisher-name>
<fpage>31</fpage>
<lpage>37</lpage>
<comment>August 3-7, 2002</comment>
</element-citation>
</ref>
<ref id="B22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Dynamic computation of generalised median strings</article-title>
<source>Pattern Anal. Appl.</source>
<year>2003</year>
<volume>6</volume>
<fpage>185</fpage>
<lpage>193</lpage>
</element-citation>
</ref>
<ref id="B23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jurka</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Repbase update, a database of eukaryotic repetitive elements</article-title>
<source>Cytogenet. Genome Res.</source>
<year>2005</year>
<volume>110</volume>
<fpage>462</fpage>
<lpage>467</lpage>
<pub-id pub-id-type="pmid">16093699</pub-id>
</element-citation>
</ref>
<ref id="B24">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kelkar</surname>
<given-names>Y.DD</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The genome-wide determinants of human and chimpanzee microsatellite evolution</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<fpage>30</fpage>
<lpage>38</lpage>
<pub-id pub-id-type="pmid">18032720</pub-id>
</element-citation>
</ref>
<ref id="B25">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kolpakov</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kucherov</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Finding approximate repetitions under Hamming distance</article-title>
<source>Theor. Comput. Sci.</source>
<year>2003</year>
<volume>303</volume>
<fpage>135</fpage>
<lpage>156</lpage>
</element-citation>
</ref>
<ref id="B26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kolpakov</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>mreps: efficient and flexible detection of tandem repeats in DNA</article-title>
<source>Nucleic Acids Res.</source>
<year>2003</year>
<volume>31</volume>
<fpage>3672</fpage>
<lpage>3678</lpage>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://bioinfo.lifl.fr/mreps/">http://bioinfo.lifl.fr/mreps/</ext-link>
(last accessed date April 27, 2010)</comment>
<pub-id pub-id-type="pmid">12824391</pub-id>
</element-citation>
</ref>
<ref id="B27">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kolpakov</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Kucherov</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Finding maximal repetitions in a word in linear time</article-title>
<source>Proceedings of the 40th Annual Symposium on Foundations of Computer Science (October 17–18, 1999). FOCS</source>
<year>1999</year>
<publisher-loc>Washington, DC</publisher-loc>
<publisher-name>IEEE Computer Society</publisher-name>
<fpage>596</fpage>
</element-citation>
</ref>
<ref id="B28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krishnan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Exhaustive whole-genome tandem repeats search</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>2702</fpage>
<lpage>2710</lpage>
<pub-id pub-id-type="pmid">15145809</pub-id>
</element-citation>
</ref>
<ref id="B29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schleiermacher</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Reputer: fast computation of maximal repeats in complete genomes</article-title>
<source>Bioinformatics</source>
<year>1999</year>
<volume>15</volume>
<fpage>426</fpage>
<lpage>427</lpage>
<pub-id pub-id-type="pmid">10366664</pub-id>
</element-citation>
</ref>
<ref id="B30">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Reputer: the manifold applications of repeat analysis on a genomic scale</article-title>
<source>Nucleic Acids Res.</source>
<year>2001</year>
<volume>29</volume>
<fpage>4633</fpage>
<lpage>4642</lpage>
<pub-id pub-id-type="pmid">11713313</pub-id>
</element-citation>
</ref>
<ref id="B31">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Landau</surname>
<given-names>GM</given-names>
</name>
<etal></etal>
</person-group>
<article-title>An algorithm for approximate tandem repeats</article-title>
<source>J. Comput. Biol.</source>
<year>2001</year>
<volume>8</volume>
<fpage>1</fpage>
<lpage>18</lpage>
<pub-id pub-id-type="pmid">11339903</pub-id>
</element-citation>
</ref>
<ref id="B32">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leclercq</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Detecting microsatellites within genomes: significant variation among algorithms</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>1</fpage>
<lpage>125</lpage>
<pub-id pub-id-type="pmid">17199892</pub-id>
</element-citation>
</ref>
<ref id="B33">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Legendre</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Sequence-based estimation of minisatellite and microsatellite repeat variability</article-title>
<source>Genome Res.</source>
<year>2007</year>
<volume>17</volume>
<fpage>1787</fpage>
<lpage>1796</lpage>
<pub-id pub-id-type="pmid">17978285</pub-id>
</element-citation>
</ref>
<ref id="B34">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Motwani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raghavan</surname>
<given-names>P</given-names>
</name>
</person-group>
<source>Randomized Algorithms.</source>
<year>1995</year>
<publisher-loc>Cambridge, UK</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
</element-citation>
</ref>
<ref id="B35">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mudunuri</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Nagarajaram</surname>
<given-names>HA</given-names>
</name>
</person-group>
<article-title>Imex: Imperfect microsatellite extractor</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>1181</fpage>
<lpage>1187</lpage>
<pub-id pub-id-type="pmid">17379689</pub-id>
</element-citation>
</ref>
<ref id="B36">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Mulmuley</surname>
<given-names>K</given-names>
</name>
</person-group>
<source>Computational Geometry, an Introduction through Randomized Algorithms.</source>
<year>1993</year>
<publisher-loc>Englewood Cliffs, New Jersey</publisher-loc>
<publisher-name>Prentice Hall</publisher-name>
</element-citation>
</ref>
<ref id="B37">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>O'Dushlaine</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Tandem repeat copy-number variation in protein-coding regions of human genes</article-title>
<source>Genome Biology</source>
<year>2005</year>
<volume>6</volume>
<fpage>R69</fpage>
<pub-id pub-id-type="pmid">16086851</pub-id>
</element-citation>
</ref>
<ref id="B38">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parisi</surname>
<given-names>V</given-names>
</name>
<etal></etal>
</person-group>
<article-title>String: finding tandem repeats in DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>1733</fpage>
<lpage>1738</lpage>
<pub-id pub-id-type="pmid">14512342</pub-id>
</element-citation>
</ref>
<ref id="B39">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peterlongo</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Lossless filter for multiple repeats with bounded edit distance</article-title>
<source>Algorithms Mol. Biol.</source>
<year>2009</year>
<volume>4</volume>
<fpage>1</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="pmid">19128477</pub-id>
</element-citation>
</ref>
<ref id="B40">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rivals</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>A survey on algorithmic aspects of tandem repeats evolution</article-title>
<source>Int. J. Found. Comput. Sci.</source>
<year>2004</year>
<volume>15</volume>
<fpage>225</fpage>
<lpage>257</lpage>
</element-citation>
</ref>
<ref id="B41">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rivals</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences</article-title>
<source>Comput. Appl. Biosci.</source>
<year>1997</year>
<volume>13</volume>
<fpage>131</fpage>
<lpage>136</lpage>
<pub-id pub-id-type="pmid">9146959</pub-id>
</element-citation>
</ref>
<ref id="B42">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rowen</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The complete 685-kilobase DNA sequence of the human beta T Cell Receptor Locus</article-title>
<source>Science</source>
<year>1996</year>
<volume>272</volume>
<fpage>1755</fpage>
<lpage>1762</lpage>
<pub-id pub-id-type="pmid">8650574</pub-id>
</element-citation>
</ref>
<ref id="B43">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saha</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Empirical comparison of ab initio repeat finding programs</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>2284</fpage>
<lpage>2294</lpage>
<pub-id pub-id-type="pmid">18287116</pub-id>
</element-citation>
</ref>
<ref id="B44">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sammeth</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Comparing tandem repeats with duplications and excisions of variable degree</article-title>
<source>IEEE/ACM Trans. Comput. Biology Bioinform.</source>
<year>2006</year>
<volume>3</volume>
<fpage>395</fpage>
<lpage>407</lpage>
</element-citation>
</ref>
<ref id="B45">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sharma</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Spectral repeat finder (SRF): identification of repetitive sequences using fourier transformation</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>1405</fpage>
<lpage>1412</lpage>
<pub-id pub-id-type="pmid">14976032</pub-id>
</element-citation>
</ref>
<ref id="B46">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sim</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>The consensus string problem for a metric is np-complete</article-title>
<source>J. Discrete Algorithms</source>
<year>2003</year>
<volume>1</volume>
<fpage>111</fpage>
<lpage>117</lpage>
</element-citation>
</ref>
<ref id="B47">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Smit</surname>
<given-names>A.F.A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Repeatmasker open-3.0.</article-title>
<year>(1996–2004)</year>
<date-in-citation>last access date April 27, 2010</date-in-citation>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.repeatmasker.org/">http://www.repeatmasker.org/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B48">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sokol</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Tandem repeats over the edit distance</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>30</fpage>
<lpage>35</lpage>
<pub-id pub-id-type="pmid">17130137</pub-id>
</element-citation>
</ref>
<ref id="B49">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Stolovitzky</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Tandem repeat detection using pattern discovery with applications to the identification of yeast satellites</article-title>
<source>Technical Report RC21508</source>
<year>1999</year>
<publisher-name>IBM T.J. Watson Research Labs</publisher-name>
</element-citation>
</ref>
<ref id="B50">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vissers</surname>
<given-names>LE</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rare pathogenic microdeletions and tandem duplications are microhomology-mediated and stimulated by local genomic architecture</article-title>
<source>Hum. Mol. Genet.</source>
<year>2009</year>
<volume>18</volume>
<fpage>3579</fpage>
<lpage>3593</lpage>
<pub-id pub-id-type="pmid">19578123</pub-id>
</element-citation>
</ref>
<ref id="B51">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vogler</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Effect of repeat copy number on variable-number tandem repeat mutations in Escherichia coli O157:H7</article-title>
<source>J. Bacteriol.</source>
<year>2006</year>
<volume>188</volume>
<fpage>4253</fpage>
<lpage>4563</lpage>
<pub-id pub-id-type="pmid">16740932</pub-id>
</element-citation>
</ref>
<ref id="B52">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Warburton</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Analysis of the largest tandemly repeated DNA families in the human genome</article-title>
<source>BMC Genomics</source>
<year>2008</year>
<volume>9</volume>
<fpage>533</fpage>
<pub-id pub-id-type="pmid">18992157</pub-id>
</element-citation>
</ref>
<ref id="B53">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wells</surname>
<given-names>RD</given-names>
</name>
</person-group>
<article-title>DNA triplexes and Friedreich ataxia</article-title>
<source>FASEB J.</source>
<year>2008</year>
<volume>22</volume>
<fpage>1625</fpage>
<lpage>1634</lpage>
<pub-id pub-id-type="pmid">18211957</pub-id>
</element-citation>
</ref>
<ref id="B54">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Wexler</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Finding approximate tandem repeats in genomic sequences</article-title>
<source>Proceedings of the Eighth Annual International Conference on Resaerch in Computational Molecular Biology (RECOMB 2004)</source>
<year>2004</year>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>ACM</publisher-name>
<fpage>223</fpage>
<lpage>232</lpage>
</element-citation>
</ref>
<ref id="B55">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wexler</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Finding approximate tandem repeats in genomic sequences</article-title>
<source>J. Comput. Biol.</source>
<year>2005</year>
<volume>12</volume>
<fpage>928</fpage>
<lpage>942</lpage>
<pub-id pub-id-type="pmid">16201913</pub-id>
</element-citation>
</ref>
<ref id="B56">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wooster</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Instability of short tandem repeats (microsatellites) in human cancers</article-title>
<source>Nat. Genet.</source>
<year>1994</year>
<volume>6</volume>
<fpage>152</fpage>
<lpage>156</lpage>
<pub-id pub-id-type="pmid">8162069</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/TelematiV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000433 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000433 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    TelematiV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2881393
   |texte=   TRStalker: an efficient heuristic for finding fuzzy tandem repeats
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:20529928" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a TelematiV1 

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Thu Nov 2 16:09:04 2017. Site generation: Sun Mar 10 16:42:28 2024