Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Clustering of reads with alignment-free measures and quality values

Identifieur interne : 000244 ( Pmc/Corpus ); précédent : 000243; suivant : 000245

Clustering of reads with alignment-free measures and quality values

Auteurs : Matteo Comin ; Andrea Leoni ; Michele Schimd

Source :

RBID : PMC:4331138

Abstract

Background

The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.

Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %).

Results

In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and k-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called Dq-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on de novo assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).


Url:
DOI: 10.1186/s13015-014-0029-x
PubMed: 25691913
PubMed Central: 4331138

Links to Exploration step

PMC:4331138

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Clustering of reads with alignment-free measures and quality values</title>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leoni, Andrea" sort="Leoni, Andrea" uniqKey="Leoni A" first="Andrea" last="Leoni">Andrea Leoni</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Schimd, Michele" sort="Schimd, Michele" uniqKey="Schimd M" first="Michele" last="Schimd">Michele Schimd</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25691913</idno>
<idno type="pmc">4331138</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331138</idno>
<idno type="RBID">PMC:4331138</idno>
<idno type="doi">10.1186/s13015-014-0029-x</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000244</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000244</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Clustering of reads with alignment-free measures and quality values</title>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leoni, Andrea" sort="Leoni, Andrea" uniqKey="Leoni A" first="Andrea" last="Leoni">Andrea Leoni</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Schimd, Michele" sort="Schimd, Michele" uniqKey="Schimd M" first="Michele" last="Schimd">Michele Schimd</name>
<affiliation>
<nlm:aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Algorithms for Molecular Biology : AMB</title>
<idno type="eISSN">1748-7188</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on
<italic>k</italic>
-mers counts, have been used to cluster reads.</p>
<p>Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15
<italic>%</italic>
).</p>
</sec>
<sec>
<title>Results</title>
<p>In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and
<italic>k</italic>
-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on
<italic>de novo</italic>
assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Medini, D" uniqKey="Medini D">D Medini</name>
</author>
<author>
<name sortKey="Serruto, D" uniqKey="Serruto D">D Serruto</name>
</author>
<author>
<name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
<author>
<name sortKey="Relman, Da" uniqKey="Relman D">DA Relman</name>
</author>
<author>
<name sortKey="Donati, C" uniqKey="Donati C">C Donati</name>
</author>
<author>
<name sortKey="Moxon, R" uniqKey="Moxon R">R Moxon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jothi, R" uniqKey="Jothi R">R Jothi</name>
</author>
<author>
<name sortKey="Cuddapah, S" uniqKey="Cuddapah S">S Cuddapah</name>
</author>
<author>
<name sortKey="Barski, A" uniqKey="Barski A">A Barski</name>
</author>
<author>
<name sortKey="Cui, K" uniqKey="Cui K">K Cui</name>
</author>
<author>
<name sortKey="Zhao, K" uniqKey="Zhao K">K Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sims, Ge" uniqKey="Sims G">GE Sims</name>
</author>
<author>
<name sortKey="Jun, S R" uniqKey="Jun S">S-R Jun</name>
</author>
<author>
<name sortKey="Wu, Ga" uniqKey="Wu G">GA Wu</name>
</author>
<author>
<name sortKey="Kim, S H" uniqKey="Kim S">S-H Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Zhai, Z" uniqKey="Zhai Z">Z Zhai</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dai, Q" uniqKey="Dai Q">Q Dai</name>
</author>
<author>
<name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gao, L" uniqKey="Gao L">L Gao</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Luo, H" uniqKey="Luo H">H Luo</name>
</author>
<author>
<name sortKey="Hao, B" uniqKey="Hao B">B Hao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goke, J" uniqKey="Goke J">J Göke</name>
</author>
<author>
<name sortKey="Schulz, Mh" uniqKey="Schulz M">MH Schulz</name>
</author>
<author>
<name sortKey="Lasserre, J" uniqKey="Lasserre J">J Lasserre</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kantorovitz, Mr" uniqKey="Kantorovitz M">MR Kantorovitz</name>
</author>
<author>
<name sortKey="Robinson, Ge" uniqKey="Robinson G">GE Robinson</name>
</author>
<author>
<name sortKey="Sinha, S" uniqKey="Sinha S">S Sinha</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Antonello, M" uniqKey="Antonello M">M Antonello</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Antonello, M" uniqKey="Antonello M">M Antonello</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qu, W" uniqKey="Qu W">W Qu</name>
</author>
<author>
<name sortKey="Hashimoto, S I" uniqKey="Hashimoto S">S-i Hashimoto</name>
</author>
<author>
<name sortKey="Morishita, S" uniqKey="Morishita S">S Morishita</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, E" uniqKey="Bao E">E Bao</name>
</author>
<author>
<name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
<author>
<name sortKey="Kaloshian, I" uniqKey="Kaloshian I">I Kaloshian</name>
</author>
<author>
<name sortKey="Girke, T" uniqKey="Girke T">T Girke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solovyov, A" uniqKey="Solovyov A">A Solovyov</name>
</author>
<author>
<name sortKey="Lipkin, W" uniqKey="Lipkin W">W Lipkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Albers, Ca" uniqKey="Albers C">CA Albers</name>
</author>
<author>
<name sortKey="Lunter, G" uniqKey="Lunter G">G Lunter</name>
</author>
<author>
<name sortKey="Macarthur, Dg" uniqKey="Macarthur D">DG MacArthur</name>
</author>
<author>
<name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
<author>
<name sortKey="Ouwehand, Wh" uniqKey="Ouwehand W">WH Ouwehand</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carneiro, Mo" uniqKey="Carneiro M">MO Carneiro</name>
</author>
<author>
<name sortKey="Russ, C" uniqKey="Russ C">C Russ</name>
</author>
<author>
<name sortKey="Ross, Mg" uniqKey="Ross M">MG Ross</name>
</author>
<author>
<name sortKey="Gabriel, Sb" uniqKey="Gabriel S">SB Gabriel</name>
</author>
<author>
<name sortKey="Nusbaum, C" uniqKey="Nusbaum C">C Nusbaum</name>
</author>
<author>
<name sortKey="Depristo, Ma" uniqKey="Depristo M">MA DePristo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blaisdell, Be" uniqKey="Blaisdell B">BE Blaisdell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lippert, Ra" uniqKey="Lippert R">RA Lippert</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Chew, D" uniqKey="Chew D">D Chew</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wan, L" uniqKey="Wan L">L Wan</name>
</author>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ewing, B" uniqKey="Ewing B">B Ewing</name>
</author>
<author>
<name sortKey="Green, P" uniqKey="Green P">P Green</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Leoni, A" uniqKey="Leoni A">A Leoni</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Algorithms Mol Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Algorithms Mol Biol</journal-id>
<journal-title-group>
<journal-title>Algorithms for Molecular Biology : AMB</journal-title>
</journal-title-group>
<issn pub-type="epub">1748-7188</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25691913</article-id>
<article-id pub-id-type="pmc">4331138</article-id>
<article-id pub-id-type="publisher-id">29</article-id>
<article-id pub-id-type="doi">10.1186/s13015-014-0029-x</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Clustering of reads with alignment-free measures and quality values</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Comin</surname>
<given-names>Matteo</given-names>
</name>
<address>
<email>comin@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Leoni</surname>
<given-names>Andrea</given-names>
</name>
<address>
<email>andrea.leoni@studenti.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Schimd</surname>
<given-names>Michele</given-names>
</name>
<address>
<email>schimdmi@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">Department of Information Engineering, University of Padova, Padova, Italy</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>28</day>
<month>1</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>28</day>
<month>1</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>10</volume>
<elocation-id>4</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>11</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>12</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>© Comin et al.; licensee BioMed Central. 2015</copyright-statement>
<license license-type="open-access">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on
<italic>k</italic>
-mers counts, have been used to cluster reads.</p>
<p>Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15
<italic>%</italic>
).</p>
</sec>
<sec>
<title>Results</title>
<p>In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and
<italic>k</italic>
-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on
<italic>de novo</italic>
assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Alignment-free measures</kwd>
<kwd>Reads quality values</kwd>
<kwd>Reads clustering</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems [
<xref ref-type="bibr" rid="CR1">1</xref>
]. Current technologies produce over 500 billion bases of DNA per run, and the forthcoming sequencers promise to increase this throughput. The rapid improvement of sequencing technologies has enabled a number of different sequencing-based applications like genome resequencing, RNA-Seq, ChIP-Seq and many others [
<xref ref-type="bibr" rid="CR2">2</xref>
]. Handling and processing such large files is becoming one of the major challenges in most genome research projects.</p>
<p>Alignment-based methods have been used for quite some time to establish similarity between sequences [
<xref ref-type="bibr" rid="CR3">3</xref>
]. However there are cases where alignment methods can not be applied or they are not suited.</p>
<p>For example the comparison of whole genomes is impossible to conduct with traditional alignment techniques, because of events like rearrangements that can not be captured with an alignment [
<xref ref-type="bibr" rid="CR4">4</xref>
-
<xref ref-type="bibr" rid="CR6">6</xref>
]. Although fast alignment heuristics exist, another drawback is that alignment methods are usually time consuming, thus they are not suited for large-scale sequence data produced by Next-Generation Sequencing technologies (NGS) [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. For these reasons a number of alignment-free techniques have been proposed over the years [
<xref ref-type="bibr" rid="CR9">9</xref>
].</p>
<p>The use of alignment-free methods for comparing sequences has proved useful in different applications. Researchers have shown that the use of
<italic>k</italic>
-mers frequencies can improve the construction of phylogenetic trees traditionally based on a multiple-sequence alignment, especially for distant related species [
<xref ref-type="bibr" rid="CR10">10</xref>
]. Some alignment-free measures use the patterns distribution to study evolutionary relationships among different organisms [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
,
<xref ref-type="bibr" rid="CR12">12</xref>
]. The efficiency of alignment-free measures also allows the reconstruction of phylogenies for whole genomes [
<xref ref-type="bibr" rid="CR4">4</xref>
-
<xref ref-type="bibr" rid="CR6">6</xref>
]. Several alignment-free methods have been devised for the detection of enhancers in ChIP-Seq data [
<xref ref-type="bibr" rid="CR13">13</xref>
-
<xref ref-type="bibr" rid="CR15">15</xref>
] and also of entropic profiles [
<xref ref-type="bibr" rid="CR16">16</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
]. Another application is the classification of protein remotely related, which can be addressed with sophisticated word counting procedures [
<xref ref-type="bibr" rid="CR18">18</xref>
,
<xref ref-type="bibr" rid="CR19">19</xref>
]. The assembly-free comparison of genomes based on NGS reads has been investigated only recently [
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
]. For a comprehensive review of alignment-free measures and applications we refer the reader to [
<xref ref-type="bibr" rid="CR9">9</xref>
].</p>
<p>In this study we want to explore the ability of alignment-free measures to cluster reads data. Clustering techniques are widely used in many different applications based on NGS data, from error correction [
<xref ref-type="bibr" rid="CR20">20</xref>
] to the discovery of groups of microRNAs [
<xref ref-type="bibr" rid="CR21">21</xref>
]. With the increasing throughput of NGS technologies another important aspect is the reduction of data complexity by collapsing redundant reads into a single cluster to improve the run time, memory requirements, and quality of subsequent steps like assembly.</p>
<p>In [
<xref ref-type="bibr" rid="CR22">22</xref>
] Solovyov
<italic>et al.</italic>
presented one of the first comparison of alignment-free measures when applied to NGS reads clustering. They focused on clustering reads coming from different genes and different species based on
<italic>k</italic>
-mer counts. They showed that
<italic>D</italic>
-type measures (see next section), in particular
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D_{2}^{*}$ \end{document}</tex-math>
<mml:math id="M2">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
, can efficiently detect and cluster reads from the same gene or species (as opposed to [
<xref ref-type="bibr" rid="CR21">21</xref>
] where the clustering is focused on errors). In this paper we extend this study by incorporating quality value information into these measures.</p>
<p>Quality scores produced by NGS platforms are fundamental for various analysis of NGS data: mapping reads to a reference genome [
<xref ref-type="bibr" rid="CR23">23</xref>
]; error correction [
<xref ref-type="bibr" rid="CR20">20</xref>
]; detection of insertion and deletion [
<xref ref-type="bibr" rid="CR24">24</xref>
] and many others. Moreover future-generation sequencing technologies will produce longer and less biased reads with a large number of erroneous bases [
<xref ref-type="bibr" rid="CR25">25</xref>
]. The average number of errors per read will grow up to 15
<italic>%</italic>
, thus it will be fundamental to exploit quality value information within the alignment-free framework and the
<italic>de novo</italic>
assembly where longer and less biased reads could have dramatic impact.</p>
<p>Most applications require as input a set of reads that is error-free, thus they need to pre-process the data with a filter. Usually quality values are used to detect low quality reads, that in most applications are discarded. With the increasing of error rates, the ability to work with erroneous reads will be fundamental. Moreover, in this scenario, quality values are used only during the pre-process to select reads that are error-free. Approximately half of the data produced by a sequencers are quality values, yet they are discarded after the pre-processing. In this paper we pave the way to a new paradigm where also quality values play a major role when analyzing reads data.</p>
<p>In the following section we briefly review some alignment-free measures. Then we present a new family of statistics, called
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type
<sup>a</sup>
, that take advantage of quality values. The software QCluster is discussed and relevant results on simulated and real data are presented in the results section. In the last section we summarize the findings and we discuss future directions of investigation.</p>
</sec>
<sec id="Sec2">
<title>Previous work on alignment-free measures</title>
<p>One of the first papers that introduced an alignment-free method is due to Blaisdell in 1986 [
<xref ref-type="bibr" rid="CR26">26</xref>
]. He proposed a statistic called
<italic>D</italic>
<sub>2</sub>
, to study the correlation between two sequences. The initial purpose was to speed up database searches, where alignment-based methods were too slow. The
<italic>D</italic>
<sub>2</sub>
similarity is the correlation between the number of occurrences of all
<italic>k</italic>
-mers appearing in two sequences. Let
<italic>X</italic>
and
<italic>Y</italic>
be two sequences from an alphabet
<italic>Σ</italic>
. The value
<italic>X</italic>
<sub>
<italic>w</italic>
</sub>
is the number of times
<italic>w</italic>
appears in
<italic>X</italic>
, with possible overlaps. Then the
<italic>D</italic>
<sub>2</sub>
statistic is:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ D_{2}= \sum_{w \in \Sigma^{k}} X_{w} Y_{w}. $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>This is the inner product of the word vectors
<italic>X</italic>
<sub>
<italic>w</italic>
</sub>
and
<italic>Y</italic>
<sub>
<italic>w</italic>
</sub>
, each one representing the number of occurrences of words of length
<italic>k</italic>
,
<italic>i.e.</italic>
<italic>k</italic>
-mers, in the two sequences. However, it was shown by Lippert
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="CR27">27</xref>
] that the
<italic>D</italic>
<sub>2</sub>
statistic can be biased by the stochastic noise in each sequence. To address this issue another popular statistic, called
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${D_{2}^{z}}$ \end{document}</tex-math>
<mml:math id="M6">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
, was introduced in [
<xref ref-type="bibr" rid="CR14">14</xref>
]. This measure was proposed to standardize the
<italic>D</italic>
<sub>2</sub>
in the following manner:
<disp-formula id="Equb">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${D_{2}^{z}} = \frac{D_{2} - \mathbb{E}(D_{2})} {\mathbb{V}(D_{2})}, $$ \end{document}</tex-math>
<mml:math id="M8">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔼</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="double-struck">𝕍</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $\mathbb {E}(D_{2})$ \end{document}</tex-math>
<mml:math id="M10">
<mml:mi mathvariant="double-struck">𝔼</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $\mathbb {V}(D_{2})$ \end{document}</tex-math>
<mml:math id="M12">
<mml:mi mathvariant="double-struck">𝕍</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
are the expectation and the standard deviation of
<italic>D</italic>
<sub>2</sub>
, respectively. Although the
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${D_{2}^{z}}$ \end{document}</tex-math>
<mml:math id="M14">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
similarity improves
<italic>D</italic>
<sub>2</sub>
, it is still dominated by the specific variation of each pattern from the background [
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
]. To account for different distributions of the
<italic>k</italic>
-mers, in [
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
] two other new statistics are defined and named
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D_{2}^{*}$ \end{document}</tex-math>
<mml:math id="M16">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${D_{2}^{s}}$ \end{document}</tex-math>
<mml:math id="M18">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
. Let
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $\tilde {X}_{w}=X_{w} - (n-k+1)*p_{w}$ \end{document}</tex-math>
<mml:math id="M20">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $\tilde {Y}_{w}=Y_{w} - (n-k+1)*p_{w}$ \end{document}</tex-math>
<mml:math id="M22">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
where
<italic>p</italic>
<sub>
<italic>w</italic>
</sub>
is the probability of
<italic>w</italic>
under the null model. Then
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D_{2}^{*}$ \end{document}</tex-math>
<mml:math id="M24">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${D_{2}^{s}}$ \end{document}</tex-math>
<mml:math id="M26">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
can be defined as follows:
<disp-formula id="Equc">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$D_{2}^{*} = \sum _{w \in \Sigma^{k}} \frac{\tilde{X}_{w} \tilde{Y}_{w}}{(n-k+1)p_{w}} $$ \end{document}</tex-math>
<mml:math id="M28">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
and,
<disp-formula id="Equd">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$D_{2}^{s} = \sum_{w \in \Sigma^{k}} \frac{\tilde{X}_{w} \tilde{Y}_{w}}{\sqrt{\tilde{X}^{2}_{w} + \tilde{Y}^{2}_{w}}}. $$ \end{document}</tex-math>
<mml:math id="M30">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>This latter similarity measure responds to the need of normalization of
<italic>D</italic>
<sub>2</sub>
. These set of alignment-free measures are usually called
<italic>D</italic>
-type statistics. All these statistics have been studied by Reinert
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="CR28">28</xref>
] and Wan
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="CR29">29</xref>
] for the detection of regulatory sequences. From the word vectors
<italic>X</italic>
<sub>
<italic>w</italic>
</sub>
and
<italic>Y</italic>
<sub>
<italic>w</italic>
</sub>
several other measures can be computed like
<italic>L</italic>
<sub>2</sub>
, Kullback-Leibler divergence (KL), symmetrized KL [
<xref ref-type="bibr" rid="CR22">22</xref>
] etc.</p>
</sec>
<sec id="Sec3">
<title>Comparison of reads with quality values</title>
<sec id="Sec4">
<title>Background on quality values</title>
<p>Upon producing base calls for a read
<italic>x</italic>
, sequencing machines also assign a
<italic>quality score</italic>
<italic>Q</italic>
<sub>
<italic>x</italic>
</sub>
(
<italic>i</italic>
) to each base in the read. These scores are usually given as
<italic>phred</italic>
-scaled probability [
<xref ref-type="bibr" rid="CR30">30</xref>
] of the
<italic>i</italic>
-th base being wrong
<disp-formula id="Eque">
<alternatives>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${} Q_{x}(i) = -10 \log_{10}{Prob\{\text{the base \textit{i} of read \textit{x} is wrong }\}}. $$ \end{document}</tex-math>
<mml:math id="M32">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mn>10</mml:mn>
<mml:munder>
<mml:mrow>
<mml:mo>log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:munder>
<mml:mtext mathvariant="italic">Prob</mml:mtext>
<mml:mo>{</mml:mo>
<mml:mtext>the base</mml:mtext>
<mml:mtext mathvariant="italic">i</mml:mtext>
<mml:mtext>of read</mml:mtext>
<mml:mtext mathvariant="italic">x</mml:mtext>
<mml:mtext mathvariant="italic">is wrong</mml:mtext>
<mml:mo>}</mml:mo>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Eque.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
For example, if
<italic>Q</italic>
<sub>
<italic>x</italic>
</sub>
(
<italic>i</italic>
)=30 then there is 1 in 1000 chance that base
<italic>i</italic>
of read
<italic>x</italic>
is incorrect.</p>
<p>If we assume that quality values are produced independently to each other (similarly to [
<xref ref-type="bibr" rid="CR23">23</xref>
]), we can calculate the probability of an entire read
<italic>x</italic>
being correct as:
<disp-formula id="Equf">
<alternatives>
<tex-math id="M33">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$P_{x}\left\{\text{the read \textit{x} is correct}\right\}= \prod_{j=0}^{n-1}{\left(1- 10^{- Q_{x}(j)/{10}}\right)} $$ \end{document}</tex-math>
<mml:math id="M34">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:mtext>the read</mml:mtext>
<mml:mtext mathvariant="italic">x</mml:mtext>
<mml:mtext>is correct</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equf.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>n</italic>
is the length of the read
<italic>x</italic>
. In the same way we define the probability of a word
<italic>w</italic>
of length
<italic>k</italic>
, occurring at position
<italic>i</italic>
of read
<italic>x</italic>
being correct as:
<disp-formula id="Equg">
<alternatives>
<tex-math id="M35">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{aligned} P_{w,i}&\left\{\text{the word \textit{w} at position \textit{i} of read \textit{x} is correct}\right\}\\ &= \prod_{j=0}^{k-1}{\left(1- 10^{-Q_{x}(i+j)/{10}}\right)}. \end{aligned} $$ \end{document}</tex-math>
<mml:math id="M36">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:mtext>the word</mml:mtext>
<mml:mtext mathvariant="italic">w</mml:mtext>
<mml:mtext>at position</mml:mtext>
<mml:mtext mathvariant="italic">i</mml:mtext>
<mml:mtext>of read</mml:mtext>
<mml:mtext mathvariant="italic">x</mml:mtext>
<mml:mtext>is correct</mml:mtext>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mi>.</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equg.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>In all previous alignment-free statistics the
<italic>k</italic>
-mers are counted such that each occurrence contributed as 1 irrespective of its quality. Here we can use the quality of that occurrence instead to account also for erroneous
<italic>k</italic>
-mers. The idea is to model sequencing as the process of reading
<italic>k</italic>
-mers from the reference and assigning a probability to them. Thus this formula can be used to weight the occurrences of all
<italic>k</italic>
-mers used in the previous statistics.</p>
</sec>
<sec id="Sec5">
<title>New
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type statistics</title>
<p>We extend here
<italic>D</italic>
-type statistics [
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
] to account for quality values. By defining
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M37">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${X_{w}^{q}}$ \end{document}</tex-math>
<mml:math id="M38">
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
as the sum of probabilities of all the occurrences of
<italic>w</italic>
in
<italic>x</italic>
:
<disp-formula id="Equh">
<alternatives>
<tex-math id="M39">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$X_{w}^{q} = \sum_{i \in \left\{i| \text{\textit{w} occurs in \textit{x} at position \textit{i}}\right\} }P_{w,i} $$ \end{document}</tex-math>
<mml:math id="M40">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>|</mml:mo>
<mml:mtext mathvariant="italic">w</mml:mtext>
<mml:mtext>occurs in</mml:mtext>
<mml:mtext mathvariant="italic">x</mml:mtext>
<mml:mtext>at position</mml:mtext>
<mml:mtext mathvariant="italic">i</mml:mtext>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equh.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
we assign a weight (
<italic>i.e.</italic>
a probability) to each occurrence of
<italic>w</italic>
. Now
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M41">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${X_{w}^{q}}$ \end{document}</tex-math>
<mml:math id="M42">
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
can be used instead of
<italic>X</italic>
<sub>
<italic>w</italic>
</sub>
to compute the alignment-free statistics. Note that, by using
<inline-formula id="IEq14">
<alternatives>
<tex-math id="M43">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${X_{w}^{q}}$ \end{document}</tex-math>
<mml:math id="M44">
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq14.gif"></inline-graphic>
</alternatives>
</inline-formula>
, every occurrence is not counted as 1, but with a value in [ 0,1] depending of the reliability of the read. We can now define a new alignment-free statistic as :
<disp-formula id="Equi">
<alternatives>
<tex-math id="M45">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {D_{2}^{q}}= \sum_{w \in \Sigma^{k}} {X_{w}^{q}} {Y_{w}^{q}}. $$ \end{document}</tex-math>
<mml:math id="M46">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equi.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>This is the extension of the
<italic>D</italic>
<sub>2</sub>
measure, in which occurrences are weighted based on quality scores. Following the previous section we can also define the centralized
<italic>k</italic>
-mers counts as follows:
<disp-formula id="Equj">
<alternatives>
<tex-math id="M47">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \tilde{{X_{w}^{q}}} = {X_{w}^{q}} - (n-k+1)p_{w} E(P_{w}) $$ \end{document}</tex-math>
<mml:math id="M48">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>E</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equj.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>n</italic>
=|
<italic>x</italic>
| is the length of
<italic>x</italic>
,
<italic>p</italic>
<sub>
<italic>w</italic>
</sub>
is the probability of the word
<italic>w</italic>
in the i.i.d. model and the expected number of occurrences (
<italic>n</italic>
<italic>k</italic>
+1)
<italic>p</italic>
<sub>
<italic>w</italic>
</sub>
is multiplied by
<italic>E</italic>
(
<italic>P</italic>
<sub>
<italic>w</italic>
</sub>
) which represents the expected probability of
<italic>k</italic>
-mer
<italic>w</italic>
based on the quality scores.</p>
<p>We can now extend two other popular alignment-free statistics:
<disp-formula id="Equk">
<alternatives>
<tex-math id="M49">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$D_{2}^{*q} = \sum_{w \in \Sigma^{k}} \frac{\tilde{{X_{w}^{q}}} \tilde{{Y_{w}^{q}}}}{(n-k+1)p_{w} E(P_{w})} $$ \end{document}</tex-math>
<mml:math id="M50">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>E</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equk.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
and,
<disp-formula id="Equl">
<alternatives>
<tex-math id="M51">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$D_{2}^{sq} = \sum_{w \in \Sigma^{k}} \frac{\tilde{{X_{w}^{q}}}\tilde{{Y_{w}^{q}}}}{\sqrt{\tilde{{X_{w}^{q}}}^{2} + \tilde{{Y_{w}^{q}}}^{2}}}. $$ \end{document}</tex-math>
<mml:math id="M52">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">sq</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equl.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>We call these three alignment-free measures
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type. Now,
<italic>E</italic>
(
<italic>P</italic>
<sub>
<italic>w</italic>
</sub>
) depends on
<italic>w</italic>
and on the actual sequencing machine, therefore it can be very hard, if not impossible, to calculate precisely. However, if the set
<inline-graphic xlink:href="13015_2014_29_Figa_HTML.gif" id="d30e1999"></inline-graphic>
of all the reads is large enough we can estimate the prior probability using the posterior relative frequency,
<italic>i.e.</italic>
the frequency observed on the actual set
<inline-graphic xlink:href="13015_2014_29_Figa_HTML.gif" id="d30e2005"></inline-graphic>
, similarly to [
<xref ref-type="bibr" rid="CR23">23</xref>
]. We assume that, given the quality values, the error probability on a base is independent from its position within the read and from all other quality values (see [
<xref ref-type="bibr" rid="CR23">23</xref>
]). We defined two different approximations, the first one estimates
<italic>E</italic>
(
<italic>P</italic>
<sub>
<italic>w</italic>
</sub>
) as the average error probability of the
<italic>k</italic>
-mer
<italic>w</italic>
among all reads
<inline-formula id="IEq15">
<alternatives>
<tex-math id="M53">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $x \in \mathbb {D}$ \end{document}</tex-math>
<mml:math id="M54">
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔻</mml:mi>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq15.gif"></inline-graphic>
</alternatives>
</inline-formula>
:
<disp-formula id="Equm">
<alternatives>
<tex-math id="M55">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$E(P_{w})\approx \frac{\sum_{x \in \mathbb{D}} {X_{w}^{q}} }{\sum_{x \in \mathbb{D}} X_{w}} $$ \end{document}</tex-math>
<mml:math id="M56">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔻</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔻</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equm.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
while the second defines, for each base
<italic>j</italic>
of
<italic>w</italic>
, the average quality observed over all occurrences of
<italic>w</italic>
in
<inline-graphic xlink:href="13015_2014_29_Figa_HTML.gif" id="d30e2120"></inline-graphic>
:
<disp-formula id="Equn">
<alternatives>
<tex-math id="M57">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\overline{Q_{w}}[\!j]=\frac{\sum_{x \in \mathbb{D}} \sum_{i \in \{i| \text{ \textit{w} occurs in \textit{x} at position \textit{i}} \} } Q_{x}(i+j)}{\sum_{x \in \mathbb{D}} X_{w}} $$ \end{document}</tex-math>
<mml:math id="M58">
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>j</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔻</mml:mi>
</mml:mrow>
</mml:munder>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>|</mml:mo>
<mml:mtext mathvariant="italic">w</mml:mtext>
<mml:mtext>occurs in</mml:mtext>
<mml:mtext mathvariant="italic">x</mml:mtext>
<mml:mtext>at position</mml:mtext>
<mml:mtext mathvariant="italic">i</mml:mtext>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="double-struck">𝔻</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equn.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
and it uses the average quality values to compute the expected word probability.
<disp-formula id="Equo">
<alternatives>
<tex-math id="M59">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$E(P_{w})\approx \prod_{j=0}^{k-1}{\left(1- 10^{-\overline{Q_{w}}(j)/{10}}\right)} $$ \end{document}</tex-math>
<mml:math id="M60">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mover accent="false">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<graphic xlink:href="13015_2014_29_Article_Equo.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>We called the first approximation
<italic>Average Word Probability (AWP)</italic>
and the second one
<italic>Average Quality Probability (AQP)</italic>
. Both these approximations are implemented within the software QCluster and tests are presented in the Experimental Results section.</p>
</sec>
<sec id="Sec6">
<title>Quality value redistribution</title>
<p>If we consider the meaning of quality values it is possible to further exploit it to extend and improve the above statistics. Let’s say that the base
<italic>A</italic>
has quality 70%, it means that there is a 70% probability that the base is correct. However there is also another 30% probability that the base is incorrect. Let’s ignore for the moment insertion and deletion errors, if the four bases are equiprobable, this means that with uniform probability 10% the wrong base is a
<italic>C</italic>
, or a
<italic>G</italic>
or a
<italic>T</italic>
. It’s therefore possible to redistribute the “missing quality” among other bases.</p>
<p>We can perform a more precise operation by redistributing the missing quality among other bases in proportion to their frequency in the read. For example, if the frequencies of the bases in the read are
<italic>A</italic>
=20%,
<italic>C</italic>
=30%,
<italic>G</italic>
=30%,
<italic>T</italic>
=20%, the resulting qualities, after the redistribution, will be:
<italic>A</italic>
=70%,
<italic>C</italic>
=30
<italic>%</italic>
∗30
<italic>%</italic>
/(30
<italic>%</italic>
+30
<italic>%</italic>
+20
<italic>%</italic>
)=11.25
<italic>%</italic>
,
<italic>G</italic>
=30
<italic>%</italic>
∗30
<italic>%</italic>
/(30
<italic>%</italic>
+30
<italic>%</italic>
+20
<italic>%</italic>
)=11.25
<italic>%</italic>
,
<italic>T</italic>
=30
<italic>%</italic>
∗20
<italic>%</italic>
/(30
<italic>%</italic>
+30
<italic>%</italic>
+20
<italic>%</italic>
)=7.5
<italic>%</italic>
. For an example see Table
<xref rid="Tab1" ref-type="table">1</xref>
.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>
<bold>Example of quality value redistribution of the word</bold>
<bold>
<italic>T</italic>
</bold>
<bold>
<italic>G</italic>
</bold>
<bold>
<italic>A</italic>
</bold>
<bold>
<italic>C</italic>
</bold>
<bold>
<italic>C</italic>
</bold>
<bold>
<italic>A</italic>
</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Original Word</th>
<th align="left">T</th>
<th align="left">G</th>
<th align="left">A</th>
<th align="left">C</th>
<th align="left">C</th>
<th align="left">A</th>
</tr>
<tr>
<th align="left">Accuracy</th>
<th align="left">X</th>
<th align="left">X</th>
<th align="left">70%</th>
<th align="left">X</th>
<th align="left">X</th>
<th align="left">X</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Possible Word 1</td>
<td align="center">T</td>
<td align="center">G</td>
<td align="center">C</td>
<td align="center">C</td>
<td align="center">C</td>
<td align="center">A</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">11.25%</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">X</td>
</tr>
<tr>
<td align="left">Possible Word 2</td>
<td align="center">T</td>
<td align="center">G</td>
<td align="center">G</td>
<td align="center">C</td>
<td align="center">C</td>
<td align="center">A</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">11.25%</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">X</td>
</tr>
<tr>
<td align="left">Possible Word 3</td>
<td align="center">T</td>
<td align="center">G</td>
<td align="center">T</td>
<td align="center">C</td>
<td align="center">C</td>
<td align="center">A</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">7.5%</td>
<td align="center">X</td>
<td align="center">X</td>
<td align="center">X</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>The same redistribution, with a slight approximation, can be extended to
<italic>k</italic>
-mers quality. More in detail, we consider the case in which only one base is wrong, thus we redistribute the quality of only one base at a time. Given a
<italic>k</italic>
-mer, we generate all neighboring words that can be obtained by substitution of the wrong base. The quality of the replaced letter is calculated as in the previous example and the quality of the entire word is again given by the product of the qualities of all the bases in the new
<italic>k</italic>
-mers. We increment the corresponding entry of the vector
<inline-formula id="IEq16">
<alternatives>
<tex-math id="M61">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${X_{w}^{q}}$ \end{document}</tex-math>
<mml:math id="M62">
<mml:msubsup>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq16.gif"></inline-graphic>
</alternatives>
</inline-formula>
with the score obtained for the new
<italic>k</italic>
-mer. This process is repeated for all bases of the original
<italic>k</italic>
-mer. Thus every time we are evaluating the quality of a word, we are also scoring neighboring
<italic>k</italic>
-mers by redistributing the qualities. We didn’t consider the case where two or more bases are wrong simultaneously, because the computational cost would be too high and the quality of the resulting word would not appreciably affect the measures.</p>
</sec>
</sec>
<sec id="Sec7">
<title>QCluster: clustering of reads with
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type measures</title>
<p>Clustering is the process of partitioning a given set into
<italic>c</italic>
distinct disjoint subsets called
<italic>clusters</italic>
such that elements (
<italic>e.g.</italic>
reads) on the same cluster have minimum distance between them and maximum distance with elements of different clusters. Centroid clustering associates to each cluster one point on the space of input elements called
<italic>centroid</italic>
which does not need to be part of the input set. Each element is then assigned to the cluster for which the distance measure to the centroid is minimized. A classical example of centroid clustering is the algorithm k-means.</p>
<p>We extent the software afcluster [
<xref ref-type="bibr" rid="CR22">22</xref>
] which uses k-means to compute the clustering of reads based on several distance measures:
<italic>L</italic>
<sub>2</sub>
which is the Euclidean norm, Kullback-Liebler divergence and its symmetrized version, and
<italic>D</italic>
<sub>2</sub>
based measures. Starting from this software we developed QCluster by incorporating the computation of the
<inline-formula id="IEq17">
<alternatives>
<tex-math id="M63">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} ${D_{2}^{q}}$ \end{document}</tex-math>
<mml:math id="M64">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
-type statistics described above using both
<italic>AWP</italic>
and
<italic>AQP</italic>
prior probability estimators and the redistribution of quality values.</p>
<p>The program takes in input a FastQ format file and performs centroid-based clustering (k-means) of the reads based on the counts and the quality of
<italic>k</italic>
-mers. When using the
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type measures, one needs to choose the method for the computation of the expected word probability,
<italic>AWP</italic>
or
<italic>AQP</italic>
, and the quality redistribution.</p>
<p>Since some of the implemented distances (symmetrized KL,
<inline-formula id="IEq18">
<alternatives>
<tex-math id="M65">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M66">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
) do not guarantee to converge, we implemented a stopping criteria. The execution of the algorithm interrupts if the number of iterations without improvements, over the best solution, exceeds a certain threshold. In this case, the best solution found is returned. To avoid as much as possible biases due to the initial random generation of centroids, the best solution over several runs is reported. The number of runs may be set by the user and for our experiments we use the value 5.</p>
<p>Several other options like consensus clustering, reverse complement and different normalizations are available. All implemented measures can be computed in linear time and space, which is desirable for large NGS datasets. The QCluster is freely available (http://www.dei.unipd.it/~ciompin/main/qcluster.html), it has been implemented in C++ and compiled and tested using GNU GCC.</p>
</sec>
<sec id="Sec8">
<title>Experimental results</title>
<p>Several tests have been performed in order to estimate the effectiveness of the different distances, on both simulated and real datasets. In particular, we had to ensure that, with the use of the additional information of quality values, the clustering improved compared to that produced by the original algorithms.</p>
<p>For simulations we use the dataset of human mRNA genes downloaded from NCBI [
<xref ref-type="bibr" rid="CR31">31</xref>
], also used in [
<xref ref-type="bibr" rid="CR22">22</xref>
]. We randomly select 50 sets of 100 sequences each of human mRNA, with the length of each sequence ranged between 500 and 10000 bases. From each sequence, 10000 reads of length 200 were simulated using Mason [
<xref ref-type="bibr" rid="CR32">32</xref>
,
<xref ref-type="bibr" rid="CR33">33</xref>
] with different parameters,
<italic>e.g.</italic>
percentage of mismatches, read length. We apply QCluster using different distances, to the whole set of reads and then we measure the quality of the clusters produced by evaluating the extent to which the partitioning agrees with the natural splitting of the sequences. In other words, we measured how well reads originating from the same sequence are grouped together. We calculate the recall rate as follows, for each mRNA sequence
<italic>S</italic>
we identify the set of reads originated from
<italic>S</italic>
. We look for the cluster
<italic>C</italic>
that contains most of the reads of
<italic>S</italic>
. The percentage of the
<italic>S</italic>
reads that have been grouped in
<italic>C</italic>
is the recall value for the sequence
<italic>S</italic>
. We repeat the same operation for each sequence and calculate the average value of recall rate over all sequences.</p>
<p>Several clustering were produced by using the following distance types:
<inline-formula id="IEq19">
<alternatives>
<tex-math id="M67">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M68">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
,
<italic>D</italic>
<sub>2</sub>
,
<italic>L</italic>
<sub>2</sub>
,
<italic>KL</italic>
, symmetrized
<italic>KL</italic>
and compared with
<inline-formula id="IEq20">
<alternatives>
<tex-math id="M69">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D_{2}^{*q}$ \end{document}</tex-math>
<mml:math id="M70">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
in all its variants, using the expectation formula (1)
<italic>AWP</italic>
or (2)
<italic>AQP</italic>
, with and without quality redistribution (q-red). In order to avoid as much as possible biases due to the initial random generation of centroids, each algorithm was executed 5 times with different random seeds and the clustering with the lower distortion was chosen.</p>
<p>Table
<xref rid="Tab2" ref-type="table">2</xref>
reports the recall while varying error rates, number of clusters and
<italic>k</italic>
. As expected, for all distances the recall rate decreases with the number of clusters. For traditional distances, if the reads do not contain errors then
<inline-formula id="IEq21">
<alternatives>
<tex-math id="M71">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M72">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
preforms consistently better then the others
<italic>D</italic>
<sub>2</sub>
,
<italic>L</italic>
<sub>2</sub>
,
<italic>KL</italic>
. When the sequencing process becomes more noisy, the
<italic>KL</italic>
distances appears to be less sensitive to sequencing errors. However if quality information are used,
<inline-formula id="IEq22">
<alternatives>
<tex-math id="M73">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M74">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
outperforms all other methods and the advantage grows with the error rate. This confirms that the use of quality values can improve clustering accuracy. When the number of clusters increases then the advantage of
<inline-formula id="IEq23">
<alternatives>
<tex-math id="M75">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M76">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq23.gif"></inline-graphic>
</alternatives>
</inline-formula>
becomes more evident. In these experiments the use of
<italic>AQP</italic>
for expectation within
<inline-formula id="IEq24">
<alternatives>
<tex-math id="M77">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M78">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq24.gif"></inline-graphic>
</alternatives>
</inline-formula>
is more stable and better performing compared with formula
<italic>AWP</italic>
. The contribution of quality redistribution (q-red) is limited, although it seems to have some positive effect with the expectation
<italic>AQP</italic>
.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>
<bold>Recall rates of clustering of mRNA simulated reads (10000 reads of length 200) for different measures, error rates, number of clusters and parameter</bold>
<bold>
<italic>k</italic>
</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="4">
<bold>
<italic>k</italic>
</bold>
<bold>=2</bold>
</th>
<th align="left"></th>
<th align="left" colspan="4">
<bold>
<italic>k</italic>
</bold>
<bold>=3</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left" colspan="4">
<bold>(a)</bold>
</th>
<th align="left"></th>
<th align="left" colspan="4">
<bold>(b)</bold>
</th>
</tr>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>No errors</bold>
</th>
<th align="left">
<bold>3</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>5</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>10</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left"></th>
<th align="left">
<bold>No errors</bold>
</th>
<th align="left">
<bold>3</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>5</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>10</bold>
<bold>
<italic>%</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left" colspan="4">
<bold>2 clusters</bold>
</td>
<td align="left"></td>
<td align="left" colspan="4">
<bold>2 clusters</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq25">
<alternatives>
<tex-math id="M79">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M80">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq25.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">
<bold>0,815</bold>
</td>
<td align="left">0,813</td>
<td align="left">0,810</td>
<td align="left">0,801</td>
<td align="left"></td>
<td align="left">
<bold>0,822</bold>
</td>
<td align="left">0,819</td>
<td align="left">0,814</td>
<td align="left">0,794</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq26">
<alternatives>
<tex-math id="M81">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M82">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq26.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">
<bold>0,815</bold>
</td>
<td align="left">
<bold>0,815</bold>
</td>
<td align="left">
<bold>0,813</bold>
</td>
<td align="left">
<bold>0,810</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,822</bold>
</td>
<td align="left">
<bold>0,822</bold>
</td>
<td align="left">
<bold>0,820</bold>
</td>
<td align="left">
<bold>0,809</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq27">
<alternatives>
<tex-math id="M83">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M84">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq27.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">
<bold>0,815</bold>
</td>
<td align="left">
<bold>0,815</bold>
</td>
<td align="left">
<bold>0,813</bold>
</td>
<td align="left">
<bold>0,810</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,822</bold>
</td>
<td align="left">
<bold>0,822</bold>
</td>
<td align="left">
<bold>0,820</bold>
</td>
<td align="left">0,807</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq28">
<alternatives>
<tex-math id="M85">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M86">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq28.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0,809</td>
<td align="left">0,806</td>
<td align="left">0,805</td>
<td align="left">0,802</td>
<td align="left"></td>
<td align="left">0,809</td>
<td align="left">0,807</td>
<td align="left">0,805</td>
<td align="left">0,802</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq29">
<alternatives>
<tex-math id="M87">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M88">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq29.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0,809</td>
<td align="left">0,806</td>
<td align="left">0,805</td>
<td align="left">0,802</td>
<td align="left"></td>
<td align="left">0,809</td>
<td align="left">0,807</td>
<td align="left">0,805</td>
<td align="left">0,802</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0,811</td>
<td align="left">0,807</td>
<td align="left">0,806</td>
<td align="left">0,801</td>
<td align="left"></td>
<td align="left">0,810</td>
<td align="left">0,806</td>
<td align="left">0,805</td>
<td align="left">0,801</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0,812</td>
<td align="left">0,809</td>
<td align="left">0,807</td>
<td align="left">0,802</td>
<td align="left"></td>
<td align="left">0,812</td>
<td align="left">0,809</td>
<td align="left">0,807</td>
<td align="left">0,802</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">0,812</td>
<td align="left">0,809</td>
<td align="left">0,807</td>
<td align="left">0,802</td>
<td align="left"></td>
<td align="left">0,812</td>
<td align="left">0,808</td>
<td align="left">0,806</td>
<td align="left">0,802</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0,811</td>
<td align="left">0,807</td>
<td align="left">0,806</td>
<td align="left">0,801</td>
<td align="left"></td>
<td align="left">0,809</td>
<td align="left">0,806</td>
<td align="left">0,805</td>
<td align="left">0,800</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="4">
<bold>3 clusters</bold>
</td>
<td align="left"></td>
<td align="left" colspan="4">
<bold>3 clusters</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq30">
<alternatives>
<tex-math id="M89">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M90">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq30.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">
<bold>0,695</bold>
</td>
<td align="left">0,689</td>
<td align="left">0,683</td>
<td align="left">0,662</td>
<td align="left"></td>
<td align="left">
<bold>0,717</bold>
</td>
<td align="left">0,707</td>
<td align="left">0,697</td>
<td align="left">0,668</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq31">
<alternatives>
<tex-math id="M91">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M92">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq31.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">
<bold>0,695</bold>
</td>
<td align="left">
<bold>0,696</bold>
</td>
<td align="left">
<bold>0,696</bold>
</td>
<td align="left">0,689</td>
<td align="left"></td>
<td align="left">
<bold>0,717</bold>
</td>
<td align="left">0,711</td>
<td align="left">
<bold>0,705</bold>
</td>
<td align="left">0,679</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq32">
<alternatives>
<tex-math id="M93">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M94">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq32.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">
<bold>0,695</bold>
</td>
<td align="left">
<bold>0,696</bold>
</td>
<td align="left">
<bold>0,696</bold>
</td>
<td align="left">
<bold>0,691</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,717</bold>
</td>
<td align="left">
<bold>0,712</bold>
</td>
<td align="left">0,704</td>
<td align="left">
<bold>0,681</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq33">
<alternatives>
<tex-math id="M95">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M96">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq33.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0,653</td>
<td align="left">0,646</td>
<td align="left">0,646</td>
<td align="left">0,638</td>
<td align="left"></td>
<td align="left">0,668</td>
<td align="left">0,662</td>
<td align="left">0,655</td>
<td align="left">0,646</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq34">
<alternatives>
<tex-math id="M97">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M98">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq34.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0,653</td>
<td align="left">0,646</td>
<td align="left">0,645</td>
<td align="left">0,637</td>
<td align="left"></td>
<td align="left">0,668</td>
<td align="left">0,662</td>
<td align="left">0,655</td>
<td align="left">0,644</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0,682</td>
<td align="left">0,673</td>
<td align="left">0,671</td>
<td align="left">0,657</td>
<td align="left"></td>
<td align="left">0,685</td>
<td align="left">0,677</td>
<td align="left">0,674</td>
<td align="left">0,663</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0,694</td>
<td align="left">0,687</td>
<td align="left">0,685</td>
<td align="left">0,672</td>
<td align="left"></td>
<td align="left">0,696</td>
<td align="left">0,689</td>
<td align="left">0,687</td>
<td align="left">0,675</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">0,693</td>
<td align="left">0,686</td>
<td align="left">0,684</td>
<td align="left">0,669</td>
<td align="left"></td>
<td align="left">0,695</td>
<td align="left">0,688</td>
<td align="left">0,685</td>
<td align="left">0,673</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0,675</td>
<td align="left">0,668</td>
<td align="left">0,662</td>
<td align="left">0,654</td>
<td align="left"></td>
<td align="left">0,675</td>
<td align="left">0,671</td>
<td align="left">0,665</td>
<td align="left">0,655</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="4">
<bold>4 clusters</bold>
</td>
<td align="left"></td>
<td align="left" colspan="4">
<bold>4 clusters</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq35">
<alternatives>
<tex-math id="M99">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M100">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq35.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">
<bold>0,623</bold>
</td>
<td align="left">0,613</td>
<td align="left">0,606</td>
<td align="left">0,574</td>
<td align="left"></td>
<td align="left">0,627</td>
<td align="left">0,616</td>
<td align="left">0,591</td>
<td align="left">0,551</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq36">
<alternatives>
<tex-math id="M101">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M102">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq36.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0,622</td>
<td align="left">0,621</td>
<td align="left">0,618</td>
<td align="left">0,602</td>
<td align="left"></td>
<td align="left">
<bold>0,628</bold>
</td>
<td align="left">
<bold>0,617</bold>
</td>
<td align="left">0,602</td>
<td align="left">0,572</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq37">
<alternatives>
<tex-math id="M103">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M104">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq37.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0,622</td>
<td align="left">
<bold>0,622</bold>
</td>
<td align="left">
<bold>0,619</bold>
</td>
<td align="left">
<bold>0,605</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,628</bold>
</td>
<td align="left">
<bold>0,617</bold>
</td>
<td align="left">
<bold>0,603</bold>
</td>
<td align="left">
<bold>0,573</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq38">
<alternatives>
<tex-math id="M105">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M106">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq38.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0,580</td>
<td align="left">0,563</td>
<td align="left">0,566</td>
<td align="left">0,535</td>
<td align="left"></td>
<td align="left">0,582</td>
<td align="left">0,571</td>
<td align="left">0,572</td>
<td align="left">0,555</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq39">
<alternatives>
<tex-math id="M107">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M108">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq39.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0,580</td>
<td align="left">0,560</td>
<td align="left">0,565</td>
<td align="left">0,533</td>
<td align="left"></td>
<td align="left">0,582</td>
<td align="left">0,570</td>
<td align="left">0,570</td>
<td align="left">0,555</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0,554</td>
<td align="left">0,551</td>
<td align="left">0,547</td>
<td align="left">0,540</td>
<td align="left"></td>
<td align="left">0,568</td>
<td align="left">0,565</td>
<td align="left">0,553</td>
<td align="left">0,543</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0,555</td>
<td align="left">0,548</td>
<td align="left">0,545</td>
<td align="left">0,536</td>
<td align="left"></td>
<td align="left">0,566</td>
<td align="left">0,558</td>
<td align="left">0,547</td>
<td align="left">0,537</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">0,556</td>
<td align="left">0,549</td>
<td align="left">0,546</td>
<td align="left">0,538</td>
<td align="left"></td>
<td align="left">0,562</td>
<td align="left">0,554</td>
<td align="left">0,547</td>
<td align="left">0,539</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0,553</td>
<td align="left">0,547</td>
<td align="left">0,547</td>
<td align="left">0,538</td>
<td align="left"></td>
<td align="left">0,556</td>
<td align="left">0,549</td>
<td align="left">0,548</td>
<td align="left">0,540</td>
</tr>
<tr>
<td align="left"></td>
<td align="left" colspan="4">
<bold>5 clusters</bold>
</td>
<td align="left"></td>
<td align="left" colspan="4">
<bold>5 clusters</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq40">
<alternatives>
<tex-math id="M109">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M110">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq40.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0,553</td>
<td align="left">0,539</td>
<td align="left">0,532</td>
<td align="left">0,500</td>
<td align="left"></td>
<td align="left">0,560</td>
<td align="left">0,534</td>
<td align="left">0,512</td>
<td align="left">0,462</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq41">
<alternatives>
<tex-math id="M111">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M112">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq41.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">
<bold>0,554</bold>
</td>
<td align="left">
<bold>0,545</bold>
</td>
<td align="left">
<bold>0,551</bold>
</td>
<td align="left">0,532</td>
<td align="left"></td>
<td align="left">0,560</td>
<td align="left">0,544</td>
<td align="left">0,524</td>
<td align="left">
<bold>0,489</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq42">
<alternatives>
<tex-math id="M113">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M114">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq42.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0,553</td>
<td align="left">0,544</td>
<td align="left">0,550</td>
<td align="left">
<bold>0,533</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,561</bold>
</td>
<td align="left">
<bold>0,545</bold>
</td>
<td align="left">
<bold>0,531</bold>
</td>
<td align="left">0,487</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq43">
<alternatives>
<tex-math id="M115">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M116">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq43.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0,483</td>
<td align="left">0,475</td>
<td align="left">0,470</td>
<td align="left">0,463</td>
<td align="left"></td>
<td align="left">0,509</td>
<td align="left">0,494</td>
<td align="left">0,485</td>
<td align="left">0,470</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq44">
<alternatives>
<tex-math id="M117">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M118">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq44.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0,483</td>
<td align="left">0,475</td>
<td align="left">0,470</td>
<td align="left">0,461</td>
<td align="left"></td>
<td align="left">0,509</td>
<td align="left">0,494</td>
<td align="left">0,482</td>
<td align="left">0,470</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0,478</td>
<td align="left">0,472</td>
<td align="left">0,465</td>
<td align="left">0,453</td>
<td align="left"></td>
<td align="left">0,500</td>
<td align="left">0,495</td>
<td align="left">0,486</td>
<td align="left">0,465</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0,498</td>
<td align="left">0,488</td>
<td align="left">0,484</td>
<td align="left">0,468</td>
<td align="left"></td>
<td align="left">0,507</td>
<td align="left">0,501</td>
<td align="left">0,492</td>
<td align="left">0,476</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">0,498</td>
<td align="left">0,488</td>
<td align="left">0,484</td>
<td align="left">0,468</td>
<td align="left"></td>
<td align="left">0,507</td>
<td align="left">0,500</td>
<td align="left">0,491</td>
<td align="left">0,474</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0,470</td>
<td align="left">0,464</td>
<td align="left">0,457</td>
<td align="left">0,449</td>
<td align="left"></td>
<td align="left">0,488</td>
<td align="left">0,482</td>
<td align="left">0,476</td>
<td align="left">0,455</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>In a second series of experiments, maintaining the previously described experimental setup, we test how the number of reads and the different types of errors affect the recall rates. Table
<xref rid="Tab3" ref-type="table">3</xref>
shows the recall rates, for different methods, while varying the number of reads and the types of sequencing errors. The relative performances are similar to that of Table
<xref rid="Tab2" ref-type="table">2</xref>
, however we can note that as the number of reads increases the advantage of quality based measures slightly improve. It is of interest to note that among the different types of sequencing errors, deletions seem to cause a drop of recall rates more evident than mismatches and insertions.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>
<bold>Recall rates of clustering of mRNA simulated reads (reads of length 200,</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=2 and 2 clusters) for different measures, different types of errors and number of reads</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>No errors</bold>
</th>
<th align="left">
<bold>Mismatch = 10%</bold>
</th>
<th align="left">
<bold>Insertion = 10%</bold>
</th>
<th align="left">
<bold>Deletion = 10%</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left"></th>
<th align="left">
<bold>Mismatch = 10%</bold>
</th>
<th align="left">
<bold>Mismatch = 10%</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="5">
<bold>500 reads</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq45">
<alternatives>
<tex-math id="M119">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M120">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq45.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.86445887</td>
<td align="left">0.83981814</td>
<td align="left">0.79073482</td>
<td align="left">0.80640363</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq46">
<alternatives>
<tex-math id="M121">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M122">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq46.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0.86441326</td>
<td align="left">
<bold>0.86694192</bold>
</td>
<td align="left">
<bold>0.86376933</bold>
</td>
<td align="left">
<bold>0.85925575</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq47">
<alternatives>
<tex-math id="M123">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M124">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq47.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.86441326</td>
<td align="left">0.86375045</td>
<td align="left">0.85782736</td>
<td align="left">0.85818320</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq48">
<alternatives>
<tex-math id="M125">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M126">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq48.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0.86723257</td>
<td align="left">0.85428665</td>
<td align="left">0.84756397</td>
<td align="left">0.85088665</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq49">
<alternatives>
<tex-math id="M127">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M128">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq49.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0.86723257</td>
<td align="left">0.85613671</td>
<td align="left">0.85305013</td>
<td align="left">0.85504185</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.86114263</td>
<td align="left">0.85504302</td>
<td align="left">0.85105192</td>
<td align="left">0.85118905</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.86258900</td>
<td align="left">0.85247832</td>
<td align="left">0.84995366</td>
<td align="left">0.85110380</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">
<bold>0.87235487</bold>
</td>
<td align="left">0.85916040</td>
<td align="left">0.85026923</td>
<td align="left">0.85475077</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.86712365</td>
<td align="left">0.85695963</td>
<td align="left">0.84730941</td>
<td align="left">0.85418699</td>
</tr>
<tr>
<td align="left" colspan="5">
<bold>1000 reads</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">0.86594479</td>
<td align="left">0.83906192</td>
<td align="left">0.78782226</td>
<td align="left">0.80686962</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq50">
<alternatives>
<tex-math id="M129">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M130">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq50.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0.86599548</td>
<td align="left">
<bold>0.86400152</bold>
</td>
<td align="left">
<bold>0.86423642</bold>
</td>
<td align="left">
<bold>0.85659489</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq51">
<alternatives>
<tex-math id="M131">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M132">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq51.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.86600096</td>
<td align="left">0.86099042</td>
<td align="left">0.85469494</td>
<td align="left">0.85441545</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq52">
<alternatives>
<tex-math id="M133">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M134">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq52.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0.86790093</td>
<td align="left">0.85433807</td>
<td align="left">0.84230775</td>
<td align="left">0.84839892</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq53">
<alternatives>
<tex-math id="M135">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M136">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq53.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0.86790093</td>
<td align="left">0.85770704</td>
<td align="left">0.85062824</td>
<td align="left">0.85104321</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.86216987</td>
<td align="left">0.85477261</td>
<td align="left">0.84904670</td>
<td align="left">0.85024936</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.86058645</td>
<td align="left">0.85312555</td>
<td align="left">0.84767965</td>
<td align="left">0.85043005</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">
<bold>0.87048717</bold>
</td>
<td align="left">0.85667036</td>
<td align="left">0.85002398</td>
<td align="left">0.85088847</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.86919513</td>
<td align="left">0.85488101</td>
<td align="left">0.84896184</td>
<td align="left">0.84950072</td>
</tr>
<tr>
<td align="left" colspan="5">
<bold>2000 reads</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq54">
<alternatives>
<tex-math id="M137">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M138">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq54.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.86307749</td>
<td align="left">0.83460148</td>
<td align="left">0.78680210</td>
<td align="left">0.81273009</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq55">
<alternatives>
<tex-math id="M139">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M140">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq55.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0.86306541</td>
<td align="left">
<bold>0.86490821</bold>
</td>
<td align="left">
<bold>0.86432381</bold>
</td>
<td align="left">
<bold>0.85783381</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq56">
<alternatives>
<tex-math id="M141">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M142">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq56.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.86306541</td>
<td align="left">0.86129411</td>
<td align="left">0.85330127</td>
<td align="left">0.85111236</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq57">
<alternatives>
<tex-math id="M143">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M144">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq57.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0.86305839</td>
<td align="left">0.85432677</td>
<td align="left">0.84295441</td>
<td align="left">0.85043303</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq58">
<alternatives>
<tex-math id="M145">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M146">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq58.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0.86306276</td>
<td align="left">0.85799349</td>
<td align="left">0.84868427</td>
<td align="left">0.85289041</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.86125521</td>
<td align="left">0.85265296</td>
<td align="left">0.84487856</td>
<td align="left">0.84694314</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.85971734</td>
<td align="left">0.85283644</td>
<td align="left">0.84325115</td>
<td align="left">0.84899721</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">
<bold>0.86990625</bold>
</td>
<td align="left">0.85621086</td>
<td align="left">0.84559916</td>
<td align="left">0.85108524</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.86827273</td>
<td align="left">0.85433859</td>
<td align="left">0.84321338</td>
<td align="left">0.85010800</td>
</tr>
<tr>
<td align="left" colspan="5">
<bold>3000 reads</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq59">
<alternatives>
<tex-math id="M147">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M148">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq59.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.86131992</td>
<td align="left">0.83027426</td>
<td align="left">0.79355066</td>
<td align="left">0.81057286</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq60">
<alternatives>
<tex-math id="M149">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M150">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq60.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0.86134064</td>
<td align="left">
<bold>0.86519721</bold>
</td>
<td align="left">
<bold>0.86235323</bold>
</td>
<td align="left">
<bold>0.85792626</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq61">
<alternatives>
<tex-math id="M151">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M152">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq61.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.86128705</td>
<td align="left">0.85978356</td>
<td align="left">0.85252267</td>
<td align="left">0.85262847</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq62">
<alternatives>
<tex-math id="M153">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M154">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq62.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0.86477422</td>
<td align="left">0.85334750</td>
<td align="left">0.84374378</td>
<td align="left">0.84947286</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq63">
<alternatives>
<tex-math id="M155">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M156">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq63.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0.86477422</td>
<td align="left">0.85637033</td>
<td align="left">0.84850933</td>
<td align="left">0.85162186</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.86370337</td>
<td align="left">0.85297951</td>
<td align="left">0.84525794</td>
<td align="left">0.84901375</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.86242736</td>
<td align="left">0.85271505</td>
<td align="left">0.84384526</td>
<td align="left">0.84832590</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">
<bold>0.86934393</bold>
</td>
<td align="left">0.85488377</td>
<td align="left">0.84531374</td>
<td align="left">0.85014251</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.86580244</td>
<td align="left">0.85353783</td>
<td align="left">0.84308462</td>
<td align="left">0.84878825</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq64">
<alternatives>
<tex-math id="M157">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M158">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq64.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.86179886</td>
<td align="left">0.83217374</td>
<td align="left">0.79345107</td>
<td align="left">0.80917623</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq65">
<alternatives>
<tex-math id="M159">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M160">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq65.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">0.86166330</td>
<td align="left">
<bold>0.86412834</bold>
</td>
<td align="left">
<bold>0.86385592</bold>
</td>
<td align="left">
<bold>0.86064860</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq66">
<alternatives>
<tex-math id="M161">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M162">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq66.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.86166519</td>
<td align="left">0.85559541</td>
<td align="left">0.85133437</td>
<td align="left">0.85345570</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq67">
<alternatives>
<tex-math id="M163">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M164">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq67.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0.86317541</td>
<td align="left">0.85224352</td>
<td align="left">0.84168072</td>
<td align="left">0.84837070</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq68">
<alternatives>
<tex-math id="M165">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M166">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq68.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0.86317541</td>
<td align="left">0.85543020</td>
<td align="left">0.84770910</td>
<td align="left">0.85121979</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.86262435</td>
<td align="left">0.85243814</td>
<td align="left">0.84436053</td>
<td align="left">0.84898583</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.86122271</td>
<td align="left">0.85167640</td>
<td align="left">0.84308556</td>
<td align="left">0.84801094</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">
<bold>0.86792997</bold>
</td>
<td align="left">0.85473650</td>
<td align="left">0.84431637</td>
<td align="left">0.84985690</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.86488656</td>
<td align="left">0.85297623</td>
<td align="left">0.84262083</td>
<td align="left">0.84815285</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>The future generation sequencing technologies will produce long reads with a large number of erroneous bases. To this end we study how read length affects these measures. Since the length of sequences under investigation is limited we keep the read length under 400 bases. In Table
<xref rid="Tab4" ref-type="table">4</xref>
we report some experiments for the setup with 4 clusters and
<italic>k</italic>
=3, while varying the error rate and read length. If we compare these results with Table
<xref rid="Tab2" ref-type="table">2</xref>
, where the read length is 200, we can observe a similar behavior. As the error rate increases the improvement with respect to the other measures remains evident, in particular the difference in terms of recall of
<inline-formula id="IEq69">
<alternatives>
<tex-math id="M167">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M168">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq69.gif"></inline-graphic>
</alternatives>
</inline-formula>
with the expectations
<italic>AQP</italic>
grows with the length of reads when compared with
<italic>KL</italic>
(up to 9%), and it remains constant when compared with
<inline-formula id="IEq70">
<alternatives>
<tex-math id="M169">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M170">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq70.gif"></inline-graphic>
</alternatives>
</inline-formula>
. With the current tendency of the future sequencing technologies to produce longer reads this behavior is desirable. These performance are confirmed also for other setups with larger
<italic>k</italic>
and higher number of clusters (data not shown).
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>
<bold>Recall rates for clustering of mRNA simulated reads(10000 reads,</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=3, 4 clusters) for different measures, error rates and read length</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="4">
<bold>read length = 300</bold>
</th>
<th align="left"></th>
<th align="left" colspan="4">
<bold>read length = 400</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left" colspan="4">
<bold>(a)</bold>
</th>
<th align="left"></th>
<th align="left" colspan="4">
<bold>(b)</bold>
</th>
</tr>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>No errors</bold>
</th>
<th align="left">
<bold>3</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>5</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>10</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left"></th>
<th align="left">
<bold>No Errors</bold>
</th>
<th align="left">
<bold>3</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>5</bold>
<bold>
<italic>%</italic>
</bold>
</th>
<th align="left">
<bold>10</bold>
<bold>
<italic>%</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left" colspan="4">
<bold>4 clusters</bold>
</td>
<td align="left"></td>
<td align="left" colspan="4">
<bold>4 clusters</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq71">
<alternatives>
<tex-math id="M171">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M172">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq71.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">
<bold>0,680</bold>
</td>
<td align="left">0,667</td>
<td align="left">0,658</td>
<td align="left">0,625</td>
<td align="left"></td>
<td align="left">
<bold>0,713</bold>
</td>
<td align="left">0,700</td>
<td align="left">0,697</td>
<td align="left">0,672</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq72">
<alternatives>
<tex-math id="M173">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M174">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq72.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
</td>
<td align="left">
<bold>0,680</bold>
</td>
<td align="left">
<bold>0,672</bold>
</td>
<td align="left">
<bold>0,673</bold>
</td>
<td align="left">
<bold>0,650</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,713</bold>
</td>
<td align="left">
<bold>0,712</bold>
</td>
<td align="left">0,710</td>
<td align="left">0,693</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq73">
<alternatives>
<tex-math id="M175">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M176">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq73.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">
<bold>0,680</bold>
</td>
<td align="left">0,671</td>
<td align="left">
<bold>0,673</bold>
</td>
<td align="left">
<bold>0,650</bold>
</td>
<td align="left"></td>
<td align="left">
<bold>0,713</bold>
</td>
<td align="left">0,711</td>
<td align="left">
<bold>0,711</bold>
</td>
<td align="left">
<bold>0,694</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq74">
<alternatives>
<tex-math id="M177">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M178">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq74.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
</td>
<td align="left">0,616</td>
<td align="left">0,610</td>
<td align="left">0,608</td>
<td align="left">0,601</td>
<td align="left"></td>
<td align="left">0,643</td>
<td align="left">0,636</td>
<td align="left">0,632</td>
<td align="left">0,623</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq75">
<alternatives>
<tex-math id="M179">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M180">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq75.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">0,616</td>
<td align="left">0,610</td>
<td align="left">0,607</td>
<td align="left">0,602</td>
<td align="left"></td>
<td align="left">0,643</td>
<td align="left">0,635</td>
<td align="left">0,631</td>
<td align="left">0,622</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0,610</td>
<td align="left">0,600</td>
<td align="left">0,602</td>
<td align="left">0,581</td>
<td align="left"></td>
<td align="left">0,638</td>
<td align="left">0,630</td>
<td align="left">0,624</td>
<td align="left">0,614</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0,617</td>
<td align="left">0,604</td>
<td align="left">0,601</td>
<td align="left">0,577</td>
<td align="left"></td>
<td align="left">0,649</td>
<td align="left">0,632</td>
<td align="left">0,628</td>
<td align="left">0,618</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">0,613</td>
<td align="left">0,603</td>
<td align="left">0,599</td>
<td align="left">0,576</td>
<td align="left"></td>
<td align="left">0,647</td>
<td align="left">0,632</td>
<td align="left">0,627</td>
<td align="left">0,616</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0,601</td>
<td align="left">0,593</td>
<td align="left">0,588</td>
<td align="left">0,575</td>
<td align="left"></td>
<td align="left">0,626</td>
<td align="left">0,618</td>
<td align="left">0,615</td>
<td align="left">0,604</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
</p>
<sec id="Sec9">
<title>Boosting assembly</title>
<p>Assembly is one of the most challenging computational problems in the field of NGS data. It is a very time consuming process with highly variable outcomes for different datasets [
<xref ref-type="bibr" rid="CR34">34</xref>
]. Currently large datasets can only be assembled on high performance computing systems with considerable CPU and memory resources. Clustering has been used as preprocessing, prior to assembly, to improve memory requirements as well as the quality of the assembled contigs [
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
]. Here we test if the quality of assembly of real read data can be improved with clustering. For the assembly component we use Velvet [
<xref ref-type="bibr" rid="CR35">35</xref>
], one of the most popular assembly tool for NGS data. We study two genomes:
<italic>Helicobacter Pylori</italic>
and
<italic>Zymomonas Mobilis</italic>
. We download the reads datasets
<italic>SRR023794</italic>
and
<italic>SRR017901</italic>
, of about 117 and 23.5 MBases respectively, corresponding to 10 × coverage. We apply the clustering algorithms, with
<italic>k</italic>
=3, and divide the datasets of reads in two and three clusters. Then we produce an assembly, as a set of contigs, for each cluster using Velvet and we merged the generated contigs. In order to evaluate the clustering quality, we compare this merged set with the assembly, without clustering, using of the whole set of reads. Commonly used metrics such as number of contigs,
<italic>N</italic>
50 and percentage of mapped contigs are presented in Tables
<xref rid="Tab5" ref-type="table">5</xref>
and
<xref rid="Tab6" ref-type="table">6</xref>
. When merging contigs from different clusters, some contig might be very similar or they can cover the same region of the genome, this can artificially increase these values. Thus we compute also a less biased measure that is the percentage of the genome that is covered by the contigs (last column).
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>
<bold>Comparison of assembly with and without clustering preprocess (</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=3, 2 clusters)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>Mapped contigs</bold>
</th>
<th align="left">
<bold>N50</bold>
</th>
<th align="left">
<bold>Number of contigs</bold>
</th>
<th align="left">
<bold>Genome coverage</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">No Clustering</td>
<td align="left">93.55%</td>
<td align="left">112</td>
<td align="left">22823</td>
<td align="left">0,828</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq76">
<alternatives>
<tex-math id="M181">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M182">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq76.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">94.13%</td>
<td align="left">
<bold>141</bold>
</td>
<td align="left">
<bold>29421</bold>
</td>
<td align="left">
<bold>0,920</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq77">
<alternatives>
<tex-math id="M183">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M184">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq77.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">93.97%</td>
<td align="left">138</td>
<td align="left">28701</td>
<td align="left">0,914</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">94.24%</td>
<td align="left">135</td>
<td align="left">28297</td>
<td align="left">0,904</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">94.19%</td>
<td align="left">135</td>
<td align="left">28171</td>
<td align="left">0,903</td>
</tr>
<tr>
<td align="left">Symm, KL</td>
<td align="left">94.27%</td>
<td align="left">134</td>
<td align="left">27999</td>
<td align="left">0,902</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">
<bold>94.33%</bold>
</td>
<td align="left">134</td>
<td align="left">28019</td>
<td align="left">0,903</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The assembly with Velvet is evaluated in terms of mapped contigs, N50, number of contigs and genome coverage. The dataset used is SRR017901 (23.5M bases, 10x coverage) that contains reads of
<italic>Zymomonas mobilis</italic>
. Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>
<bold>Comparison of assembly with and without clustering preprocess (</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=3, 3 clusters)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>Mapped contigs</bold>
</th>
<th align="left">
<bold>N50</bold>
</th>
<th align="left">
<bold>Number of contigs</bold>
</th>
<th align="left">
<bold>Genome coverage</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">No Clustering</td>
<td align="left">96.97%</td>
<td align="left">122</td>
<td align="left">16724</td>
<td align="left">0.729</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq78">
<alternatives>
<tex-math id="M185">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M186">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq78.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">
<bold>98.49%</bold>
</td>
<td align="left">175</td>
<td align="left">
<bold>41086</bold>
</td>
<td align="left">
<bold>0.994</bold>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq79">
<alternatives>
<tex-math id="M187">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M188">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq79.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">98.38%</td>
<td align="left">174</td>
<td align="left">40156</td>
<td align="left">
<bold>0.994</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">98.16%</td>
<td align="left">175</td>
<td align="left">36798</td>
<td align="left">0.986</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">98.28%</td>
<td align="left">178</td>
<td align="left">37717</td>
<td align="left">0.990</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">98.30%</td>
<td align="left">182</td>
<td align="left">37217</td>
<td align="left">0.990</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">98.22%</td>
<td align="left">
<bold>186</bold>
</td>
<td align="left">34866</td>
<td align="left">0.987</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The assembly with Velvet is evaluated in terms of mapped contigs, N50, number of contigs and genome coverage. The dataset used is SRR023794 (117MBases) that contains reads of
<italic>Helicobacter Pylori</italic>
. Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>In this set of experiments the introduction of clustering as a preprocessing step increases the number of contigs and the N50. More relevant is the fact that the genome coverage is incremented by 10% with respect to the assembly without clustering. The relative performance between the distance measures is very similar to the case with simulated data. In fact
<inline-formula id="IEq80">
<alternatives>
<tex-math id="M189">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M190">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq80.gif"></inline-graphic>
</alternatives>
</inline-formula>
with expectation
<italic>AQP</italic>
and quality redistribution is again the best performing. More experiments should be conducted in order to prove that assembly can benefit from the clustering preprocessing. However this first preliminary tests show that, at least for some configuration, a 10% improvement on the genome coverage can be obtained. The time required to performed the above experiments are in general less than a minute on a modern laptop with an Intel i7 and 8Gb of ram. The introduction of quality values typically increases the running time by 4% compared to standard alignment-free methods.</p>
</sec>
<sec id="Sec10">
<title>Clustering metagenomic reads</title>
<p>Another application, where the use of clustering techniques might be of help, is the classification of metagenomic reads. Modern sequencing machines are capable of sequencing several genomes at the same time, more precisely the input can be a microbiome community composed of thousands of different organisms. If the reference genomes are not available, or we don’t know all the organisms being sequenced, clustering techniques can be used to group together reads with the same word distribution that presumably come from the same genome. To test our quality based measure on this challenging task we devise a simple preliminary test. We consider the reads of the following four different organisms:
<italic>Helicobacter pylori</italic>
(
<italic>SRR023794</italic>
),
<italic>Zymomonas mobilis</italic>
(
<italic>SRR017901</italic>
),
<italic>E.coli</italic>
(
<italic>FXAWNEV04</italic>
) and
<italic>Legionella pneumophila</italic>
(
<italic>ERR164429</italic>
). These datasets contain reads of length between 150 to 350 bases. We create a single mixture of reads by sampling the same number of reads from each organisms. Then we tested how well clustering techniques can recover the original taxonomy of each genome in this artificial dataset. In Table
<xref rid="Tab7" ref-type="table">7</xref>
we report the recall rates for different alignment-free measures. Surprisingly, without knowing any reference genome, we can classify correctly about 80
<italic>%</italic>
of reads. Again quality based methods have a small advantage over traditional alignment-free techniques. This is just a preliminary test, however we believe that the classification of metagenomic reads with alignment-free methods deserved to be further investigated.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>
<bold>Metagenomic reads classification of</bold>
<bold>
<italic>Helicobacter pylori</italic>
</bold>
<bold> (</bold>
<bold>
<italic>SRR023794</italic>
</bold>
<bold>),</bold>
<bold>
<italic>Zymomonas mobilis</italic>
</bold>
<bold> (</bold>
<bold>
<italic>SRR017901</italic>
</bold>
<bold>),</bold>
<bold>
<italic>E.coli</italic>
</bold>
<bold> (</bold>
<bold>
<italic>FXAWNEV04</italic>
</bold>
<bold>) and</bold>
<bold>
<italic>Legionella pneumophila</italic>
</bold>
<bold> (</bold>
<bold>
<italic>ERR164429</italic>
</bold>
<bold>)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>Distance</bold>
</th>
<th align="left">
<bold>4 cluster</bold>
</th>
<th align="left">
<bold>3 cluster</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<inline-formula id="IEq81">
<alternatives>
<tex-math id="M191">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*}_{2}$ \end{document}</tex-math>
<mml:math id="M192">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq81.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.79782297</td>
<td align="left">0.79129356</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq82">
<alternatives>
<tex-math id="M193">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M194">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq82.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AQP</italic>
q-red</td>
<td align="left">0.79775189</td>
<td align="left">0.76920676</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq83">
<alternatives>
<tex-math id="M195">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M196">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq83.gif"></inline-graphic>
</alternatives>
</inline-formula>
<italic>AWP</italic>
q-red</td>
<td align="left">
<bold>0.80050234</bold>
</td>
<td align="left">
<bold>0.82603989</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<sub>2</sub>
</td>
<td align="left">0.64335292</td>
<td align="left">0.73455525</td>
</tr>
<tr>
<td align="left">KL</td>
<td align="left">0.78663484</td>
<td align="left">0.80525234</td>
</tr>
<tr>
<td align="left">Simm, KL</td>
<td align="left">0.77196713</td>
<td align="left">0.79216786</td>
</tr>
<tr>
<td align="left">
<italic>D</italic>
<sub>2</sub>
</td>
<td align="left">0.73917085</td>
<td align="left">0.77062424</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The recall rates for different measures with
<italic>k</italic>
= 4 and 3 and 4 clusters. Best results are in bold.</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec11" sec-type="conclusion">
<title>Conclusions</title>
<p>The comparison of reads with quality values is essentials in many genome projects. Moreover, the importance of quality values will increase in the near future with the advent of future sequencing technologies, that promise to produce long reads, but with up to 15% error rates. In this paper we presented a family of alignment-free measures, called
<italic>D</italic>
<sup>
<italic>q</italic>
</sup>
-type, that incorporate quality value information and
<italic>k</italic>
-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. If quality information are used,
<inline-formula id="IEq84">
<alternatives>
<tex-math id="M197">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $D^{*q}_{2}$ \end{document}</tex-math>
<mml:math id="M198">
<mml:msubsup>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="13015_2014_29_Article_IEq84.gif"></inline-graphic>
</alternatives>
</inline-formula>
outperforms all other methods and the advantage grows with the error rate and with the length of reads. This confirms that the use of quality values can improve clustering accuracy.</p>
<p>Furthermore, preliminary experiments on real reads data show that the quality of assembly can be improved by using clustering as preprocessing. Also metagenomic reads classification can be addressed with these statistics, especially when the reference genomes are unknown. All these measures are implemented in a software called QCluster. As a future work we plan investigate other applications like genome diversity estimation and meta-genome assembly in which the impact of reads clustering might be substantial.</p>
</sec>
<sec id="Sec12">
<title>Endnote</title>
<p>
<sup>a</sup>
a preliminary version of this work as been presented at WABI 2014 [
<xref ref-type="bibr" rid="CR36">36</xref>
].</p>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>M. Comin conceived the study; M. Schimd and A. Leoni wrote and tested computer programs for clustering reads data. All authors drafted and approved the manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>M. Comin was partially supported by the Ateneo Project CPDA110239 and by the P.R.I.N. Project 20122F87B2.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medini</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Serruto</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Parkhill</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Relman</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Donati</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Moxon</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Microbiology in the post-genomic era</article-title>
<source>Nat Rev Microbiol.</source>
<year>2008</year>
<volume>6</volume>
<fpage>419</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="pmid">18475305</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jothi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Cuddapah</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Barski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Genome-wide identification of in vivo protein–dna binding sites from chip-seq data</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>5221</fpage>
<lpage>31</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn488</pub-id>
<pub-id pub-id-type="pmid">18684996</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol.</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sims</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Jun</surname>
<given-names>S-R</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>GA</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>S-H</given-names>
</name>
</person-group>
<article-title>Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions</article-title>
<source>Proc Nat Acad Sci.</source>
<year>2009</year>
<volume>106</volume>
<fpage>2677</fpage>
<lpage>82</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0813249106</pub-id>
<pub-id pub-id-type="pmid">19188606</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<mixed-citation publication-type="other">Comin M, Verzotto D. Whole-genome phylogeny by virtue of unic subwords. In: 23rd international workshop on Database and EXpert systems Applications (DEXA 2012): 2012. p. 190–194.</mixed-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Alignment-free phylogeny of whole genomes using underlying subwords</article-title>
<source>Algorithms Mol Biol.</source>
<year>2012</year>
<volume>7</volume>
<issue>1</issue>
<fpage>34</fpage>
<pub-id pub-id-type="doi">10.1186/1748-7188-7-34</pub-id>
<pub-id pub-id-type="pmid">23216990</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison based on next-generation sequencing reads</article-title>
<source>J Comput Biol.</source>
<year>2013</year>
<volume>20</volume>
<issue>2</issue>
<fpage>64</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0228</pub-id>
<pub-id pub-id-type="pmid">23383994</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schimd</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns</article-title>
<source>BMC Bioinformatics.</source>
<year>2014</year>
<volume>15</volume>
<issue>Suppl 9</issue>
<fpage>1</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-15-S9-S1</pub-id>
<pub-id pub-id-type="pmid">24383880</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison–a review</article-title>
<source>Bioinformatics.</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Comparison study on k-word statistical measures for protein: From sequence to’sequence space’</article-title>
<source>BMC Bioinformatics.</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>394</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-394</pub-id>
<pub-id pub-id-type="pmid">18811946</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Whole genome molecular phylogeny of large dsdna viruses using composition vector method</article-title>
<source>BMC Evol Biol.</source>
<year>2007</year>
<volume>7</volume>
<issue>1</issue>
<fpage>41</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2148-7-41</pub-id>
<pub-id pub-id-type="pmid">17359548</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Cvtree: a phylogenetic tree reconstruction tool based on whole genomes</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<issue>suppl 2</issue>
<fpage>45</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh362</pub-id>
<pub-id pub-id-type="pmid">14704342</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Göke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Lasserre</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts</article-title>
<source>Bioinformatics.</source>
<year>2012</year>
<volume>28</volume>
<issue>5</issue>
<fpage>656</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts028</pub-id>
<pub-id pub-id-type="pmid">22247280</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kantorovitz</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A statistical method for alignment-free comparison of regulatory sequences</article-title>
<source>Bioinformatics.</source>
<year>2007</year>
<volume>23</volume>
<issue>13</issue>
<fpage>249</fpage>
<lpage>55</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm211</pub-id>
<pub-id pub-id-type="pmid">17032675</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinformatics.</source>
<year>2014</year>
<volume>11</volume>
<issue>4</issue>
<fpage>628</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2014.2306830</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonello</surname>
<given-names>M</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Ngom</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Formenti</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>J-K</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>X-M</given-names>
</name>
<name>
<surname>van Laarhoven</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Fast computation of entropic profiles for the detection of conservation in genomes</article-title>
<source>Pattern recognition in Bioinformatics. vol. 7986,</source>
<year>2013</year>
<publisher-loc>Berlin Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonello</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes</article-title>
<source>Comput Biol Bioinform IEEE/ACM Trans.</source>
<year>2014</year>
<volume>11</volume>
<issue>3</issue>
<fpage>500</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2013.2297924</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Classification of protein sequences by means of irredundant patterns</article-title>
<source>BMC bioinformatics.</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 1</issue>
<fpage>16</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-11-S1-S16</pub-id>
<pub-id pub-id-type="pmid">20064218</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>The irredundant class method for remote homology detection of protein sequences</article-title>
<source>J Comput Biol.</source>
<year>2011</year>
<volume>18</volume>
<issue>12</issue>
<fpage>1819</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2010.0171</pub-id>
<pub-id pub-id-type="pmid">21548811</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qu</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hashimoto</surname>
<given-names>S-i</given-names>
</name>
<name>
<surname>Morishita</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing</article-title>
<source>Genome Res.</source>
<year>2009</year>
<volume>19</volume>
<issue>7</issue>
<fpage>1309</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="doi">10.1101/gr.089151.108</pub-id>
<pub-id pub-id-type="pmid">19439514</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kaloshian</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Girke</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Seed: efficient clustering of next-generation sequences</article-title>
<source>Bioinformatics.</source>
<year>2011</year>
<volume>27</volume>
<issue>18</issue>
<fpage>2502</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">21810899</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Solovyov</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lipkin</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Centroid based clustering of high throughput sequencing reads based on n-mer counts</article-title>
<source>BMC Bioinformatics.</source>
<year>2013</year>
<volume>14</volume>
<issue>1</issue>
<fpage>268</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-14-268</pub-id>
<pub-id pub-id-type="pmid">24011402</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Mapping short dna sequencing reads and calling variants using mapping quality scores</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<issue>11</issue>
<fpage>1851</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1101/gr.078212.108</pub-id>
<pub-id pub-id-type="pmid">18714091</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Albers</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Lunter</surname>
<given-names>G</given-names>
</name>
<name>
<surname>MacArthur</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>McVean</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Ouwehand</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Dindel: accurate indel calls from short-read data</article-title>
<source>Genome Res.</source>
<year>2011</year>
<volume>21</volume>
<issue>6</issue>
<fpage>961</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1101/gr.112326.110</pub-id>
<pub-id pub-id-type="pmid">20980555</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carneiro</surname>
<given-names>MO</given-names>
</name>
<name>
<surname>Russ</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ross</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Gabriel</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Nusbaum</surname>
<given-names>C</given-names>
</name>
<name>
<surname>DePristo</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>Pacific biosciences sequencing technology for genotyping and variation discovery in human data</article-title>
<source>BMC Genomics.</source>
<year>2012</year>
<volume>13</volume>
<issue>1</issue>
<fpage>375</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-13-375</pub-id>
<pub-id pub-id-type="pmid">22863213</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blaisdell</surname>
<given-names>BE</given-names>
</name>
</person-group>
<article-title>A measure of the similarity of sets of sequences not requiring sequence alignment</article-title>
<source>Proc Natl Acad Sci.</source>
<year>1986</year>
<volume>83</volume>
<issue>14</issue>
<fpage>5155</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.83.14.5155</pub-id>
<pub-id pub-id-type="pmid">3460087</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lippert</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Distributional regimes for the number of k-word matches between two random sequences</article-title>
<source>Proc Natl Acad Sci.</source>
<year>2002</year>
<volume>99</volume>
<issue>22</issue>
<fpage>13980</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.202468099</pub-id>
<pub-id pub-id-type="pmid">12374863</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chew</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison (i): statistics and power</article-title>
<source>J Comput Biol.</source>
<year>2009</year>
<volume>16</volume>
<issue>12</issue>
<fpage>1615</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2009.0198</pub-id>
<pub-id pub-id-type="pmid">20001252</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison (ii): theoretical power of comparison statistics</article-title>
<source>J Comput Biol.</source>
<year>2010</year>
<volume>17</volume>
<issue>11</issue>
<fpage>1467</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2010.0056</pub-id>
<pub-id pub-id-type="pmid">20973742</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewing</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Base-calling of automated sequencer traces using phred. ii. error probabilities</article-title>
<source>Genome Res.</source>
<year>1998</year>
<volume>8</volume>
<issue>3</issue>
<fpage>186</fpage>
<lpage>94</lpage>
<pub-id pub-id-type="doi">10.1101/gr.8.3.175</pub-id>
<pub-id pub-id-type="pmid">9521922</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<mixed-citation publication-type="other">NCBI dataset of human mRNA genes.
<ext-link ext-link-type="uri" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/">ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR32">
<label>32</label>
<mixed-citation publication-type="other">Mason.
<ext-link ext-link-type="uri" xlink:href="http://seqan.de/projects/mason.html">http://seqan.de/projects/mason.html</ext-link>
.</mixed-citation>
</ref>
<ref id="CR33">
<label>33</label>
<mixed-citation publication-type="other">Holtgrewe M. Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin. 2010. TR-B-10-06.</mixed-citation>
</ref>
<ref id="CR34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Assemblies: the good, the bad, the ugly</article-title>
<source>Nat Methods.</source>
<year>2011</year>
<volume>8</volume>
<issue>1</issue>
<fpage>59</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth0111-59</pub-id>
<pub-id pub-id-type="pmid">21191376</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de bruijn graphs</article-title>
<source>Genome Res.</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>821</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Leoni</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Schimd</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Qcluster: Extending alignment-free measures with quality values for reads clustering</article-title>
<source>Algorithms Bioinform Lect Notes Comput Sci.</source>
<year>2014</year>
<volume>8701</volume>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1007/978-3-662-44753-6_1</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000244 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000244 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4331138
   |texte=   Clustering of reads with alignment-free measures and quality values
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25691913" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021